Monday, November 7, 2011

MYSQL with Cloudera-VM for Hadoop..

I will discuss about how to configure hive to run using MYSQL Db instead of default derby DB in cloudera's VM.

First step is to install mysql-server which is provided in the packages.
Go to terminal and start:

$ sudo apt-get install mysql-server
It will start the installation of MYSQL.

Keep the username and password as root

After it gets installed, u can check that it works fine. For it, u go to the location where u hav installed it (/etc/mysql) and give the command:

$ mysql -uroot -proot

or if u don't want to show the password:

$ mysql -uroot -p

(it will ask for the password)

Den if u want a seperate user for hadoop then u can do that:
create user 'hadoop'@'localhost' identified by 'hadoop'
grant all privileges on *.* to 'hadoop'@'localhost'with grant option

I have done it though using the root user only. So i don't need to explicitely give the permissions (it has by default)



Then u can 'exit' from it.

After this, u need to change the hive configuration so it can use MySQL.
(/etc/hive/conf/hive-site.xml or whereever u have installed)




Now u need to get the MYSQL JDBC driver to make the connection. Go to  this site  to get the latest version. If it is not compatible with the version of your MYSQL version, check for previous versions.

After downloading the file, untar it. Copy the jar file from inside it to the lib folder of hive. (default is /usr/lib/hive/lib/).

Now u can check that hive connects to the MYSQL DB. Go to hive (/etc/hive) and type command:

$ hive

hive> show tables;

U will see that the tables which u created in your derby DB are now not available in it.. It is now connected to MYSQL DB.

U can now start playing around hive with MYSQL as DB.





Tuesday, November 1, 2011

Making files in UNIX using scripting


Lets say that I need to create some files in UNIX box. The name of the files and the respective contents are written in a text file. Say, I have a text file as “fileContents.txt” which has the name of the file to be created followed by its contents which is followed by next file name and its contents and so on..

The file “fileContents.txt” is like:



sds_md_Acc_ContactRead.pl

#!/bin/perl
require "./directory_working/bin/package.pl";
        $exec_dir = $ARGV[0];
        $REPO = $ARGV[1];
----------------------------------------------------------------------------------------

sds_md_Acc_Datatype.pl

#!/bin/perl
require "./directory_working/bin/package.pl";
        $exec_dir = $ARGV[0];
        $REPO = $ARGV[1];
-------------------------------------------------------------------------------------------

sds_md_Bin_anno.pl

#!/bin/perl
………..
………..
………..
And so on and on


Now, v need to make a script that will read the name from this txt file, make that file and insert the respective contents in it.

First, we need to get the name of the files which we want to create. For that we can use the following command:

cat fileContents.txt | grep ‘\.pl$’

Here, the symbol ‘\’ is for escape sequence because we want to match the ‘.pl’ and if we don’t give the escape sequence, it will take it as command.
“The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character”

The symbol ‘$’ is used so that grep takes the word only which has the ‘.pl’ at its end (no character after that).

Now, we also need to find the line number of the occurrence of the file name. This is requird because we know that the text is in between two file names. So, for line numbers use,

cat fileContents.txt | grep -n ‘\.pl$’

Here, -n will give the line number also. Important thing to note is the name of the file and the line numbers are separated by a “:”.

We can store this result in some file say “saurabh.txt”. For that:

cat fileContents.txt | grep -n ‘\.pl$’ > saurabh.txt


U can open and check the file saurabh.txt.

Now we will write the script. Make a file say saurabh.sh which will contain the following contents:

1.       #!/bin/bash
2.       while read LINE
3.       do
4.                        end_lineNum=`echo $LINE | cut -d ":" -f1`
5.                        end_lineNum=`echo $end_lineNum -3 |bc`
6.                        fileName=`echo $LINE | cut -d ":" -f2`
7.                        sed -n "$start_lineNum","$end_lineNum"p fileContents.txt > "$OldFile"
8.                        end_lineNum=`echo $end_lineNum + 5|bc`
9.                        start_lineNum=$end_lineNum
10.                     OldFile=$fileName
11.   done < saurabh.txt
12.   end_lineNum=`wc -l fileContents.txt | awk '{print $1}'`
13.   sed -n "$start_lineNum","$end_lineNum"p fileContents.txt > "$OldFile"


The code is mostly self-explanatory. We are first reading the file ‘saurabh.txt’ and separating its contents (line number and name of the file) by cutting based on ‘:’. Next we are using the line number of the previous file name and the new one. We know, that content exists in between those line numbers and hence we are storing the line number of the previous file as ‘start_lineNum’ and that of the new one as ‘end_lineNum’. We are using the ‘sed’ command to get the contents in between the two line numbers from ‘fileContents.txt’ and storing it into a file which is made on the fly. The last file will not be made in the by the loop because the loop terminates before making the last file. So, for that we are using the lines 12 and 13. In 12, we are getting the total number of lines in the file ‘fileContents.txt’ because that will be the last line of the content of the last file. In 13th line, we are making that last file and storing the contents.

Execute the script as
sh saurabh.sh


It will make the files along with the contents. J

Important point to note is if u use
cat saursbh.txt | while red LINE
do
…….
…….
…….
done


instead of :
while read LINE
do
……
……
……
done > saurabh.txt

U will note that the variable defined inside the loop will not be accessible outside the loop. This is because the command “cat” is blocking command. Read more about blocking commands