Monday, November 7, 2011

MYSQL with Cloudera-VM for Hadoop..

I will discuss about how to configure hive to run using MYSQL Db instead of default derby DB in cloudera's VM.

First step is to install mysql-server which is provided in the packages.
Go to terminal and start:

$ sudo apt-get install mysql-server
It will start the installation of MYSQL.

Keep the username and password as root

After it gets installed, u can check that it works fine. For it, u go to the location where u hav installed it (/etc/mysql) and give the command:

$ mysql -uroot -proot

or if u don't want to show the password:

$ mysql -uroot -p

(it will ask for the password)

Den if u want a seperate user for hadoop then u can do that:
create user 'hadoop'@'localhost' identified by 'hadoop'
grant all privileges on *.* to 'hadoop'@'localhost'with grant option

I have done it though using the root user only. So i don't need to explicitely give the permissions (it has by default)



Then u can 'exit' from it.

After this, u need to change the hive configuration so it can use MySQL.
(/etc/hive/conf/hive-site.xml or whereever u have installed)




Now u need to get the MYSQL JDBC driver to make the connection. Go to  this site  to get the latest version. If it is not compatible with the version of your MYSQL version, check for previous versions.

After downloading the file, untar it. Copy the jar file from inside it to the lib folder of hive. (default is /usr/lib/hive/lib/).

Now u can check that hive connects to the MYSQL DB. Go to hive (/etc/hive) and type command:

$ hive

hive> show tables;

U will see that the tables which u created in your derby DB are now not available in it.. It is now connected to MYSQL DB.

U can now start playing around hive with MYSQL as DB.





Tuesday, November 1, 2011

Making files in UNIX using scripting


Lets say that I need to create some files in UNIX box. The name of the files and the respective contents are written in a text file. Say, I have a text file as “fileContents.txt” which has the name of the file to be created followed by its contents which is followed by next file name and its contents and so on..

The file “fileContents.txt” is like:



sds_md_Acc_ContactRead.pl

#!/bin/perl
require "./directory_working/bin/package.pl";
        $exec_dir = $ARGV[0];
        $REPO = $ARGV[1];
----------------------------------------------------------------------------------------

sds_md_Acc_Datatype.pl

#!/bin/perl
require "./directory_working/bin/package.pl";
        $exec_dir = $ARGV[0];
        $REPO = $ARGV[1];
-------------------------------------------------------------------------------------------

sds_md_Bin_anno.pl

#!/bin/perl
………..
………..
………..
And so on and on


Now, v need to make a script that will read the name from this txt file, make that file and insert the respective contents in it.

First, we need to get the name of the files which we want to create. For that we can use the following command:

cat fileContents.txt | grep ‘\.pl$’

Here, the symbol ‘\’ is for escape sequence because we want to match the ‘.pl’ and if we don’t give the escape sequence, it will take it as command.
“The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character”

The symbol ‘$’ is used so that grep takes the word only which has the ‘.pl’ at its end (no character after that).

Now, we also need to find the line number of the occurrence of the file name. This is requird because we know that the text is in between two file names. So, for line numbers use,

cat fileContents.txt | grep -n ‘\.pl$’

Here, -n will give the line number also. Important thing to note is the name of the file and the line numbers are separated by a “:”.

We can store this result in some file say “saurabh.txt”. For that:

cat fileContents.txt | grep -n ‘\.pl$’ > saurabh.txt


U can open and check the file saurabh.txt.

Now we will write the script. Make a file say saurabh.sh which will contain the following contents:

1.       #!/bin/bash
2.       while read LINE
3.       do
4.                        end_lineNum=`echo $LINE | cut -d ":" -f1`
5.                        end_lineNum=`echo $end_lineNum -3 |bc`
6.                        fileName=`echo $LINE | cut -d ":" -f2`
7.                        sed -n "$start_lineNum","$end_lineNum"p fileContents.txt > "$OldFile"
8.                        end_lineNum=`echo $end_lineNum + 5|bc`
9.                        start_lineNum=$end_lineNum
10.                     OldFile=$fileName
11.   done < saurabh.txt
12.   end_lineNum=`wc -l fileContents.txt | awk '{print $1}'`
13.   sed -n "$start_lineNum","$end_lineNum"p fileContents.txt > "$OldFile"


The code is mostly self-explanatory. We are first reading the file ‘saurabh.txt’ and separating its contents (line number and name of the file) by cutting based on ‘:’. Next we are using the line number of the previous file name and the new one. We know, that content exists in between those line numbers and hence we are storing the line number of the previous file as ‘start_lineNum’ and that of the new one as ‘end_lineNum’. We are using the ‘sed’ command to get the contents in between the two line numbers from ‘fileContents.txt’ and storing it into a file which is made on the fly. The last file will not be made in the by the loop because the loop terminates before making the last file. So, for that we are using the lines 12 and 13. In 12, we are getting the total number of lines in the file ‘fileContents.txt’ because that will be the last line of the content of the last file. In 13th line, we are making that last file and storing the contents.

Execute the script as
sh saurabh.sh


It will make the files along with the contents. J

Important point to note is if u use
cat saursbh.txt | while red LINE
do
…….
…….
…….
done


instead of :
while read LINE
do
……
……
……
done > saurabh.txt

U will note that the variable defined inside the loop will not be accessible outside the loop. This is because the command “cat” is blocking command. Read more about blocking commands

Thursday, October 27, 2011

Installing Hadoop on Windows using Cygwin...


Hi all, this time we will be discussing about installing Hadoop on windows using Cygwin. For it, the first step is to install Cygwin on the windows system. U can download the Cygwin version from here (http://www.cygwin.com/). Download and install it. Make sure u select the “OpenSSh” package when it asks for which packages to include. As:



Now, set the environment variable (if it is not set by itself while installing) to point to the installation of Cygwin. 

Configure the ssh daemon. For it, execute the command:
ssh-host-config

When asked if privilege separation should be used, answer no.
When asked if sshd should be installed as a service, answer yes.
When asked about the value of CYGWIN environment variable, enter ntsec.

Now, if the Cygwin service is not started, go to “services.msc” from run and start the Cygwin service.

Then u need to do authorization so that it may not ask every time and u can connect to it using ssh. For this, follow the following steps:
Execute the command in Cygwin:
ssh-keygen

When it asks for the file names and the pass phrases, type ENTER (without writing anything) and accept default values.

After that, once the command executes fully, type the following:
cd ~/.ssh

To check the keys, u can do “ls”. U will see the id_rsa.pub and id_rsa files.

If it is the first time u r using the Cygwin installed on yr system, den it is ok otherwise u will need to write those values to the authorization value. 

For that execute:
cat id_rsa.pub >> authorized_keys

Now try and login into yr own localhost by executing:
ssh localhost

The first time it will ask for conformation, do yes and den u can see that from next time u will not have to do it again.

Its all done. Your Cygwin is ready to install Hadoop in it and start working. So u need to download Hadoop. You can download it here (http://www.gtlib.gatech.edu/pub/apache//hadoop/common/).

You will also need to have Java installed on your machine. This is because Hadoop uses Java also and u will have to set the environment variable “JavaHome”. I have used Java 1.6 and the tutorial is based on usage of that version.

After u have downloaded Hadoop, u need to extract it (as it is in zip format). To extract it, u can type the following command on Cygwin command prompt.
tar -xzf hadoop-0.20.203.0.tar.gz

The file name may be different as per the version u have downloaded). Also I have extracted the Hadoop to the Cygwin folder under C drive. So, commands and paths may change as per the case may be.

Once it is done, u can see the contents of the Hadoop file by using the command “ls”.

If u want to work with Hadoop file system, go into the location where extracted Hadoop is present and type:
$ cd ../../hadoop-0.20.203.0
$ bin/hadoop fs -ls

What it says??? It says about the env. variable JAVA_HOME not set. So, lets set its path and other configuration files too.

Go to the Hadoop extracted folder -> conf  -> hadoop.env.sh (open it to edit)

This is the file where the environment variables are set. In this, u fill find one commented line (line beginning with #) as 
#export JAVA_HOME=
U  need  to un-comment it and then give the path where the Java is installed on your system.

If u want to give the path, u will have to include the cygdrive before it to access it. For e.g., if u have java at location c://Java, then u will have to give the path as:
export JAVA_HOME=/cygdrive/c/Java/jdk1.6.0_27


Or else if there is some space in between in the path, then u need to have one escape sequence character. For e.g.,

export JAVA_HOME=/cygdrive/c/Program\ Files/Java/jdk1.6.0_27


Rest all the env. variables are set by default. 

Now we go to second configuration file, hdfs-site.xml in the same location. In it, write the following in between the tag configuration:



Here we have to set the no of replication which should be done for every file. U can set the value as u want. Here we have set the value as 1.

Now open the file core-site.xml and give the following properties in it:




U can give any port number instead of 4440 but it should be free.


Now we have to make changes to mapred-site.xml file, located at the same location. In it, write the following:



Here we give the port number where we want to run the job tracker. U can give any of the free ports.


All the configurations are done. U can see the remaining files. U can note that in both the files “master” and “slave”, it is the localmachine which is acting as both.


Then the last thing u need to do is to make the namenode available and ready. To do it, first u need to format it because it is being used for the first time. So type the following command

I am assuming that u r presently in the Hadoop folder. If not, type cd
Eg. $   cd ../../hadoop-0.20.203.0 

$ bin/hadoop  namenode –format

It will give some information and then will finally format the namenode. Now u need to run and see it. So, type :
$ bin/hadoop namenode

And it will start running. Don’t stop it, instead open new command prompt of Cygwin and go to the Hadoop location and type:
$ bin/hadoop fs
It will show the options u can use with the fs command. Try using some of them like ls etc.

Now, in another command prompt, run the secondary name node as:
$ bin/hadoop secondarynamenode

In the third command prompt, run job tracker as:
$ bin/hadoop jobtracker

In fourth, run the datanode as:
$ bin/hadoop datanode

And in the last u can run the task tracer as:
$ bin/hadoop tasktracker

While u were starting these, u must have noticed the change in the name node. Try reading those changes and information.

Next task is to run Hive in this system. Download hive from here. (http://mirror.olnevhost.net/pub/apache//hive/). Try using it till my next post.. :-)




Thursday, September 29, 2011

Hive... start n have fun...

Hive is a data warehousing infrastructure based on the Hadoop. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

ref: wiki of apache

Now we will start directly by performing some operations on hive. Based on my earlier post, i hope u all must hav downloaded the cloudera's VM for hadoop and have got at least some familiar with it. U know the best thing about this VM which i like is that all the configurations for the Hadoop, Hive, HDFS etc are all done. So, we need not to configure them.

The commands which i am going to write are same as what i tried. I searched many a times for small small things which i will try and highlight them into this post. Most of these i have not got anywhere and finally i had to mail someone or get it by many a times of hit-n-trial. So, i thought of sharing them.

This is the basic structure of the data structures supported:
the Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

                  Number
                     
                        DOUBLE
                           
                              BIGINT
                                 
                                    INT
                                       
                                          TINYINT
                                 
                                    FLOAT
                                 
                                    INT
                                       
                                          TINYINT
                                 
                                    STRING
                           
                              BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:

    * to

Explicit type conversion can be done using the cast operator.

 

Now enough of theory, we will do and see some of the things:

Go to that VM and login using either of the account, i.e. root/cloudera or cloudera/cloudera.

Go to terminal and type hive and press enter:
$>hive


It will enter into hive QL inerface.

Then type the command
show tables;
it will show all the current tables in the database.
Sometimes it may be the case that it gives the error as "error in metadata......"
Actually i faced that problem and after a lots of efforts, i found the solution.
For this u shud know that hive has a single running cluster, so u need to close it where ever it is opened, i.e., check if it is opened from another terminal. If not then it might be running on other user's session. So,login with other user credential and check.
This u can test it your own also. Try running the hive from 2 different terminals on the same user login. In second when u will type any command like :
show tables;
it will show the above mentioned error. U will have to exit from the first hive interface for that.


Now, since it is the first time we are running hive on this box, it does not have any tables and so it doesn't show any table, and note that the execution time is also shown.

We can start creating the tables: (type all the following commands on the hive terminal):

create table t1(id int, name string) row format delimited fields terminated by '44';
here fields terminated by '44' means that the value of fields is seperated by ascii value 44 character, i.e., comma. By default it is ctrl+A.
show tables;
it will show the table t1 now.
to view its structure:
describe t1;
or to see full structure:
describe extended t1;

now we can load data into the table created.

Before that create one local text file say 'input.txt' with values:
1,saurabh
2,maheshwari
3,bangalore

load data local inpath 'input.txt' into table t1; 
 
This loads data into the table. Here 'input.txt' is the text file. 
If it is not at the current path,i.e, root--> home --> cloudera, v need to give the whole path as:
 
load data local inpath '//input.txt' into table t1;

Another imp thing: (again this i searched it for long enough n finally got it, so its worth mentioning):
 
The keyword 'local' shows that the file is on the local file system and not on the hadoop file system. If the file from which u want to input the data is on the HDFS (Hadoop File System), then u will have to skip keyword 'local'.
(U can try this by using the command 'copyFromLocal' in the hadoop interface (by exiting hive) and then load the data.)

U can see the contents of the table as:

select * from t1;

Again u give the command:
 
load data local inpath 'input.txt' into table t1;

now see the contents as:

select * from t1;
 
u will see that the contents are duplicated (bez of updation).


now type:

load data local inpath 'input.txt' overwrite into table t1;

and see the contents. It is self explinatory.


------------------------------------------
now,

load data local inpath 'input.txt' into table t1;
load data local inpath 'input.txt' into table t1;

now see as:

select distinct name from t1;

see when u do select distinct from t1, it will do it as map reduce. It breaks the task into map and reduce by itself. Here it is not so inportant, but when u do it with large data (which is the best use case for hive) u will save a lot of time.
-------------------------------------------

Partitioned Tables:

I got the examples about partition tables but i was not getting the real meaning and use of partition. After a long quest of thrist, i finally found a doc which says:
Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory. Suppose data for table T is in the directory /wh/T. If T is partitioned on columns ds and ctry, then data with a particular ds value 20090101 and ctry value US, will be stored in files within the directory /wh/T/ds=20090101/ctry=US

I hope this cleares your confusion also. Now we will try it.

first drop the table
drop table t1;
show tables;
create table t1 (id int, name string) comment 'This is test file' partitioned by(id) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as sequencefile;

it will say error. because we need to have the different partition column,i.e., other than the provided explicitly.

so:

create table t1 (id int, name string) comment 'This is test file' partitioned by(dt string) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as sequencefile;

 
now see the table details as
describe table t1

u will see the 3 columns, one is the partition column which is taken as seperate.
now load the data:

load data local inpath 'input.txt' overwrite into table t1 partition (dt = 'q');

It will show error. This is also an important point to mention. Again not to say, it was one of the problems i faced. When u r loading the file, it will say Exception, wrong file format, check the file format and blah blah. So what's the error???
The error is the file format, yes the thing mentioned in the error details. The data we are loading is text file data. So, the file format should be textfile instead of sequencefile.


So, do following:
 
drop table t1;
create table t1 (id int, name string) comment 'This is test file' partitioned by(dt string) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as textfile;
load data local inpath 'input.txt' overwrite into table t1 partition (dt = 'q');

select * from t1;
or
select * from t1 where .....;

you will see that the column partition will get filled by value which u give while inserting.

------------------------------------------------
 

u must hav got a bit of confidence about dealing with hive. it is much like sql query.
 
now do the following:
 
create text file input1.txt, with the similar data as input.txt.
create table t2 (id_t2 int, name_t2 string, addr_t2 string) row format delimited fields terminated by '44' stored as textfile;
load data local inpath 'input1.txt' overwrite into table t2;

insert overwrite table t1 partition (dt='q') select id_t2, name_t2 from t2 where id_t2 < 5;


see the output, the first few lines are as:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
.........
.........
den at last the statement comes as:
* rows loaded to t1

--------------------------------------------

now we will try some more commands:

alter table t1 rename to tb1;
---------------------------------------------
alter table t1 drop partition (dt='q');
describe t1;
select * from t1;
 
no data will be there... mast na..

alter table t1 add partition (dt='q');
describe t1;
----------------------------------------------
alter table t1 add columns (c int comment 'a new column');
select * from t1;
 
see the data remains and the new column is added with values NULL
-----------------------------------------------------
dere is nothing to drop any column. u can use the thing:
 
alter table t1 replace columns(c int)
see the effect of it by using :
describe t1;

see the output..
Ha Ha...
copy the commands to create the table n load the data from above
-----------------------------------------------
now u want to see the table and its contents graphically... this is one more attribute which i like and admire. all u have to do is to go into your browser in VM and type the address :

http://localhost:8088/

u will have a nice view of what is called as Hue Shell. Down there are options. First u can go into the File Browser and look for your table. It is mostly inside /user/hive/warehouse

now when u look for t1 (partitioned table) u will find one more folder inside it which will have the name same as the partition files u provided. So, the partition table gets stored as that... Now, u can believe on what i mentioned about partition tables above. 



Rest of the shells are:

Flume Shell : Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced.

Pig Shell: Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster.

HBase Shell: Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS).

There is also an User Admin in that web page. U have admin tasks over there. u can do stuffs like create user etc. These are easy and can b tried out...
---------------------------------------------------------
Hope u all must have found these things easy n interesting. real things come when we have to deal with more than one clusters. soon i will write about that. till den u all keep reading. n discuss it here only if any modifications, or issues or share if some1 has new "knowledge database" entries...

Saturday, September 24, 2011

Hadoop n Map-Reduce...

Hi. I am writing after a long break!! This time I am going to discuss about Hadoop. Y Hadoop???  bez need of the market is    "BIG DATA"...

I will not talk much about the theory part as why Hadoop and what is Hadoop etc. i just assume that you all know the pros and cons of Hadoop. We will quickly start by installing and working with Hadoop...


Installing Hadoop as with Cloudera on Windows as Demo Version Of VM

  •  First, download the Hadoop version of Cloudera VM(i have used with version 0.3.7 but now recently i found that this version is no longer available for use) here.
  •  Then you will have to download any virtual machine player to run it. So, download VMWare Player at this location.
  •  Now, since you have downloaded it with Cloudera, u needn’t do anything to configure the environment variables or JavaHome. You will need to do this if you do it using Cygwin or anything else (that will be covered in next part).
  •  Now yr setup is ready and so just start VMWare player and provide it the extracted file of Hadoop version of Cloudera VM which u downloaded just now.
The version 0.3.7 looks like :


  •  Assuming that u have net on windows machine, u don’t need to configure anything to run it as well on VM. So, download Eclipse on VM to run your first program.

If you have eclipse already downloaded in your Win box, u can get it on your VM as: In VM, go to Places --> Connect To Server --> Service Type --> Windows Share --> Give IP adder. --> Done

  •  Den make a java application project in eclipse. Include the jars for Hadoop. They must be in folder /usr/lib. Include Hadoop 0.20 files and its lib files.

If your program needs any argument as the files, then include it as Run eclipse and go to Run --> Run Configuration --> Java Application --> Arguments And include the file name as (the files are in HDFS system) hdfs://localhost/user/Cloudera/…/filename1.txt hdfs://localhost/user/Cloudera/…/filename2.txt

  •  To run it from terminal, u will need to first make the jar file of the main class. To do it, u rt. click on the main class --> export --> as jar --> browse and give name.
  •  Then on terminal, first make your working directory as the workspace of your eclipse and then to the location of the jar. Type the following command to run it on terminal:
            hadoop jar {jar file name} {main class name} {parameters}

 


I hope now u must be feeling proud after running your first Map Reduce program in Cloudera’s version of Hadoop VM. 


Please write if any doubts. I will cover the installing of Hadoop and running it on windows by using Cygwin in my next post. This is important because this Cloudera’s version is single node based, i.e., u can’t have more than one node. For having more than one node, u need to have Hadoop installed. Also, since it is single node based, whatever replication factor u give in configuration files, it will not bother for it and it will not replicate it more than once.