Thursday, September 29, 2011

Hive... start n have fun...

Hive is a data warehousing infrastructure based on the Hadoop. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

ref: wiki of apache

Now we will start directly by performing some operations on hive. Based on my earlier post, i hope u all must hav downloaded the cloudera's VM for hadoop and have got at least some familiar with it. U know the best thing about this VM which i like is that all the configurations for the Hadoop, Hive, HDFS etc are all done. So, we need not to configure them.

The commands which i am going to write are same as what i tried. I searched many a times for small small things which i will try and highlight them into this post. Most of these i have not got anywhere and finally i had to mail someone or get it by many a times of hit-n-trial. So, i thought of sharing them.

This is the basic structure of the data structures supported:
the Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

                  Number
                     
                        DOUBLE
                           
                              BIGINT
                                 
                                    INT
                                       
                                          TINYINT
                                 
                                    FLOAT
                                 
                                    INT
                                       
                                          TINYINT
                                 
                                    STRING
                           
                              BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:

    * to

Explicit type conversion can be done using the cast operator.

 

Now enough of theory, we will do and see some of the things:

Go to that VM and login using either of the account, i.e. root/cloudera or cloudera/cloudera.

Go to terminal and type hive and press enter:
$>hive


It will enter into hive QL inerface.

Then type the command
show tables;
it will show all the current tables in the database.
Sometimes it may be the case that it gives the error as "error in metadata......"
Actually i faced that problem and after a lots of efforts, i found the solution.
For this u shud know that hive has a single running cluster, so u need to close it where ever it is opened, i.e., check if it is opened from another terminal. If not then it might be running on other user's session. So,login with other user credential and check.
This u can test it your own also. Try running the hive from 2 different terminals on the same user login. In second when u will type any command like :
show tables;
it will show the above mentioned error. U will have to exit from the first hive interface for that.


Now, since it is the first time we are running hive on this box, it does not have any tables and so it doesn't show any table, and note that the execution time is also shown.

We can start creating the tables: (type all the following commands on the hive terminal):

create table t1(id int, name string) row format delimited fields terminated by '44';
here fields terminated by '44' means that the value of fields is seperated by ascii value 44 character, i.e., comma. By default it is ctrl+A.
show tables;
it will show the table t1 now.
to view its structure:
describe t1;
or to see full structure:
describe extended t1;

now we can load data into the table created.

Before that create one local text file say 'input.txt' with values:
1,saurabh
2,maheshwari
3,bangalore

load data local inpath 'input.txt' into table t1; 
 
This loads data into the table. Here 'input.txt' is the text file. 
If it is not at the current path,i.e, root--> home --> cloudera, v need to give the whole path as:
 
load data local inpath '//input.txt' into table t1;

Another imp thing: (again this i searched it for long enough n finally got it, so its worth mentioning):
 
The keyword 'local' shows that the file is on the local file system and not on the hadoop file system. If the file from which u want to input the data is on the HDFS (Hadoop File System), then u will have to skip keyword 'local'.
(U can try this by using the command 'copyFromLocal' in the hadoop interface (by exiting hive) and then load the data.)

U can see the contents of the table as:

select * from t1;

Again u give the command:
 
load data local inpath 'input.txt' into table t1;

now see the contents as:

select * from t1;
 
u will see that the contents are duplicated (bez of updation).


now type:

load data local inpath 'input.txt' overwrite into table t1;

and see the contents. It is self explinatory.


------------------------------------------
now,

load data local inpath 'input.txt' into table t1;
load data local inpath 'input.txt' into table t1;

now see as:

select distinct name from t1;

see when u do select distinct from t1, it will do it as map reduce. It breaks the task into map and reduce by itself. Here it is not so inportant, but when u do it with large data (which is the best use case for hive) u will save a lot of time.
-------------------------------------------

Partitioned Tables:

I got the examples about partition tables but i was not getting the real meaning and use of partition. After a long quest of thrist, i finally found a doc which says:
Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory. Suppose data for table T is in the directory /wh/T. If T is partitioned on columns ds and ctry, then data with a particular ds value 20090101 and ctry value US, will be stored in files within the directory /wh/T/ds=20090101/ctry=US

I hope this cleares your confusion also. Now we will try it.

first drop the table
drop table t1;
show tables;
create table t1 (id int, name string) comment 'This is test file' partitioned by(id) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as sequencefile;

it will say error. because we need to have the different partition column,i.e., other than the provided explicitly.

so:

create table t1 (id int, name string) comment 'This is test file' partitioned by(dt string) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as sequencefile;

 
now see the table details as
describe table t1

u will see the 3 columns, one is the partition column which is taken as seperate.
now load the data:

load data local inpath 'input.txt' overwrite into table t1 partition (dt = 'q');

It will show error. This is also an important point to mention. Again not to say, it was one of the problems i faced. When u r loading the file, it will say Exception, wrong file format, check the file format and blah blah. So what's the error???
The error is the file format, yes the thing mentioned in the error details. The data we are loading is text file data. So, the file format should be textfile instead of sequencefile.


So, do following:
 
drop table t1;
create table t1 (id int, name string) comment 'This is test file' partitioned by(dt string) clustered by (id) sorted by (id) into 2 buckets row format delimited fields terminated by '44' stored as textfile;
load data local inpath 'input.txt' overwrite into table t1 partition (dt = 'q');

select * from t1;
or
select * from t1 where .....;

you will see that the column partition will get filled by value which u give while inserting.

------------------------------------------------
 

u must hav got a bit of confidence about dealing with hive. it is much like sql query.
 
now do the following:
 
create text file input1.txt, with the similar data as input.txt.
create table t2 (id_t2 int, name_t2 string, addr_t2 string) row format delimited fields terminated by '44' stored as textfile;
load data local inpath 'input1.txt' overwrite into table t2;

insert overwrite table t1 partition (dt='q') select id_t2, name_t2 from t2 where id_t2 < 5;


see the output, the first few lines are as:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
.........
.........
den at last the statement comes as:
* rows loaded to t1

--------------------------------------------

now we will try some more commands:

alter table t1 rename to tb1;
---------------------------------------------
alter table t1 drop partition (dt='q');
describe t1;
select * from t1;
 
no data will be there... mast na..

alter table t1 add partition (dt='q');
describe t1;
----------------------------------------------
alter table t1 add columns (c int comment 'a new column');
select * from t1;
 
see the data remains and the new column is added with values NULL
-----------------------------------------------------
dere is nothing to drop any column. u can use the thing:
 
alter table t1 replace columns(c int)
see the effect of it by using :
describe t1;

see the output..
Ha Ha...
copy the commands to create the table n load the data from above
-----------------------------------------------
now u want to see the table and its contents graphically... this is one more attribute which i like and admire. all u have to do is to go into your browser in VM and type the address :

http://localhost:8088/

u will have a nice view of what is called as Hue Shell. Down there are options. First u can go into the File Browser and look for your table. It is mostly inside /user/hive/warehouse

now when u look for t1 (partitioned table) u will find one more folder inside it which will have the name same as the partition files u provided. So, the partition table gets stored as that... Now, u can believe on what i mentioned about partition tables above. 



Rest of the shells are:

Flume Shell : Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced.

Pig Shell: Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster.

HBase Shell: Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS).

There is also an User Admin in that web page. U have admin tasks over there. u can do stuffs like create user etc. These are easy and can b tried out...
---------------------------------------------------------
Hope u all must have found these things easy n interesting. real things come when we have to deal with more than one clusters. soon i will write about that. till den u all keep reading. n discuss it here only if any modifications, or issues or share if some1 has new "knowledge database" entries...

Saturday, September 24, 2011

Hadoop n Map-Reduce...

Hi. I am writing after a long break!! This time I am going to discuss about Hadoop. Y Hadoop???  bez need of the market is    "BIG DATA"...

I will not talk much about the theory part as why Hadoop and what is Hadoop etc. i just assume that you all know the pros and cons of Hadoop. We will quickly start by installing and working with Hadoop...


Installing Hadoop as with Cloudera on Windows as Demo Version Of VM

  •  First, download the Hadoop version of Cloudera VM(i have used with version 0.3.7 but now recently i found that this version is no longer available for use) here.
  •  Then you will have to download any virtual machine player to run it. So, download VMWare Player at this location.
  •  Now, since you have downloaded it with Cloudera, u needn’t do anything to configure the environment variables or JavaHome. You will need to do this if you do it using Cygwin or anything else (that will be covered in next part).
  •  Now yr setup is ready and so just start VMWare player and provide it the extracted file of Hadoop version of Cloudera VM which u downloaded just now.
The version 0.3.7 looks like :


  •  Assuming that u have net on windows machine, u don’t need to configure anything to run it as well on VM. So, download Eclipse on VM to run your first program.

If you have eclipse already downloaded in your Win box, u can get it on your VM as: In VM, go to Places --> Connect To Server --> Service Type --> Windows Share --> Give IP adder. --> Done

  •  Den make a java application project in eclipse. Include the jars for Hadoop. They must be in folder /usr/lib. Include Hadoop 0.20 files and its lib files.

If your program needs any argument as the files, then include it as Run eclipse and go to Run --> Run Configuration --> Java Application --> Arguments And include the file name as (the files are in HDFS system) hdfs://localhost/user/Cloudera/…/filename1.txt hdfs://localhost/user/Cloudera/…/filename2.txt

  •  To run it from terminal, u will need to first make the jar file of the main class. To do it, u rt. click on the main class --> export --> as jar --> browse and give name.
  •  Then on terminal, first make your working directory as the workspace of your eclipse and then to the location of the jar. Type the following command to run it on terminal:
            hadoop jar {jar file name} {main class name} {parameters}

 


I hope now u must be feeling proud after running your first Map Reduce program in Cloudera’s version of Hadoop VM. 


Please write if any doubts. I will cover the installing of Hadoop and running it on windows by using Cygwin in my next post. This is important because this Cloudera’s version is single node based, i.e., u can’t have more than one node. For having more than one node, u need to have Hadoop installed. Also, since it is single node based, whatever replication factor u give in configuration files, it will not bother for it and it will not replicate it more than once.