Big Data and Hadoop: Introduction to Hive

Hive is a data warehouse framework built on top of hadoop. Hive was developed by facebook. Due to developer support and usage of Hive, it became one of the top project of Apache foundation.

The main reason of Hive's popularity is because of its SQL like language for accessing the data. Hive is very handy for data analysts who are more comfortable with SQL than MapReduce programs written in java. Queries written in Hive are converted into MapReduce programs under the hood.

So when you are working with Hive, you would be creating a table and running queries on the table to retrieve required results. Tables are fundamental entity in Hive. The tables that are created in Hive are stored in HDFS as directories under /user/hive/warehouse by default.

Traditional databases use indexing for faster retrieval of data. Similarly, Hive has a concept of Partitioning. Hive allows user to specify the partitioning columns and based on those columns the data is partitioned and stored in various directories. Partitioning log data based on date is very useful and is stored in various directories under the table’s main directory segregated as folders based on dates. You can specify more than one partitioning column.

Another concept supported in Hive is buckets. It is used to provide a random sample of data which might be useful in some scenarios. For e.g. – aggregate of a sample can provide a good approximation of the entire data set. Buckets work on top of partitions of data created. So if user specifies 32 buckets based on login_id on log data partitioned by date, then for say date 31 Jan 2014, will have 32 buckets and so on.

Metastore

Metastore is a database that stores hive's metadata. Metastore is a service and the backing store for the data. There are 3 configurations of metastore:

Embedded metastore: By default, metastore contains an imbedded Derby database which is located on the local disk and the metastore service runs in the same JVM as the Hive service.

Local metastore: In order to support multiple sessions and users, any JDBC complaint db running in a separate process either in a local machine or a remote machine is connected to metastore service running in same JVM as the hive service.

Remote Metastore: In this configuration Metastore service is run independent of Hive service i.e. in a separate process and connects to the JDBC compliant db. Because metastore is separated from the hive service so it brings more manageability and security.

Access:

Hive is accessed by Command Line Interface (CLI), web GUI and JDBC/ODBC connection.

Hive’s Schema on Read versus Traditional database’s Schema on Write

Traditional databases need to have a schema in place for the tables before you insert data into it. This is called schema on write as data is verified when it is loaded. Hive on the other hand doesn't verify the data when it is loaded. Data is simply copied to the file store, no transformation is required. Hive tables are created when user want to analyze the data and when user reads data from the table then the schema for the table is verified against the data. This is called schema on read. There are some advantages and disadvantages in both the approaches.

Schema on Write takes more time at loading but querying of data is very fast and efficient.

Schema on Read supports saving any kind of data without any botheration about the type, length or any other constraints of the schema. New data with different set of columns can start flowing any time. So, data loading part is very quick and flexible/agile. But at the same time, querying can take more time as your schema verification against data is happening at the time of querying of the data. Also, based on kind of analysis to be performed on data user can have multiple schemas written for the same underlying data.

Big Data and Hadoop

Tuesday, 26 August 2014

Introduction to Hive

No comments:

Post a Comment