Hive is a data
warehouse framework built on top of hadoop. Hive was developed by facebook. Due
to developer support and usage of Hive, it became one of the top project of Apache
foundation.
The main reason
of Hive's popularity is because of its SQL like language for accessing the data.
Hive is very handy for data analysts who are more comfortable with SQL than MapReduce
programs written in java. Queries written in Hive are converted into MapReduce
programs under the hood.
So when you are
working with Hive, you would be creating a table and running queries on the
table to retrieve required results. Tables are fundamental entity in Hive. The
tables that are created in Hive are stored in HDFS as directories under
/user/hive/warehouse by default.
Traditional
databases use indexing for faster retrieval of data. Similarly, Hive has a
concept of Partitioning. Hive allows user to specify the partitioning columns
and based on those columns the data is partitioned and stored in various
directories. Partitioning log data based on date is very useful and is stored
in various directories under the table’s main directory segregated as folders
based on dates. You can specify more than one partitioning column.
Another concept
supported in Hive is buckets. It is used to provide a random sample of data
which might be useful in some scenarios. For e.g. – aggregate of a sample can
provide a good approximation of the entire data set. Buckets work on top of
partitions of data created. So if user specifies 32 buckets based on login_id
on log data partitioned by date, then for say date 31 Jan 2014, will have 32
buckets and so on.
Metastore
Metastore is a
database that stores hive's metadata. Metastore is a service and the backing store for
the data. There are 3 configurations of metastore:
Embedded metastore: By
default, metastore contains an imbedded Derby database which is located on the
local disk and the metastore service runs in the same JVM as the Hive service.
Local metastore: In order
to support multiple sessions and users, any JDBC complaint db running in a
separate process either in a local machine or a remote machine is connected to
metastore service running in same JVM as the hive service.
Remote Metastore: In this
configuration Metastore service is run independent of Hive service i.e. in a
separate process and connects to the JDBC compliant db. Because metastore is
separated from the hive service so it brings more manageability and security.
Access:
Hive is accessed
by Command Line Interface (CLI), web GUI and JDBC/ODBC connection.
Hive’s Schema on Read versus Traditional database’s Schema
on Write
Traditional
databases need to have a schema in place for the tables before you insert data
into it. This is called schema on write as data is verified when it is loaded.
Hive on the other hand doesn't verify the data when it is loaded. Data is
simply copied to the file store, no transformation is required. Hive tables are
created when user want to analyze the data and when user reads data from the
table then the schema for the table is verified against the data. This is
called schema on read. There are some advantages and disadvantages in both the
approaches.
Schema on Write takes more time at loading but querying of data is very fast and efficient.
Schema on Read supports saving any kind of data without any botheration about the type, length
or any other constraints of the schema. New data with different set of columns
can start flowing any time. So, data loading part is very quick and
flexible/agile. But at the same time, querying can take more time as your
schema verification against data is happening at the time of querying of the
data. Also, based on kind of analysis to be performed on data user can have
multiple schemas written for the same underlying data.