Big Data and Hadoop: August 2014

Hive is a data warehouse framework built on top of hadoop. Hive was developed by facebook. Due to developer support and usage of Hive, it became one of the top project of Apache foundation.

The main reason of Hive's popularity is because of its SQL like language for accessing the data. Hive is very handy for data analysts who are more comfortable with SQL than MapReduce programs written in java. Queries written in Hive are converted into MapReduce programs under the hood.

So when you are working with Hive, you would be creating a table and running queries on the table to retrieve required results. Tables are fundamental entity in Hive. The tables that are created in Hive are stored in HDFS as directories under /user/hive/warehouse by default.

Traditional databases use indexing for faster retrieval of data. Similarly, Hive has a concept of Partitioning. Hive allows user to specify the partitioning columns and based on those columns the data is partitioned and stored in various directories. Partitioning log data based on date is very useful and is stored in various directories under the table’s main directory segregated as folders based on dates. You can specify more than one partitioning column.

Another concept supported in Hive is buckets. It is used to provide a random sample of data which might be useful in some scenarios. For e.g. – aggregate of a sample can provide a good approximation of the entire data set. Buckets work on top of partitions of data created. So if user specifies 32 buckets based on login_id on log data partitioned by date, then for say date 31 Jan 2014, will have 32 buckets and so on.

Metastore

Metastore is a database that stores hive's metadata. Metastore is a service and the backing store for the data. There are 3 configurations of metastore:

Embedded metastore: By default, metastore contains an imbedded Derby database which is located on the local disk and the metastore service runs in the same JVM as the Hive service.

Local metastore: In order to support multiple sessions and users, any JDBC complaint db running in a separate process either in a local machine or a remote machine is connected to metastore service running in same JVM as the hive service.

Remote Metastore: In this configuration Metastore service is run independent of Hive service i.e. in a separate process and connects to the JDBC compliant db. Because metastore is separated from the hive service so it brings more manageability and security.

Access:

Hive is accessed by Command Line Interface (CLI), web GUI and JDBC/ODBC connection.

Hive’s Schema on Read versus Traditional database’s Schema on Write

Traditional databases need to have a schema in place for the tables before you insert data into it. This is called schema on write as data is verified when it is loaded. Hive on the other hand doesn't verify the data when it is loaded. Data is simply copied to the file store, no transformation is required. Hive tables are created when user want to analyze the data and when user reads data from the table then the schema for the table is verified against the data. This is called schema on read. There are some advantages and disadvantages in both the approaches.

Schema on Write takes more time at loading but querying of data is very fast and efficient.

Schema on Read supports saving any kind of data without any botheration about the type, length or any other constraints of the schema. New data with different set of columns can start flowing any time. So, data loading part is very quick and flexible/agile. But at the same time, querying can take more time as your schema verification against data is happening at the time of querying of the data. Also, based on kind of analysis to be performed on data user can have multiple schemas written for the same underlying data.

Hadoop Distributed File System (HDFS) is the distributed storage component of hadoop framework. In order to interface with HDFS the user must know hadoop file shell commands.

Hadoop commands take the form of

hadoop fs -command <args>

Whenever you write hadoop commands you have to prefix it with 'hadoop fs' followed by hyphen and command name. This is required so as the system to recognize that user is accessing files at HDFS and not local file system(underlying operating system).

Let us see some commonly used HDFS commands:

· copyFromLocal OR put: This command is used to copy/put files from local ( underlying operating system's ) file system to HDFS.

Syntax: hadoop fs -copyFromLocal / put <path of file at local file system: source> <HDFS path: target>

· copyToLocal OR get: This command is used to copy files from HDFS to local ( underlying operating system's ) file system.

Syntax: hadoop fs - copyToLocal / get [-ignorecrc] [-crc] <HDFS path: source>

The parameter [-ignorecrc] [-crc] is important when you are copying files from HDFS. HDFS creates a checksum value for each block of every file. Whenever user gets file from HDFS to its local file system she has the choice of validating that data using the checksum value associated with it.

· ls: This command is used to show the list of files / directories in HDFS. It shows name, permissions, owner, group, replication factor, modification date for each entry.

Syntax: hadoop fs -ls <path>

· lsr: This command is used to show list of all files and directories recursively. Each entry show same information as ls.

Syntax: hadoop fs -lsr <path>

· moveFromLocal: This command is used to move file(s) from local file system to HDFS. The source file is deleted from local file system after successful copying on HDFS.

Syntax: hadoop fs -moveFromLocal <local file path: source> <HDFS path: target>

· rm: This command is used to delete files or directories.

Syntax: hadoop fs -rm <path>

· rmr: This command works same as rm but with recursion.

Syntax: hadoop fs -rmr <path>

Big Data and Hadoop

Tuesday, 26 August 2014

Introduction to Hive

Sunday, 3 August 2014

HDFS File Commands