Big Data and Hadoop: October 2014

MongoDB is a open-source document-oriented database. It's a NoSql database.

NoSql means 'Not Only SQL' i.e. a database which is more flexible than a SQL database. Flexibility in terms of the schema, data model and the agility of the database

to handle new data.

10gen is initiator of MongoDB project.

Document-oriented database means a database where each record of data is a document. The document in MongoDB is similar to that of JSON object. JSON is short for

JavaScript Object Notation. In fact, data that is stored in database for MongoDB is BSON object, which is Binary JSON.

JSON is a format to write an object in Javascript. JSON objects are in the form of key-value pairs.

A database in MongoDB is made up of collections. Collection is similar to table in RDBMS. Each collection contains documents. Document is similar to record in RDBMS. Each

document can have its own unique set of fields. Documents in MongoDB are very rich and can have embedded documents or an array of documents or lists of values.

MongoDB has a non-normalized way of maintaining data and that is one of the reason that it doesn't support joins. Joins perform poorly scaling out and since MongoDB is

designed for horizontal scaling, it doesn't' support joins.

Let me give you a simple example that shows how data is maintained in MongoDB.

Let's see the example of a Blog: A main page of a blog(similar to say the one you are seeing) usually has Author details, Posts details, Comments details and Tags.

If you have to maintain this kind of data in RDBMS, you need to have separate tables for author details, posts, comments and tags. There will be foreign key specified

between the tables, for e.g.: posts will have a tag_id foreign key or say, posts table will have an author_id foreign key and there can be couple of more relationships

between the tables. So in order to show all the details when someone opens the home page of a blog, there need to be couple of joins between the tables. This is how

RDBMS will store data for a blog.

Now seeing, the same scenario for MongoDB, we typically will have one collection say ,posts. Within this collection we will have documents, list of documents and list of

values which can represent all the data that goes into making a home page of a blog.

something like this:

posts:

{

title:"MongoDB",

body: "....",

author:"geetanjali badhla",

date"26 Oct 2014",

comments:[{name:"vijay",email:"vijay@gmail.com",comment:"..."},{name:"pooja",....},...],

tags:["MongoDB","NoSql","Document-Oriented"]

}

So, if you look closely, comments field is a list of embedded documents and not a separate table as it probably would be in RDBMS.

Also, tags is a list of values in the same collection itself.

This kind of structure is very handy and efficient when you have all the data that is usually required together is stored together on disk. There is no need for

multiple joins in this kind of scenario and because of this the scalability of database is easier and efficient.

This represents one document of 'posts' collection. Posts Collection can have multiple documents specified in it, which can have different fields as well. So, you can

have something like below in the 'posts' collection as well:

posts:

{

title:"NoSql",

body: "....",

author:"geetanjali badhla",

date"26 Oct 2014",

tags:["MongoDB","NoSql","Document-Oriented"]

}

Above example shows that we don't have comments field for 'posts' collection in one document but can have in another document within the same collection. This is how

MongoDB dynamically maintain data.

I will continue writing more stuff about probably MongoDB or generic topic of NoSql. Stay Tuned!

Big Data and Hadoop

Sunday, 26 October 2014

Introduction to MongoDB