Apache HBase Overview

Apache HBase
  • HBase is one of the column-oriented NoSQL database built on top of the HDFS. 
  • NOSQL Stands for Not Only SQL.
  • Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data.
  • HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS).
  • HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. 
  • It is well suited for real-time data processing. 
  • It is well suited for random read/write access to large volumes of data.
  • Unlike relational database systems, HBase does not support a structured query language like SQL.
  • HBase applications are written in Java™ much like a typical Apache MapReduce application. It does support writing applications in Apache Avro, REST and Thrift.
  • Each table must have an element defined as a primary key, and all access attempts to HBase tables must use this primary key.
  • HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into HBase, but if you’re running a production cluster, it’s suggested that you have a dedicated ZooKeeper cluster that’s integrated with your HBase cluster.
  • Hive supports a rich set of primitive data types including: numeric, binary data and strings; and a number of complex types including arrays, maps, enumerations and records.
  • Apache HBase was first released in February 2007. Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Apache HBase Data Model

Relational databases are row oriented while HBase is column-oriented. Row-oriented databases store table records in a sequence of rows whereas column-oriented databases store table records in a sequence of columns.


The layout of HBase data model eases data partitioning and distribution across the cluster. HBase data model consists of several logical components - Column Identifier, Row, Column family, Table, Row key.

Column Identifier - Each column's name is known as its column identifier. (column name in RDBMS)

Row - Data is stored in rows. (row in RDBMS)

Column family - Various columns are combined in a column family. These column families are stored together which makes the searching process faster. It is like table in RDBMS. (Table in RDBMS)

Table - Data is stored in a table format in HBase. But here tables are in column-oriented format. Each table with column families and rows. There is no concept of DB in HBase. Simply they are calling DB as Table here. (DB or Schema in RDBMS)

RowKey - Every entry in an HBase table is identified and indexed by a RowKey. Generally it is used to search records which make searches fast. Row key acts as a Primary key (RDBMS) in HBase. (Rowid or PK in RDBMS)

Let us take an example and consider the table below.


If this table is stored in a row-oriented database. It will store the records as shown below.

1001, Laundry detergent, 12
1007, Toothpaste, 3
1010, Chlorine bleach, 4
1024, Toothpaste, 3

In row-oriented databases data is stored on the basis of rows as you can see above. While the column-oriented databases store this data as like below.

1001,1007,1010,1024
Laundry detergent,Toothpaste,Chlorine bleach,Toothpaste
12,3,4,3


In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner.

Refer Below images for better understanding about RDBMS vs HBase Data storage,

Example 1,


Example 2,


Example 3,


Storage Mechanism in HBase
  • HBase is a column-oriented database and the tables in it are sorted by row. 
  • The table schema defines only column families, which are the key value pairs. 
  • A table have multiple column families and each column family can have any number of columns.
  • Subsequent column values are stored continuously on the disk. 
  • Each cell value of the table has a timestamp. 
  • In short, in an HBase:
    • Table is a collection of rows. 
    • Row is a collection of column families. 
    • Column family is a collection of columns. 
    • Column is a collection of key value pairs. 

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?