Posts

Showing posts from August, 2023

Compaction Techniques & Crash Recovery in HBase

Image
Compaction in HBase The recommended maximum region size is 10 - 20 Gb. For HBase clusters running version 0.90. x, the maximum recommended region size is 4 Gb and the default is 256 Mb. Compaction in HBase is a process by which HBase cleans itself. HBase is a distributed data store optimized for read performance. Optimal read performance comes from having one file per column family. It is not always possible to have one file per column family during the heavy writes. That is reason why HBase tries to combine all HFiles into a large single HFile to reduce the maximum number of disk seeks needed for read. This process is known as compaction. Compactions can cause HBase to block writes to prevent JVM heap exhaustion. Whereas this process is of two types:  Minor HBase Compaction  Major HBase Compaction. This Minor and Major Compaction will take time for merging/zipping those files so it makes network traffic. For avoiding network traffic, it is generally scheduled during low peak load timi

Apache HBase Architecture

Image
HBase is one of the column-oriented NoSQL database built on top of the HDFS for storage and YARN for processing. In this chapter we will discuss about HBase Architecture. Architecture There is no concept of DB in HBase. Simply they are calling DB as Table.  In HBase, tables are split into regions and that are served by the region servers. Major components of HBase are below,            1. Regions (MemStore, .META., -ROOT-)           2.  HBase Region Server (Regions, HLog)           3.  HMaster Server           4.  Zookeeper HMaster, Region Server, Zookeeper are placed to coordinate and manage Regions and perform various operations inside the Regions. We will discuss about HBase components one by one and how it helps to store and process the large set of data. Region HBase tables(schema/DB in RDBMS) can be divided into a number of regions. All the columns of a column family is stored in single MemStore of region. Single region can contains more than one MemStore. A Gr

Features / Advantage / Disadvantage / use case of HBase

 Features of HBase NoSQL DB Scalability : HBase supports scalability in both linear and modular form. Sharding : HBase supports automatic sharding of tables. It is also configurable. Distributed storage : HBase supports distributed storage like HDFS. Consistency : It supports consistent read and write operations. Failover support : HBase supports automatic failover. API support : HBase supports Java APIs so clients can access it easily. MapReduce support : HBase supports MapReduce for parallel processing of large volume of data.  Back up support : HBase supports back up of Hadoop MapReduce jobs in HBase tables.  Real time processing : It supports block cache and Bloom filters. So, real time query processing is easy.  Apart from the above major features, HBase also supports REST-ful web services, jruby-based shell, Ganglia and JMX. So, HBase has a very strong presence in the NoSQL database world. Advantages of HBase Can store large data sets. Database can be shared. Cost-effective from

Why HBase ?

Image
Why/When do we need Apache HBase When the amount of data is very huge, like in terms of petabytes or exabytes, we use column-oriented approach, because the data of a single column is stored together and can be accessed faster. Row-oriented database handles less number of data and it stores data in a structured  format.  When we need to store and analyze a large set of semi-structured or unstructured data, we use column oriented approach. Quick access to data: If you need a random and real time access to your data, then HBase is a suitable candidate. It is also a perfect fit for storing large tables with multi structured data. It gives 'flashback' support to queries, which makes it more suitable for fetching data in a particular instance of time. HBase clusters expand by adding RegionServers, it doubles both in terms of storage and as well as processing capacity. HBase provides fast record lookups (and updates) for large tables. HBase internally uses Hash tables and provides ran

Apache HBase Overview

Image
Apache HBase HBase is one of the column-oriented NoSQL database built on top of the HDFS.  NOSQL Stands for Not Only SQL. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data. HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.  It is well suited for real-time data processing.  It is well suited for random read/write access to large volumes of data. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications are written in Java™ much like a typical Apache MapReduce application. It does support writing applications in Apache Avro, REST and Thrift. Each table must have an element defined as a primary key, and all access attempts to HBase tabl