Posts

Showing posts from 2023

Features of Scala

 Below are following features of Scala, Type inference Singleton object Immutability Lazy computation Case classes and Pattern matching Concurrency control String interpolation Higher order function Traits Rich collection set Type Inference In Scala, you don't require to mention data type and function return type explicitly. Scala is enough smart to deduce the type of data.  The return type of function is determined by the type of last expression/value/data present in the function. Singleton object In Scala, there are no static variables or methods. Scala uses singleton object, which is essentially class with only one object in the source file.  Singleton object is declared by using object instead of class keyword. Immutability Scala uses immutability concept. Each declared variable is immutable by default. Immutable means you can't modify its value. You can also create mutable variables which can be changed. Immutable data helps to manage concurrency control which requires man

Scala Overview & History & Popularity & Usage

Image
 Scala Overview Scala is an object-oriented and functional programming language. Scala is a general-purpose programming language.  It supports object oriented, functional and imperative programming approaches.  It is a pure object-oriented language in the sense that every value is an object and functional language in the sense that every function is a value. In Scala, everything is an object whether it is a function or a number.  It is a strong static type language.  It does not have concept of primitive data. The name of Scala is derived from word scalable which means it can grow with the demand of users. Scala is not an extension of Java, but it is completely interoperable with it. While compilation, Scala file translates to Java bytecode and runs on JVM (Java Virtual machine). History Scala was created and developed by Martin Odersky. Martin started working on Scala in 2001 at the Ecole Polytechnique Federale de Lausanne (EPFL). It was officially released on January 20, 2004. It was

Compaction Techniques & Crash Recovery in HBase

Image
Compaction in HBase The recommended maximum region size is 10 - 20 Gb. For HBase clusters running version 0.90. x, the maximum recommended region size is 4 Gb and the default is 256 Mb. Compaction in HBase is a process by which HBase cleans itself. HBase is a distributed data store optimized for read performance. Optimal read performance comes from having one file per column family. It is not always possible to have one file per column family during the heavy writes. That is reason why HBase tries to combine all HFiles into a large single HFile to reduce the maximum number of disk seeks needed for read. This process is known as compaction. Compactions can cause HBase to block writes to prevent JVM heap exhaustion. Whereas this process is of two types:  Minor HBase Compaction  Major HBase Compaction. This Minor and Major Compaction will take time for merging/zipping those files so it makes network traffic. For avoiding network traffic, it is generally scheduled during low peak load timi

Apache HBase Architecture

Image
HBase is one of the column-oriented NoSQL database built on top of the HDFS for storage and YARN for processing. In this chapter we will discuss about HBase Architecture. Architecture There is no concept of DB in HBase. Simply they are calling DB as Table.  In HBase, tables are split into regions and that are served by the region servers. Major components of HBase are below,            1. Regions (MemStore, .META., -ROOT-)           2.  HBase Region Server (Regions, HLog)           3.  HMaster Server           4.  Zookeeper HMaster, Region Server, Zookeeper are placed to coordinate and manage Regions and perform various operations inside the Regions. We will discuss about HBase components one by one and how it helps to store and process the large set of data. Region HBase tables(schema/DB in RDBMS) can be divided into a number of regions. All the columns of a column family is stored in single MemStore of region. Single region can contains more than one MemStore. A Gr

Features / Advantage / Disadvantage / use case of HBase

 Features of HBase NoSQL DB Scalability : HBase supports scalability in both linear and modular form. Sharding : HBase supports automatic sharding of tables. It is also configurable. Distributed storage : HBase supports distributed storage like HDFS. Consistency : It supports consistent read and write operations. Failover support : HBase supports automatic failover. API support : HBase supports Java APIs so clients can access it easily. MapReduce support : HBase supports MapReduce for parallel processing of large volume of data.  Back up support : HBase supports back up of Hadoop MapReduce jobs in HBase tables.  Real time processing : It supports block cache and Bloom filters. So, real time query processing is easy.  Apart from the above major features, HBase also supports REST-ful web services, jruby-based shell, Ganglia and JMX. So, HBase has a very strong presence in the NoSQL database world. Advantages of HBase Can store large data sets. Database can be shared. Cost-effective from

Why HBase ?

Image
Why/When do we need Apache HBase When the amount of data is very huge, like in terms of petabytes or exabytes, we use column-oriented approach, because the data of a single column is stored together and can be accessed faster. Row-oriented database handles less number of data and it stores data in a structured  format.  When we need to store and analyze a large set of semi-structured or unstructured data, we use column oriented approach. Quick access to data: If you need a random and real time access to your data, then HBase is a suitable candidate. It is also a perfect fit for storing large tables with multi structured data. It gives 'flashback' support to queries, which makes it more suitable for fetching data in a particular instance of time. HBase clusters expand by adding RegionServers, it doubles both in terms of storage and as well as processing capacity. HBase provides fast record lookups (and updates) for large tables. HBase internally uses Hash tables and provides ran

Apache HBase Overview

Image
Apache HBase HBase is one of the column-oriented NoSQL database built on top of the HDFS.  NOSQL Stands for Not Only SQL. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data. HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.  It is well suited for real-time data processing.  It is well suited for random read/write access to large volumes of data. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications are written in Java™ much like a typical Apache MapReduce application. It does support writing applications in Apache Avro, REST and Thrift. Each table must have an element defined as a primary key, and all access attempts to HBase tabl

YARN

Image
YARN Overview YARN stands for Yet Another Resource Negotiator.  The Yarn was introduced in Hadoop 2.x.  It is purely for processing data and processing layer. It is called as Data Processing Framework (DPF). Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS.  Apart from resource management, Yarn is also used for job Scheduling. YARN Architecture Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave daemon called "Node Manager (one per slave node)".  Resource Manager, Node Manager are the two daemons of YARN.  Resource Manager contains in name node.  Node Manager contains in data node. Each data node will have one separate Node Manager. Resource Manager In General view, Resource Manager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs).  Prior to

HDFS Architecture

Image
HDFS Services/Daemons           1. Name Node (NN)           2. Data node (DN)           3. Secondary Name Node (SNN)             4. Standby Name Node (Standby) These 3 nodes deal with HDFS which used to store the data into HDFS. V1 Map Reduce Daemons     Yarn Daemons (Map Reduce V2)           4. Job Tracker                  4. Resource Manager           5. Task Tracker                  5. Node Manager These daemons are responsible for running the MapReduce jobs. Start-dsf.sh service for starting hdfs daemons and start-yarn.sh for starting yarn daemons separately. start-all.sh service for starting all above daemons. Name Node (NN) Only one name node available for HDFS. It is a master node. Stores only Meta data of all the data. If name node fails then HDFS will be down, i.e., called Single Point Of Failure (SPOF). High Availability (HA) Here if Name node Active is down then Name Node Stand by acting like 'Active Name Node'. How memory management in NN , When we are doing any tra

HDFS Overview

Image
HDFS Hadoop Distributed File System which is specially designed file system. Data storage framework. It is storage layer which is designed for only storing data. It is used to store huge amount of datasets with cluster of commodity hardware and with streaming access pattern. Why we are saying HDFS is specially designed ? File system means it is way of storing files and directories. Hard disk is having memory space like 500 GB. By default one block will be 4kb. Here I am storing 2kb of file then the remaining memory space will be wasted. It is a normal process for storing files which following in RDBMS. In HDFS, by default each block will have 64 MB size. When I am storing 35 MB of data in a block then the remaining memory space will be used to store another files or directories. This process is called Sharding. So that only we are calling HDFS as specially designed file system. If we are wasting these remaining spaces then we need more systems to store the huge datasets. You can make