Posts

Serializer and Deserializer in Hive

What is SerDe? The record parsing of a Hive table is handled by a serializer/deserializer or SerDe for short. Hive uses the SerDe interface for IO. The Hive deserializer converts record (string or binary) into a java object that Hive can process (modify). Now, the Hive serializer will take this Java object, convert it into suitable format that can be stored into HDFS. So, basically a serde is responsible for converting the record bytes into something that can be used by Hive.                 HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object (Java object)                Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files  (Java object) Why we need SerDe? If we need to handle/load the semi structured data we can go with this case. If we have unstructured data, then we use RegEx SerDe which will instruct hiv...

Hive Joins

 Map Join In Apache Hive, there is a feature that we use to speed up Hive queries. Basically, that feature is what we call Map join in Hive. Apache Hive Map Join is also known as Auto Map Join, or Map Side Join, or Broadcast Join. Also, we use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into RAM. Map Join could be performed within a mapper without using a Map/Reduce step. However, map joins in Hive are way faster than the regular joins since no reducers are necessary. If we are joining two tables using map join then one table should have minimal data compare with the other one table. If that small table is having data upto or less than 25MB then it considered like small table. If it is exceeding 25MB but still it is small table(1TB) compare with the other one(10 TB) then we have to use some different method in map join to handle this case. In a Map join no reduce task is required that is why it improves the query performance significant...

HDFS Infographic

Image
Already we know, the data will be getting stored in 3 data nodes (replication) for prevention of failures or corruption. It is useless when you storing the same data in same disk/rack as 3 times. Suppose that whole data replicated single disk/rack is corrupted then it won't be going to work out, It is useless to come and do replication process in HDFS. When we are storing all of these 3 replicated data in different data node but in same rack. It is also not helpful. Suppose if you lose whole rack, then won't be going to help. So we have to store it in different data node with different rack then it will be more helpful to protect the data. Here we are going to discuss about how to store the data in different data node with different rack and how to handle Fault Tolerance. Replica Placement Strategy Name node will follow  Rack Awareness Policy  algorithm for replicating data into 3 different data nodes. This Rack Awareness Policy follows Replica placement strategy. Here th...

Features of Scala

 Below are following features of Scala, Type inference Singleton object Immutability Lazy computation Case classes and Pattern matching Concurrency control String interpolation Higher order function Traits Rich collection set Type Inference In Scala, you don't require to mention data type and function return type explicitly. Scala is enough smart to deduce the type of data.  The return type of function is determined by the type of last expression/value/data present in the function. Singleton object In Scala, there are no static variables or methods. Scala uses singleton object, which is essentially class with only one object in the source file.  Singleton object is declared by using object instead of class keyword. Immutability Scala uses immutability concept. Each declared variable is immutable by default. Immutable means you can't modify its value. You can also create mutable variables which can be changed. Immutable data helps to manage concurrency control which requir...

Scala Overview & History & Popularity & Usage

Image
 Scala Overview Scala is an object-oriented and functional programming language. Scala is a general-purpose programming language.  It supports object oriented, functional and imperative programming approaches.  It is a pure object-oriented language in the sense that every value is an object and functional language in the sense that every function is a value. In Scala, everything is an object whether it is a function or a number.  It is a strong static type language.  It does not have concept of primitive data. The name of Scala is derived from word scalable which means it can grow with the demand of users. Scala is not an extension of Java, but it is completely interoperable with it. While compilation, Scala file translates to Java bytecode and runs on JVM (Java Virtual machine). History Scala was created and developed by Martin Odersky. Martin started working on Scala in 2001 at the Ecole Polytechnique Federale de Lausanne (EPFL). It was officially released on ...

Compaction Techniques & Crash Recovery in HBase

Image
Compaction in HBase The recommended maximum region size is 10 - 20 Gb. For HBase clusters running version 0.90. x, the maximum recommended region size is 4 Gb and the default is 256 Mb. Compaction in HBase is a process by which HBase cleans itself. HBase is a distributed data store optimized for read performance. Optimal read performance comes from having one file per column family. It is not always possible to have one file per column family during the heavy writes. That is reason why HBase tries to combine all HFiles into a large single HFile to reduce the maximum number of disk seeks needed for read. This process is known as compaction. Compactions can cause HBase to block writes to prevent JVM heap exhaustion. Whereas this process is of two types:  Minor HBase Compaction  Major HBase Compaction. This Minor and Major Compaction will take time for merging/zipping those files so it makes network traffic. For avoiding network traffic, it is generally scheduled during low peak ...

Apache HBase Architecture

Image
HBase is one of the column-oriented NoSQL database built on top of the HDFS for storage and YARN for processing. In this chapter we will discuss about HBase Architecture. Architecture There is no concept of DB in HBase. Simply they are calling DB as Table.  In HBase, tables are split into regions and that are served by the region servers. Major components of HBase are below,            1. Regions (MemStore, .META., -ROOT-)           2.  HBase Region Server (Regions, HLog)           3.  HMaster Server           4.  Zookeeper HMaster, Region Server, Zookeeper are placed to coordinate and manage Regions and perform various operations inside the Regions. We will discuss about HBase components one by one and how it helps to store and process the large set of data. Region HBase tables(schema/DB in RDBMS) can be divided into a number of regions. All th...