Posts

Showing posts from February, 2024

Spark Architecture

Image
Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.  It's designed to handle large-scale data processing and analytics tasks efficiently. Spark's architecture is structured to maximize performance and scalability. Architecture Driver Program The driver program is the entry point of any Spark application.  The driver program is responsible for translating the user's code into tasks and distributing them to the executors.  The Driver Program is a process that runs the main() function of the application and creates the SparkContext object which represents the connection to the Spark cluster. When a user submits a Spark application, the Driver Program receives the application code and any associated configurations. Rest of the technical stuffs could be taken by SparkContext. The Driver Program interacts with the cluster manager (such as Apache Mesos or YARN or

Apache Spark

What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.  It was originally developed in 2009 in UC Berkeley's AMP Lab, and open sourced in 2010 as an Apache project. Apache Spark is a single node or cluster computing platform on top of any storage layer for large scale data processing. It is an End to End Analytics platform (Data Ingestion, ETL, Analytics and streaming). Developed to overcome limitations of Hadoop/MapReduce. Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Major companies like Amazon, eBay, Yahoo uses Spark. Why do we prefer Spark over Hadoop? Hadoop as a big data proce

Hive File Formats - Questions & Answers

01. What is the primary use of Textfile, Sequencefile, RCfile, ORCfile formats? Thus you can use the above four file formats depending on your data.                 For example,  If your data is delimited by some parameters then you can use TEXTFILE format.  If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.  If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.  If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format. 02. What is the difference between a block and a stripe? HDFS blocks is the lowest level, ORC stripe is upper level, these levels are completely independent, and stripes in ORC do not care about lower storage layer. HDFS blocks: HDFS blocks is the lowest level, independent from file format. HDFS splits files in blocks to optimize storage. One stripe can be

Hive File Formats

Image
 Hive supports several file formats: Text File SequenceFile RCFile Avro Files ORC Files Parquet Custom INPUTFORMAT and OUTPUTFORMAT      The hive.default.fileformat configuration parameter which is avaialble in hive-site.xml determines the format to use if it is not specified in a CREATE TABLE or ALTER TABLE statement.        Text file is the parameter's default value. What is File Format ? File Format is a way in which information is stored or encoded in a computer file.  In Hive it refers to how records are stored inside the file. These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O. Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not. Text File TEXT FILE format is a famous input/output format used in Hadoop. In Hive if we define a table as TEXTFILE it can load data of from csv (Comma Separated Values), tsv, txt

Hive Data Types

Image
 All the data types in Hive are classified into four types, given as follows: Column Types Literals Null Values Misc Types Complex Types Column Types Integral Types Integer type data can be specified using integral data types, INT. When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT. The data ranges for TINYINT, SMALLINT, INT, BIGINT is below, TINYINT (1-byte signed integer, from -128 to 127) SMALLINT (2-byte signed integer, from -32,768 to 32,767) INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647) BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) Example create table test1(id tinyint) row format delimited fields terminated by ','; insert into test1 values(1); insert into test1 values(127); insert into test1 values(128); hive> select * from test1; OK 1 127 NULL Time taken: 0.055 seconds, Fetched: 3 row(s)

Serializer and Deserializer in Hive

What is SerDe? The record parsing of a Hive table is handled by a serializer/deserializer or SerDe for short. Hive uses the SerDe interface for IO. The Hive deserializer converts record (string or binary) into a java object that Hive can process (modify). Now, the Hive serializer will take this Java object, convert it into suitable format that can be stored into HDFS. So, basically a serde is responsible for converting the record bytes into something that can be used by Hive.                 HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object (Java object)                Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files  (Java object) Why we need SerDe? If we need to handle/load the semi structured data we can go with this case. If we have unstructured data, then we use RegEx SerDe which will instruct hive how to handle that record. A SerDe allows Hive to read in data from a table, and write it back ou

Hive Joins

 Map Join In Apache Hive, there is a feature that we use to speed up Hive queries. Basically, that feature is what we call Map join in Hive. Apache Hive Map Join is also known as Auto Map Join, or Map Side Join, or Broadcast Join. Also, we use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into RAM. Map Join could be performed within a mapper without using a Map/Reduce step. However, map joins in Hive are way faster than the regular joins since no reducers are necessary. If we are joining two tables using map join then one table should have minimal data compare with the other one table. If that small table is having data upto or less than 25MB then it considered like small table. If it is exceeding 25MB but still it is small table(1TB) compare with the other one(10 TB) then we have to use some different method in map join to handle this case. In a Map join no reduce task is required that is why it improves the query performance significantly i