Posts

Showing posts from December, 2022

Pig Datatypes

Numeric datatypes int Singned 32-bit integer Example: 8 (or) 5 (or) any number long Singned 64-bit integer Example: 5L (or) 5l float 32-bit floating point Example: 5.5F (or) 5.5f (or) 5.0f (or) any number with decimal       (early version, recently float has been dropped) double 64-bit floating point Example: 10.5 (or) 5.0 (or) 1.5 (or) 1.5e2 (or) 1.5E2 (or) any number with decimal Non-numeric or Arrays datatype chararray  character array(string) Example: 'karpagaraj' or 'hello there' or any string bytearray blob Example: any binary data Non-numeric datatype boolean true or false (case sensitive) datetime Example 1970-01-01T00:00:00.000+00:00 null unknown or data missing or any data element can be null. Complex datatype tuple ordered set of fields Example: (1,3) like that bag collection of tuples Example: {(1,3),(4,6)} map collection of tuples: set of key/value pairs Example: [k1#v1, k2#v2] atom  any single value in Pig Latin, irrespective of their data, t...

Apache Pig Running & Execution Mode

Apache Pig Running Mode You can run/execute Pig latin stetements and commands using three modes. Interactive mode Batch mode Embedded mode(UDF).  Interactive mode Interactive processing, we can use Apache Tez or Apache Spark as processing engine. Interactive processing means that the person needs to provide the computer with instructions whilst it is doing the processing. i.e the user 'interacts' with the computer to complete the processing. For example, if you use an online system to book a hotel room, you will fill in a web form, submit it and it will come back to inform you of the room you have booked. You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line. Batch mode We can process the data set by set or batch by batch with limited number of records/data/rows Like mapreduce, hive. We will do this in pig usi...

Apache Pig Operators

Load & Store Opertators Load Store Diagnostic Opertators Dump Describe Explain Illustrate Grouping & Joining Opertators Group Cogroup Join - Single Key | Multi Key join Cross Combining & Splitting Opertators Union Split Filtering Opertators Distinct Filter Foreach Sorting Opertators order by  limit Built-In Functions Concat tokenize flatten bag, tuple, map, top PigStorage TextLoader BinStorage

Apache Pig Overview

Image
Apache Pig It is another hadoop framework for Non Java Developers. Originally developed at Yahoo! (2007).  PIG can eat anything that means it can handle structured and semi-structured. It is using Pig Latin Language. It is a data-flow language. It is Intermediate language between java and hive.  you want to play around with data in a Hadoop cluster without having to write hundreds or thousands of lines of Java MapReduce code, you most likely will use either Hive (using the  Hive Query Language HQL) or Pig. Hive is a SQL-like language which compiles to Java map-reduce code, while Pig is a data flow language which allows you to specify your map-reduce data pipelines using high level abstractions. Why Pig? Map Reduce requires programmers. For pig only less programming.  No Java knowledge.  Development time is very less.  Can process any kind of data (structured, semi-structured, un-structured).  Good for Ad-hoc queries.  Extensible by UDF by Java...

Impala

What is Impala Impala server is a distributed, massively parallel processing (MPP) database engine. Distributed Massively Parallel processing engine Developed by Cloudera. Based on Google 2010 dremel paper. Direct access to HDFS/HBase data via impala engine. Hive and Impala are similar. Both are meant for “SQL on Hadoop”. Any table create on impala can access from hive and impala Impala SQL is a subset of Hive Query Language. It provides high performance and low latency compared to other SQL engines for Hadoop. Why Impala when we have Hive Hive queries incurs overheads of starting MapReduce jobs: Job setup and creation Start JVMs Slot assignment Input split creation Map tasks generation Hive inherits the latency of MapReduce. The latency with a MapReduce jobs is higher since it has the overhead of creating a job, starting JVMs, calculating input splits etc. Higher latency, poor performance. So, Hive cannot be used interactively by the user. Impala does not build on MapReduce. It is its...