Apache Pig Overview

Apache Pig

  • It is another hadoop framework for Non Java Developers.
  • Originally developed at Yahoo! (2007). 
  • PIG can eat anything that means it can handle structured and semi-structured.
  • It is using Pig Latin Language.
  • It is a data-flow language.
  • It is Intermediate language between java and hive. 
  • you want to play around with data in a Hadoop cluster without having to write hundreds or thousands of lines of Java MapReduce code, you most likely will use either Hive (using the  Hive Query Language HQL) or Pig.
  • Hive is a SQL-like language which compiles to Java map-reduce code, while Pig is a data flow language which allows you to specify your map-reduce data pipelines using high level abstractions.

Why Pig?

  • Map Reduce requires programmers.
  • For pig only less programming. 
  • No Java knowledge. 
  • Development time is very less. 
  • Can process any kind of data (structured, semi-structured, un-structured). 
  • Good for Ad-hoc queries. 
  • Extensible by UDF by Java , Python, Java script and Ruby.

Use cases

  • Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.



We can Pig in ETL for,

  • Processing large amount of log data.
  • Clean bad data.

We can use Pig in Research of Raw data,

  • User audit logs.
  • Schema may be unknown or inconsistent.

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types