Apache Pig Overview
Apache Pig
- It is another hadoop framework for Non Java Developers.
- Originally developed at Yahoo! (2007).
- PIG can eat anything that means it can handle structured and semi-structured.
- It is using Pig Latin Language.
- It is a data-flow language.
- It is Intermediate language between java and hive.
- you want to play around with data in a Hadoop cluster without having to write hundreds or thousands of lines of Java MapReduce code, you most likely will use either Hive (using the Hive Query Language HQL) or Pig.
- Hive is a SQL-like language which compiles to Java map-reduce code, while Pig is a data flow language which allows you to specify your map-reduce data pipelines using high level abstractions.
Why Pig?
- Map Reduce requires programmers.
- For pig only less programming.
- No Java knowledge.
- Development time is very less.
- Can process any kind of data (structured, semi-structured, un-structured).
- Good for Ad-hoc queries.
- Extensible by UDF by Java , Python, Java script and Ruby.
Use cases
- Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.
We can Pig in ETL for,
- Processing large amount of log data.
- Clean bad data.
We
can use Pig in Research of Raw data,
- User audit logs.
- Schema may be unknown or inconsistent.
Comments
Post a Comment