Apache Pig Running & Execution Mode
Apache Pig Running Mode
You can run/execute Pig latin stetements and commands using three modes.
- Interactive mode
- Batch mode
- Embedded mode(UDF).
Interactive mode
- Interactive processing, we can use Apache Tez or Apache Spark as processing engine.
- Interactive processing means that the person needs to provide the computer with instructions whilst it is doing the processing. i.e the user 'interacts' with the computer to complete the processing.
- For example, if you use an online system to book a hotel room, you will fill in a web form, submit it and it will come back to inform you of the room you have booked.
- You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.
Batch mode
- We can process the data set by set or batch by batch with limited number of records/data/rows Like mapreduce, hive.
- We will do this in pig using mapreduce as processing engine.
Embedded mode(UDF)
- Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, Python, Ruby and using them in our script.
- Running pig logics or loading data or manipulating data in pig using UDF of java or pythin or ruby.
Execution Modes of Pig
- Local Mode
- Tex Local Mode
- Spark Local Mode
- Mapreduce Mode
- Tez Mode
- Spark Mode
Local Mode
- In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
$ pig -x local
Tez Local Mode
- To run pig in tez local mode.
- It is similar to local mode except internally pig will invoke tez runtime engine.
$ pig -x tez_local
Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.
Spark Local Mode
- To run pig in spark local mode.
- It is similar to local mode except internally pig will invoke spark runtime engine.
$ pig -x spark_local
Note: Spark local mode is experimental. There are some queries which just error out on bigger data in local mode.
MapReduce Mode
- MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
- You need access to a Hadoop Cluster and HDFS installation.
- Mapreduce mode is default mode But dont need to specify it while logging into pig.
$ pig -x mapreduce (or) $ pig
Tez Mode
- To run pig in tez mode. you need access to a Hadoop Cluster and HDFS installation.
- It will internally pig will invoke tez runtime engine for processing which is in cluster.
$ pig -x tez
Spark Mode
- To run pig in spark mode. you need access to a spark and Hadoop Cluster and HDFS installation.
- It will internally pig will invoke spark runtime engine for processing which is in cluster.
$ pig -x spark
- pig scripts run on spark can take advantage of the "dynamic allocation" feature. This feature can be enabled using below.
spark.conf.set("spark.dynamicAllocation.enabled", "true") -- pyspark
spark.dynamicAllocation.enabled=true -- scala
- In general all properties in the pig script prefixed with spark are copied to the Spark Application Configuration.
- Please note that Yarn auxillary service need to be enabled on Spark for this to work.
Comments
Post a Comment