Apache Pig Running & Execution Mode

Apache Pig Running Mode

You can run/execute Pig latin stetements and commands using three modes.

  1. Interactive mode
  2. Batch mode
  3. Embedded mode(UDF). 

Interactive mode

  • Interactive processing, we can use Apache Tez or Apache Spark as processing engine.
  • Interactive processing means that the person needs to provide the computer with instructions whilst it is doing the processing. i.e the user 'interacts' with the computer to complete the processing.
    • For example, if you use an online system to book a hotel room, you will fill in a web form, submit it and it will come back to inform you of the room you have booked.
  • You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

Batch mode

  • We can process the data set by set or batch by batch with limited number of records/data/rows Like mapreduce, hive.
  • We will do this in pig using mapreduce as processing engine.

Embedded mode(UDF)

  • Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, Python, Ruby and using them in our script. 
  • Running pig logics or loading data or manipulating data in pig using UDF of java or pythin or ruby.

Execution Modes of Pig

  1. Local Mode
  2. Tex Local Mode
  3. Spark Local Mode
  4. Mapreduce Mode
  5. Tez Mode
  6. Spark Mode

Local Mode

  • In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

            $ pig -x local

Tez Local Mode

  • To run pig in tez local mode.
  • It is similar to local mode except internally pig will invoke tez runtime engine.

    $ pig -x tez_local

Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.

Spark Local Mode

  • To run pig in spark local mode.
  • It is similar to local mode except internally pig will invoke spark runtime engine.

    $ pig -x spark_local

Note: Spark local mode is experimental. There are some queries which just error out on bigger data in local mode.

MapReduce Mode

  • MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
  • You need access to a Hadoop Cluster and HDFS installation.
  • Mapreduce mode is default mode But dont need to specify it while logging into pig.

    $ pig -x mapreduce (or) $ pig

Tez Mode

  • To run pig in tez mode. you need access to a Hadoop Cluster and HDFS installation.
  • It will internally pig will invoke tez runtime engine for processing which is in cluster.

    $ pig -x tez

Spark Mode

  • To run pig in spark mode. you need access to a spark and Hadoop Cluster and HDFS installation.
  • It will internally pig will invoke spark runtime engine for processing which is in cluster.

    $ pig -x spark

  • pig scripts run on spark can take advantage of the "dynamic allocation" feature. This feature can be enabled using below.

spark.conf.set("spark.dynamicAllocation.enabled", "true") -- pyspark

spark.dynamicAllocation.enabled=true -- scala

  • In general all properties in the pig script prefixed with spark are copied to the Spark Application Configuration.
  • Please note that Yarn auxillary service need to be enabled on Spark for this to work.

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?