04 - Pig Programming

4.1 What is Pig

Yahoo developed Pig and is one of the heaviest users of Hadoop, run 40 percent of all its Hadoop jobs with Pig.Twitter is also another well-known user of Pig.

Pig is a high-level Map Reduce based programming language built up over Hadoop platform. The language is commonly known as Pig Latin. Pig Latin works on the same way as the JAVA works for the implementation of Map Reduce; the only difference is that it is a high level language. Pig is a good starting point for writing some programs for beginners so that they can get familiarize with Hadoop eco system. Pig provides its own set of functions for programmers to use as toolbox. However on top of it they can build their own UDF (User Defined Functions). These UDFs can be written in Java, Python, Ruby and etc. and then call directly from the Pig Latin.

It is not easy task to implement Map Reduce Programs in any language. Till now we were thinking always in a procedural manner and try to accomplish any task. We start our program with defining some variables using some functions, array, loops, and conditional statements to write program. But now while using Hadoop we need to change our mindset because now we will write our program flow to implement Map Reduce programming model. We need to think at the level of mapper and reducer function. Pig simplifies the Hadoop programming by giving a high-level data processing language while keeps the scalability and reliability. Pig majorly has two major components –

-A high-level data processing language known as Pig Latin

-A compiler that compiles and runs Pig Latin script

 

Pig Latin has a very good capability to build the program with a far better ease and comforts with its powerful tool and internal architecture. Particularly its compiler helps to automatically achieve optimization; these definitely lower down the burden of programmer for having to do tuning in program.

Here are the major bullet points that differentiate Pig from SQL:

-Pig makes use of lazy evaluation

-ETL is being used

-Pipeline splits are being supported

-Execution plans can be declared

Pig Latin is a procedural language in nature in the pipeline paradigm while SQL is instead declarative. To have data in a single table from two or more different tables in SQL it must be joined in one or other way while in Pig you have several ways to keep data from two or more different tables. Pig Latin programming is very much similar to specifying a query execution plan;it makes it very easy for the programmer to explicitly control the flow of their data processing task. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph rather than a pipeline. Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development.

4.2 Pig Installation And Configuration

Pig Installation is on the same line as was that of Hadoop. Download the latest release of Pig from Apache website. Copy in into a folder,un tar the file using TAR command and this will take care of your Pig installation. Do the following settings so that to make it work –

  • Using command vi $HOME/.bashrc, Set the PIG_HOME environment variable
  • Set export PIG_HOME=/<<path>>/<<name of the pig unzipped pig folder>>
  • #export PATH=$PATH:$PIG_HOME/bin
  • #bash
  • $ cd /<<path>>/<<name of the pig unzipped pig folder>>
  • Set the value to local for exectype in file pig.properties which is present in conf directory

Start Pig using command $ bin/pig -x local and you will see the GRUNT shell that looks like grunt>

Run the help command to make sure Pig is working

grunt> HELP

Commands:

<piglatin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs<fs arguments> - Equivalent to Hadoopdfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe<alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot] [-param<param_name>=<param_value>]

        [-param_file<file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

        -script - Explain the entire script.

        -out - Store the output into directory rather than print to stdout.

        -brief - Don't expand nested plans (presenting a smaller graph for overview).

        -dot - Generate the output in .dot format. Default is text format.

-param<param_name - See parameter substitution for details.

-param_file<file_name> - See parameter substitution for details.

alias - Alias to explain.

dump<alias> - Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param<param_name>=param_value] [-param_file<file_name>] <script> -

        Execute the script with access to grunt environment including aliases.

-param<param_name - See parameter substitution for details.

-param_file<file_name> - See parameter substitution for details.

script - Script to be executed.

run [-param<param_name>=param_value] [-param_file<file_name>] <script> -

        Execute the script with access to grunt environment.

-param<param_name - See parameter substitution for details.

-param_file<file_name> - See parameter substitution for details.

script - Script to be executed.

sh<shell command> - Invoke a shell command.

kill<job_id> - Kill the hadoop job specified by the hadoop job id.

set<key><value> - Provide execution parameters to Pig. Keys and values are case sensitive.

        The following keys are supported:

default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

debug - Set debug on or off. Default is off.

        job.name - Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath - String that contains the path. This is used by streaming.

anyhadoop property.

help - Display this message.

quit - Quit the grunt shell.

 

4.3 Writing Simple to Complex Pig Programs

The language of Pig is also known as Pig Latin. First we will visit the data types:

Simple Data Type

Type

Size

Int

Signed 32-bit integer

Long

Signed 64-bit integer

Float

32-bit floating point

Double

64-bit floating point

Bytearray

Binary object

chararray

String

 

Complex Data Type                                                                       

Type

Description

Tuple

When some fields gets arranged in order and form a row then it is known as Tuple, represented by fields separated by commas in a parentheses

(Apple, 15, 1000)

Bag

Bag is a collection of Tuples. This definitely resembles with our SQL Tables. Tuples can be duplicated in a Bag. Two Tuple in a Bag may or may not have same number of fields. A Bag is represented by curly braces that contains tuples

{<Apple, 15, 1000>,<Banana, 20, 1500>}

Map

Map is a set of key/value pairs. In Map:

Key –Should be unique and of type string

Value – Can be of any type

[Key#Value]

Let us perform some programming operations. We have created a space delimited file that contains following columns Emp_Name, Emp_Id, and Emp_Age. The records in file looks like -:

Employee Table

Emp_Name

Emp_Id

Emp_Age

John

10011

32

Johnathan

10012

28

Sally

10013

35

Salvia

10014

26

Case Sensitivity Nature of Pig

Here is the distinction of Case Sensitive and Case InSensitiveInstruction for Pig –

Case Sensitive –

  • Name of Relations
  • Name of Fields
  • Name of Pig Latin Functions (PigStorage, COUNT)

Case InSensitive –

  • Name of Parameters
  • Pig Latin Keywords (LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, DUMP)

Instruction for Identifier

Identifiers includes following:

  • Name of relations (aliases)
  • Name of fields
  • Name of variables

Identifier name should start with a letter and then it can be followed by any number of letters, digits, or underscores.

LOAD

The first task is to read this file into Pig, for this purpose we will use the LOAD command

    grunt> Employee = LOAD '/SOFT/PigExercise/EmployeeData' USING PigStorage(' ') as (Emp_Name:chararray, Emp_Id:int, Emp_Age:int);

After execution of this query an Employee table got created, Note that our input file EmployeeData was not having any extension like .txt or etc hence we have simply written the file name without extension (Pig accept it). We have passed single space as parameter in PigStorage() function, This is because our input file was delimited by space.

PigStorage loads and stores data in the structured text files. Note that the default field delimiter for PigStorage is tab ('\t'). We definitely have option to specify other characters as field delimiters by enclosing them in single quotes.

DUMP

To check the table content in Pig DUMP command needs to be used

     grunt> DUMP Employee;

It is showing the output as

    (John,10011,32)

    (Johnathan,10012,28)

    (Sally,10013,35)

    (Salvia,10014,26)

This confirms that Employee table is created. We had 4 records in the file and all 4 are being displayed over here. Sometimes we might have thousands of records in input file and after loading in to a table, if we want to run the DUMP command to check whether table got created successfully then this command will display all thousands records, to restricts the number of records that should display on the screen we can use LIMIT command.

LIMIT

In the example above if we want to display only 3 records out of 4 then we can do it as

    grunt>Lmt = LIMIT Employee 3;

    grunt> DUMP Lmt;

It is showing the output as

    (John,10011,32)

    (Sally,10013,35)

    (Johnathan,10012,28)                  

LIMIT command limits the number of tuples to a specified number. When we use it right after a table that was processed by an ORDER by clause operator then LIMIT returns the first n tuples. Note that here there’s no guarantee which tuples are returned. The LIMIT command does not perform categorization since it is not a read write operator. Please note that after executing of LIMIT command in our above example the records for John, Johnathan and Sally (the first 3 records) does not return in an order. Instead it displays records in order of John, Sally and Johnathan.

STORE

Now the next is to save Employee table in Pig, for this purpose we will use STORE command

     grunt> STORE Employee INTO '/SOFT/PigOutput';

This statement will store the Employee table in the directory /SOFT/PigOutput. Make sure this folder PigOutputshould not be present before running above statement otherwise an error would be prompted as

     2014-02-19 13:08:20,731 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN

     2014-02-19 13:08:20,736 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000:

     <line 4, column 0> Output Location Validation Failed for: '/SOFT/PigOutput More info to follow:

     Output directory /SOFT/PigOutput already exists

DESCRIBE

When you want to check the column name and type of column name then use DESCRIBE command. While executing the statement for Employee table

      grunt> DESCRIBE Employee;

Output received as

     Employee: {Emp_Name: chararray,Emp_Id: int,Emp_Age: int}

Like DESCRIBE we can use EXPLAIN and ILLUSTRATE as diagnostic operators in Pig language.

EXPLAIN – It displays the execution plan used to compute a relation.

ILLUSTRATE – It displaysthe step by step how data is transformed, starting with the load command and till the data come to a final resulting relation. In order to keep the display and processing manageable only a sample of the input data is used to simulate theexecution.

FOREACH GENERATE

Suppose we want to select few columns from an existing table then we can use FOREACH and GENERATE to accomplish this. We already have an Employee table and now to select only the Name and Age column then we can write something like

       grunt>Emp_Subset = FOREACH Employee GENERATE Emp_Name, Emp_Age;  

       grunt> DESCRIBE Emp_Subset;

Output of above statement shows only Name and Age column information.

       Emp_Subset: {Emp_Name: chararray,Emp_Age: int}

In the statement “FOREACH Employee GENERATE Emp_Name, Emp_Age“ Emp_Name and Emp_Age can be replaced by $0 and $2,  hence new statement can be “FOREACH Employee GENERATE $0, $2“, as Emp_Name and Name_Age are the first and third column in original Employee table. The rule is column1 is represented by $0, column2 by $1, column3 by $2, column4  by $3 and so on.

AVG

AVG function Calculates the average of numeric values in a single-column bag.

Suppose we have following data in table FruitAvg column1, column2 are FruitName, FruitCount respectively.

Apple 16

Banana 20

Orange 22

Grapes 12

Fruit Fruit

Apple 10

Banana 16

Orange 14

Grapes 12

Note that row5 in above table is Fruit Fruit (Here FruitCount column assigned a word ‘Fruit’ instead of FruitCount value). We have taken this example intentionally, This is just to show you the power of Pig, in output this row would get assigned a null value as average since Pig is performing calculation only over numeric data. Our task is to identify the average of each fruit from the table. First we will make the group on FruitName column in table FruitAvg and form a new table FruitAvgGrp using query

      grunt>FruitAvgGrp = GROUP FruitAvg BYFruitName; 

Now to generate the average on column FruitCount, for this we will use the following query that is using AVG function and generating output in a new table FruitAvgCalc.

      grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, AVG(FruitAvg.FruitCount) AS FruitCount;

Dumping FruitAvgCalc to check what data got stored in it

      grunt> DUMPFruitAvgCalc;

({(Apple),(Apple)},13.0)

({(Fruit)},)

({(Banana),(Banana)},18.0)

({(Grapes),(Grapes)},12.0)

({(Orange),(Orange)},18.0)

Along with the average the output is also showing the number of instance a FruitNameappered for generating average, in our case it was 2.

COUNT

COUNT Calculate the number of tuples in a bag.

In the above example we are doing a slight modification in FruitAvg table and removing row5 that contains ‘Fruit Fruit’ data. We have already grouped on FruitName in FruitAvgGrptable, now to generate the count we will use the query

       grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, COUNT(FruitAvg.FruitCount) AS FruitCount;

Dumping to check the output

({(Apple),(Apple)},2)

({(Banana),(Banana)},2)

({(Grapes),(Grapes)},2)

({(Orange),(Orange)},2)

Note that while counting tuples COUNT function ignores the null values and count only the not null values.

MAX

Calculate the maximum value in a bag.

In table FruitAvg we are doing some modification so the table looks like as

Apple 16

Banana 20

Orange 22

Grapes 12

Apple 10

Banana 16

Orange 14

Grapes 12

Grapes 32

Watermelon 8

We have already grouped on FruitName in FruitAvgGrptable, now to generate the max we will use the query

       grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, MAX(FruitAvg.FruitCount) AS FruitCount;

Dumping to check the output

({(Apple),(Apple)},16)

({(Banana),(Banana)},20)

({(Grapes),(Grapes),(Grapes)},32)

({(Orange),(Orange)},22)

({(Watermelon)},8)

MIN

Calculate the minimum value in a bag.

Referencing above MAX example we are replacing MAX by MIN and let us see what output it generates

        grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, MIN(FruitAvg.FruitCount) AS FruitCount;

Dumping to check the output

({(Apple),(Apple)},10)

({(Banana),(Banana)},16)

({(Grapes),(Grapes),(Grapes)},12)

({(Orange),(Orange)},14)

({(Watermelon)},8)

SUM

Calculate the sum of numeric values in a bag.

Referencing above MAX example we are replacing MAX by SUM and let us see what output it generates

       grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, SUM(FruitAvg.FruitCount) AS FruitCount;

Dumping to check the output

({(Apple),(Apple)},26)

({(Banana),(Banana)},36)

({(Grapes),(Grapes),(Grapes)},56)

({(Orange),(Orange)},36)

({(Watermelon)},8)

CONCAT

Concatenate two strings of type chararray or bytearrays.

We have following two columns A and B stored in a table InputSample

Apache

Pig

Apache

Hive

Apache

HBase

Apache

Sqoop

Apache

ZooKeeper

 

Following query would be run to concatenate the two columns using CONCAT function

       grunt>CONCATAandB = FOREACH InputSample GENERATE A, B, CONCAT(A,B) AS C;

Let us see what we get if we DESCRIBE table CONCATAandB

       grunt> DESCRBECONCATAandB;                                            

CONCATAandB: {A: chararray,B: chararray,C: chararray}

A new column C got created with the same data type of chararray.

Dumping to check the output

       grunt> DUMP CONCATAandB;

(Apache,Pig,ApachePig)

(Apache,Hive,ApacheHive)

(Apache,HBase,ApacheHBase)

(Apache,Sqoop,ApacheSqoop)

(Apache,ZooKeeper,ApacheZooKeeper)

Both first and second column got concatenated in third column.

COUNT_STAR

COUNT_STAR function is used to calculate the number of elements in a bag. Following is must for COUNT_STAR function to work –

-Preceding GROUP ALL statement for global counts

-GROUP BY statement for group counts.

COUNT_STAR includes NULL values in the count computation, this is very unlike to the COUNT function that ignores NULL values.

DIFF

DIFF function performs comparison between two bags and returned the difference of tuples between two bags. We need to pass bags as arguments and DIFF function will compare them. If any tuples that are in one bag but not present in other bag then those tuples would be returned in a bag. In the case of when all the tuples of both bag matches then an empty bag would be returned.

SIZE

It calculates the size of the data contained; Note that for a bag it counts the number of tuples while for a tuple it counts the number of elements. Referencing above table InputSample which we used in CONCAT example and we are using SIZE function in it for column B. the query goes like as

        grunt>ExampleSIZE = FOREACH InputSample GENERATE B, SIZE(B) AS BSIZE;

Dumping to check the output

(Pig,3)

(Hive,4)

(HBase,5)

(Sqoop,5)

(ZooKeeper,9)

As clear from the output that in case of working with chararray SIZE function returns the number of character in data. Similarly it works for other data type as:

-For int it returns 1

-For long it returns 1

-For float it returns 1

-For double it returns 1

-We have already seen for chararray

-For bytearray it returns the number of bytes in the array

-For tuple it returns the number of fields in the tuple

-For bag it returns number of tuple in that bag

-For map it returns the number of key value pair in that map

TOKENIZE

It Split a string of type chararray into a bag of words where each word is a tuple in the bag.Word separators could be space, double quote ("), comma, parentheses and asterisk (*).

We have tuple with data “My First Pig Program” in a table String. To get this entire string into a single tuple we cannot use space as separator since this string itself contains embedded space. So we will use ‘\n’ new line character separator. The query goes like as

        grunt> String = LOAD '/SOFT/PigExercise/String' USING PigStorage('\n') as (A:chararray);

Let us use TOKENIZE function to tokenize this string.

       grunt> TOKEN = FOREACH String GENERATE A, TOKENIZE(A);

Dumping to check the output

(My First Pig Program,{(My),(First),(Pig),(Program)})

IsEmpty

It Check whether a bag or map is empty; Note that the IsEmpty function checks if bag or map has no data. This function is normally used to filter data.

Math Functions (Alphabetically Sorted)

ABS

ABS function is used toget the absolute value from an expression. Note that when using the ABS function If the returned result is positive then the result is returned, otherwise if the result is negative then the negation of the result is returned.

ACOS

ACOS is a mathematical trigonometric function that is used to return the arc cosine part from an expression.

ASIN

ASIN is a mathematical trigonometric function that is used to return the arc sine part from an expression.

ATAN

ASIN is a mathematical trigonometric function that is used to return the arc tangentpart from an expression.

CBRT

CBRT function returns the cube root from an expression.

CEIL

CEIL function is used to return the rounded and nearest to the integervalue from an expression. CEIL will never ever lower the result. Let us have look below where we have listed many possibilities and outcome after using CEIL –

If a = 2.3 then CEIL(a) = 3

If a = 7.5 then CEIL(a) = 8

If a = 4.7 then CEIL(a) = 5

If a = 2.0 then CEIL(a) = 2

If a = -2.3 then CEIL(a) = -2

If a = -7.5 then CEIL(a) = -7

If a = -4.7 then CEIL(a) = -4

COS

COS function is used to return the trigonometric cosine from an expression.

COSH

COSH function is used to return the hyperbolic cosine from an expression.

EXP

EXP function is used to calculate the exponential value. EXP(x) will calculate the Euler’s number e raised to the power x.

FLOOR

FLOOR function is used to return the rounded and farthest to the integer value from an expression. FLOOR will always lower the result.Let us have look below where we have listed many possibilities and outcome after using FLOOR –

If a = 2.3 then FLOOR(a) = 2

If a = 7.5 then FLOOR(a) = 7

If a = 4.7 then FLOOR(a) = 4

If a = 2.0 then FLOOR(a) = 2

If a = -2.3 then FLOOR(a) = -3

If a = -7.5 then FLOOR(a) = -8

If a = -4.7 then FLOOR(a) = -5

LOG

LOG function is used to return the natural logarithm from an expression.

LOG10

LOG10 function is used to return the base 10 logarithm from an expression.

RANDOM

RANDOM function is used to return a pseudo random number of data type double that is greater than or equal to 0.0 and less than 1.0. It is basically used to generate the random number.

ROUND

ROUND function is used to perform rounding of an expression that is rounded to an integer. Let us have look below where we have listed many possibilities and outcome after using ROND –

If a = 2.3 then ROUND(a) = 2

If a = 7.5 then ROUND(a) = 8

If a = 4.7 then ROUND(a) = 5

If a = 2.0 then ROUND(a) = 2

If a = -2.3 then ROUND(a) = -2

If a = -7.5 then ROUND(a) = -7

If a = -4.7 then ROUND(a) = -5

SIN

SIN function is used to return the sine from an expression.

SINH

SINH function is used to return the hyperbolic sine from an expression.

SQRT

SQRT function is used to calculate the positive square root from an expression.

TAN

TAN function is used to return the trigonometric tangent from an angle.

TANH

TANH function is used to return the hyperbolic tangent from an expression.

Date Time Functions (Alphabetically Sorted)

AddDuration

AddDuration function is used to createa new datetime object by adding some datetimeduration to given datetime object.

CurrentTime

CurrentTime function is used tocreate a datetime object of current time with having accuracy in millisecond.

DaysBetween

DaysBetween function is used to get the number of days between two datetime objects.

GetDay

GetDay function is used to extract the day of a month from the datetime object.

GetHour

GetHour function is used to extract the hour of a day from the datetime object.

GetMilliSecond

GetMilliSecond function is used to retrieve the millsecond of a second from the datetime object.

GetMinute

GetMinute function is used to retrieve the minute of anhour from the datetime object.

GetMonth

GetMonth function is used to retrieve the month of a year from the datetime object.

GetSecond

GetSecond function is used to retrieve the second of a minute from the datetime object.

GetWeek

GetWeek function is used to retrieve the week of a year from the datetimeobject.

GetYear

GetYear function is used to get the year from the datetime object.

HoursBetween

HoursBetween function is used to retrievethe number of hours between two given datetime objects.

MilliSecondsBetween

MilliSecondsBetween function is used to extract the number of millseconds between two datetime objects.

MinutesBetween

MinutsBetween function is used to extract the number of minutes between two datetime objects.

MonthsBetween

MonthsBetween function is used to extract the number of months between two datetime objects.

SecondsBetween

SecondsBetween function is used to retrieve the number of seconds between two datetime objects.

SubtractDuration

SubtractDuration function is used to create a new datetime object by subtracting some duration from a datetime object.

ToDate

ToDate function is used to create a DateTime object. This function uses the default Time Zone when the timezone is not specified.

ToMilliSeconds

ToMilliSeconds function converts the DateTime to the number of milliseconds Since January 1, 1970 00:00:00.000 GMT.

ToString

ToString function converts the DateTime to the string format.

ToUnixTime

ToUnixTime function is used to convert the DateTime to the Unix Time.

WeeksBetween

WeeksBetween function is used to extractthe number of weeks between two datetime objects.

YearsBetween

YearsBetween function is used to extractthe number of years between two datetime objects.

Like us on Facebook