Pig UDFs can be easily implemented in Java, below are the Steps to create a UDF using eclipse.
We need to make sure that we download the latest pig.jar and include it in the build path, otherwise the code will not compile. Every new function has to either extend ‘EvalFunc’ class or any other class like ‘LoadFunc’. All these dependent classes reside in the pig.jar.
grunt> register 'your_path_to_jar/NewUDF.jar';
A single word can be defined for the whole method, so as to make the code more readable and also to avoid writing the full method specification at every part of code where it is needed.
grunt>define TRIM com.hadoop.pig.Trim();
grunt>divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
grunt>trimmed = foreach divs generate TRIM(symbol);
This function can also be used in order to include a set of paths on the command line for Pig to search, while looking for UDFs.
So we change our invocation to:
pig -Dudf.import.list=org.apache.pig.piggybank.evaluation.string register.pig
Using yet another property, we can get rid of the register command as well.
If we add the below code to our command line, then the register command is no longer necessary.
Set Dpig.additional.jars=/usr/local/pig/piggybank/piggybank.jar
Creating a UDF (without eclipse)
The class name should also be TrimTo.java. The package name should be same as the folder name where the Java file resides. i.e 'myNewUdf'.
$ cd myNewUdf/
$ ls -l
total 8
-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java
:~/pig/myNewUdf$ javac –classpath /home/userName/pig/trunk/pig.jar TrimTo.java
Now, the class file would be visible
:~/pig/myNewUdf$ ls -l
total 8
-rw-rw-r-- 1 userName userName 1917 Feb 21 16:45 TrimTo.class
-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java
:~/pig/myNewUdf$ cd ..
:~/pig$ jar cf myNewUdf.jar myNewUdf
grunt> REGISTER /home/userName /pig/myNewUdf.jar;
grunt>GrocPricesTrim= FOREACH GrocPrices generate myNewUdf.TrimTo(PRODUCTNAME);
grunt> ILLUSTRATE GrocPricesTrim;
Creating and using Macros
Macros are declared with the define statement. A macro takes a set of input parameters, which are string values that will be substituted for the parameters when the macro is expanded. The name of output relation is given in a return statement. The operators of the macro are enclosed in {} (braces).
-------- Macro.pig --------
<strong>define dividend_analysis (daily, year, daily_symbol, daily_open, daily_close)</strong>
<strong>returns analyzed</strong> {
divs = load 'NYSE_dividends' as(exchange:chararray,symbol:chararray,
date:chararray, dividends:float);
divsthisyear = filter divs by date matches '$year-.*';
dailythisyear = filter $daily by date matches '$year-.*';
jnd = join divsthisyear by symbol, dailythisyear by $daily_symbol;
$analyzed = foreach jnd generate dailythisyear::symbol, $daily_close
- $daily_open;
};
------- on the Grunt shell ---------
daily = load '/home/share/Customer-Bigdata-Analysis/NYSE_daily.txt'
as (exchange:chararray, symbol:chararray,date:chararray, open:float,
high:float, low:float, close:float,volume:int, adj_close:float);
import '/home/cs246/PigPPT/macro.pig';
results = dividend_analysis(daily, '2009', 'symbol', 'open', 'close');
describe results;
If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com
Other useful links:
Email Classifier using Mahout on Hadoop
Installing SolrCloud on Hadoop