And Hadoop: Integration of Data and Analytics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

SAS AND HADOOP :

Integration of Data and Analytics

This white paper covers the basics of using Hadoop for data storage and
retrieval via SAS, and how these two technologies merge.

Gyan Gaurav
12/22/2016
12/22/2016

SAS AND HADOOP:


Integration of Data and Analytics
For many years mining huge volume of data and extracting meaningful information from
it has been a great challenge. Over the years the volume of data has exponentially
increased and analytics have been more comprehensive. In this domain Hadoop have
revolutionized the way data is stored and retrieved. SAS on the other hand have been a
leader in analytics.

When Hadoop was introduced, it provided a completely new approach to store and
retrieve data, and SAS was quick to integrate Hadoop within its programming
framework.

This paper explores that how SAS has seamlessly integrated Hadoop within its
programming framework, and how a SAS user can write code that will run in a Hadoop
cluster and take advantage of the massive parallel processing power of Hadoop

SAS has the capability to easily process all data residing on Hadoop. It can process
feeds and format data in a meaningful way to get the optimum performance from
analytics. One can use SAS to query and use the SAS DATA step against Hive and
Impala data.

Initial configuration setting with Hadoop


For this paper let us assume the standard configuration file for Hadoop (Config.xml)
resides at a location \\machine\config.xml

Now if we want to move data into Hadoop, the process is very simple in SAS, and for
native SAS programmer it would be very easy to understand and use.

1
12/22/2016

The SAS statement to move the data into Hadoop would look like:

Filename cfg “\\machine\config.xml”

Proc Hadoop

Options = cfg Authdomain 0 “HADOOP” verbose;


Hdfs copyfromlocal = “C:\user\data\file.txt”
Out = “/usr/sas/file_output.txt” overwrite;
Run;

This program will do the following:

 Create authdomain in SAS Metadata named "HADOOP"


 Copy file from the local file system to HDFS at location /user/sas
 The AUTHDOMAIN option is used. If the AUTHDOMAIN is already set up, then
there is no need to specify USERNAME= and PASSWORD=.

The above mentioned simple steps will put the data within HDFS and then all the power
of HDFS can be unleashed to process the data optimally.

2
12/22/2016

Executing MAPREDUCE code from SAS

The next logical step is to check how we can run a MAPREDUCE program in HDFS
from SAS. Let’s have a look at this approach:

proc hadoop
options = cfg verbose
AuthDomain="HADOOP"
;
hdfs delete="/user/sas/outputs";
Map reduce
input="/user/sass/hamlet.txt"
output="/user/sas/outputs"
jar="C:\Public\tmp\wcount.jar"
map="org.apache.hadoop.examples.WordCount$TokenizerMapper"
reduce="org.apache.hadoop.examples.WordCount$IntSumReducer"
combine="org.apache.hadoop.examples.WordCount$IntSumReducer"
outputvalue="org.apache.hadoop.io.IntWritable"
outputkey="org.apache.hadoop.io.Text"
;
run;

Here again it is very evident, that SAS has extremely simplified the process of writing
map reduce code within SAS. 2 points are worth mentioning:

1: The overhead of creating the connection is completely from the programmer. The first
three simple lines of SAS code does all the hard work.
2: The basic flavor of configuring map reduce code is kept the same.

The bottom line is , if an user knows the basic SAS coding and knows the steps to
configure Map-reduce code, this method of coding is very simple for the programmer.

3
12/22/2016

Executing PIG LATIN code from SAS


The same is also true for PIG LATIN language. Let’s have a look at it as well.

filename W2A8SBAK temp ;


data _null_;
file W2A8SBAK;
put "A = load '/user/sas/hamlet.txt'; ";
put "B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; ";
put "C = Filter B by (word matches 'The');";
put "D = group C by word;";
put "E = foreach D generate COUNT(C), group;";
put "F = store E into '/user/sas/pig_theCount';";
run;
proc hadoop
options = cfg verbose
AuthDomain="HADOOP";
pig code=W2A8SBAK;
run;

Executing SAS procedures on HDFS data

Up-till now what we have seen is only 1 side of the story. There is another aspect as well
which we need to explore. And it is, if we have the data already stored in HDFS, and we
want to execute SAS procedures on them. Well SAS have also proposed a solution for
it. SAS uses its distributed data storage format with the SAS Scalable Performance Data
(SPDE) Engine for Hadoop. SAS Scalable Performance Data Engine (SPDE) provides
parallel access to partitioned data files from the SAS client.

Let’s have a look at a sample code.

libname testdata spde '/user/dodeca' hdfshost=default;


libname cars base '\\sash\cardata';

proc copy in=cars out=testdata;


select cardata;
run;

4
12/22/2016

This is a regulation SAS program to select an attribute from a table and write it in
another dataset. The only point to see here is that the source is a HDFS data file and
target is a SAS dataset.

Similarly it is very simple to execute normal SAS procedures with HDFS data.
Programmer s can simply give the data path and execute the procedures as normal.

proc freq data=testdata.Cardata;


tables Color;
run;

Accessing HIVE Data Using SAS/ACCESS Interface to


HADOOP

HIVE is the NOSQL database used for accessing data via HDFS. Just like other
Databases SAS has provided a connector via SAS / ACCESS interface.

Lets have a look at the sample code to understand the procedure.

proc sql;
connect to hadoop (server=duped user=myUserID);
execute (create table myUserID_store_cnt
row format delimited fields terminated by '\001'
stored as textfile
as
select customer_rk, count(*) as total_orders
from order_fact group by customer_rk)
by hadoop;
disconnect from hadoop;
quit;
/* simple libname statement */
libname myhdp hadoop server=duped user=myUserID ;

5
12/22/2016

/* Create a SAS data set from Hadoop data */


proc sql;
create table work.join_test as (
select c.customer_rk, o.store_id
from myhdp.customer_dim c , myhdp.order_fact o
where c.customer_rk = o.customer_rk);
quit;

Below is a PROC FREQ example. The Hadoop LIBNAME exposes standard SAS
functionality such as PROC FREQ against Hadoop data.

/* PROC FREQ example */


/* Create a Hive table */
data myhdp.myUserID_class;
set sashelp.class;
run;
/* Run PROC FREQ on the class table */
proc freq data=myhdp.myUserID_class;
tables sex * age;
where age > 9;
title 'Frequency';
run;

Conclusion :

As mentioned and described in this paper , we can clearly see that SAS has seamlessly
integrated the HADOOP framework and HDFS file system within its processing.

For a SAS user using a Hapoop system as a source or target is as simple as using any
other data source. Also the basic structure for configuring HDFS framework and map
reduce jobs is kept as it is.

SAS also has a products like SAS VISUAL ANALYTICS, SAS EVENT STREAM
MANAGEMENT, SAS REAL TIME DATA MANAGEMENT which seamlessly allow users
to connect seamlessly to Hadoop and process big data. Please refer to SAS website for
further details

You might also like