Blog | heysoftware

SAS: Writing Stored Process Web Applications

Updated: Sep 19, 2018

With SAS being arguably one of the world’s leading analytics software platforms, most know SAS for its ability to analyze large amounts of data. As a Data Scientist, I utilize SAS programming on a daily basis, along with other programming languages and platforms such as R and Hadoop. Most do not know that SAS programs can be integrated into applications. A stored process in SAS is a SAS program that is stored on a server that can be executed by requesting applications such as web reports, analytics, and web applications. There are a number of benefits to storing your SAS programs on a server and executing it within applications, I would advise you to Google SAS stored processes for more information.

The purpose of this article is to demonstrate how to write several SAS stored processes to produce a web application for client use. It is assumed that you have already created your SAS server and metadata using SAS Metadata Server.

If you want to get straight to the SAS code, skip to the bottom of the Blog post to view both the Initial Page and Report 1 scripts.

First Stored Process:

A good tip to avoid initializing every libname in SAS is to create an individual SAS program that contains all the metadata information for all your library references, such as physical name, libname ref, and data source engine. Use libname _all_ list; to view metadata information for your libraries.

This SAS file can be saved to your local drive, shared drive, or server, which can be referenced by using the convention \\SERVER\SHARE\FOLDER\FILEPATH

This will stream line the process of calling data tables from different libraries without cluttering your SAS program with unnecessary code. This is something that I include in all of my SAS programs at the top of the page. My file is saved on my local shared drive so I reference it with the following convention.

%include “../../../data/share/folder1/LibnamesToInclude.sas”;

Instantiate your global variables along with your stored process reserved macro variables.

*Declare reserved server environment macro variables;
%global _GOPT_DEVICE _GOPTIONS _ODSOPTIONS;
%Global Var1 Var2 text;

We can now generate the top of the initial page by creating a null data set using the datalines4 statement. The datalines4 statement is used with the input statement to read data that contains semicolons that you enter directly in the program after the datalines4. The statement is closed out with 4 semicolons before the RUN statement. The _webout function serves as a way to execute and create the webpage while executing the stored process.

data _null_;
infile datalines;
file _webout;
input;
length Text $ 28000;
Text=resolve(_infile_);
PUT Text;
datalines4;
<HTML>
<BODY>
<hr/>
<h2>Welcome</h2>
<h4>The Welcome Page</h4>
<hr/>
<style>
.loader {
border: 16px solid #f3f3f3;
border-radius: 50%;
border-top: 5px solid blue;
border-right: 5px solid blue;
border-bottom: 5px solid purple;
border-left: 5px solid purple;
width: 20px;
height: 20px;
-webkit-animation: spin 2s linear infinite;
animation: spin 2s infinite linear;
}
@keyframes spin {
from {
transform:rotate(0deg);
}
to{
transform:rotate(360deg);
}
}

The loader CSS style is a class used to create a loader icon after the submit button is activated. The buttons CSS style sets the style of the buttons on our webpage.

.buttons
{
border-right:#ffffff 1px solid;
border-left:#ffffff 1px solid;
border-top:#ffffff 1px solid;
border-bottom:#ffffff 1px solid;
font-family:verdana;
font-weight:bold;
font-size:12px;
padding: 5
cursor:hand;
background-color:Blue;
color:White;
width: 150px;
height: 25px;
margin-right:5px;
}
input{
width: 250px;
height: 25px;
margin-right:10px;
}
input[type=submit] {
}
@media print
{#noprint {display:none;}}
</style>

The below javascript fetches the id element on the submit buttons then the validation functions are used to execute the loader CSS style we created above along with the message “Please wait…” to be displayed while the next page loads, which contains the second or third stored process.

<script type=”text/javascript”>
function getId(id) {
return document.getElementById(id);
}
function validation() {
getId(“submit_btn”).style.display=”none”;
getId(“wait_tip”).style.display=””;
return true;
}
function validation2() {
getId(“submit_btn2”).style.display=”none”;
getId(“wait_tip2”).style.display=””;
return true;
}
</script>

Inside the action argument of the form tag you will notice the global macro variable &_URL. &_URL is a reserved SAS stored process macro variable. It is created during the execution of the stored process on the server. The _URL specifies the location on the server of the stored process. This is used in conjunction with the value &_METAFOLDER macro variable to post the current request to the second or third stored process you have created (i.e., Report1 or Report2).

<FORM onsubmit=”return validation();” action=”&_URL” method=”POST”>
<label>
Number 1
<input type=”text” name=”CBox”/>
</label>
<label>
<input id =”submit_btn” type=”submit” class=”buttons” value=”Number 1"/><br>
<input type=”hidden” name=”_program” value=”&_METAFOLDER.Report1"</input>
</label>
<label>
<span id=”wait_tip” style=”display:none;” class=”loader”>Please wait…</span>
</FORM>
<FORM onsubmit=”return validation2();” action=”&_URL” method=”POST”>
Number 2
<input type=”text” name=”CBox”/>
</label>
<label>
<input id =”submit_btn2" type=”submit” class=”buttons” value=”Number 2"/><br>
<input type=”hidden” name=”_program” value=”&_METAFOLDER.Report2"</input>
</label>
<span id=”wait_tip2" style=”display:none;” class=”loader”>Please wait…</span>
</FORM>
<TABLE cellpadding=”5" cellspacing=”5">
<hr/>
<hr/>
;;;;
RUN;

The following code is optional if you would like to update the user on the current status of the web page being loaded. For more information regarding this and other SAS macros for SAS Stored Processes using Javascript visit: http://support.sas.com/resources/papers/proceedings11/257-2011.pdf

%let text=“Your Page is loading”;
%macro message_js(text) ;
%if %symexist(_odsdest) %then
%if %upcase(&_odsdest)=RTF or
%upcase(&_odsdest)=PDF %then
%return ;
data _null_ ;
file _webout;
put ‘<script type=”text/JavaScript”>’ ;
put “window.status = “”&text””;” ;
put ‘</script>’ ;
run ;
%mend message_js;
%message_js;

Next we are going to create the bottom of the webpage using the same approach we took by creating a null data set above.

data _null_;
infile datalines;
file _webout;
input;
length Text $ 28000;
Text=resolve(_infile_);
PUT Text;
datalines4;
</tbody>
</table>
</div>
</body>
</FORM>
</BODY>
</HTML>
;;;;
RUN;

Now that we have created the initial stored process, the webpage generated will display two textboxes and two submit buttons. Dependent upon which textbox the user enters the value, the submit button will redirect the SAS stored process to one of two additional stored processes, which will gather data from the specified tables in the SAS libraries.

For the sake of time we are going to assume the user entered a value into the first textbox.

OPTIONS MPRINT SYMBOLGEN; *macro debugging options;%include “../../../data/share/folder1/LibnamesToInclude.sas”;
*Declare reserved server environment macro variables;
%global _GOPT_DEVICE _GOPTIONS _ODSOPTIONS;
*ProcessBody;
*Declare input parameters;
%global Texts Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 ODSPUTOUT _ODSDEST _ODSSTYLE _ODSOPTIONS DATASET;

The Macro below queries several data tables within several SAS libraries. The input argument specified in the keyword parameter A is the value passed over from the initial page stored process when the user input the value into the textbox. You can insert additional keyword parameters if needed when you modify this script, but for now we will only use one. The &CBox value was passed from the initial page stored process when the submit button was clicked.

%macro TestA(A=&CBox);
%let CBox=&CBox;
%let Length=%sysfunc(lengthn(&CBox));
%if &Length >=12 %then %do;
%if “&CBox” ne “” %Then;
PROC SQL Noprint;
SELECT t1.Col1,t1.Col2,t1.Col3, min(t1.DateCol4) FORMAT DateTime20.,t1.Col5,t1.Col6,t1.Col7,t1.Col8,t1.Col9,t1.Col10 INTO :Var1, :Var2, :Var3, :Var4, :Var5,:Var6,:Var7,:Var8,:Var9,:Var10
FROM libname1.Table1 t1
WHERE t1.ID =”&CBox” and t1. flg = ‘N’;
QUIT;

The Proc SQL NOPRINT is used to create macro variables from the table specified in the query. The into statement assigns the values from the columns specified in the query to macro variables you create. The %put statement allows you to see the resolved values of several new macro variables in the SAS log.

%put &Var4 &Var5 &Var6;
%let EntryDate=%sysevalf(“&Var4”d);%put &EntryDate;

The data _null_ step is used to create macro variables such as the &Comp variable that can be used with other macro variables created above. We are going to treat &Var1 as a column that contains one unique ID referenced to the number entered in the text box that holds the name CBox on Initial Page.

data _null_;
Compare=intnx(“month”,today(),-3,’s’);
Call symputx(‘Comp’,Compare);
run;
%put &Comp;
%if (&EntryDate ge &Comp) %then %do;
Proc Sql;
create Table Work.TempTB1 as
select t1.* from libname1.Table1 t1
where t1.ID=”&CBox”;
quit;
PROC SQL;
CREATE TABLE WORK.TempTB2 AS
SELECT “&Var1” as Num, t1.* FROM libname2.Table2 t1
where t1.ID = &Var1;
Quit;
PROC SQL;
CREATE TABLE WORK.TempTB3 as
Select t1.* from libname3.Table3 t1 where t1.FKID=&Var1 and t1.ID=”&CBox”;
quit;
Proc SQL;
Create table work.TempTB4 as
select t1.* from libname4.Table4t1 where t1.FKID=&Var1 and t1.IDr=”&CBox”;
quit;
proc sql;
CREATE TABLE work.TempTB5 as
select t1.* from libname4.Table5 t1 where t1.FKID=&Var1 and t1.ID=”&CBox”
Order by t1.Date DESC;
quit;
proc sql;
create table work.TempTB6 as
select t1.* from libname5.Table6 t1 where t1.ID=”&CBox”;
quit;
Proc SQL;
Create Table Work.TempTB7 as
select t1.* from libname5.Table7 t1 where t1.ID=”&CBox”;
quit;
proc sql;
Create table Work.TempTB8 as
select t1.* from libname8.Table2 t1 where t1.ID=”&CBox”;
quit;
PROC SQL;
CREATE TABLE WORK.TempTB9 as select t1.* FROM libname8.Table3 t1
where t1.ID=”&CBox”;
quit;
%end;
%else %if (&EntryDate le &Comp) %then %do;
Data NoObs;
set libname1.Table1 libname2.Table1;
texts=””;
stop;
Run;

The above data _null_ statement creates a new Data table called NoObs that takes on the same structure as the two tables referenced. We stop the execution of merging the data in both tables by using the stop statement.

proc sql; insert into NoObs(Texts)
values(“The Number you entered &CBox data cannot be downloaded it is outside of the date range”);
quit;
proc sql noprint;
select Texts into :Texts
from NoObs;
quit;

The above Proc SQL NOPRINT creates a macro variable called Texts. Remember that the macro variable was declared as a global macro variables before its creation, which took place at the beginning of the script.

%put &Texts;
%end;
%end;
%else %if &Length <12 %then %do;
Data NoObs;
set libname1.Table1 libname2.Table1;
texts=””;
stop;
run;
proc sql; insert into NoObs(Texts)
values(“The number you entered &CBox is not the correct length”);
quit;
proc sql noprint;
select Texts into :Texts
from NoObs;
quit;
%put &Texts;
%end;
%mend TestA;
%TestA;

The purpose of the macro TestA is to query several data tables across multiple libraries such that the queries return the correct information that matches the input value of &CBox. We create macro variables in the initial Proc SQL NOPRINT statement that are also used within the macro to further query the data tables. Based on &CBox length and date ranges we execute different steps within the macro. Notice above when we call our macro TestA we do not have to specify the arguments for keyword parameter A. We do not have to include them because the arguments were already declared inside the macro with macro variable &CBox.

%macro PrintData;
%stpbegin;
%if (&Length >= 12) %then %do;
%if (&EntryDate ge &Comp) %then %do;
proc print data=Work.TempTB1 NOOBS;
run;
proc print data= Work.TempTB2 NOOBS;
run;
proc print data= Work.TempTB3 NOOBS;
run;
proc print data= Work.TempTB4 NOOBS;
run;
proc print data= Work.TempTB5 NOOBS;
run;
proc print data= Work.TempTB6 NOOBS;
run;
proc print data= Work.TempTB7 NOOBS;
run;
proc print data= Work.TempTB8 NOOBS;
run;
proc print data= Work.TempTB9 NOOBS;
run;
%end;
%else %if (&EntryDate le &Comp) %then %do;
proc print data=Work.NoObs NOOBS;
run;
%end;
%end;
%else %if (&Length < 12) %then %do;
proc print data=Work.NoObs NOOBS;
run;
%end;
ods html close;
%stpend;
%mend PrintData;

The macro PrintData is used to print all of the data tables we created using our macro TestA. You will notice the usage of %STPBEGIN and %STPEND in the PrintData macro. These two macros initialize and terminate the ODS to generate output from the stored process. Notice before the %STPEND statement the ods html close statement is used to close out our display of results as html. An ods html statement is not required at the beginning of the macro because we will be generating the html below in a data _null_ step and not saving the PrintData output to a file.

Similar to creating the welcome page we can construct the second webpage. Recall that we reference this second webpage after we submit our value in the first text box on the initial page.

*Create output;
data _null_;
infile datalines;
file _webout;
input;
length Text $ 2800;
Text=resolve(_infile_);
PUT Text;
datalines4;
<html>
<head>
<script src=”https://cdnjs.cloudflare.com/ajax/libs/xlsx/0.10.3/xlsx.full.min.js"></script>
<script src=”https://cdnjs.cloudflare.com/ajax/libs/FileSaver.js/1.3.3/FileSaver.min.js"></script>
<style type=”text/css”>
.container{
display: flex;
overflow: auto;
min-height: 50vh;
padding-top: 0px;
width: 100%;
backface-visibility: hidden;
will-change: overflow;
background-color: white;
border: 1px solid black;
}
.topnav {
overflow: hidden;
background-color: #4B0082;
}
.topnav a {
float: left;
color: white;
text-align: center;
padding: 14px 16px;
text-decoration: none;
font-size: 16px;
}
.topnav a:hover {
background-color: #C0C0C0;
color: black;
}
.topnav a:active {
background-color:gray;
color: white;
}
.topnav-center {
float: center;
}
.topnav-left
{
float:left;
}
.buttons{
float: left;
color: white;
text-align: center;
padding: 14px 16px;
text-decoration: none;
font-size: 16px;
}
.buttons:hover {
background-color: #C0C0C0;
color: black;
}
.buttons:active{
background-color:gray;
color: white;
}
display: inline-block;
input{
width: 25px;
height: 25px;
margin-right:10px;
}
input[type=submit] {
}
@media print
{#noprint {display:none;}}
</style>

The style CSS above specifies the display of the webpage generated after the Initial Page is executed and the request is posted to the SAS server to execute the Report1 stored process. Include javascript sources above if you plan on using javascript. The javascript included in this stored process extracts the 9 tables created in PrintData, stores the data into a Blob and writes the Blob to an excel file with multiple sheets. This javascript dynamically creates a blob based on the CBox value entered and the Work Tables that are created.

When dealing with large data, all use cases have to be taken into consideration along with business cases associated with the project. The business case associated with Report1 stored process is that not every table will contain information about &CBox or &Var1 due to the flow of data through the vast number of systems encountered. This limits us to range of 1 to 9 tables that could possibly be returned pertaining to the CBox value passed to the TestA macro keyword parameter A. When adapting the SAS code within this article, adjust it based on your business and use cases.

<script type=”text/javascript”>
function s2ab(s) {
var buf = new ArrayBuffer(s.length);
var view = new Uint8Array(buf);
for (var i=0; i!=s.length; ++i) view[i] = s.charCodeAt(i) & 0xFF;
return buf;
}
var csvToString=[];
var csv=[];
function PrepareCSV(i){
var csvToString=[];
var tables=document.getElementsByClassName(“table”);
var rows = document.querySelectorAll(“.table tr”);
var rowsinT=tables[i].querySelectorAll(“tr”);
//console.log(rowsinT);
var csv = [];
for (var l = 0; l < rowsinT.length; l++) {
var row = [], cols = rowsinT[l].querySelectorAll(“th.l.header,th.r.header,td.l.data,td.r.data”);
//console.log(cols);
for (var k=0; k<cols.length; k++)
row.push(cols[k].innerText);
csv.push(row);
}
return csv;
}
function exportTableToCSV() {
var tables=document.getElementsByClassName(“table”);
var blob,
wb={SheetNames:[],Sheets:{}};
var SheetName=[“Sheet1”,” Sheet2",” Sheet3",” Sheet4",”Sheet5",”Sheet6",”Sheet7",”Sheet8",”Sheet9"];
for(l=0;l<tables.length;l++){
var rowsinT=tables[l].querySelectorAll(“tr”);
//console.log(rowsinT);
var ws1 = XLSX.utils.aoa_to_sheet(PrepareCSV(l));//XLSX.read(PrepareCSV(l), {type:”array”}).Sheets.Sheet1;
console.log(ws1);
console.log(PrepareCSV(l));
wb.SheetNames.push(SheetName[l]); wb.Sheets[SheetName[l]]=ws1;
//console.log(wb.Sheets[SheetName[l]]);
//console.log(wb);
}
blob = new Blob([s2ab(XLSX.write(wb, {bookType:”xlsx”, type:”binary”,compression:false}))],{type: “application/octet-stream”});
console.log(blob);
saveAs(blob, “Report1Results.xlsx”);
}
</script>
<body>
<div class=”topnav”>
<div class=”topnav-left”>
<FORM action=”&_URL” method=”PUT”>
<label>
<input class=”buttons” type=”submit” value=”Home”</input>
<input type=”hidden” name=”_program” value=”&_METAFOLDER.Initial_Page”/>
</label>

This form button redirects the user back to the initial page after they have received their results on the display page (i.e., Export 1 to 9 of the Work.Temp tables we created using PrintData macro).

</FORM>
</div>
<div class=”topnav-center”>
<a type=”submit” onclick=”exportTableToCSV()”>Download Data</a>
</div>
</div>
<h1 align=”center” style=”font-family:verdana;”><b> Report 1 Results</b></h1>
<h2 align=”center” style=”font-family:verdana;”><b>Your results from your Number Search are below</b></h2>
<h3 align=”center” style=”font-family:verdana;”><b><font size=”6" color=”red”> &Texts</font></b></h3>

&Texts will display within the webpage dependent upon the value and length of CBox. This is based on the TestA and PrintData macros.

<div class=”container”>
<b align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Number You Entered: &CBox </font><b>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var1 Number: &Var1 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Date of &CBox: &Var4 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var2: &Var2 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var3: &Var3 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var5: &Var5 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var6: &Var6 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var7: &Var7 </font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var8: &Var8</font><p>
<p align=”left” style=”font-family:verdana;”><font size=”4" color=”black”> Var9: &Var9 </font><p>
</div>
</hr>
</hr>
</body>

Macro variables can be used within html only if they have been declared as global variables at the beginning of the script.

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8"/>
<div style=”display:none”>
;;;;
run;

Given the current business case we can have 1 to 9 tables created based on the value of &CBox and other macro variables such as &Var1 the tables can vary in length and dimension. To make the user’s experience easier and seamless, it is not a good idea to display all data tables on a single webpage. So before executing our PrintData macro, we will create a div tag and set the style display to none.

%PrintData;

The PrintData macro has been executed and the data tables have html code associated with them (again remember this is because of our ods html close statement inside of PrintData). This creates the html code for each of the 1 to 9 Work.Temp data tables dynamically given the value entered in the first textbox on the initial page stored process and queried in the TestA macro.

data _null_;
infile datalines;
file _webout;
input;
length Text $ 2800;
Text=resolve(_infile_);
PUT Text;
datalines4;
</div>
</Head>
</HTML>
;;;;
RUN;

We finally create the bottom of the webpage by using the data _null_ data step. Be sure to close out the div tag we created that encloses the PrintData macro results, this ensures that the tables do not display on the webpage.

Questions or feedback regarding this article, SAS stored processes, the html and/or javascript utilized in this article can be directed to breya@heysoftwareinnovation.com.

Breya is the Lead Data Scientist at HeySoftWare!,LLC — A Memphis based software consulting firm specializing in innovative solutions for customers of all kinds.

Below is what the Welcome Page will look like.

Below is what the second webpage will look like when we execute Report1.

React Http Request via Axios

Updated: Sep 18, 2018

This is a quick tutorial to get you up and running with a working React application as quickly as possible. When learning a new technology there are TONS of documentation that can be overkill for getting your feet wet in an effective way. This tutorial is for people that prefer to learn by Reverse Engineering and tinkering. I will I will not not go very deep in depth of how things work, but rather give you everything you need to have a 100% functioning result. This will allow you to tear it apart and in order to learn rapidly. So lets get started!

Generate a React app

First we want to install Node by going here: https://nodejs.org/en/

Lets the latest create-react-app functionality so that we can easily generate a basic app template to work with. Go to you command terminal and type:

npm install -g create-react-app

Paste the following command into your terminal in order to generate a new React app.

create-react-app ReactHttpDemo

Start your server

cd into you newly created application and type the following command into your terminal:

npm start

Navigate to http://localhost:3000/ within your browser and you should see something similar to the image below. Good job! you have a basic React app running.

Installing Axios (Will be needed to run your code)

Run the following command within terminal to install Axios:

npm install axios -S

React is a library and not a full featured framework. We are responsible for leveraging a third party package for handling http. Axios was chosen in this example. Axios leveraged promises which might be worth studying if you don’t know about them.

Getting to the code

Under the src directory, create an new folder named components. Inside this folder create a blank JavaScript file named Item.js. Your folder structure should now look similar to the below image.

Replace the contents of the src/App.css file with:

Replace the contents of the src/App.js file with:

Replace the contents of the src/components/Item.js file with:

Lets include bootstrap into our pasting the following cdn within the head of the public/index.html file:

<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.0/css/bootstrap.min.css"
integrity="sha384-9gVQ4dYFwwWSjIDZnLEWnxCjeSWFphJiwGPXr1jddIhOegiu1FwO5qRGvFXOdJZ4"
crossorigin="anonymous">

Running the Result

Type the following code into your terminal in order to run your solution:

npm start

Navigate to http://localhost:3000/ within your web browser and you should see something similar to the image below:

Click the button and you should see something similar to the image below:

Understanding what happened

Your application should now be working. Your base React component is called App which is defined in the src/App.js file. It’s job is to render the purple button that is displayed in your web browser. Upon clicking this button the handle click function is called.

Handle click uses Axios to make an HTTP request to the following endpoint, https://jsonplaceholder.typicode.com/users. The response data is then stored into the component state variable to be used by the Item component. Give this endpoint a look if you are curios as to what the data looks like. Ounce the component state contains data for persons state variable, it is then passed as a property to the Item component.

The data is used to populate the Item component with data that it needs to render.

Thats it!

This tutorial is meant to kick start the process of learning React. The goal was to get you to a working app as fast as possible. This way we can do what we do best…. Reverse Engineer!!!! I personally learn extremely fast this way and wanted to provide this for others who happen to appreciate a similar approach to learning.

Cadarrius McGlown

Lead Software Architect

Heysoftwaresolutions.com

Map Reduce example (Hadoop)

Updated: Jul 23, 2018

Introduction

Hadoop MapReduce is a software that easily writes applications that handle large amounts of data in parallel across distributed clusters of nodes (AKA computers). MapReduce is a single concept associated with Hadoop’s Distributed File System (HDFS). MapReduce usually splits the input data-set into independent subsets which are processed in a parallel manner. The outputs of the each map, which are then input into the reduce task. Both the input and output are both stored in a ﬁle-system. Typically the computing nodes and the storage nodes are the same. So the MapReduce framework and the HDFS are running on the same nodes. The concept of MapReduce is diﬀerent from that of Spark, for example, simply because MapReduce utilizes disk space to partition and perform computation whereas Spark uses RAM and residual disk space if needed.

The concept of distributed ﬁle systems is important to understand when dealing with big data because traditionally data was small enough to be handled on personal machines in single ﬁles. Now, data can be so large that the data itself has to be housed on a sever and computed across distributed systems and allocated disk space (MapReduce) or RAM space (Spark) when computations are performed. Below I demonstrate how to get started with the concept of MapReduce in Hadoop using Ubuntu OS. MapReduce is a very fundamental concept to master in data science because all other distributed ﬁle systems and memory allocated techniques stem from MapReduce.

Tutorial

1. Open Command Terminal

2. Head over to https://drive.google.com/drive/folders/1dXBCs_VUi9z3bhcDrtpvxxQLLg9miIac?usp=sharing and download the following ﬁles needed for the example. The ﬁles are compiled jar ﬁles that contain the MapReduce java code needed for Hadoop.

3. Create the following folders on your home directory: Hadoop_MapReduceExamples and the sub folder MapReduceTutorial and MapReduceData

4. Inside the MapReduceData folder place the following jar ﬁles inside: SaleCountry.jar, SalesCountry2.jar, and SalesJan2009.csv

5. Change your working directory similar to what is below

6. Get a list of the files in your newly made MapReduceData folder. Similar to what is below:

7. Start Hadoop by running the following line of code in the command terminal (direct to your local folder where Hadoop is installed):

8. Make the following directory inside of MapReduceData folder:

9. Inside of the newly created inputSalesExample folder we want to insert the following ﬁles inside: SalesJan2009.csv

10. Run yarn by calling the following code:

11. Change directories to a new folder on your desktop called: mapreduce_output_sales

12. Next you will run Mapreduce using the code inside of the SalesCountry2.jar ﬁle located in the MapReduceData folder. The output will be sent to the mapreduce_output_sales folder you created.

13. You should see output similar to this:

14. Lets review the results inside the mapreduce_output_sales folder.

15. Output should look similar to below

16. The results from Map reduce are located in the part-00000 file inside of the mapreduce_output_sales folder.

Breya Walker

Data Scientist - HeySoftware!