LT Mahout Exercises

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

L & T Mahout Practice Examples

Logistic Regression
Create a folder in your home directory with the following command:
cd $HOME
mkdir bank_data
cd bank_data
Download the data in the bank_data directory:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bankadditional.zip
unzip the file
unzip bank-additional.zip
cd bank-additional
Open the bank-additional directory and observe the file using ls and gedit commands
Sed is a powerful editor on linux tha can perform data pre-processing tool . The general syntax is :
sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_
fileName
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g'
Command to replace ; with , and remove " are as follows:
sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv
sed -i 's/"//g' input_bank_data.csv
Remove the heading line from the dataset
sed -i '1d' input_bank_data.csv
Create new dir and copy the files
mkdir input_bank
cp input_bank_data.csv input_bank
Set Mahout to run in local and not on distributed mode export MAHOUT_LOCAL=TRUE
Split the dataset into training and test datasets using the Mahout split command mahout split --input input_bank --trainingOutput train_data --testOutput
test_data -xm sequential --randomSelectionPct 30

Restore the header line in test and training datasets


sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' train_data/input_
bank_data.csv
sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' test_data/input_bank_
data.csv
Train the model mahout trainlogistic --input train_data/input_bank_data.csv --output
model --target y --predictors age job marital education default housing
loan contact month day_of_week duration campaign pdays previous poutcome
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types
n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate
50 --categories 2
Get re-substitution error
mahout runlogistic --auc --confusion --input train_data/input_bank_data.
csv --model model
To get the scores for each instance, we use the --scores option as follows:
mahout runlogistic --scores --input train_data/input_bank_data.csv
--model model
To test the model on the test data, we will pass on the test file created during the split
process as follows:
mahout runlogistic --auc --confusion --input test_data/input_bank_data.
csv --model model
use scores option to predict scores.
LDA
On the command line, first set up the working directory as follows:
mkdir /tmp/lda
export WORK_DIR=/tmp/lda
Then we download the data to a location on the hard drive and extract the
downloaded file to the working directory:

wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
tar xvzf reuters21578.tar.gz -C $WORK_DIR/input
We will use the Mahout class ExtractReuters to extract the files:
mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/input$WORK_DIR/reutersfinal
Set mahout to run on hadoop cluster instead of local
export MAHOUT_LOCAL=FALSE
Close the terminal and reopen the terminal
Transfer the reutersfinal to hadoop cluster from local
hadoop fs -put /tmp/lda/reutersfinal reutersfinal
Check if the files are transferred to hadoop
hadoop fs -ls reutersfinal
The next step is to convert the files to the sequence format. We will use the Mahout
command seqdirectory for that:
mahout seqdirectory -i reutersfinal -o sequencefiles -c UTF-8 -chunk 5
To view one of the sequence files, we will use the seqdumper utility:
mahout seqdumper -i sequencefiles/part-m-00000 -o part-m-00000.txt
gedit part-m-00000.txt
The next step is to convert the sequence file into a term frequency matrix. We will
use the Mahout utility seq2sparse for that. This matrix can then be used to perform
topic modeling :
mahout seq2sparse -i sequencefiles/ -o vectors/ -wt tf --namedVector
Check the files created
hadoop fs -ls vectors
using rowid to convert sparse vectors to the form needed for cvb clustering (i.e., to change the Text
key to an Integer).
mahout rowid -i vectors/tf-vectors -o reuters-out-matrix
We execute the Mahout cvb command to perform topic modeling on the input dataset:
mahout cvb -i reuters-out-matrix/matrix -o reuterslda -k 20 -ow -x 20 -dict vectors/dictionary.file-0

-dt reuters-lda-topics -mt reuters-lda-model


Verify the output directories
hadoop fs -ls reuters-lda-topics
To view the results, we will use the Mahout vectordump utility:
mahout vectordump -i reuterslda/part-m-00000 -o reutersldaop/vectordump -vs 10 -p true -d
vectors/dictionary.file-0 -dt sequencefile -sort reuterslda/part-m-00000
Check the dump created
ls reutersldaop
gedit reutersldaop/vectordump

You might also like