Professional Documents
Culture Documents
LT Mahout Exercises
LT Mahout Exercises
LT Mahout Exercises
Logistic Regression
Create a folder in your home directory with the following command:
cd $HOME
mkdir bank_data
cd bank_data
Download the data in the bank_data directory:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bankadditional.zip
unzip the file
unzip bank-additional.zip
cd bank-additional
Open the bank-additional directory and observe the file using ls and gedit commands
Sed is a powerful editor on linux tha can perform data pre-processing tool . The general syntax is :
sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_
fileName
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g'
Command to replace ; with , and remove " are as follows:
sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv
sed -i 's/"//g' input_bank_data.csv
Remove the heading line from the dataset
sed -i '1d' input_bank_data.csv
Create new dir and copy the files
mkdir input_bank
cp input_bank_data.csv input_bank
Set Mahout to run in local and not on distributed mode export MAHOUT_LOCAL=TRUE
Split the dataset into training and test datasets using the Mahout split command mahout split --input input_bank --trainingOutput train_data --testOutput
test_data -xm sequential --randomSelectionPct 30
wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
tar xvzf reuters21578.tar.gz -C $WORK_DIR/input
We will use the Mahout class ExtractReuters to extract the files:
mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/input$WORK_DIR/reutersfinal
Set mahout to run on hadoop cluster instead of local
export MAHOUT_LOCAL=FALSE
Close the terminal and reopen the terminal
Transfer the reutersfinal to hadoop cluster from local
hadoop fs -put /tmp/lda/reutersfinal reutersfinal
Check if the files are transferred to hadoop
hadoop fs -ls reutersfinal
The next step is to convert the files to the sequence format. We will use the Mahout
command seqdirectory for that:
mahout seqdirectory -i reutersfinal -o sequencefiles -c UTF-8 -chunk 5
To view one of the sequence files, we will use the seqdumper utility:
mahout seqdumper -i sequencefiles/part-m-00000 -o part-m-00000.txt
gedit part-m-00000.txt
The next step is to convert the sequence file into a term frequency matrix. We will
use the Mahout utility seq2sparse for that. This matrix can then be used to perform
topic modeling :
mahout seq2sparse -i sequencefiles/ -o vectors/ -wt tf --namedVector
Check the files created
hadoop fs -ls vectors
using rowid to convert sparse vectors to the form needed for cvb clustering (i.e., to change the Text
key to an Integer).
mahout rowid -i vectors/tf-vectors -o reuters-out-matrix
We execute the Mahout cvb command to perform topic modeling on the input dataset:
mahout cvb -i reuters-out-matrix/matrix -o reuterslda -k 20 -ow -x 20 -dict vectors/dictionary.file-0