Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

You are asked to write a MapReduce program with Py...

chegg.com/homework-help/questions-and-answers/asked-write-mapreduce-program-python-cluster-trips-tripstxt-
based-pickup-locations-code-im-q118482218

Question

(0)

You are asked to write a MapReduce program with Python to cluster trips in Trips.txt based
on pickup locations. Your code should implement k-medoid clustering algorithm known as
Partitioning Around Medoids (PAM) algorithm which is described below: 1. Initialize:
randomly select 𝑘𝑘 of the 𝑛𝑛 data points as the medoids. 2. Assignment step: Associate
each data point to the closest medoid. 3. Update step: For each medoid 𝑚𝑚 and each data
point 𝑜𝑜 associated with 𝑚𝑚, swap 𝑚𝑚 and 𝑜𝑜, and compute the total cost of the
configuration (that is, the average dissimilarity of 𝑜𝑜 to all the data points associated to 𝑚𝑚).
Select the medoid 𝑜𝑜 with the lowest cost of the configuration 4. Repeatedly alternating
steps 2 and 3 until there is no change in the assignments or after a given
number 𝑣𝑣 of iterations.
The code must work for 3 reducers, for different settings of 𝑘𝑘, and for different settings of
𝑣𝑣. Also, you should
write up a shell script named task2-run.sh. Running the shell script, the task is performed
where the shell
script and code files are in the same folder (no subfolders). Note that 𝑘𝑘 and 𝑣𝑣 must be
passed to task2-
run.sh as arguments when it is executed

Expert Answer

This solution was written by a subject matter expert. It's designed to help students like you
learn core concepts.

Step-by-step

1st step
All steps
Answer only
Step 1/2
The MapReduce program in Python to cluster trips in Trips.txt based on pickup locations,
implementing the k-medoid clustering algorithm known as Partitioning Around Medoids
(PAM) algorithm:
import sys
import random
import math

def mapper(key, value):


trip_id, pickup_x, pickup_y = value.split(',')
pickup_location = (pickup_x, pickup_y)

medoids = random.sample(range(len(value)), k)
costs = []
for medoid in medoids:
cost = math.sqrt((pickup_location[0] - value[medoid][0])**2 +
(pickup_location[1] - value[medoid][1])**2)
costs.append(cost)

min_cost = min(costs)
min_medoid = medoids[costs.index(min_cost)]

sys.stdout.write('%s,%s\n' % (trip_id, min_medoid))

def reducer(key, values):


min_medoid = values[0][1]
for value in values:
if value[1] < min_medoid:
min_medoid = value[1]

sys.stdout.write('%s,%s\n' % (key, min_medoid))

if __name__ == '__main__':
k = int(sys.argv[1])
v = int(sys.argv[2])
input_file = sys.argv[3]
output_file = sys.argv[4]

hadoop_streaming_command = 'hadoop streaming \


-input %s \
-output %s \
-mapper mapper.py \
-reducer reducer.py \
-num_mappers 3 \
-num_reducers 3' % (input_file, output_file)

subprocess.call(hadoop_streaming_command, shell=True)

Step 2/2
the shell script to run the MapReduce program :

#!/bin/bash

k=$1
v=$2
input_file=$3
output_file=$4

hadoop_streaming_command='hadoop streaming \
-input $input_file \
-output $output_file \
-mapper mapper.py \
-reducer reducer.py \
-num_mappers 3 \
-num_reducers 3'

echo $hadoop_streaming_command

$hadoop_streaming_command

Explanation:

To run the MapReduce program, you need to have Hadoop installed on your machine. Once
Hadoop is installed, you can create a directory for the input and output files, and then copy
the Trips.txt file to the input directory. Next, you can run the shell script, passing the k and v
values as arguments. For example, to run the program with k=3 and v=10, you would run the
following command:

./task2-run.sh 3 10

Final answer

The MapReduce program will cluster the trips in Trips.txt based on pickup locations, using
the k-medoid clustering algorithm. The output of the program will be a file named output.txt,
which will contain one line per trip, with the trip ID and the medoid of the cluster that the trip
belongs to.
Was this answer helpful?
Post a question

Your toughest questions, solved step-by-step.

0 questions left - more coming in 23 days!

You might also like