Welcome to Scribd!

Cs426 Mining Massive Data Sets Project 2: Pagerank: Links-Simple-Sorted - Zip

Uploaded by

0% found this document useful (0 votes)

14 views2 pages

This document provides instructions for Project 2 of the CS426 Mining Massive Data Sets course. Students are tasked with using Hadoop and MapReduce to compute PageRank scores on Wikipedia link and page title datasets totaling over 350MB. Students must report the top 100 pages by PageRank when the teleport parameter is set to 0.85, compare results to top pages by in-links, and experiment with other parameter settings. Novel approaches beyond the basic task will receive a 20% bonus. Students must write a report including background, algorithm details, results, team contributions, and references. The final package is due by July 02, 2013.

Original Description:

Original Title

CS426-Project2

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

14 views2 pages

Cs426 Mining Massive Data Sets Project 2: Pagerank: Links-Simple-Sorted - Zip

Uploaded by

Ammar Najjar

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

CS426 Mining Massive Data Sets

Project 2: PageRank

Task: Compute the PageRank scores on the Wikipedia dataset.

Dataset:
links-simple-sorted.zip (323MB)
(http://users.on.net/~henry/pagerank/links-simple-sorted.zip)

titles-sorted.zip (28MB)
(http://users.on.net/~henry/pagerank/titles-sorted.zip)

The above files contain all links between Wikipedia pages (English page only). In
links-simple-sorted.zip, there is one line for each page which contains the destination pages of
the out-links from one specific page. The format of the lines is as follows:

from1: to11 to12 to13 ...

from2: to21 to22 to23 ...
…

Here from1 is an integer denoting the source page, and to11 to12… are integers denoting the
destination pages of the out-links from the corresponding source page. The n-th line in the file
titles-sorted.zip is the page title of page n.

In this project, you need to install your Hadoop environment and use MapReduce to complete
the PageRank calculation on the given datasets. You need to report the titles of Top 100 pages
and the corresponding PageRand scores. You can choose different parameters, such as the
teleport parameter, of PageRank algorithm and do comparison on different results. One result
you must report is the result when the teleport parameter (beta) is set to 0.85. In addition, you
need to compare the PageRank results with the top pages ranked by the number of in-links
and give some analysis.

Of course, you can do more interesting work on the dataset. You are greatly encouraged to
implement your own ideas on the datasets. Teams with novel ideas or more trials will get a 20%
bonus in the final score.

You need to write a report about this project. The report should include but not limited to the
following contents:
1. Background, understanding and analysis of the task;

2. Details of the algorithm and platform;

3. Experimental results and findings;

4. The contribution of each member of the group;

5. Reference.

The deadline for Project 2 is July 02, 2013 (23:59pm).

Package your clean code, final report and other relevant files, name the package as

MMDS-PRO2-[成员姓名].[rar/zip].

Some notes:

1. You may find useful information from:

The official website: http://hadoop.apache.org

Course slides: http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html

2. Final package should be sent to TA (xiaoyu199175@gmail.com) by the team leader.

3. Only cluster mode will be accepted. It is strongly recommended to use real Ethernet

instead of virtual machines. If you do so, please mention it in your report.

All your submitted assignments must be entirely your own (or your own group's). Any student
found cheating or performing plagiarism will receive a final score of zero for this course.

Respuestas Ejercicios Netcraker
Document6 pages
Respuestas Ejercicios Netcraker
JOSE LUIS
No ratings yet
Visa Contactless Payment Specification (VCPS) 2.2 Updates List
Document34 pages
Visa Contactless Payment Specification (VCPS) 2.2 Updates List
martin rupp
No ratings yet
Task - Data Engineering
Document2 pages
Task - Data Engineering
Last Over
No ratings yet
Django
Document93 pages
Django
Tushar Jogani
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Study of Plcs and CNC Machines at Bhel,: Register No: 18becxxxx Name: V Ajay Kumar
Document33 pages
Study of Plcs and CNC Machines at Bhel,: Register No: 18becxxxx Name: V Ajay Kumar
manjeet kumar
No ratings yet
OpenMPCoursework2
Document5 pages
OpenMPCoursework2
Jaokd
100% (1)
JtdmoMJK64 hw4
Document10 pages
JtdmoMJK64 hw4
Daniel Felipe Rambaut Lemus
No ratings yet
Dynamo For Schematic Design
Document23 pages
Dynamo For Schematic Design
Abel Oliveira
No ratings yet
HW 0
Document3 pages
HW 0
noman iqbal (nomylogy)
No ratings yet
Map Combine Reduce Assignment (Updated)
Document5 pages
Map Combine Reduce Assignment (Updated)
Akash Kundu
No ratings yet
CW COMP1556 28 Ver1 1617
Document6 pages
CW COMP1556 28 Ver1 1617
dharshinishurya
No ratings yet
Project Assignment.2024
Document2 pages
Project Assignment.2024
tin nguyen
No ratings yet
R Packages 1
Document3 pages
R Packages 1
tejukmr
No ratings yet
Useful R Packages For Data Analysis
Document3 pages
Useful R Packages For Data Analysis
tejukmr
No ratings yet
Final Project
Document2 pages
Final Project
rp7028694956
No ratings yet
Peer To Peer
Document6 pages
Peer To Peer
Scholar 4you
No ratings yet
Unit Iv GCC
Document13 pages
Unit Iv GCC
Gobi Nathan
No ratings yet
Project Report
Document71 pages
Project Report
Kalishavali Shaik
0% (1)
DataGrokr-Software Development Internship Assignment
Document5 pages
DataGrokr-Software Development Internship Assignment
greatsashi vardhan
No ratings yet
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Document3 pages
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Fgv
No ratings yet
STA304 Assignment2 Instructions
Document6 pages
STA304 Assignment2 Instructions
rchen500
No ratings yet
HW 9
Document2 pages
HW 9
jovanhuang
No ratings yet
Assignment 2 Task Sheet
Document3 pages
Assignment 2 Task Sheet
陈二二
No ratings yet
Assignment 1
Document3 pages
Assignment 1
Manjish Pal
No ratings yet
EGRE526/CMSC506 - Design Project Guidelines: March 27, 2016 April 14, 2016 May 10, 2016
Document2 pages
EGRE526/CMSC506 - Design Project Guidelines: March 27, 2016 April 14, 2016 May 10, 2016
Mustafa Nyaz
No ratings yet
2.3 - Best Practices - Native Hadoop Tool Sqoop
Document24 pages
2.3 - Best Practices - Native Hadoop Tool Sqoop
Cherrie Ann Bautista Domingo-Dillera
No ratings yet
DSBDSAssingment 11
Document20 pages
DSBDSAssingment 11
403 Chaudhari Sanika Sagar
No ratings yet
Milestone
Document7 pages
Milestone
kirankashif1118
No ratings yet
Predictive Analytics Exam-December 2020: Exam PA Home Page
Document10 pages
Predictive Analytics Exam-December 2020: Exam PA Home Page
justtestit
No ratings yet
Course Project Guideline - New
Document6 pages
Course Project Guideline - New
Caiyi Xu
No ratings yet
Mapreduce Ii: Permalink Comments (25) Trackbacks
Document6 pages
Mapreduce Ii: Permalink Comments (25) Trackbacks
Ravin Ravin
No ratings yet
Predictive Analytics Exam-June 2020: Exam PA Home Page
Document10 pages
Predictive Analytics Exam-June 2020: Exam PA Home Page
justtestit
No ratings yet
BD7 Assignment S22
Document4 pages
BD7 Assignment S22
Course Zila
No ratings yet
Seminar Report: Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks and
Document7 pages
Seminar Report: Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks and
amrutadeshmukh222
No ratings yet
Linked Data For Information Extraction Challenge 2014
Document6 pages
Linked Data For Information Extraction Challenge 2014
J. Albert Bowden II
No ratings yet
Himanshu Gupta Configuration Manual
Document16 pages
Himanshu Gupta Configuration Manual
Sheethal K. S
No ratings yet
05 Movies Data Analysis Using Mapreduce
Document20 pages
05 Movies Data Analysis Using Mapreduce
mohammadkhaja.shaik
No ratings yet
Assignment 1 Spec
Document5 pages
Assignment 1 Spec
Anitha Reddy
No ratings yet
Spark Interview Questions PDF 2
Document19 pages
Spark Interview Questions PDF 2
Varun
No ratings yet
ADA Report
Document12 pages
ADA Report
Nihal Jayachandran
No ratings yet
Qualitative Analysis Using R - A Free Analytic Tool
Document15 pages
Qualitative Analysis Using R - A Free Analytic Tool
aa 267
No ratings yet
Data Stage PDF
Document37 pages
Data Stage PDF
pappujaiswal
No ratings yet
Cse6242 HW3
Document8 pages
Cse6242 HW3
Richard Ding
No ratings yet
Hadoop Pagerank PDF
Document17 pages
Hadoop Pagerank PDF
Nawaal Ali
No ratings yet
Predictive Analytics Exam-June 2021: Exam PA Home Page
Document10 pages
Predictive Analytics Exam-June 2021: Exam PA Home Page
hninyi theint
No ratings yet
Synopsis
Document8 pages
Synopsis
Sahil Gupta
No ratings yet
CS628 - Assignment 4
Document2 pages
CS628 - Assignment 4
kumarhrishav3
No ratings yet
COMP529 Streaming Analytics (2019) 2
Document3 pages
COMP529 Streaming Analytics (2019) 2
Anonymous w3mfE7D6
No ratings yet
Correction Mock 2023
Document9 pages
Correction Mock 2023
Eva Tomi
No ratings yet
Structure
Document34 pages
Structure
boubidi.anisnjm
No ratings yet
Comp 90015 Assignment 1
Document4 pages
Comp 90015 Assignment 1
Bradley Jackson
100% (1)
RIP Riverbed Lab
Document13 pages
RIP Riverbed Lab
neka
No ratings yet
Informatica Assessment - 6D Case Studyu
Document5 pages
Informatica Assessment - 6D Case Studyu
Venkata Chaitanya Gannavarapu
No ratings yet
TM5140 Data-Driven Decision Making: Assignment 1
Document3 pages
TM5140 Data-Driven Decision Making: Assignment 1
Utpali Bhagya
No ratings yet
Introduction: Terminology: Code Chunks: Comparing Sweave and Knitr
Document15 pages
Introduction: Terminology: Code Chunks: Comparing Sweave and Knitr
gaby-01
No ratings yet
Fast R
Document43 pages
Fast R
Hannah Muricho
No ratings yet
Leverage The BinRank's Performance Using HubRank
Document4 pages
Leverage The BinRank's Performance Using HubRank
WARSE Journals
No ratings yet
SocialMediaLab Package Tutorial
Document17 pages
SocialMediaLab Package Tutorial
Shinji Ikari
No ratings yet
CSC431 AI - ProjectSLU2023
Document2 pages
CSC431 AI - ProjectSLU2023
muazu muhammad
No ratings yet
Part Ii: Applications of Gas: Ga and The Internet Genetic Search Based On Multiple Mutation Approaches
Document31 pages
Part Ii: Applications of Gas: Ga and The Internet Genetic Search Based On Multiple Mutation Approaches
Srikar Chintala
No ratings yet
Our Crawler Implementation: 7.1 Programming Environment and Dependencies
Document14 pages
Our Crawler Implementation: 7.1 Programming Environment and Dependencies
Nitin Malhotra
No ratings yet
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
Rating: 4 out of 5 stars
4/5 (1)
Deep Belief Nets in C++ and CUDA C: Volume 3: Convolutional Nets
From Everand
Deep Belief Nets in C++ and CUDA C: Volume 3: Convolutional Nets
Timothy Masters
No ratings yet
9 Distance Measures in Data Science
Document9 pages
9 Distance Measures in Data Science
Ammar Najjar
No ratings yet
Practicing Tasks in Data Mining: Techniques For The Subject
Document22 pages
Practicing Tasks in Data Mining: Techniques For The Subject
Ammar Najjar
No ratings yet
XO02-EUGLOH-ISGEE Enterpoly Week-Fostering Entrepreneurship Competences With Gamification
Document2 pages
XO02-EUGLOH-ISGEE Enterpoly Week-Fostering Entrepreneurship Competences With Gamification
Ammar Najjar
No ratings yet
Moosavi-Dezfooli Et Al. 2017 Universal Adversarial Perturbations
Document11 pages
Moosavi-Dezfooli Et Al. 2017 Universal Adversarial Perturbations
Ammar Najjar
No ratings yet
Wu Et Al. 2019, Making The Invisibility Cloak
Document16 pages
Wu Et Al. 2019, Making The Invisibility Cloak
Ammar Najjar
No ratings yet
2019 Adversarial Examples in Modern Machine Learning - A Review
Document97 pages
2019 Adversarial Examples in Modern Machine Learning - A Review
Ammar Najjar
No ratings yet
Carlini and Wagner 2017 Towards - Evaluating - The - Robustness - of - Neural - Networks
Document19 pages
Carlini and Wagner 2017 Towards - Evaluating - The - Robustness - of - Neural - Networks
Ammar Najjar
No ratings yet
Hashing Functions Properties
Document9 pages
Hashing Functions Properties
Rashmi vb
No ratings yet
Submit For Assessment Administrative Guide
Document9 pages
Submit For Assessment Administrative Guide
Nadia F. Zoppas
No ratings yet
Giga Data Sheet
Document2 pages
Giga Data Sheet
abdurraufrafli.2021
No ratings yet
Embedding Flash Programming Into TMS320LF240x Applications: Rev 1.3 23 Jan 2002 1
Document28 pages
Embedding Flash Programming Into TMS320LF240x Applications: Rev 1.3 23 Jan 2002 1
ASOCIACION ATECUBO
No ratings yet
Rashid Beach Resort Online Reservation A
Document13 pages
Rashid Beach Resort Online Reservation A
John Joseph Pasawa
No ratings yet
Unit III - PDF MICROPROCESER
Document7 pages
Unit III - PDF MICROPROCESER
Bonsa Mesganaw
No ratings yet
VEIKK A30 Instruction Manual
Document22 pages
VEIKK A30 Instruction Manual
LQ530
No ratings yet
Canon SmartBase Mpc400, 600 Parts Manual
Document76 pages
Canon SmartBase Mpc400, 600 Parts Manual
okeinfo
No ratings yet
The Most Comprehensive Digital IC Back-End Design and Implementation Traini Orial in History (Organized Version)
Document6 pages
The Most Comprehensive Digital IC Back-End Design and Implementation Traini Orial in History (Organized Version)
Agnathavasi
No ratings yet
SIL 3 Mechanical Devices, Really - Risknowlogy
Document6 pages
SIL 3 Mechanical Devices, Really - Risknowlogy
scottt_84
No ratings yet
MicroBit CULMINATING Project Rubric - ROBOTICS Space Exploration
Document3 pages
MicroBit CULMINATING Project Rubric - ROBOTICS Space Exploration
jaspal.ughra
No ratings yet
Maxgoogle Xtreamiptv
Document20 pages
Maxgoogle Xtreamiptv
sfhk
100% (1)
CA4101 Lecture 3 Business Architecture - BPMN
Document40 pages
CA4101 Lecture 3 Business Architecture - BPMN
Manoj Venkat
No ratings yet
SFTP Vscode
Document9 pages
SFTP Vscode
Paulo Almeida
No ratings yet
CS ZG524ES ZG524MEL ZG524 - Real Time Operating System - EC2 Regular
Document3 pages
CS ZG524ES ZG524MEL ZG524 - Real Time Operating System - EC2 Regular
2022ht80553
No ratings yet
Report
Document156 pages
Report
Fabricio Ochoa
No ratings yet
Catia DMU Kinematics Simulator
Document189 pages
Catia DMU Kinematics Simulator
ccooxxyy
100% (1)
Display Simulator Help Guide v1.5: GS2 1800 Display GS3 2630 Display
Document15 pages
Display Simulator Help Guide v1.5: GS2 1800 Display GS3 2630 Display
Focss
No ratings yet
SUSE Linux 11 SP2 - Fundamentals Workbook
Document86 pages
SUSE Linux 11 SP2 - Fundamentals Workbook
emcvilt
No ratings yet
Questionnaire For MS Excel 2016 (Part 4)
Document4 pages
Questionnaire For MS Excel 2016 (Part 4)
Cindy Craus
No ratings yet
Task 1
Document5 pages
Task 1
Dirty Rajan
No ratings yet
Artemis MK V Artemis MK V: Features and Benefits Features and Benefits
Document1 page
Artemis MK V Artemis MK V: Features and Benefits Features and Benefits
Elettore
No ratings yet
NB-MED - 2.2 - Rec - 4-2010 Software and Medical Devices
Document16 pages
NB-MED - 2.2 - Rec - 4-2010 Software and Medical Devices
Ivan Popovic
No ratings yet
Competitor Feature Analysis
Document6 pages
Competitor Feature Analysis
Vivek Tyagi
No ratings yet
Delmia Robotics Simulation
Document4 pages
Delmia Robotics Simulation
Bas Ramu
0% (1)
PeckShield-Audit-Report-SushiSwap-v1 (4th Copy) .0 PDF
Document47 pages
PeckShield-Audit-Report-SushiSwap-v1 (4th Copy) .0 PDF
SidneySissaoui
No ratings yet