Welcome to Scribd!

Skip carousel

DA Lab Program-6

Uploaded by

Diksha Padiyar

0% found this document useful (0 votes)

0 views4 pages

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

0 views4 pages

DA Lab Program-6

Uploaded by

Diksha Padiyar

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 4

Search inside document

DATA ANALYTICS LABORATORY (21CSL66)

6. Utilize Apache Spark to achieve the following tasks for the given dataset,

a) Find the movie with the lowest average rating with RDD.

b) Identify users who have rated the most movies.

c) Explore the distribution of ratings over time.

d)Find the highest-rated movies with a minimum number of ratings.

Steps to be followed:

Prerequisites:

• Apache Spark Setup: Ensure you have Apache Spark installed and configured.

(If not go to terminal and install pyspark using pip install pyspark)

• Dataset: Load the dataset.

o movies dataset has columns: movie_id, title, genres.

o ratings dataset has columns: user_id, movie_id, rating.

• Loading Data: First, load the data into Spark RDDs or DataFrames.

from pyspark.sql import SparkSession

# Create RDDs

#movies_rdd = movies_df.rdd

#ratings_rdd = ratings_df.rdd

# Initialize Spark session

spark = SparkSession.builder.appName("MovieRatingsAnalysis").getOrCreate()

1
# Load datasets

movies_df = spark.read.csv("/Users/amithpradhaan/Desktop/ml-latest-small/movies.csv",
header=True, inferSchema=True)

ratings_df = spark.read.csv("/Users/amithpradhaan/Desktop/ml-latest-small/ratings.csv",
header=True, inferSchema=True)

# Create RDDs

movies_rdd = movies_df.rdd

ratings_rdd = ratings_df.rdd

a) Find the Movie with the Lowest Average Rating Using RDD.

# Compute average ratings

avg_ratings_rdd = ratings_rdd.map(lambda x: (x['movieId'], (x['rating'], 1))) \

.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \

.mapValues(lambda x: x[0] / x[1])

# Find the movie with the lowest average rating

lowest_avg_rating = avg_ratings_rdd.sortBy(lambda x: x[1]).first()

print(f"Movie with the lowest average rating: {lowest_avg_rating}")

b) Identify Users Who Have Rated the Most Movies.

# Compute number of ratings per user

user_ratings_count = ratings_rdd.map(lambda x: (x['userId'], 1)) \

.reduceByKey(lambda x, y: x + y) \

.sortBy(lambda x: x[1], ascending=False)

# Get top users

top_users = user_ratings_count.take(10)
2
print(f"Top users by number of ratings: {top_users}")

c) Explore the Distribution of Ratings Over Time.

from pyspark.sql.functions import from_unixtime, year, month

# Convert timestamp to date and extract year and month

ratings_df = ratings_df.withColumn("year", year(from_unixtime(ratings_df['timestamp']))) \

.withColumn("month", month(from_unixtime(ratings_df['timestamp'])))

# Group by year and month to get rating counts

ratings_over_time = ratings_df.groupBy("year", "month").count().orderBy("year", "month")

# Show distribution

ratings_over_time.show()

d) Find the Highest-Rated Movies with a Minimum Number of Ratings.

Set a minimum number of ratings, ex: 100.

# Compute average ratings and count ratings per movie

movie_ratings_stats = ratings_rdd.map(lambda x: (x['movieId'], (x['rating'], 1))) \

.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \

.mapValues(lambda x: (x[0] / x[1], x[1])) # (avg_rating, count)

# Filter movies with a minimum number of ratings

min_ratings = 100

qualified_movies = movie_ratings_stats.filter(lambda x: x[1][1] >= min_ratings)

# Find the highest-rated movies

3
highest_rated_movies = qualified_movies.sortBy(lambda x: x[1][0],
ascending=False).take(10)

print(f"Highest-rated movies with at least {min_ratings} ratings: {highest_rated_movies}")

035 Assignment PDF
Document14 pages
035 Assignment PDF
Tman Letswalo
No ratings yet
Get Access To C100DEV Exam Dumps Now: Pass MongoDB Certified Developer Associate Exam With Full Confidence - Valid IT Exam Dumps Questions
Document5 pages
Get Access To C100DEV Exam Dumps Now: Pass MongoDB Certified Developer Associate Exam With Full Confidence - Valid IT Exam Dumps Questions
Mai Phuong
No ratings yet
Guia de Esmaltes Ceramicos Linda Bloomfield 2
Document140 pages
Guia de Esmaltes Ceramicos Linda Bloomfield 2
Mailen Vales
No ratings yet
Tarea de Ciencia de Datos
Document32 pages
Tarea de Ciencia de Datos
Leomaris Ferreras
No ratings yet
Writing Efficient R Code
Document5 pages
Writing Efficient R Code
Octavio Flores
No ratings yet
Movie Que
Document3 pages
Movie Que
Shubham dattatray kote
No ratings yet
Source Code
Document19 pages
Source Code
Kash Sharma
No ratings yet
Recommendation Engine Problem Statement
Document37 pages
Recommendation Engine Problem Statement
SBS Movies
No ratings yet
Cabico Tan
Document11 pages
Cabico Tan
jaydee cabico
No ratings yet
Classifier Series - Naive Bayes Sentiment Analysis
Document10 pages
Classifier Series - Naive Bayes Sentiment Analysis
Jorge
No ratings yet
Tahapan Sentiment Analysis
Document6 pages
Tahapan Sentiment Analysis
Dedek Julian
No ratings yet
Pattern Report Final
Document10 pages
Pattern Report Final
MD Rubel Amin
No ratings yet
Experiment No 10 - Updated
Document28 pages
Experiment No 10 - Updated
Aman Jain
No ratings yet
Installing Spark On Windows Environment
Document16 pages
Installing Spark On Windows Environment
Dr Mohammed Kamal
No ratings yet
Practical File: Deep Learning
Document33 pages
Practical File: Deep Learning
GAMING WITH DEAD PAN
No ratings yet
Abhiml ML File
Document74 pages
Abhiml ML File
Bhawna Chandla
No ratings yet
Dinosaurus Island - Character-Level Language Model - (Final) - Learners - Ipynb
Document10 pages
Dinosaurus Island - Character-Level Language Model - (Final) - Learners - Ipynb
EMBA IITKGP
No ratings yet
Section 6 - Jupyter Notebook
Document11 pages
Section 6 - Jupyter Notebook
Mohamed Aymen
No ratings yet
Edx Course Lab Programs
Document19 pages
Edx Course Lab Programs
dl0395736
No ratings yet
Tensor Flow and Keras Sample Programs
Document22 pages
Tensor Flow and Keras Sample Programs
vinothkumar0743
No ratings yet
Assignment 1
Document5 pages
Assignment 1
forgamesforgames7
No ratings yet
Appendix B: Source Code
Document5 pages
Appendix B: Source Code
AISHWARYA S
No ratings yet
ML NEW Final Format
Document37 pages
ML NEW Final Format
Tharun Kumar
No ratings yet
Data Science Lab Manual
Document32 pages
Data Science Lab Manual
Ravishankar Gautam
No ratings yet
TP Mongo DB 1
Document5 pages
TP Mongo DB 1
account
No ratings yet
Pretty Movie Usa
Document6 pages
Pretty Movie Usa
منير رامي
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
Document20 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
Saloni Tuli
No ratings yet
Machine Learning Pract
Document7 pages
Machine Learning Pract
Sunil Shedge
No ratings yet
Advanced Recommender Systems With Python
Document13 pages
Advanced Recommender Systems With Python
Fabian Hafner
No ratings yet
Twitter Sentiment Analysis Dss
Document14 pages
Twitter Sentiment Analysis Dss
Minh Tâm Trần
No ratings yet
R Manual
Document10 pages
R Manual
Superset Notifications
No ratings yet
Task 5 - Coursework
Document14 pages
Task 5 - Coursework
Raj Moktan
No ratings yet
Dermoscopic_lesion_imaging Trail PDF (Epoch 1)
Document3 pages
Dermoscopic_lesion_imaging Trail PDF (Epoch 1)
Pranay Mahawar
No ratings yet
Week14 N9
Document3 pages
Week14 N9
20131A05N9 SRUTHIK THOKALA
No ratings yet
Final Code 2
Document3 pages
Final Code 2
incomemining.bi
No ratings yet
C 36 Harsh Imdb
Document8 pages
C 36 Harsh Imdb
Aman Bansal
No ratings yet
BOW Assignment 210097
Document10 pages
BOW Assignment 210097
Sehr Malik
No ratings yet
Ds File
Document58 pages
Ds File
tapcom19
No ratings yet
0 PDF
Document9 pages
0 PDF
pepe
No ratings yet
Problem Statement
Document6 pages
Problem Statement
Archana Jagadeesan
No ratings yet
1st Harvard Project
Document17 pages
1st Harvard Project
Alexandre Tem-Pass
No ratings yet
ML 1-10
Document53 pages
ML 1-10
22128008
No ratings yet
Looking For Real Exam Questions For IT Certification Exams!
Document32 pages
Looking For Real Exam Questions For IT Certification Exams!
namrata_thukral07
No ratings yet
Practica12 3
Document2 pages
Practica12 3
Jua Ign
No ratings yet
CV 5
Document4 pages
CV 5
jiteshkumardj
No ratings yet
Campus Connect Projects
Document27 pages
Campus Connect Projects
prerna panda
No ratings yet
Correction
Document3 pages
Correction
bougmazisoufyane
No ratings yet
Supervised Learning in R Classification
Document7 pages
Supervised Learning in R Classification
Octavio Flores
No ratings yet
Practical File of AI and ML
Document26 pages
Practical File of AI and ML
eluzer.cyris
No ratings yet
Unit1 ML Programs
Document5 pages
Unit1 ML Programs
diroja5648
No ratings yet
7 Data Science / Machine Learning Cheat Sheets in One
Document9 pages
7 Data Science / Machine Learning Cheat Sheets in One
shubhamsachdeva10
100% (1)
Code
Document4 pages
Code
sohaila
No ratings yet
NLP Lab Manual
Document15 pages
NLP Lab Manual
shalima
No ratings yet
60 ChatGPT Prompts For Data Science 2023
Document67 pages
60 ChatGPT Prompts For Data Science 2023
T L
100% (2)
Python Code
Document7 pages
Python Code
Gnan Shetty
No ratings yet
Assignment 1 - Ue21cs343ab2 - Big Data
Document8 pages
Assignment 1 - Ue21cs343ab2 - Big Data
Sonupatel Sonupatel
No ratings yet
Email Spam Classifier
Document22 pages
Email Spam Classifier
phenomenal beast
No ratings yet
Data Analysis W Pandas
Document4 pages
Data Analysis W Pandas
x7jn4sxdn9
No ratings yet
NLP Lab Manual
Document6 pages
NLP Lab Manual
Zeha 1
No ratings yet
Week12 Assignment Solution
Document10 pages
Week12 Assignment Solution
Arnab Dey
No ratings yet
ML - Practical File
Document15 pages
ML - Practical File
Jatin Mathur
No ratings yet
What Is AI? What Is ML? What Is Deep Learning? Machine Learning Process
Document8 pages
What Is AI? What Is ML? What Is Deep Learning? Machine Learning Process
ranamzeeshan
No ratings yet
TH 1089
Document60 pages
TH 1089
Suson Dhital
No ratings yet
Overview of MS Project 2000
Document42 pages
Overview of MS Project 2000
Manish Kanwar
100% (1)
Actual4Test: Actual4test - Actual Test Exam Dumps-Pass For IT Exams
Document8 pages
Actual4Test: Actual4test - Actual Test Exam Dumps-Pass For IT Exams
German Delgadillo
No ratings yet
1.what Is Talend Software? What Is A Project in Talend? Why Is Talend Called A Code Generator?
Document3 pages
1.what Is Talend Software? What Is A Project in Talend? Why Is Talend Called A Code Generator?
dante-kun3
No ratings yet
MIMB17N - Gateway User Manual PDF
Document208 pages
MIMB17N - Gateway User Manual PDF
Rafael Angel Figueroa Velez
No ratings yet
Hong Kong Preliminary SC 03-04
Document2 pages
Hong Kong Preliminary SC 03-04
Carlos Victor
No ratings yet
History of Windows
Document27 pages
History of Windows
Erin Klay
No ratings yet
PDF Protocolo Sueroterapia Compress
Document6 pages
PDF Protocolo Sueroterapia Compress
D ́vid Medical
No ratings yet
Doc. Tech. - Artica V4 - Admin Guide
Document296 pages
Doc. Tech. - Artica V4 - Admin Guide
etienne Curet
No ratings yet
Disk Explanation
Document5 pages
Disk Explanation
Renegade Diver
No ratings yet
Booth S Multiplication Data Path Control Path
Document19 pages
Booth S Multiplication Data Path Control Path
qaisar ayub
No ratings yet
Digital Image Watermarking Thesis
Document4 pages
Digital Image Watermarking Thesis
sonyajohnsonjackson
100% (1)
UserManual CAD DIY
Document8 pages
UserManual CAD DIY
marco monroy
No ratings yet
Datasheet: Unibox U-500
Document5 pages
Datasheet: Unibox U-500
manafpp
No ratings yet
Radiance 19 FMMI User Manual
Document28 pages
Radiance 19 FMMI User Manual
Alejandro de la Cruz
No ratings yet
Computers Are Your Future: Twelfth Edition
Document30 pages
Computers Are Your Future: Twelfth Edition
O A
No ratings yet
Computer Organisation and Architecture.
Document178 pages
Computer Organisation and Architecture.
AkshayAnand
100% (1)
Ceragon FibeAir 2500 Product Description V3.3.30 RevA
Document127 pages
Ceragon FibeAir 2500 Product Description V3.3.30 RevA
Eloy Andrés
No ratings yet
Specifications For LCD Module: Customer Customer Part No. Ampex Part No. AM-320240LNTMQW00H-A Approved by Date
Document24 pages
Specifications For LCD Module: Customer Customer Part No. Ampex Part No. AM-320240LNTMQW00H-A Approved by Date
Dago Saenz
No ratings yet
New Seo-1
Document8 pages
New Seo-1
Asif Abdullah Khan
No ratings yet
Manual IHM XINJE de Conectividade Com PLC Diversas Marcas
Document225 pages
Manual IHM XINJE de Conectividade Com PLC Diversas Marcas
sergio0696
No ratings yet
01 BE Handbook 2017-18 PDF
Document142 pages
01 BE Handbook 2017-18 PDF
Raihan Ahmed
No ratings yet
Chap-1 Language Processors
Document83 pages
Chap-1 Language Processors
Skanda Gupta
No ratings yet
Best Way To Limit Local Login To Workstations To Certain User Accounts - Ars Technica Openforum
Document5 pages
Best Way To Limit Local Login To Workstations To Certain User Accounts - Ars Technica Openforum
Departamento de TI Sanclin
No ratings yet
Introduction To Hadoop Slides
Document111 pages
Introduction To Hadoop Slides
hegde247
No ratings yet
Natural General Intelligence How Understanding The Brain Can Help Us Build Ai Christopher Summerfield Full Chapter
Document67 pages
Natural General Intelligence How Understanding The Brain Can Help Us Build Ai Christopher Summerfield Full Chapter
jacob.dabney480
100% (13)
SupremeFX Combo slot распиновка и описание at pinoutguide
Document3 pages
SupremeFX Combo slot распиновка и описание at pinoutguide
DFGDG
No ratings yet
Arduino Playground - ShiftOutX Library
Document4 pages
Arduino Playground - ShiftOutX Library
Winex
0% (1)