Advance Data Mining Assignment

ASSIGNMENT
Course Name: Advance Data Mining
Name: Umar Yameen
Father Name: Muhammad Yameen
Student ID: 20903
Date: 25th June, 2024

Question No:1
I will use BeautifulSoup, a Python library for parsing HTML and XML, often used in conjunction with
other libraries to build web crawlers, to collect data of interest.
Topic: Latest Technology News
We'll scrape a popular technology news website to gather titles, publication dates, and brief descriptions
of the latest articles.
Steps:
1. Install required libraries.
2. Identify the website to scrape.
3. Write a script to collect the data.
4. Display the collected data.
Step 1: Install Required Libraries
Ensure you have BeautifulSoup and requests installed. If not, you can install them using pip:
 bash
 Copy code
 pip install beautifulsoup4 requests
Step 2: Identify the Website
I like TechCrunch.
Step 3: Write a Script
Below is a script to scrape the latest technology news from TechCrunch:
# Define the URL of the TechCrunch technology news page
url = "https://techcrunch.com/startups/"
# Send a GET request to the webpage
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
# Find all articles
articles = soup.find_all('article')
# List to store the scraped data
news_data = []
# Extract details from each article
for article in articles:
title = article.find('h2').text.strip()
description = article.find('p').text.strip()
date = article.find('time')['datetime']
# Append the data to the list
news_data.append({
'title': title,
'description': description,
'date': date
})
# Display the scraped data
for news in news_data:
print(f"Title: {news['title']}")
print(f"Date: {news['date']}")
print(f"Description: {news['description']}\n")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Step 4: Display the Collected Data
Running the script will output the latest technology news articles from TechCrunch, including titles,
publication dates, and brief descriptions.
Sample Output
Title: TechCrunch Startup Battlefield

Date: 2024-06-18T12:00:00Z
Description: The best startups are competing for the coveted Disrupt Cup and a $50,000 prize.
Title: AI Startup Secures $10M Funding
Date: 2024-06-17T14:30:00Z
Description: An AI startup specializing in natural language processing has raised $10 million in a Series A
funding round.
Title: New Tech Hub in Silicon Valley
Date: 2024-06-16T09:00:00Z
Description: A new tech hub has been established in Silicon Valley, offering resources and support to
innovative startups.
Question No:2
Let's use Twint, an advanced Twitter scraping tool written in Python, to extract some Twitter data. Twint
is powerful because it doesn't require access to the Twitter API, which can be restrictive due to rate
limits and other constraints.
Installation
First, let's install Twint. Twint requires Python 3.6 or higher. You can install Twint using pip:
bash
Copy code
pip install twint
Extracting Data with Twint
We'll extract tweets containing a specific hashtag, for example, #AI, and display the tweet text,
username, and date of each tweet.
Step-by-Step Script
1. Import the Required Library
2. Configure Twint
3. Run the Twint Search
4. Display the Results

Here's a Python script using Twint:
python
Copy code
import twint
# Configure Twint to search for tweets containing the hashtag #AI
c = twint.Config()
c.Search = "#AI"
c.Limit = 10 # Limit to 10 tweets for demonstration purposes
c.Lang = "en" # Search for English tweets
c.Store_object = True # Store tweets in a Python object
c.Hide_output = True # Hide output in terminal
# Run the search
twint.run.Search(c)
# Retrieve tweets from Twint's internal storage
tweets = twint.output.tweets_list
# Display the collected tweets
for tweet in tweets:
print(f"Username: {tweet.username}")
print(f"Date: {tweet.datestamp} {tweet.timestamp}")
print(f"Tweet: {tweet.tweet}\n")
Explanation
1. Import the Required Library: We import Twint to use its functionality.
2. Configure Twint: We set up the search parameters:
o c.Search specifies the search query, which is #AI in this case.
o c.Limit limits the number of tweets to retrieve.

o c.Lang filters tweets to a specific language.
o c.Store_object tells Twint to store the results in a Python object.
o c.Hide_output hides the output in the terminal.
3. Run the Twint Search: We execute the search using twint.run.Search(c).
4. Display the Results: We access the stored tweets and print the username, date, and tweet text.
Sample Output
The output will be similar to this:
plaintext
Copy code
Username: ai_expert
Date: 2024-06-19 12:34:56
Tweet: AI is transforming the world in unprecedented ways. #AI
Username: tech_guru
Date: 2024-06-19 12:30:22
Tweet: Exciting developments in AI technology! #AI #MachineLearning
Username: datascientist
Date: 2024-06-19 12:28:15
Tweet: How AI is revolutionizing healthcare. #AI #HealthTech
Question No:3
To demonstrate the general steps involved in collecting data for analysis using crawlers, let's create a
practical example using Twint to scrape Twitter data related to the hashtag #AI.
Step 1: Identify the Data Source
We will use Twitter as our data source to collect tweets containing the hashtag #AI.
Step 2: Understand the Data Structure
We aim to collect the following elements from tweets:
 Umarchauhdry
 Date: 2024-06-19 12:28:15
 Likes: 89
 Retweets: 112
Step 3: Choose a Crawler
We will use Twint, a Python library that allows scraping Twitter without API limitations.
Step 4: Build or Configure the Crawler
We will configure Twint to search for tweets containing #AI and set the necessary parameters.
Step 5: Implement Data Extraction Logic
We'll use Twint's configuration options to specify our data extraction needs.
Step 6: Handle Pagination
Twint automatically handles pagination, so we don't need to write extra code for this.
Step 7: Ensure Compliance and Respect
We'll include a delay between requests to avoid overloading Twitter's servers and ensure compliance
with Twitter's terms of service.
Step 8: Execute the Crawler
We will run the Twint script to start collecting data.
Step 9: Clean and Preprocess the Data
After collecting the data, we'll clean and preprocess it to remove any irrelevant information and format it
for analysis.
Step 10: Analyze the Data
We'll analyze the data using Python libraries like pandas and matplotlib to gain insights.
Python Script for Steps 4-8
Here's a complete Python script to collect and preprocess Twitter data using Twint:
import pandas as pd
# Configure Twint to search for tweets containing the hashtag #AI
c = twint.Config()
c.Search = "#AI"
c.Limit = 100 # Limit to 100 tweets for demonstration purposes

c.Lang = "en" # Search for English tweets
c.Store_object = True # Store tweets in a Python object
c.Hide_output = True # Hide output in terminal
c.Pandas = True # Enable saving to pandas DataFrame
# Run the search
twint.run.Search(c)
# Retrieve tweets from Twint's internal storage
tweets_df = twint.storage.panda.Tweets_df
# Display the first few rows of the collected data
print(tweets_df.head())
# Clean and preprocess the data
# Select relevant columns
tweets_cleaned = tweets_df[['date', 'username', 'tweet', 'likes_count', 'retweets_count']]
# Remove duplicates
tweets_cleaned = tweets_cleaned.drop_duplicates()
# Save to a CSV file for further analysis
tweets_cleaned.to_csv('ai_tweets.csv', index=False)
# Display the cleaned data
print(tweets_cleaned.head())
Explanation of the Script
1. Configure Twint: We set the search parameters, including the hashtag #AI, limit to 100 tweets,
language as English, and storing results in a pandas DataFrame.
2. Run the Search: We execute the Twint search with the configured parameters.
3. Retrieve Data: We retrieve the data from Twint's internal storage and convert it to a pandas
DataFrame.
4. Clean and Preprocess Data: We select relevant columns, remove duplicates, and save the
cleaned data to a CSV file.
5. Display Data: We print the first few rows of the collected and cleaned data.
Step 9: Clean and Preprocess the Data
The script already includes data cleaning steps such as selecting relevant columns and removing
duplicates.
Step 10: Analyze the Data
Here's a simple analysis using pandas and matplotlib to visualize the number of likes and retweets:
import matplotlib.pyplot as plt
# Read the cleaned data from the CSV file
tweets_cleaned = pd.read_csv('ai_tweets.csv')
# Plot the number of likes and retweets
plt.figure(figsize=(10, 5))
# Number of likes
plt.subplot(1, 2, 1)
plt.hist(tweets_cleaned['likes_count'], bins=20, color='blue', edgecolor='black')
plt.title('Distribution of Likes')
plt.xlabel('Number of Likes')
plt.ylabel('Frequency')
# Number of retweets
plt.subplot(1, 2, 2)
plt.hist(tweets_cleaned['retweets_count'], bins=20, color='green', edgecolor='black')
plt.title('Distribution of Retweets')
plt.xlabel('Number of Retweets')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

Advance Data Mining Assignment

Uploaded by

Copyright:

Available Formats

You might also like

Advance Data Mining Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advance Data Mining Assignment

Uploaded by

Copyright:

Available Formats

ASSIGNMENT

Course Name: Advance Data Mining

Name: Umar Yameen

Father Name: Muhammad Yameen

Student ID: 20903

Date: 25th June, 2024

Topic: Latest Technology News

1. Install required libraries.

2. Identify the website to scrape.

3. Write a script to collect the data.

4. Display the collected data.

Step 1: Install Required Libraries

Step 2: Identify the Website

Step 3: Write a Script

Below is a script to scrape the latest technology news from TechCrunch:

# Define the URL of the TechCrunch technology news page

# Send a GET request to the webpage

# Check if the request was successful

# Parse the HTML content of the page

soup = BeautifulSoup(response.content, 'html.parser')

# Find all articles

# Extract details from each article

for article in articles:

# Append the data to the list

# Display the scraped data

for news in news_data:

print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Step 4: Display the Collected Data

Title: TechCrunch Startup Battlefield

Title: AI Startup Secures $10M Funding

Title: New Tech Hub in Silicon Valley

pip install twint

Extracting Data with Twint

1. Import the Required Library

3. Run the Twint Search

4. Display the Results

# Configure Twint to search for tweets containing the hashtag #AI

c.Limit = 10 # Limit to 10 tweets for demonstration purposes

c.Lang = "en" # Search for English tweets

c.Store_object = True # Store tweets in a Python object

c.Hide_output = True # Hide output in terminal

# Run the search

# Retrieve tweets from Twint's internal storage

# Display the collected tweets

for tweet in tweets:

print(f"Date: {tweet.datestamp} {tweet.timestamp}")

1. Import the Required Library: We import Twint to use its functionality.

2. Configure Twint: We set up the search parameters:

o c.Search specifies the search query, which is #AI in this case.

o c.Limit limits the number of tweets to retrieve.

o c.Store_object tells Twint to store the results in a Python object.

o c.Hide_output hides the output in the terminal.

3. Run the Twint Search: We execute the search using twint.run.Search(c).

The output will be similar to this:

Date: 2024-06-19 12:34:56

Tweet: AI is transforming the world in unprecedented ways. #AI

Date: 2024-06-19 12:30:22

Tweet: Exciting developments in AI technology! #AI #MachineLearning

Date: 2024-06-19 12:28:15