Advance Data Mining Assignment

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

ASSIGNMENT

Course Name: Advance Data Mining

Name: Umar Yameen

Father Name: Muhammad Yameen

Student ID: 20903

Date: 25th June, 2024


Question No:1
I will use BeautifulSoup, a Python library for parsing HTML and XML, often used in conjunction with
other libraries to build web crawlers, to collect data of interest.

Topic: Latest Technology News

We'll scrape a popular technology news website to gather titles, publication dates, and brief descriptions
of the latest articles.

Steps:

1. Install required libraries.

2. Identify the website to scrape.

3. Write a script to collect the data.

4. Display the collected data.

Step 1: Install Required Libraries

Ensure you have BeautifulSoup and requests installed. If not, you can install them using pip:

 bash
 Copy code
 pip install beautifulsoup4 requests

Step 2: Identify the Website

I like TechCrunch.

Step 3: Write a Script

Below is a script to scrape the latest technology news from TechCrunch:

# Define the URL of the TechCrunch technology news page

url = "https://techcrunch.com/startups/"

# Send a GET request to the webpage

response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

# Parse the HTML content of the page

soup = BeautifulSoup(response.content, 'html.parser')

# Find all articles

articles = soup.find_all('article')
# List to store the scraped data

news_data = []

# Extract details from each article

for article in articles:

title = article.find('h2').text.strip()

description = article.find('p').text.strip()

date = article.find('time')['datetime']

# Append the data to the list

news_data.append({

'title': title,

'description': description,

'date': date

})

# Display the scraped data

for news in news_data:

print(f"Title: {news['title']}")

print(f"Date: {news['date']}")

print(f"Description: {news['description']}\n")

else:

print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Step 4: Display the Collected Data

Running the script will output the latest technology news articles from TechCrunch, including titles,
publication dates, and brief descriptions.

Sample Output

Title: TechCrunch Startup Battlefield


Date: 2024-06-18T12:00:00Z

Description: The best startups are competing for the coveted Disrupt Cup and a $50,000 prize.

Title: AI Startup Secures $10M Funding

Date: 2024-06-17T14:30:00Z

Description: An AI startup specializing in natural language processing has raised $10 million in a Series A
funding round.

Title: New Tech Hub in Silicon Valley

Date: 2024-06-16T09:00:00Z

Description: A new tech hub has been established in Silicon Valley, offering resources and support to
innovative startups.

Question No:2

Let's use Twint, an advanced Twitter scraping tool written in Python, to extract some Twitter data. Twint
is powerful because it doesn't require access to the Twitter API, which can be restrictive due to rate
limits and other constraints.

Installation

First, let's install Twint. Twint requires Python 3.6 or higher. You can install Twint using pip:

bash

Copy code

pip install twint

Extracting Data with Twint

We'll extract tweets containing a specific hashtag, for example, #AI, and display the tweet text,
username, and date of each tweet.

Step-by-Step Script

1. Import the Required Library

2. Configure Twint

3. Run the Twint Search

4. Display the Results


Here's a Python script using Twint:

python

Copy code

import twint

# Configure Twint to search for tweets containing the hashtag #AI

c = twint.Config()

c.Search = "#AI"

c.Limit = 10 # Limit to 10 tweets for demonstration purposes

c.Lang = "en" # Search for English tweets

c.Store_object = True # Store tweets in a Python object

c.Hide_output = True # Hide output in terminal

# Run the search

twint.run.Search(c)

# Retrieve tweets from Twint's internal storage

tweets = twint.output.tweets_list

# Display the collected tweets

for tweet in tweets:

print(f"Username: {tweet.username}")

print(f"Date: {tweet.datestamp} {tweet.timestamp}")

print(f"Tweet: {tweet.tweet}\n")

Explanation

1. Import the Required Library: We import Twint to use its functionality.

2. Configure Twint: We set up the search parameters:

o c.Search specifies the search query, which is #AI in this case.

o c.Limit limits the number of tweets to retrieve.


o c.Lang filters tweets to a specific language.

o c.Store_object tells Twint to store the results in a Python object.

o c.Hide_output hides the output in the terminal.

3. Run the Twint Search: We execute the search using twint.run.Search(c).

4. Display the Results: We access the stored tweets and print the username, date, and tweet text.

Sample Output

The output will be similar to this:

plaintext

Copy code

Username: ai_expert

Date: 2024-06-19 12:34:56

Tweet: AI is transforming the world in unprecedented ways. #AI

Username: tech_guru

Date: 2024-06-19 12:30:22

Tweet: Exciting developments in AI technology! #AI #MachineLearning

Username: datascientist

Date: 2024-06-19 12:28:15

Tweet: How AI is revolutionizing healthcare. #AI #HealthTech

Question No:3
To demonstrate the general steps involved in collecting data for analysis using crawlers, let's create a
practical example using Twint to scrape Twitter data related to the hashtag #AI.

Step 1: Identify the Data Source

We will use Twitter as our data source to collect tweets containing the hashtag #AI.

Step 2: Understand the Data Structure

We aim to collect the following elements from tweets:

 Umarchauhdry
 Date: 2024-06-19 12:28:15

 Likes: 89

 Retweets: 112

Step 3: Choose a Crawler

We will use Twint, a Python library that allows scraping Twitter without API limitations.

Step 4: Build or Configure the Crawler

We will configure Twint to search for tweets containing #AI and set the necessary parameters.

Step 5: Implement Data Extraction Logic

We'll use Twint's configuration options to specify our data extraction needs.

Step 6: Handle Pagination

Twint automatically handles pagination, so we don't need to write extra code for this.

Step 7: Ensure Compliance and Respect

We'll include a delay between requests to avoid overloading Twitter's servers and ensure compliance
with Twitter's terms of service.

Step 8: Execute the Crawler

We will run the Twint script to start collecting data.

Step 9: Clean and Preprocess the Data

After collecting the data, we'll clean and preprocess it to remove any irrelevant information and format it
for analysis.

Step 10: Analyze the Data

We'll analyze the data using Python libraries like pandas and matplotlib to gain insights.

Python Script for Steps 4-8

Here's a complete Python script to collect and preprocess Twitter data using Twint:

import pandas as pd

# Configure Twint to search for tweets containing the hashtag #AI

c = twint.Config()

c.Search = "#AI"

c.Limit = 100 # Limit to 100 tweets for demonstration purposes


c.Lang = "en" # Search for English tweets

c.Store_object = True # Store tweets in a Python object

c.Hide_output = True # Hide output in terminal

c.Pandas = True # Enable saving to pandas DataFrame

# Run the search

twint.run.Search(c)

# Retrieve tweets from Twint's internal storage

tweets_df = twint.storage.panda.Tweets_df

# Display the first few rows of the collected data

print(tweets_df.head())

# Clean and preprocess the data

# Select relevant columns

tweets_cleaned = tweets_df[['date', 'username', 'tweet', 'likes_count', 'retweets_count']]

# Remove duplicates

tweets_cleaned = tweets_cleaned.drop_duplicates()

# Save to a CSV file for further analysis

tweets_cleaned.to_csv('ai_tweets.csv', index=False)

# Display the cleaned data

print(tweets_cleaned.head())

Explanation of the Script

1. Configure Twint: We set the search parameters, including the hashtag #AI, limit to 100 tweets,
language as English, and storing results in a pandas DataFrame.
2. Run the Search: We execute the Twint search with the configured parameters.

3. Retrieve Data: We retrieve the data from Twint's internal storage and convert it to a pandas
DataFrame.

4. Clean and Preprocess Data: We select relevant columns, remove duplicates, and save the
cleaned data to a CSV file.

5. Display Data: We print the first few rows of the collected and cleaned data.

Step 9: Clean and Preprocess the Data

The script already includes data cleaning steps such as selecting relevant columns and removing
duplicates.

Step 10: Analyze the Data

Here's a simple analysis using pandas and matplotlib to visualize the number of likes and retweets:

import matplotlib.pyplot as plt

# Read the cleaned data from the CSV file

tweets_cleaned = pd.read_csv('ai_tweets.csv')

# Plot the number of likes and retweets

plt.figure(figsize=(10, 5))

# Number of likes

plt.subplot(1, 2, 1)

plt.hist(tweets_cleaned['likes_count'], bins=20, color='blue', edgecolor='black')

plt.title('Distribution of Likes')

plt.xlabel('Number of Likes')

plt.ylabel('Frequency')

# Number of retweets

plt.subplot(1, 2, 2)

plt.hist(tweets_cleaned['retweets_count'], bins=20, color='green', edgecolor='black')

plt.title('Distribution of Retweets')
plt.xlabel('Number of Retweets')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

You might also like