Professional Documents
Culture Documents
Advance Data Mining Assignment
Advance Data Mining Assignment
Advance Data Mining Assignment
We'll scrape a popular technology news website to gather titles, publication dates, and brief descriptions
of the latest articles.
Steps:
Ensure you have BeautifulSoup and requests installed. If not, you can install them using pip:
bash
Copy code
pip install beautifulsoup4 requests
I like TechCrunch.
url = "https://techcrunch.com/startups/"
response = requests.get(url)
if response.status_code == 200:
articles = soup.find_all('article')
# List to store the scraped data
news_data = []
title = article.find('h2').text.strip()
description = article.find('p').text.strip()
date = article.find('time')['datetime']
news_data.append({
'title': title,
'description': description,
'date': date
})
print(f"Title: {news['title']}")
print(f"Date: {news['date']}")
print(f"Description: {news['description']}\n")
else:
Running the script will output the latest technology news articles from TechCrunch, including titles,
publication dates, and brief descriptions.
Sample Output
Description: The best startups are competing for the coveted Disrupt Cup and a $50,000 prize.
Date: 2024-06-17T14:30:00Z
Description: An AI startup specializing in natural language processing has raised $10 million in a Series A
funding round.
Date: 2024-06-16T09:00:00Z
Description: A new tech hub has been established in Silicon Valley, offering resources and support to
innovative startups.
Question No:2
Let's use Twint, an advanced Twitter scraping tool written in Python, to extract some Twitter data. Twint
is powerful because it doesn't require access to the Twitter API, which can be restrictive due to rate
limits and other constraints.
Installation
First, let's install Twint. Twint requires Python 3.6 or higher. You can install Twint using pip:
bash
Copy code
We'll extract tweets containing a specific hashtag, for example, #AI, and display the tweet text,
username, and date of each tweet.
Step-by-Step Script
2. Configure Twint
python
Copy code
import twint
c = twint.Config()
c.Search = "#AI"
twint.run.Search(c)
tweets = twint.output.tweets_list
print(f"Username: {tweet.username}")
print(f"Tweet: {tweet.tweet}\n")
Explanation
4. Display the Results: We access the stored tweets and print the username, date, and tweet text.
Sample Output
plaintext
Copy code
Username: ai_expert
Username: tech_guru
Username: datascientist
Question No:3
To demonstrate the general steps involved in collecting data for analysis using crawlers, let's create a
practical example using Twint to scrape Twitter data related to the hashtag #AI.
We will use Twitter as our data source to collect tweets containing the hashtag #AI.
Umarchauhdry
Date: 2024-06-19 12:28:15
Likes: 89
Retweets: 112
We will use Twint, a Python library that allows scraping Twitter without API limitations.
We will configure Twint to search for tweets containing #AI and set the necessary parameters.
We'll use Twint's configuration options to specify our data extraction needs.
Twint automatically handles pagination, so we don't need to write extra code for this.
We'll include a delay between requests to avoid overloading Twitter's servers and ensure compliance
with Twitter's terms of service.
After collecting the data, we'll clean and preprocess it to remove any irrelevant information and format it
for analysis.
We'll analyze the data using Python libraries like pandas and matplotlib to gain insights.
Here's a complete Python script to collect and preprocess Twitter data using Twint:
import pandas as pd
c = twint.Config()
c.Search = "#AI"
twint.run.Search(c)
tweets_df = twint.storage.panda.Tweets_df
print(tweets_df.head())
# Remove duplicates
tweets_cleaned = tweets_cleaned.drop_duplicates()
tweets_cleaned.to_csv('ai_tweets.csv', index=False)
print(tweets_cleaned.head())
1. Configure Twint: We set the search parameters, including the hashtag #AI, limit to 100 tweets,
language as English, and storing results in a pandas DataFrame.
2. Run the Search: We execute the Twint search with the configured parameters.
3. Retrieve Data: We retrieve the data from Twint's internal storage and convert it to a pandas
DataFrame.
4. Clean and Preprocess Data: We select relevant columns, remove duplicates, and save the
cleaned data to a CSV file.
5. Display Data: We print the first few rows of the collected and cleaned data.
The script already includes data cleaning steps such as selecting relevant columns and removing
duplicates.
Here's a simple analysis using pandas and matplotlib to visualize the number of likes and retweets:
tweets_cleaned = pd.read_csv('ai_tweets.csv')
plt.figure(figsize=(10, 5))
# Number of likes
plt.subplot(1, 2, 1)
plt.title('Distribution of Likes')
plt.xlabel('Number of Likes')
plt.ylabel('Frequency')
# Number of retweets
plt.subplot(1, 2, 2)
plt.title('Distribution of Retweets')
plt.xlabel('Number of Retweets')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()