Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

Experiment No. 3

Aim: Data Cleaning and Storage, Preprocess, filter and store social media data for business
(Using Python, MongoDB, R, etc).

Theory:
1. What is preprocessing?
Ans: Preprocessing of data refers to the manipulation and preparation of raw data before it is
used for analysis. The goal of preprocessing is to transform the raw data into a form that
is more suitable for analysis, making it easier to extract meaningful insights and patterns.

Preprocessing can include various steps depending on the type and quality of the data, as
well as the analysis objectives. Some of the most common preprocessing techniques
include:

Data Cleaning: This involves identifying and correcting or removing any errors, missing
values, or inconsistencies in the data. For example, this could include removing duplicate
records or filling in missing data points.

Data Transformation: This involves transforming the data to a more appropriate form
for analysis. For example, this could include converting data from one measurement unit
to another or normalizing data to have a mean of zero and standard deviation of one.

Data Integration: This involves combining data from multiple sources into a single
dataset. For example, this could include merging data from different databases or files.

Data Reduction: This involves reducing the amount of data to be analyzed while still
preserving the important information. For example, this could include using feature
selection techniques to identify the most relevant variables for analysis or applying
dimensionality reduction techniques to reduce the number of variables while preserving
as much of the original data as possible.

Data Discretization: This involves converting continuous data into discrete categories.
For example, this could include categorizing ages into age groups or grouping incomes
into income brackets.

Preprocessing is a critical step in the data analysis process as it helps ensure the accuracy
and relevance of the data used in analysis. By performing appropriate preprocessing
techniques, analysts can reduce errors, improve the quality of results, and increase the
efficiency of their analysis.
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

2. Why is preprocessing needed?


Ans: Preprocessing is needed for several reasons, including:

Data Quality: Raw data may contain errors, missing values, or inconsistencies that need
to be corrected or removed before analysis. Preprocessing helps to identify and handle
such issues to ensure data quality.

Data Relevance: Raw data may contain irrelevant or redundant features that do not
contribute to the analysis or may even introduce noise. Preprocessing helps to identify
and remove such features, reducing the dimensionality of the dataset and improving the
accuracy of analysis.

Data Normalization: Raw data may have different scales, units, or ranges, making it
difficult to compare or combine features. Preprocessing helps to normalize the data to a
common scale, unit, or range, making it easier to compare and combine features.

Data Integration: Raw data may come from multiple sources and may not be in a format
that can be easily combined. Preprocessing helps to integrate data from multiple sources
into a single dataset, making it easier to analyze.

Data Reduction: Raw data may be large and complex, making it difficult to analyze
efficiently. Preprocessing helps to reduce the size and complexity of the dataset while
retaining its important features, making it easier to analyze and extract meaningful
insights.

Overall, preprocessing is essential to ensure the accuracy, relevance, and efficiency of


data analysis. By addressing issues with data quality, relevance, normalization,
integration, and reduction, preprocessing helps to prepare the data for analysis and extract
valuable insights from it.

3. Explain 7-8 functions (python/R) with syntax for preprocessing?


Ans: I. Lowercase conversion:
This technique involves converting all the text to lowercase. This is useful because it
helps to reduce the dimensionality of the data and makes it easier to compare and analyze
text.
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

text = "The quick brown Fox JUMPS over the Lazy Dog."
preprocessed_text = text.lower()
print(preprocessed_text)

II. Removing Punctuation:


This technique involves removing all the punctuation marks from the text. This is useful
because punctuation marks do not add any value to the text analysis.

import string
text = "Hello, World!"
preprocessed_text = text.translate(str.maketrans("", "", string.punctuation))
print(preprocessed_text)

III. Removing Numbers:


This technique involves removing all the numerical digits from the text. This is useful
because numerical digits do not provide any contextual meaning to the text.

import re
text = "The price of the product is $50"
preprocessed_text = re.sub(r'\d+', '', text)
print(preprocessed_text)

IV. Removing Stopwords:


This technique involves removing common words that do not provide any meaningful
information for text analysis. Stopwords include words like "the", "a", "an", "and", etc.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
text = "The quick brown fox jumps over the lazy dog."
words = text.split()
preprocessed_text = [word for word in words if word.casefold() not in stop_words]
preprocessed_text = ' '.join(preprocessed_text)
print(preprocessed_text)

V. Tokenization:
This technique involves breaking the text into individual words or tokens. This is useful
because it helps to analyze text on a more granular level.
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog."
preprocessed_text = word_tokenize(text)
print(preprocessed_text)

VI. Stemming:
This technique involves reducing words to their root form by removing suffixes. This is
useful because it helps to group words with similar meanings together.

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
ps = PorterStemmer()
text = "I am going to the park to play."
words = word_tokenize(text)
preprocessed_text = ' '.join([ps.stem(word) for word in words])
print(preprocessed_text)

VII. Lemmatization:
This technique is similar to stemming, but it involves reducing words to their base form,
which can be a real word. Here's an example of how to do it in Python using the NLTK
library:

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
text = "This is an example text with some words to be lemmatized."
lemmatized_words = [lemmatizer.lemmatize(word) for word in text.split()]
print(lemmatized_words)
VIII. Removing HTML tags:
This technique involves removing any HTML tags from the text data. Here's an example
of how to do it in Python using the BeautifulSoup library:

from bs4 import BeautifulSoup


html_text = "<p>This is an example <strong>HTML</strong> text.</p>"
soup = BeautifulSoup(html_text, 'html.parser')
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

text_no_html = soup.get_text()
print(text_no_html)

IX. Removing special characters:


This technique involves removing any special characters like "@", "#", "$", "%" from the
text data. Here's an example of how to do it in Python using regular expressions:

import re
text = "This is an example text with @special characters#."
text_no_special_chars = re.sub(r'\W+', ' ', text)
print(text_no_special_chars)

Implementation:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder,
LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
# from google.colab import files

from google.colab import drive


drive.mount('/content/drive/', force_remount=True)

# Load the dataset


%cd '/content/drive/MyDrive/BE Lab'
df = pd.read_csv('Sem 8/SMA/Exp 3/dataset.csv')
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

# Extract a particular column


column_name = "review"
data_to_pre_process = df[column_name].head(5)

# Display the extracted column


print(data_to_pre_process)

# 1. Lowercase conversion
def to_lowercase(text):
return text.lower()

# 2. Remove punctuation
def remove_punctuation(text):
return text.translate(str.maketrans("", "", string.punctuation))

# 3. Remove numbers
def remove_numbers(text):
return re.sub(r'\d+', '', text)

# 4. Remove stopwords
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
tokens = word_tokenize(text)
return " ".join([word for word in tokens if word not in stop_words])

# 5. Tokenization
def tokenize(text):
return word_tokenize(text)

# 6. Stemming
def stem(text):
stemmer = SnowballStemmer("english")
tokens = word_tokenize(text)
return " ".join([stemmer.stem(word) for word in tokens])

# 7. Lemmatization
def lemmatize(text):
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
return " ".join([lemmatizer.lemmatize(word) for word in tokens])

# 8. Remove HTML tags


def remove_html_tags(text):
return re.sub(r'<.*?>', '', text)

# 9. Remove special characters


def remove_special_characters(text):
return re.sub(r'[^a-zA-Z0-9 ]', '', text)

print(" ")
print("1.Lower Case\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(to_lowercase(text))

print(" ")
print("2.Remove punctuation\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(remove_punctuation(text))

print(" ")
print("3.Remove numbers\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(remove_numbers(text))

print(" ")
print("4.Remove stopwords\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(remove_stopwords(text))

print(" ")
print("5.Tokenization\n")
for text in data_to_pre_process:
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

print(" ",end=" ")


print(tokenize(text))

print(" ")
print("6.Stemming\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(stem(text))

print(" ")
print("7.Lemmatization\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(lemmatize(text))

print(" ")
print("8.Remove HTML tags\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(remove_html_tags(text))

print("9.Remove special characters\n")


for text in data_to_pre_process:
print(" ",end=" ")
print(remove_special_characters(text))

Output:
Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

Conclusion: Using various nltk tools and python string manipulation methods, we were able to
clean data preprocess, filter and store social media data for business.

You might also like