Smaexp 3

Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26
Experiment No. 3
Aim: Data Cleaning and Storage, Preprocess, filter and store social media data for business
(Using Python, MongoDB, R, etc).
Theory:
1. What is preprocessing?
Ans: Preprocessing of data refers to the manipulation and preparation of raw data before it is
used for analysis. The goal of preprocessing is to transform the raw data into a form that
is more suitable for analysis, making it easier to extract meaningful insights and patterns.
Preprocessing can include various steps depending on the type and quality of the data, as
well as the analysis objectives. Some of the most common preprocessing techniques
include:
Data Cleaning: This involves identifying and correcting or removing any errors, missing
values, or inconsistencies in the data. For example, this could include removing duplicate
records or filling in missing data points.
Data Transformation: This involves transforming the data to a more appropriate form
for analysis. For example, this could include converting data from one measurement unit
to another or normalizing data to have a mean of zero and standard deviation of one.
Data Integration: This involves combining data from multiple sources into a single
dataset. For example, this could include merging data from different databases or files.
Data Reduction: This involves reducing the amount of data to be analyzed while still
preserving the important information. For example, this could include using feature
selection techniques to identify the most relevant variables for analysis or applying
dimensionality reduction techniques to reduce the number of variables while preserving
as much of the original data as possible.
Data Discretization: This involves converting continuous data into discrete categories.
For example, this could include categorizing ages into age groups or grouping incomes
into income brackets.
Preprocessing is a critical step in the data analysis process as it helps ensure the accuracy
and relevance of the data used in analysis. By performing appropriate preprocessing
techniques, analysts can reduce errors, improve the quality of results, and increase the
efficiency of their analysis.
2. Why is preprocessing needed?

Ans: Preprocessing is needed for several reasons, including:
Data Quality: Raw data may contain errors, missing values, or inconsistencies that need
to be corrected or removed before analysis. Preprocessing helps to identify and handle
such issues to ensure data quality.
Data Relevance: Raw data may contain irrelevant or redundant features that do not
contribute to the analysis or may even introduce noise. Preprocessing helps to identify
and remove such features, reducing the dimensionality of the dataset and improving the
accuracy of analysis.
Data Normalization: Raw data may have different scales, units, or ranges, making it
difficult to compare or combine features. Preprocessing helps to normalize the data to a
common scale, unit, or range, making it easier to compare and combine features.
Data Integration: Raw data may come from multiple sources and may not be in a format
that can be easily combined. Preprocessing helps to integrate data from multiple sources
into a single dataset, making it easier to analyze.
Data Reduction: Raw data may be large and complex, making it difficult to analyze
efficiently. Preprocessing helps to reduce the size and complexity of the dataset while
retaining its important features, making it easier to analyze and extract meaningful
insights.
Overall, preprocessing is essential to ensure the accuracy, relevance, and efficiency of

data analysis. By addressing issues with data quality, relevance, normalization,
integration, and reduction, preprocessing helps to prepare the data for analysis and extract
valuable insights from it.
3. Explain 7-8 functions (python/R) with syntax for preprocessing?

Ans: I. Lowercase conversion:
This technique involves converting all the text to lowercase. This is useful because it
helps to reduce the dimensionality of the data and makes it easier to compare and analyze
text.
text = "The quick brown Fox JUMPS over the Lazy Dog."
preprocessed_text = text.lower()
print(preprocessed_text)
II. Removing Punctuation:

This technique involves removing all the punctuation marks from the text. This is useful
because punctuation marks do not add any value to the text analysis.
import string
text = "Hello, World!"
preprocessed_text = text.translate(str.maketrans("", "", string.punctuation))
III. Removing Numbers:

This technique involves removing all the numerical digits from the text. This is useful
because numerical digits do not provide any contextual meaning to the text.
import re
text = "The price of the product is $50"
preprocessed_text = re.sub(r'\d+', '', text)
IV. Removing Stopwords:

This technique involves removing common words that do not provide any meaningful
information for text analysis. Stopwords include words like "the", "a", "an", "and", etc.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
text = "The quick brown fox jumps over the lazy dog."
words = text.split()
preprocessed_text = [word for word in words if word.casefold() not in stop_words]
preprocessed_text = ' '.join(preprocessed_text)
V. Tokenization:
This technique involves breaking the text into individual words or tokens. This is useful
because it helps to analyze text on a more granular level.
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog."
preprocessed_text = word_tokenize(text)
VI. Stemming:
This technique involves reducing words to their root form by removing suffixes. This is
useful because it helps to group words with similar meanings together.
import nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()
text = "I am going to the park to play."
words = word_tokenize(text)
preprocessed_text = ' '.join([ps.stem(word) for word in words])
VII. Lemmatization:
This technique is similar to stemming, but it involves reducing words to their base form,
which can be a real word. Here's an example of how to do it in Python using the NLTK
library:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
text = "This is an example text with some words to be lemmatized."
lemmatized_words = [lemmatizer.lemmatize(word) for word in text.split()]
print(lemmatized_words)
VIII. Removing HTML tags:
This technique involves removing any HTML tags from the text data. Here's an example
of how to do it in Python using the BeautifulSoup library:
from bs4 import BeautifulSoup

html_text = "<p>This is an example <strong>HTML</strong> text.</p>"
soup = BeautifulSoup(html_text, 'html.parser')
text_no_html = soup.get_text()
print(text_no_html)
IX. Removing special characters:

This technique involves removing any special characters like "@", "#", "$", "%" from the
text data. Here's an example of how to do it in Python using regular expressions:
import re
text = "This is an example text with @special characters#."
text_no_special_chars = re.sub(r'\W+', ' ', text)
print(text_no_special_chars)
Implementation:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder,
LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('stopwords')
nltk.download('omw-1.4')
# from google.colab import files
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)
# Load the dataset

%cd '/content/drive/MyDrive/BE Lab'
df = pd.read_csv('Sem 8/SMA/Exp 3/dataset.csv')
# Extract a particular column

column_name = "review"
data_to_pre_process = df[column_name].head(5)
# Display the extracted column

print(data_to_pre_process)
# 1. Lowercase conversion
def to_lowercase(text):
return text.lower()
# 2. Remove punctuation
def remove_punctuation(text):
return text.translate(str.maketrans("", "", string.punctuation))
# 3. Remove numbers
def remove_numbers(text):
return re.sub(r'\d+', '', text)
# 4. Remove stopwords
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
tokens = word_tokenize(text)
return " ".join([word for word in tokens if word not in stop_words])
# 5. Tokenization
def tokenize(text):
return word_tokenize(text)
# 6. Stemming
def stem(text):
stemmer = SnowballStemmer("english")
return " ".join([stemmer.stem(word) for word in tokens])
# 7. Lemmatization
def lemmatize(text):
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
return " ".join([lemmatizer.lemmatize(word) for word in tokens])
# 8. Remove HTML tags

def remove_html_tags(text):
return re.sub(r'<.*?>', '', text)
# 9. Remove special characters

def remove_special_characters(text):
return re.sub(r'[^a-zA-Z0-9 ]', '', text)
print(" ")
print("1.Lower Case\n")
for text in data_to_pre_process:
print(" ",end=" ")
print(to_lowercase(text))
print(" ")
print("2.Remove punctuation\n")
print(" ",end=" ")
print(remove_punctuation(text))
print(" ")
print("3.Remove numbers\n")
print(" ",end=" ")
print(remove_numbers(text))
print(" ")
print("4.Remove stopwords\n")
print(" ",end=" ")
print(remove_stopwords(text))
print(" ")
print("5.Tokenization\n")
print(" ",end=" ")

print(tokenize(text))
print(" ")
print("6.Stemming\n")
print(" ",end=" ")
print(stem(text))
print(" ")
print("7.Lemmatization\n")
print(" ",end=" ")
print(lemmatize(text))
print(" ")
print("8.Remove HTML tags\n")
print(" ",end=" ")
print(remove_html_tags(text))
print("9.Remove special characters\n")

print(" ",end=" ")
print(remove_special_characters(text))
Output:
Conclusion: Using various nltk tools and python string manipulation methods, we were able to
clean data preprocess, filter and store social media data for business.

Smaexp 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smaexp 3

Uploaded by

Copyright:

Available Formats

Name: Greeshma Hedvikar Class: BE CMPN A2 Roll No: 26

2. Why is preprocessing needed?

Overall, preprocessing is essential to ensure the accuracy, relevance, and efficiency of

3. Explain 7-8 functions (python/R) with syntax for preprocessing?

II. Removing Punctuation:

III. Removing Numbers:

IV. Removing Stopwords:

from bs4 import BeautifulSoup

IX. Removing special characters:

from google.colab import drive

# Load the dataset

# Extract a particular column

# Display the extracted column

# 8. Remove HTML tags

# 9. Remove special characters

print(" ",end=" ")

print("9.Remove special characters\n")

You might also like