Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

LIST OF EXERCISES:

1. Install the following Data Mining and data Analysis tool: Weka, KNIME, Tableau
Public.

2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get
different insights from the data.

3. Perform Time Series Analysis with datasets like Open Power System Data.

4. Build a time-series model on a given dataset and evaluate its accuracy.

5. Perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc..

6. Build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India etc.
7. Perform text mining on a set of documents and visualize the most important words in a
visualization such as word cloud.

8. Use a case study on a data set and apply the various visualization techniques and present
an analysis report.
EX.NO.1 Install the following Data Mining and data Analysis tool: Weka, KNIME, Tableau
Public.

A tool for data analysis, manipulation, visualization, and reporting


 Based on the graphical programming paradigm
 Provides a diverse array of extensions:
 Text Mining
 Network Mining
 Cheminformatics
 Many integrations,
Weka stands for Waikato Environment for Knowledge Analysis, it is software that is used in
the data science field for data mining. It is free software. It is written in Java hence it can be
run on any system supporting Java, so weka can be run on different operating systems like
Windows, Linux, Mac, etc. Weka provides a collection of visualization tools that can be used
for data analysis, cleaning, and predictive modeling. Weka can perform a number of tasks like
data preprocessing, clustering, classification, regression, visualization, and feature selection.

Installing Weka on Windows:

Follow the below steps to install Weka on Windows:


Step 1: Visit this website using any web browser. Click on Free Download.
Step 2: It will redirect to a new webpage, click on Start Download. Downloading of
the executable file will start shortly. It is a big 118 MB file that will take some
minutes.

Step 3: Now check for the executable file in downloads in your system and run it.
Step 4: It will prompt confirmation to make changes to your system. Click on Yes.

Step 5: Setup screen will appear, click on Next.


Step 6: The next screen will be of License Agreement, click on I Agree.

Step 7: Next screen is of choosing components, all components are already marked
so don’t change anything just click on the Install button.
step 8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of 301 MB.

Step 9: Next screen will be of choosing the Start menu folder so don’t do anything
just click on Install Button.
Step 10: After this installation process will start and will hardly take a minute to
complete the installation.

Step 11: Click on the Next button after the installation process is complete.
Step 12: Click on Finish to finish the installation process.

Step 13: Weka is successfully installed on the system and an icon is created on the
desktop.
Step 14: Run the software and see the interface.

Result:Thus the Weka Tool is successfully installed .


EX.NO.2:
Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data

Step 1: Load necessary libraries and data sel


import pandas as pd
import numpy as np

# for visualization
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objs as go
from wordcloud import WordCloud

nltk used for NLP


import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Preprocessing (sklearn)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Modeling
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
import xgboost as xgb
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Neural Network
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, GlobalM
axPooling1D, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# scoring
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_scor
e, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, RocCurveDisplay

# styling
plt.style.use('ggplot')
/opt/conda/lib/python3.7/site-packages/geopandas/_compat.py:115: UserWarning:
The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS ver
sion PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both w
ill be slow.
shapely_geos_version, geos_capi_version_string

read dataset:

df = pd.read_csv('../input/spam-email/spam.csv')
msno.matrix(df).set_title('Distribution of missing values',fontsize=20)

Pie chart interpolation


category_ct = df['Category'].value_counts()

fig = px.pie(values=category_ct.values,
names=category_ct.index,
color_discrete_sequence=px.colors.sequential.OrRd,
title= 'Pie Graph: spam or not')
fig.update_traces(hoverinfo='label+percent', textinfo='label+value+percent', t
extfont_size=15,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
Length distribution of spam & ham message
categories = pd.get_dummies(df["Category"])
spam_or_not = pd.concat([df, categories], axis=1)
spam_or_not.drop('Category',axis=1,inplace=True)

df["length"] = df["Message"].apply(len)

ham = df.loc[np.where(spam_or_not['ham'] == 1)].reset_index()


spam = df.loc[np.where(spam_or_not['ham'] == 0)].reset_index()

ham.drop('index',axis=1,inplace=True)
spam.drop('index',axis=1,inplace=True)

hist_data = [ham['length'],spam['length']]

group_labels = ['ham','spam']

colors = ['black', 'red']

# Create distplot with curve_type set to 'normal'


fig = ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colo
rs)

# Add title
fig.update_layout(title_text='Length distribution of ham and spam messages',
template = 'simple_white')
fig.show()

RESULT: Thus Exploratory data analysis (EDA) on with datasets like email data set is
successfully executed.
EX.NO.3: Perform Time Series Analysis with datasets like Open Power System Data.

1.Importing all the necessary Libraries

2.Finding out the Columns and Describing the data set

Code:
print(data.head())
print('\n')
print(data.columns)
print('\n')
print(data.info())
print('\n')
print(data.describe())

The output of the above code appears:


3.
data = pd.read_csv("Open Power Systems Data.csv",index_col=0,parse_dates=True)
data.head()
4.plotting for the Consumption column.
cols_plot = ['Consumption','Solar','Wind']

axes = data[cols_plot].plot(marker='o', alpha=1, linestyle='None', figsize=(11, 9), subplots=True)

for x in axes:

x.set_ylabel('Daily Totals (Gwh)')

The output appears like :

Result:
Thus the time series analysis is performed and visualized.
EX.NO.4 Build a time-series model on a given dataset and evaluate its accuracy.

Import the necessary Packages and Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
data = pd.read_csv('Healthcare-Diabetes.csv')
data.head()
The output of the code :

X = data.drop("Outcome", axis=1)
y = data["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()

# Train the model


model.fit(X_train_scaled, y_train)

LogisticRegression
LogisticRegression()
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

correct_diabetes = (y_test == y_pred).sum()


correct_no_diabetes = ((y_test == 0) & (y_pred == 0)).sum()

total_diabetes = (y_test == 1).sum()


total_no_diabetes = (y_test == 0).sum()

accuracy_diabetes = correct_diabetes / total_diabetes


accuracy_no_diabetes = correct_no_diabetes / total_no_diabetes
import matplotlib.pyplot as plt
# Create a simple bar plot
labels = ['Diabetes', 'No Diabetes']
accuracies = [accuracy_diabetes, accuracy_no_diabetes]

plt.bar(labels, accuracies)
plt.ylabel('Accuracy')
plt.title('Accuracy for Diabetes and No Diabetes')
plt.ylim(0, 1) # Set y-axis range to 0-1 for percentages
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns

# Create a DataFrame with actual and predicted outcomes


results_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Count the occurrences of each class


class_counts = results_df['Actual'].value_counts()
plt.figure(figsize=(6, 4))
sns.set(style="whitegrid")
sns.countplot(x='Actual', data=results_df, palette='Set1')
plt.title('Actual Outcome Counts')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['No Diabetes', 'Diabetes'])
plt.show()

RESULT: Thus the Time series model is built and its accuracy is Vizualised.
EX.NO.5
Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc

# Create a map
m_1 = folium.Map(location=[42.32,-71.0589], tiles='openstreetmap', zoom_start=
10)

# Display the map


m_1

a map
m_2 = folium.Map(location=[42.32,-71.0589], tiles='cartodbpositron', zoom_star
t=13)

# Add points to the map


for idx, row in daytime_robberies.iterrows():
Marker([row['Lat'], row['Long']]).add_to(m_2)

# Display the map


m_2
# Create the map
m_3 = folium.Map(location=[42.32,-71.0589], tiles='cartodbpositron', zoom_star
t=13)

# Add points to the map


mc = MarkerCluster()
for idx, row in daytime_robberies.iterrows():
if not math.isnan(row['Long']) and not math.isnan(row['Lat']):
mc.add_child(Marker([row['Lat'], row['Long']]))
m_3.add_child(mc)

# Display the map


m_3

Result :Thus the Rollover on Map is executed Successfully.


EX.NO.6 Build cartographic visualization for multiple datasets involving various countries of
the world; states and districts in India etc.

import pandas as pd
import altair as alt
from vega_datasets import data

7.1. Geographic Data: GeoJSON and TopoJSON

Up to this point, we have worked with JSON and CSV formatted datasets that correspond to
data tables made up of rows (records) and columns (fields). In order to represent geographic
regions (countries, states, etc.) and trajectories (flight paths, subway lines, etc.), we need to
expand our repertoire with additional formats designed to support rich geometries.

GeoJSON models geographic features within a specialized JSON format. A


GeoJSON feature can include geometric data – such as longitude, latitude coordinates that make
up a country boundary – as well as additional data attributes.

Here is a GeoJSON feature object for the boundary of the U.S. state of Colorado:

{
"type": "Feature",
"id": 8,
"properties": {"name": "Colorado"},
"geometry": {
"type": "Polygon",
"coordinates"
:[
[[-106.32056285448942,40.998675790862656],[-106.19134826714341,40.99813863734313],[-
105.27607827344248,40.99813863734313],[-104.9422739227986,40.99813863734313],[-
104.05212898774828,41.00136155846029],[-103.57475287338661,41.00189871197981],[-
103.38093099236758,41.00189871197981],[-102.65589358559272,41.00189871197981],[-
102.62000064466328,41.00189871197981],[-102.052892177978,41.00189871197981],[-
102.052892177978,40.74889940428302],[-102.052892177978,40.69733266640851],[-
102.052892177978,40.44003613055551],[-102.052892177978,40.3492571857556],[-
102.052892177978,40.00333031918079],[-102.04930288388505,39.57414465707943],[-
102.04930288388505,39.56823596836465],[-102.0457135897921,39.1331416175485],[-
102.0457135897921,39.0466599009048],[-102.0457135897921,38.69751011321283],[-
102.0457135897921,38.61478847120581],[-102.0457135897921,38.268861604631],[-
102.0457135897921,38.262415762396685],[-102.04212429569915,37.738153927339205],[-
102.04212429569915,37.64415206142214],[-102.04212429569915,37.38900413964724],[-
102.04212429569915,36.99365914927603],[-103.00046581851544,37.00010499151034],[-
103.08660887674611,37.00010499151034],[-104.00905745863294,36.99580776335414],[-
105.15404227428235,36.995270609834606],[-105.2222388620483,36.995270609834606],[-
105.7175614468747,36.99580776335414],[-106.00829426840322,36.995270609834606],[-
106.47490250048605,36.99365914927603],[-107.4224761410235,37.00010499151034],[-
107.48349414060355,37.00010499151034],[-108.38081766383978,36.99903068447129],[-
109.04483707103458,36.99903068447129],[-109.04483707103458,37.484617466122884],[-
109.04124777694163,37.88049961001363],[-109.04124777694163,38.15283644441336],[-
109.05919424740635,38.49983761802722],[-109.05201565922046,39.36680339854235],[-
109.05201565922046,39.49786885730673],[-109.05201565922046,39.66062637372313],[-
109.05201565922046,40.22248895514744],[-109.05201565922046,40.653823231326896],[-
109.05201565922046,41.000287251421234],[-107.91779872584989,41.00189871197981],[-
107.3183866123281,41.00297301901887],[-106.85895696843116,41.00189871197981],[-
106.32056285448942,40.998675790862656]]
]
}
}

The feature includes a properties object, which can include any number of data fields, plus
a geometry object, which in this case contains a single polygon that consists
of [longitude, latitude] coordinates for the state boundary.

Let’s load a TopoJSON file of world countries (at 110 meter resolution):

world = data.world_110m.url
world
'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/world-110m.json'
world_topo = data.world_110m()
world_topo.keys()
dict_keys(['type', 'transform', 'objects', 'arcs'])
world_topo['type']
'Topology'
world_topo['objects'].keys()
dict_keys(['land', 'countries'])

As TopoJSON is a specialized format, we need to instruct Altair to parse the TopoJSON


format, indicating which named faeture object we wish to extract from the topology. The
following code indicates that we want to extract GeoJSON features from the world dataset for
the countries object:

alt.topo_feature(world, 'countries')

This alt.topo_feature method call expands to the following Vega-Lite JSON:

{
"values": world,
"format": {"type": "topojson", "feature": "countries"}
}

Geoshape Marks

To visualize geographic data, Altair provides the geoshape mark type. To create a basic map, we
can create a geoshape mark and pass it our TopoJSON data, which is then unpacked into
GeoJSON features, one for each country of the world:

alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape()
In the example above, Altair applies a default blue color and uses a default map projection
(mercator). We can customize the colors and boundary stroke widths using standard mark
properties. Using the project method we can also add our own map projection:

alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
).project(
type='mercator')

By default Altair automatically adjusts the projection so that all the data fits within the width
and height of the chart. We can also specify projection parameters, such as scale (zoom level)
and translate (panning), to customize the projection settings. Here we adjust
the scale and translate parameters to focus on Europe:
alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
).project(
type='mercator', scale=400, translate=[100, 550]
)

RESULT :Thus the program is executed is successfully.


EX.NO.7Perform text mining on a set of documents and visualize the most important words in
a visualization such as word cloud.

# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#loading all necessary libraries


import numpy as np
import pandas as pd

import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
Reading the File and Understanding the Data
loading the data file
df = pd.read_csv('emails2.csv')

#shape of the dataframe


print('The shape of the dataframe is :',df.shape)
#first few records
df.head()
OUTPUT
spam1 = df[df.spam == 1]
2print(spam1.shape)
Word Cloud for 'spam' Emails
Let us build the first word cloud. The first line of code generates the word cloud on the
'final_text_spam' corpus, while the second to fifth lines of code prints the word cloud.
wordcloud_spam =
WordCloud(background_color="white").generate(final_text_spam)
2
3# Lines 2 - 5
4plt.figure(figsize = (20,20))
5plt.imshow(wordcloud_spam, interpolation='bilinear')
6plt.axis("off")
7plt.show()

RESULT: Thus the visualize the most important words in a visualization such as word cloud
is performed.

You might also like