M191861 Data Exploration Assignment

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Name Ashton Takudzwa

Surname Sibanda

Level 4.1

Reg-number M191861

Programme Information Systems

Module Data Mining

Course code ISH414

Assignment 1
1) Summary statistics

These are quantities that defines and summarizes a set of data and values.

Summary statistics measures:

a. Location – these are summary statistics that describes the central or typical value in a
data set. It gives the sense of the middle or center of the data set.

Types of location are:

 Mean which is the average of the values in a data set.


 Median which the middle value if the data in the data set is sorted first where half of the
values are less than the median and half of the values are greater than the median.
 Mode which is the value that is repeated the most in a data set.

Example of LOCATION statistics:

Fig1
Value Value MEAN = 51.1
(sorted) MODE = 42, 78
35 21 MEDIAN = 46
42 22
78 35
22 42
56 42
50 50
42 56
78 78
21 78
87 87

b. Spread describes how dispersed or varied data is

Types of spread are:

 Standard deviation which describes the amount of variation in the given data set.
 Minimum which is the smallest value in a data set.
 Maximum which is the largest value in a data set.
 Variance is the square of the standard deviation.
 Range which is the difference between the maximum and the minimum.

Example of spread using fig1

Range = 87-21 = 66

Variance = 548.767

Standard deviation = 23.426

c. Shape describes the shape of distribution for a given set of values

Types of shape are:

 Skewness describes whether data values are asymmetrically distributed.

2) Visualisation

Data visualization is a graphical representation of information and data. By using visual


elements like charts, graphs, and maps, data visualization tools provide an accessible way to see
and understand trends, outliers, and patterns in data.

The uses of Data Visualization as follows.

 Powerful way to explore data with presentable results.

 Primary use is the pre-processing portion of the data mining process.

 Supports the data cleaning process by finding incorrect and missing values.

 For variable derivation and selection means to determine which variable to include
and discarded in the analysis.

 Also play a role in combining categories as part of the data reduction process.

Data Visualization Techniques

 Box plots

 Histograms

 Heat maps
 Charts

 Tree maps

 Word Cloud/Network diagram

An example of visualisation is a pie chart

3) Online analytical processing (OLAP)

Online Analytical Processing can be defined as a set of tools and approaches to represent data from
multiple dimensions. In a broader sense, it includes a bunch of practices aimed at modelling
data/databases and creating specific analytical solutions. OLAP systems are capable of combining
classic tables in a sort of table of tables, which can be visualized as a 3D OLAP Cube for simplicity.

Examples of analysis include financial modelling, budget forecasting, production planning, and
determining broad sales and distribution trends.

OLAP databases are divided into one or more cubes. The cubes are designed in such a way that
creating and viewing reports become easy. OLAP stands for Online Analytical Processing.

Types of OLAP Servers

We have four types of OLAP servers −

 Relational OLAP (ROLAP)


ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.

 Multidimensional OLAP (MOLAP)


MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore,
many MOLAP server use two levels of data storage representation to handle dense and sparse data
sets.
 Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP
and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.
 Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.

You might also like