Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

DATA COLLECTION AND

MANAGEMENT
Methods of Data Collection
• Interviews
• Questionnaires
• Observations
• Documents and records
• Focus groups
• Master and transactional data
• Transactional data relates to the transactions of the organization and
includes data that is captured, for example, when a product is sold or
purchased. Master data is referred to in different transactions, and
examples are customer, product, or supplier data. Generally, master
data does not change and does not need to be created with every
transaction.
Sampling
• Data sampling is a statistical analysis
technique used to select, manipulate and
analyze a representative subset of data points
to identify patterns and trends in the larger
data set being examined.
• Sampling is a method that allows us to get
information about the population based on the
statistics from a subset of the population
(sample), without having to investigate every
individual.
Sampling Techniques
• Data sampling is a statistical analysis
technique used to select, manipulate and
analyze a representative subset of data points
to identify patterns and trends in the larger
data set being examined.
• Sampling is a method that allows us to get
information about the population based on the
statistics from a subset of the population
(sample), without having to investigate every
individual.
Sampling Techniques
• Probability Sampling: In probability sampling,
every element of the population has an equal
chance of being selected. Probability sampling
gives us the best chance to create a sample
that is truly representative of the population
• Non-Probability Sampling: In non-probability
sampling, all elements do not have an equal
chance of being selected. Consequently, there
is a significant risk of ending up with a
non-representative sample which does not
produce generalizable results
Sampling Techniques
Sampling Techniques
Probability Sampling Techniques
•Simple Random Sampling
• This is a type of sampling technique you must
have come across at some point. Here, every
individual is chosen entirely by chance and each
member of the population has an equal chance of
being selected.
•Systematic Sampling
• In this type of sampling, the first individual is
selected randomly and others are selected using
a fixed ‘sampling interval’. Let’s take a simple
example to understand this.
Sampling Techniques
• Say our population size is x and we have to select
a sample size of n. Then, the next individual that
we will select would be x/nth intervals away from
the first individual. We can select the rest in the
same way.
• Stratified Sampling
• In this type of sampling, we divide the population
into subgroups (called strata) based on different
traits like gender, category, etc. And then we
select the sample(s) from these subgroups:
Sampling Techniques
Probability Sampling Techniques
•Cluster Sampling
• In a clustered sample, we use the subgroups of
the population as the sampling unit rather than
individuals. The population is divided into
subgroups, known as clusters, and a whole
cluster is randomly selected to be included in the
study:
Sampling Techniques
• Cluster Sampling
• In a clustered sample, we use the subgroups of
the population as the sampling unit rather than
individuals. The population is divided into
subgroups, known as clusters, and a whole
cluster is randomly selected to be included in the
study:
Sampling Techniques
Types of Non-Probability Sampling
•Convenience Sampling
• This is perhaps the easiest method of sampling
because individuals are selected based on their
availability and willingness to take part.
•Quota Sampling
• In this type of sampling, we choose items based
on predetermined characteristics of the
population. Consider that we have to select
individuals having a number in multiples of four
for our sample:
Sampling Techniques
Types of Non-Probability Sampling
•Judgment Sampling
• It is also known as selective sampling. It depends
on the judgment of the experts when choosing
whom to ask to participate.
•Snowball Sampling
• Existing people are asked to nominate further
people known to them so that the sample
increases in size like a rolling snowball. This
method of sampling is effective when a sampling
frame is difficult to identify.
Web Data Crawling vs Web Data
Scraping
• Data Crawling means dealing with large data sets
where you develop your crawlers (or bots) which
crawl to the deepest of the web pages. Data
scraping, on the other hand, refers to retrieving
information from any source (not necessarily the
web). It’s more often the case that irrespective of
the approaches involved, we refer to extracting data
from the web as scraping (or harvesting) and that’s a
serious misconception.
A Data Crawling vs Data Scraping –
Key Differences
• 1. Scraping data does not necessarily involve the
web. Data scraping tools that help in data scraping
could refer to extracting information from a local
machine, a database. Even if it is from the internet, a
mere “Save as” link on the page is also a subset of
the data scraping universe. Data crawling, on the
other hand, differs immensely in scale as well as in
range. Firstly, crawling = web crawling which means
on the web, we can only “crawl” data. Programs that
perform this incredible job are called crawl agents or
bots or spiders. Some web spiders are algorithmically
designed to reach the maximum depth of a page &
crawl them iteratively. While both seem different,
web scraping vs web crawling is mostly the same.
• 2. The web is an open world and the
quintessential practising platform of our right
to freedom. Thus a lot of content gets created
and then duplicated. For instance, the same
blog might be posted on different pages and
our spiders don’t understand that. Hence, data
de-duplication (affectionately dedup) is an
integral part of web data crawling service.
This is done to achieve two things — keep our
clients happy by not flooding their machines
with the same data more than once; and
saving our servers some space. However,
deduplication is not necessarily a part of web
data scraping.
• 3. One of the most challenging things in the web
crawling space is to deal with the coordination of
successive crawls. Our spiders have to be polite with
the servers, that they do not piss them off when hit.
This creates an interesting situation to handle. Over
some time, our spiders have to get more intelligent
(and not crazy!). They get to develop learning to
know when and how much to hit a server, how to
crawl data feeds on its web pages while complying
with its politeness policies. While both seem
different, web scraping vs web crawling is mostly the
same.
• 4. Finally, different crawl agents are used to crawling
different websites and hence you need to ensure
they don’t conflict with each other in the process.
This situation never arises when you intend to just
crawl data.
Data Scraping Data Crawling

Involves extracting data from Refers to downloading pages from the


various web
sources including web
Can be done at any scale Mostly done at a large scale

Deduplication is not Deduplication is an essential part. It is a technique


necessarily a part for eliminating duplicate copies of repeating data.
Successful implementation of the technique can
improve storage utilization, which may in turn
lower capital expenditure by reducing the overall
amount of storage media required to meet storage
capacity needs.

Needs crawl agent and parser Needs only crawl agent

On a concluding note, when talking about web scraping


vs web crawling. ‘Scraping’ represents a very
superficial node of crawling which we call extraction,
and that again requires few algorithms and some
automation in place.
A Data Crawling vs Data Scraping –
Key Differences
• 1. Scraping data does not necessarily involve the
web. Data scraping tools that help in data scraping
could refer to extracting information from a local
machine, a database. Even if it is from the internet, a
mere “Save as” link on the page is also a subset of
the data scraping universe. Data crawling, on the
other hand, differs immensely in scale as well as in
range. Firstly, crawling = web crawling which means
on the web, we can only “crawl” data. Programs that
perform this incredible job are called crawl agents or
bots or spiders. Some web spiders are algorithmically
designed to reach the maximum depth of a page &
crawl them iteratively. While both seem different,
web scraping vs web crawling is mostly the same.
Basics of Databases
• Relational Data
• SQL
• NoSQL
Relational Databases
• A relational database is a collection of
information that organizes data points with
defined relationships for easy access. In the
relational database model, the data structures
-- including data tables, indexes and views --
remain separate from the physical storage
structures, enabling database administrators to
edit the physical data storage without affecting
the logical data structure.
Data Management
When Do We Need a Database?
Storage vs. Management

SALES RECEIPT
Date: 07/16/2016
Order#: [00315]
00315 – 07/16/2016
Customer: David Rivers
David Rivers
Product: Oil Pump Oil Pump (OP147-0623)
1 x 69.90
S/N: OP147-0623

Unit Price: 69.90


Qty: 1

Total: 69.90
Storage vs. Management
Storage vs. Management (2)
• Storing data is not the primary reason to use
a Database
• Flat storage eventually runs into issues with
• Size
• Ease of updating
• Accuracy
• Security
• Redundancy
• Importance
Databases and RDBMS
• A database is an organized collection of
information
• It imposes rules on the contained data
• The relational storage model was first proposed by
Edgar Codd in 1970

• A Relational Data Base Management System


provides tools to manage the database
• It parses requests from the user and takes the
appropriate action
• The user doesn't have direct access to the stored
data
Database Engine
Database Engine Flow
• SQL Server uses the Client-Server Model

Clients Query Engine Access Database

Data Data
Top Database Engines
• The DB-Engines Ranking ranks database
management systems according to their
popularity. The ranking is updated monthly.
https://db-engines.com/en/ranking
Download Clients & Servers
• Download SQL Server Express Edition from
Microsoft
https://www.microsoft.com/en-us/sql-server/
sql-server-editions-express

• The package includes SQL Server Management Studio


SQL Server Architecture
• Logical Storage
• Instance
• Database Instance
Database Database
• Schema
Schema
• Table Table Table Table
Database
• Physical Storage Schema

• Data files and Log files


• Data pages Data Logs
☰ ☰ ☰ ☰ ☰ ☰ ☰ ☰
Database Table Elements
• The table is the main building block of any
database Column
Customer FirstName Birthdate CityID
ID Data
1 Brigitte 03/12/1975 101
Value
2 August 27/05/1968 102
3 Benjamin 1 5/10/1988 103
4 Denis 07/01/1993 104
Row
Cell
• Each row is called a record or entity or tuple
• Columns (fields) define the type of data they
contain. Also known as fields or attributes
Structured Query Language
(SQL)
• To communicate with the Engine we use SQL
• Declarative language
• Logically divided in four sections
• Data Definition – describe the structure of our
data
• Data Manipulation – store and retrieve data
• Data Control – define who can access the data
• Transaction Control – bundle operations and allow
rollback
Engine’s Behaviour

Worker Worker Worker THREADPOOL Worker

EXECUTION FREE
Parsing
Compilation Query Execution THREAD
Optimization

Task Plan Data


CACHE RESULT
Task Cache Cache
S
Task
QUERIES Database
Network Network
Table Relationships
Why Split Related Data? Empty records

First Last Register Email Email2


ed
david@homedomain.cx
David Rivers 05/02/2016 drivers@mail.cx
Sarah Thorne 07/17/2016 sarah@mail.cx NULL
Michael Walters 11/23/2015 walters_michael@mail.cx NULL
Redundant information

OrderID Date Customer Product S/N Price


00315 07/16/2016 David Rivers Oil Pump OP147-0623 69.90
00315 07/16/2016 David Rivers Accessory Belt AB544-1648 149.99
00316 07/17/2016 Sarah Thorne Wiper Fluid WF000-0001 99.90
00317 07/18/2016 Michael Walters Oil Pump OP147-0623 69.90
Related Tables
• We split the data and introduce relationships
between the tables to avoid repeating
information
UserI First Last Register UserI Email
D ed D
203 David Rivers 05/02/2016 203 drivers@mail.cx
204 Sarah Thorne 07/17/2016 204 sarah@mail.cx
205 Michael Walters 11/23/2015 205 walters_michael@mail.cx
203 david@homedomain.cx
Primary Key Foreign
Key

The connection is established via a Foreign


Key in one table pointing to the Primary Key
in another
E/R Diagrams
Indices
• Indices make data lookup faster
• Clustered – bound to the primary key, physically
sorts data
• Non-Clustered – can be any field, references the
clustered index
• Structured as an ordered tree
Keys Index

100-19 Range
0-99 200-299 Range 1 Range 3
9 2

Data ☰ ☰ ☰ ☰ ☰ ☰ ☰
Views
• Views are prepared queries for displaying
sections of our data
• Evaluated at run time – they do not increase
performance

SELECT * FROM v_EmployeeNames

CREATE VIEW v_EmployeeNames AS


SELECT [EmployeeID]
,[FirstName]
,[LastName]
FROM [SoftUni].[dbo].[Employees]
Procedures, Functions and Triggers
• A database can further be customized with
reusable code
• Procedures – carry out a predetermined action
• E.g. calculate and store the weekly revenue based
on recorded sales in the database
• Functions – receive parameters and return a
result
• E.g. get the age of a person using their birthdate
and current date
• Triggers – watch for activity in the database
and react to it
• E.g. when a record is deleted, write it to an archive
Procedures
CREATE PROCEDURE p_LoadEmployees AS
BEGIN
SELECT [EmployeeID]
,[FirstName]
,[LastName]
,[Name] AS [DepartmentName]
INTO [Test].[dbo].[EmployeeDepartments]
FROM [SoftUni].[dbo].[Employees] AS e
LEFT OUTER JOIN [SoftUni].[dbo].[Departments] AS d
ON e.DepartmentID = d.DepartentID
END

EXEC p_LoadEmployees
Functions

CREATE FUNCTION f_GetAge (@Birthday date)


RETURNS int
AS
BEGIN
DECLARE @Result int;
SET @Result = DATEDIFF(YYYY, @Birthday, GETDATE());
RETURN @Result
END

SELECT [dbo].[f_GetAge]('2004/12/26')
Triggers
CREATE TRIGGER t_DepartmentHistory ON [SoftUni].[dbo].[Departments]
FOR DELETE AS
BEGIN
DECLARE @DepartmentID int;
DECLARE @Name nvarchar(50);
DECLARE @ManagerID int;

SET @DepartmentID = (SELECT [DepartmentID] FROM DELETED);


SET @Name = (SELECT [Name] FROM DELETED);
SET @ManagerID = (SELECT [ManagerID] FROM DELETED);

INSERT INTO [SoftUni].[dbo].[DepartmentHistory]


([DepartmentID], [Name], [ManagerID])
VALUES (@DepartmentID, @Name, @ManagerID);

PRINT 'Department with ID ' +


CAST(@DepartmentID AS varchar(50)) +
' has been recorded';
END
Summary
1. RDBMS store and manage data
2. Table relations reduce repetition and complexity
3. Databases can be customized with functions and procedures
NoSQL Databases
NoSQL Databases
What is NoSQL Database?
NoSQL== “No SQL” or “Not Only SQL”?
•Non-relational database
•Schema-free document storage

• Still support indexing, querying and CRUD


operations
• Still supports concurrency and
transactions
•Can have nested values nested values
•Highly optimized for append / retrieve

•Great performance and scalability


Non-Relational Data Models
• Document model
• Set of documents, e.g. JSON strings
• Key-value model
• Set of key-value pairs
• Column model
• Stores data tables as sections of columns of data
• Graph model
• Use a graph structure
Document Model
• Implementations of the document model may
include: Name: John Cena
• Collections Gender: male
• Tags Phone: +263778562825

• Non-visible metadata Address:

• Directory hierarchies - Street: 16 Nzembe Rd


- Post Code: 26354
• Buckets
- Town: Gweru
- Country: Zimbabwe
Email: R28625x@students.msu.ac.zw
Site: https://msu.ac.zw
Key-Value Model
• Data is stored in unstructured records
• Records consist key + the values associated
with that record
• Not adequate for complex apps
• The simplest form of DBMS
Column Model
• Stores data tables as sections of columns of
data rather than as rows of data
• Has advantages for:
• Data warehouses
• CRM
• HOC systems

• More efficient in computation over many rows


Graph Model
• Graph databases employ nodes, edges and
properties
• Based on graph theory
• Nodes represent entities
• Edges are the lines that connect nodes to
nodes
• Properties are pertinent information that
relate to nodes
Advantages of NoSQL
• Cheap and easy to implement
• Data are replicated and can be partitioned
• Easy to distribute
• Don't require a schema
• Can Scale up and down
• Quickly process large amounts of data
Disadvantages of NoSQL
• Data is generally duplicated, potential for
inconsistency
• No standardized schema
• No standard format for queries
• No standard language
• Difficult to impose complicated structures
• Depend on the application layer to enforce
data integrity
• No guarantee of support
Relational vs. NoSQL Databases
• Relational databases
• Data stored as table rows
• Relationships between related rows
• Single entity spans multiple tables
• RDBMS are very mature, rock solid
• NoSQL databases
• Data stored as documents or other values
• Single entity (document) is a single record
• Documents do not have a fixed structure
Relational vs. NoSQL Models
Relational Model Document Model
When to use NoSQL?
• The data is not structured or structure is
changing
• You need to have a denormalized
representation of your data
• You need massive write performance
• You need fast key-value access
• You need flexible schema/data types
• You need schema migration
• You need easier maintainability
http://highscalability.com/.../what-the-heck-are-you-actually-usi
ng-nosql-for.html
Comparison of NoSQL Databases
Summary
1. How column-oriented databases store data?
2. What are the main elements of graph model?
3. What are the main 3 advantages of NoSQL?
4. What are the main 3 disadvantages of
NoSQL?
5. When to use NoSQL?
Data Cleaning/Pre-processing
• Assignment

You might also like