Professional Documents
Culture Documents
Data Collection and Management
Data Collection and Management
MANAGEMENT
Methods of Data Collection
• Interviews
• Questionnaires
• Observations
• Documents and records
• Focus groups
• Master and transactional data
• Transactional data relates to the transactions of the organization and
includes data that is captured, for example, when a product is sold or
purchased. Master data is referred to in different transactions, and
examples are customer, product, or supplier data. Generally, master
data does not change and does not need to be created with every
transaction.
Sampling
• Data sampling is a statistical analysis
technique used to select, manipulate and
analyze a representative subset of data points
to identify patterns and trends in the larger
data set being examined.
• Sampling is a method that allows us to get
information about the population based on the
statistics from a subset of the population
(sample), without having to investigate every
individual.
Sampling Techniques
• Data sampling is a statistical analysis
technique used to select, manipulate and
analyze a representative subset of data points
to identify patterns and trends in the larger
data set being examined.
• Sampling is a method that allows us to get
information about the population based on the
statistics from a subset of the population
(sample), without having to investigate every
individual.
Sampling Techniques
• Probability Sampling: In probability sampling,
every element of the population has an equal
chance of being selected. Probability sampling
gives us the best chance to create a sample
that is truly representative of the population
• Non-Probability Sampling: In non-probability
sampling, all elements do not have an equal
chance of being selected. Consequently, there
is a significant risk of ending up with a
non-representative sample which does not
produce generalizable results
Sampling Techniques
Sampling Techniques
Probability Sampling Techniques
•Simple Random Sampling
• This is a type of sampling technique you must
have come across at some point. Here, every
individual is chosen entirely by chance and each
member of the population has an equal chance of
being selected.
•Systematic Sampling
• In this type of sampling, the first individual is
selected randomly and others are selected using
a fixed ‘sampling interval’. Let’s take a simple
example to understand this.
Sampling Techniques
• Say our population size is x and we have to select
a sample size of n. Then, the next individual that
we will select would be x/nth intervals away from
the first individual. We can select the rest in the
same way.
• Stratified Sampling
• In this type of sampling, we divide the population
into subgroups (called strata) based on different
traits like gender, category, etc. And then we
select the sample(s) from these subgroups:
Sampling Techniques
Probability Sampling Techniques
•Cluster Sampling
• In a clustered sample, we use the subgroups of
the population as the sampling unit rather than
individuals. The population is divided into
subgroups, known as clusters, and a whole
cluster is randomly selected to be included in the
study:
Sampling Techniques
• Cluster Sampling
• In a clustered sample, we use the subgroups of
the population as the sampling unit rather than
individuals. The population is divided into
subgroups, known as clusters, and a whole
cluster is randomly selected to be included in the
study:
Sampling Techniques
Types of Non-Probability Sampling
•Convenience Sampling
• This is perhaps the easiest method of sampling
because individuals are selected based on their
availability and willingness to take part.
•Quota Sampling
• In this type of sampling, we choose items based
on predetermined characteristics of the
population. Consider that we have to select
individuals having a number in multiples of four
for our sample:
Sampling Techniques
Types of Non-Probability Sampling
•Judgment Sampling
• It is also known as selective sampling. It depends
on the judgment of the experts when choosing
whom to ask to participate.
•Snowball Sampling
• Existing people are asked to nominate further
people known to them so that the sample
increases in size like a rolling snowball. This
method of sampling is effective when a sampling
frame is difficult to identify.
Web Data Crawling vs Web Data
Scraping
• Data Crawling means dealing with large data sets
where you develop your crawlers (or bots) which
crawl to the deepest of the web pages. Data
scraping, on the other hand, refers to retrieving
information from any source (not necessarily the
web). It’s more often the case that irrespective of
the approaches involved, we refer to extracting data
from the web as scraping (or harvesting) and that’s a
serious misconception.
A Data Crawling vs Data Scraping –
Key Differences
• 1. Scraping data does not necessarily involve the
web. Data scraping tools that help in data scraping
could refer to extracting information from a local
machine, a database. Even if it is from the internet, a
mere “Save as” link on the page is also a subset of
the data scraping universe. Data crawling, on the
other hand, differs immensely in scale as well as in
range. Firstly, crawling = web crawling which means
on the web, we can only “crawl” data. Programs that
perform this incredible job are called crawl agents or
bots or spiders. Some web spiders are algorithmically
designed to reach the maximum depth of a page &
crawl them iteratively. While both seem different,
web scraping vs web crawling is mostly the same.
• 2. The web is an open world and the
quintessential practising platform of our right
to freedom. Thus a lot of content gets created
and then duplicated. For instance, the same
blog might be posted on different pages and
our spiders don’t understand that. Hence, data
de-duplication (affectionately dedup) is an
integral part of web data crawling service.
This is done to achieve two things — keep our
clients happy by not flooding their machines
with the same data more than once; and
saving our servers some space. However,
deduplication is not necessarily a part of web
data scraping.
• 3. One of the most challenging things in the web
crawling space is to deal with the coordination of
successive crawls. Our spiders have to be polite with
the servers, that they do not piss them off when hit.
This creates an interesting situation to handle. Over
some time, our spiders have to get more intelligent
(and not crazy!). They get to develop learning to
know when and how much to hit a server, how to
crawl data feeds on its web pages while complying
with its politeness policies. While both seem
different, web scraping vs web crawling is mostly the
same.
• 4. Finally, different crawl agents are used to crawling
different websites and hence you need to ensure
they don’t conflict with each other in the process.
This situation never arises when you intend to just
crawl data.
Data Scraping Data Crawling
SALES RECEIPT
Date: 07/16/2016
Order#: [00315]
00315 – 07/16/2016
Customer: David Rivers
David Rivers
Product: Oil Pump Oil Pump (OP147-0623)
1 x 69.90
S/N: OP147-0623
Total: 69.90
Storage vs. Management
Storage vs. Management (2)
• Storing data is not the primary reason to use
a Database
• Flat storage eventually runs into issues with
• Size
• Ease of updating
• Accuracy
• Security
• Redundancy
• Importance
Databases and RDBMS
• A database is an organized collection of
information
• It imposes rules on the contained data
• The relational storage model was first proposed by
Edgar Codd in 1970
Data Data
Top Database Engines
• The DB-Engines Ranking ranks database
management systems according to their
popularity. The ranking is updated monthly.
https://db-engines.com/en/ranking
Download Clients & Servers
• Download SQL Server Express Edition from
Microsoft
https://www.microsoft.com/en-us/sql-server/
sql-server-editions-express
EXECUTION FREE
Parsing
Compilation Query Execution THREAD
Optimization
100-19 Range
0-99 200-299 Range 1 Range 3
9 2
Data ☰ ☰ ☰ ☰ ☰ ☰ ☰
Views
• Views are prepared queries for displaying
sections of our data
• Evaluated at run time – they do not increase
performance
EXEC p_LoadEmployees
Functions
SELECT [dbo].[f_GetAge]('2004/12/26')
Triggers
CREATE TRIGGER t_DepartmentHistory ON [SoftUni].[dbo].[Departments]
FOR DELETE AS
BEGIN
DECLARE @DepartmentID int;
DECLARE @Name nvarchar(50);
DECLARE @ManagerID int;