Professional Documents
Culture Documents
Unit Iii
Unit Iii
1
Dr S Palaniappan , Associate Professor , Department of ADS
Text Book 3
• Catherine Marsh, Jane Elliott, “Exploring Data: An Introduction to Data Analysis for
Social Scientists”, Wiley Publications, 2nd Edition, 2008
INTRODUCTION TO DATA
EXPLORATION
3
Dr S Palaniappan , Associate Professor , Department of ADS
Lesson Plan
Planned Relevant CO Highest Cognitive
Description of Portion to be Covered
Hour Nos level**
2 Introduction to Single variable CO1 K1
2 Distributions and Variables CO1 K1
2 Numerical Summaries of Level and spread CO1 K1
1 Scaling and Standardizing CO1 K1
1 Inequality CO1 K1
1 Smoothing Time Series CO1 K1
• The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes
• The main purpose of the analysis is to describe the data and find patterns that exist
within it.
• Travel Time (minutes): 15, 29, 8, 42, 35, 21, 18, 42, 26
• Find how spread out it is using range, quartiles and standard deviation
2. Variables
• Historians may collect information on a series of years, and look for similar patterns
in such things as the unemployment rate and the incidence of suicide or depression;
here years are the cases (or units of analysis).
• In a survey of individuals, their age, income, sex and satisfaction with life are some
of the variables that might be recorded.
• We mean that the individual cases do not all have the same income, the same sex and
age etc.
• The main aim of the survey is to collect data on a range of core topics, covering
household, family and individual information
• The two datasets we will use in this chapter provide information from the GHS:
• First about a sample of individuals
• And second, about a sample of households. (No. of dependent Children)
• Third unit - A family unit is taken to consist of single people or married couples living
with dependent children
Dr S Palaniappan , Associate Professor , Department of ADS 12
Individual Data
• In large-scale surveys such as the GHS different numbers are routinely used to
indicate different types of missing data.
• A dot'.' was used to indicate that data had not been collected since it was a child.
• In addition it can be seen that person number 12 has a '97' recorded for the variable
NS-SECS - Info was insufficient
134
121
167
• Bar chart, a visual display in which bars are drawn to represent each category of a
variable such that the length of the bar is proportional to the number of cases in the
category.
• A pie chart categorizes various data into segments. Bar charts display them
horizontally or vertically.
• Level
• Spread
• Shape
• Outliers
Spread : How widely dispersed are the values? Do they differ very much from
one another?
• Profit of a business
• Sales of a product in month
• Test Scores
• We shall recode the interval variable into an ordinal variable with three
categories:
For Women For Men
1. No alcohol drunk (none) 1. No alcohol drunk (none)
2. 1-21 units drunk (moderate 2. 1-14 units drunk (moderate
drinking) drinking)
3. Over 21 units drunk (heavy 3. Over 14 units drunk (heavy
drinking) drinking)
outliers
It is an interval-level variable
• The measurement of hours that people work is important when analyzing a variety of
economic and social phenomena
• The patterns of hours worked and comparisons of the hours worked by different
groups - lifestyles, the labour market and social changes.
Dr S Palaniappan , Associate Professor , Department of ADS 35
Working hours of couples in Britain
• Level, Spread, Shape And Outliers
• Level is expressed on the previous example running from 1 hour per week to 100
hours per week
• To summarize these values, one number must be found to express the typical hours
worked by men, for example.
• The mean and the median will tend to be the same in symmetrical datasets.
IQR = 15
55
• Men are typically working 4 hours more per week than the women.
• Middle 50 per cent of the population of men work between 37 and 42.9 hours per week -
everyone would understand the statement
• We have already noted that means and standard deviations are more influenced by unusual
data values than medians and midspreads.
• Means and SD more influenced by a change in any individual data point then Median and
dQ
Dr S Palaniappan , Associate Professor , Department of ADS 50
• If we were entirely confident that the numbers collected were accurate, we might
prefer measures that used more information.
• The sampling theory of the mean and standard deviation is more developed than that
of the median and midspread,
• The range has all of the disadvantages discussed above and none of the
advantages.
Dr S Palaniappan , Associate Professor , Department of ADS 51
Individual and aggregate level data
• Some analysis has already been carried out, and that the data are summarized
• Annual Survey of Hours and Earnings in Britain - are not generally available at the
individual level.
Distribution of
hours worked in
terms of deciles
• The change made to the data by adding or subtracting a constant is fairly trivial.
• Only the level is affected; spread, shape and outliers remain unaltered.
• Multiplying or dividing each of the values has a more powerful effect than adding or
subtracting.
• Scale the entire variable by a factor, evenly stretching or shrinking the axis like a
piece of elastic
The whole
distribution has
simply been
scaled by a
constant factor
• Neither the level nor the spread takes the same numeric value as previously.
• The original scale has been lost temporarily, but it could always be retrieved by
rescaling
• A variable which has been standardized in this way is forced to have a mean or
median of 0 and a standard deviation or midspread of 1.
It might be
useful to try to
turn them into a
single measure
of that construct
• The Act was amended with effect from the beginning of January 1984,
• What gap between the richest and the poorest in society is shifting over time?
• Household disposable income per head, adjusted for inflation ↑ 1/3rd 1971 to 2003
£100 a household had to spend in 1971, by 2003 they had to spend £234
• Extra income has a bigger impact on increasing the happiness of the poor than the
rich - Layard (2005)
• On the other hand, wealth is the market price of the stock of asset possessed by an
individual or household.
1. Follow accounting and tax practices, and make a clear distinction between income
and additions to wealth
2. Treat income as the value of goods and services consumed in a given period plus
net changes in personal wealth during that period.
• Earned income
- from either employment or self-employment;
• Unearned income
- which accrues from ownership of investments, property, rent and so on;
• Transfer income
- that is benefits and pensions transferred on the basis of entitlement, not on the
basis of work or ownership, mainly by the government but occasionally by
individuals (e.g. alimony).
• Income are generally shared across broader units - food, housing, fuel and electricity.
• A £40,000 annual income is a “lower” income for a family of four than for a single
adult
• For example, a household with a married couple and one child aged six would have
an equivalence number of
• 1.0 + 0.21 = 1.21
• In the example above the household's disposable income is £40,000, and so its
equalized disposable income is £33,058 (i.e. £40,000/1.21).
• Income inequality will appear greater the shorter the time period considered.
1. The Department for Work and Pensions uses the Family Resources Survey (FRS)
- 'Households Below Average Income‘
2. Office for National Statistics uses the Expenditure and Food Survey (EFS) to
produce a series on the redistribution of income published annually as 'The effects
of taxes and benefits on household incomes'.
• The last two rows of the first column of figure shows the Government Didn't
eradicate inequality in original or 'pre-tax' incomes.
• For example, the top 10 per cent of income units had over thirty times as much
income as the bottom 10 per cent of units.
• The graph plots percentiles of the population on the horizontal axis according to
income or wealth and plots cumulative income or wealth on the vertical axis.
Gini coefficient = A / (A + B)
• Smoothing is usually done to help us better see patterns, trends for example, in time
series.
• Sharp variations in a time series can occur for many reasons - it could be sampling
error.
• The process of smoothing time series also produces such a decomposition of the
data.
• Blair first promised that government would use both punishment and prevention to
tackle the rising crime rates
• The results of opinion polls which aim to capture individuals political allegiances
and voting intentions.
• This is measured by asking respondents to rate their certainty to vote on a scale from
1 to 10,
• where '1' means absolutely certain not to vote and
• '10' means absolutely certain to vote, and
• In March, April and May 2003, the median is 43 per cent, so April's value is unchanged.
• In April, May and June 2003, the median is 41 per cent, so the value for May is altered
to 41
Dr S Palaniappan , Associate Professor , Department of ADS 100
1 2 3 4 5
Data Medians of 3 Residuals Means of 3 Residuals
Mar-04 35 35 0 35.0 -0.3
Apr-04 36 35 1 35.3 0.7
May-04 35 35 0 35.0 0.0
Jun-04 34 34 0 33.7 0.3
Jul-04 32 34 -2 34.0 -2.0
Aug-04 36 32 4 33.3 2.7
Sep-04 32 36 -4 35.7 -3.7
Oct-04 39 35 4 35.3 3.7
Nov-04 35 35 0 36.3 -1.3
Dec-04 35 35 0 36.0 -1.0
Jan-05 38 38 0 38.0 0
• A procedure called hanning, named after its protagonist is used to smooth further