Big Data - Sources and Opportunities

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Big Data - Sources and

Opportunities
WRM
What is Big Data?
• Big data is a process to deliver decision-making insights.
• The process uses people and technology to quickly analyze large
amounts of data of different types (traditional table structured data
and unstructured data, such as pictures, video, email, transaction
data, and social media interactions) from a variety of sources to
produce a stream of actionable knowledge.
• “Big Data” refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and
analyze".
• This definition is intentionally subjective and incorporates a moving
definition of how big a dataset needs to be in order to be considered
Big Data – i.e., we don’t define big data in terms of being larger than a
certain number of terabytes (thousands of gigabytes).
• We assume that, as technology advances over time, the size of
datasets that qualify as big data will also increase.
• Also note that the definition can vary by sector, depending on what
kinds of software tools are commonly available and what sizes of
datasets are common in a particular industry.
THREE Vs
• Gartner indicates the following :
• Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.
Four primary technologies for accelerating
processing of huge data sets :
• Grid Computing : A centrally managed grid infrastructure provides dynamic
workload balancing, high availability and parallel processing for data
management, analytics and reporting.
• In-Database Processing : Moving relevant data management, analytics and
reporting tasks to where the data resides improves speed to insight,
reduces data movement and promotes better data governance.
• In-Memory Analytics : Quickly solves complex problems using big data and
sophisticated analytics in an unfettered manner.
• Visual Analytics : Using Visual Analytics, one can very quickly see
correlations and patterns in big data, identify opportunities for further
analysis and easily publish reports
Measures of Location
• Measures of location provide estimates of a single value that in some
fashion represents the “centering” of a set of data. The most common
is the average.
• We all use averages routinely in our lives, for example :
• To measure student accomplishment in college (e.g., grade point
average).
• To measure the performance of sports teams (e.g., batting average).
• To measure performance in business (e.g., average delivery time).
Arithmetic Mean
• This simply means that the sum of the deviations above the mean are the
same as the sum of the deviations below the mean; essentially, the mean
“balances” the values on either side of it.
• However, it does not suggest that half the data lie above or below the mean—
a common misconception among those who don’t understand statistics.
• In addition, the mean is unique for every set of data and is meaningful for
both interval and ratio data.
• However, it can be affected by outliers—observations that are radically
different from the rest—which pull the value of the mean toward these values.
Median
• The measure of location that specifies the middle value when the data are
arranged from least to greatest is the median.
• Half the data are below the median, and half the data are above it. For an odd
number of observations, the median is the middle of the sorted numbers.
• For an even number of observations, the median is the mean of the two
middle numbers.
• We could use the Sort option in Excel to rank-order the data and then
determine the median.
• The Excel function MEDIAN (data range) could also be used. The median is
meaningful for ratio, interval, and ordinal data. As opposed to the mean, the
median is not affected by outliers.
Mode
• A third measure of location is the mode. The mode is the observation that occurs
most frequently.
• The mode is most useful for data sets that contain a relatively small number of
unique values.
• For data sets that have few repeating values, the mode does not provide much
practical value.
• You can easily identify the mode from a frequency distribution by identifying the
value having the largest frequency or from a histogram by identifying the highest
bar.
• You may also use the Excel function MODE.SNGL (data range).
• For frequency distributions and histograms of grouped data, the mode is the group
with the greatest frequency.
Measures of Variability
• Variance A more commonly used measure of dispersion is
the variance, whose computation depends on all the data. The larger
the variance, the more the data are spread out from the mean and
the more variability one can expect in the observations. The formula
used for calculating the variance is different for populations and
samples.
• The Excel function VAR.S (data range) may be used to compute the
sample variance, s2, whereas the Excel function VAR.P (data range) is
used to compute the variance of a population, s2.
• Note that the dimension of the variance is the square of the
dimension of the observations. So for example, the variance of the
cost per order is not expressed in dollars, but rather in dollars
squared. This makes it difficult to use the variance in practical
applications. However, a measure closely related to the variance that
can be used in practical applications is the standard deviation.
• The Excel function STDEV.P (data range) calculates the standard
deviation for a population 1s2; the function STDEV.S (data range)
calculates it for a sample (s).
Linear Regression
• In Statistical Modeling, Regression Analysis is used to estimate the relationships
between two or more variables :
• Dependent Variables : (aka criterion variable) is the main factor you are trying to
understand and predict.
• Independent Variables : (aka explanatory variables, or predictors) are the factors
that might influence the dependent variable.
• Regression Analysis helps you understand how the dependent variable changes
when one of the independent variables varies and allows to mathematically
determine which of those variables really has an impact.
• Technically, a regression analysis model is based on the sum of squares, which is a
mathematical way to find the dispersion of data points. The goal of a model is to get
the smallest possible sum of squares and draw a line that comes closest to the data.
• In statistics, they differentiate between a simple and multiple linear
regression.
• Simple linear regression models the relationship between a
dependent variable and one independent variables using a linear
function.
• If you use two or more explanatory variables to predict the
dependent variable, you deal with multiple linear regression.
• So, in Excel, you do linear regression using the least squares method and seek
coefficients a and b such that :

y = bx + a
• For our example, the linear regression equation takes the following shape :
• Umbrellas_Sold = b * rainfall + a
• There exist a handful of different ways to find a and b.
• The three main methods to perform linear regression analysis in Excel are :
Regression tool included with Analysis ToolPak
Scatter chart with a trendline
Linear regression formula
Interpret Regression Analysis Output
• Multiple R : It is the Correlation Coefficient that measures the strength of a
linear relationship between two variables. The correlation coefficient can be
any value between -1 and 1, and its absolute value indicates the relationship
strength. The larger the absolute value, the stronger the relationship:
• 1 means a strong positive relationship
• -1 means a strong negative relationship
• 0 means no relationship at all
• R Square : It is the Coefficient of Determination, which is used as an indicator
of the goodness of fit. It shows how many points fall on the regression line.
The R2 value is calculated from the total sum of squares, more precisely, it is
the sum of the squared deviations of the original data from the mean.
• Adjusted R Square : It is the R square adjusted for the number of
independent variable in the model. You will want to use this value
instead of R square for multiple regression analysis.
• Standard Error : It is another goodness-of-fit measure that shows the
precision of your regression analysis - the smaller the number, the more
certain you can be about your regression equation. While R2 represents
the percentage of the dependent variables variance that is explained by
the model, Standard Error is an absolute measure that shows the
average distance that the data points fall from the regression line.
• Observations : It is simply the number of observations in your model.
Regression Analysis Output: ANOVA
• Basically, it splits the sum of squares into individual components that give
information about the levels of variability within your regression model :
• df is the number of the degrees of freedom associated with the sources of
variance.
• SS is the sum of squares. The smaller the Residual SS compared with the
Total SS, the better your model fits the data.
• MS is the mean square.
• F is the F statistic, or F-test for the null hypothesis. It is used to test the
overall significance of the model.
• Significance F is the P-value of F.
• The ANOVA part is rarely used for a simple linear regression analysis
in Excel, but you should definitely have a close look at the last
component. The Significance F value gives an idea of how reliable
(statistically significant) your results are. If Significance F is less than
0.05 (5%), your model is OK. If it is greater than 0.05, you'd probably
better choose another independent variable.
How to do linear regression in Excel with Analysis Tool Pak
• In your Excel, click File> Options.
• In the Excel Optionsdialog box, select Add-ins on the left sidebar,
make sure Excel Add-ins is selected in the Manage box, and click Go.
In the Add-insdialog box, tick off Analysis Toolpak, and
click OK:
………………………..
TUTORIALS
• Using vehicle production data produce line and bar graphs (clustered)
• Using bank closures produce pivot tables and chats for states, the for
Navada and Florida ib selected time periods
• Using stat score data produce a regression output
• KE A LEBOGHA!

You might also like