Topic Key Concepts Learning Outcomes

The module consists of ten topics that focus on key areas of the fundamentals of statistics and statistical data
mining that are required for data science.
Topic Key concepts Learning outcomes
• Understand the basics of statistics

• Overview of statistics and scales of measurement
• Explain different data types
• Different types of variables in data
• Describe data via basic statistics, such as mean, median and so on
• Measures of location and variation (e.g. mean, median, mode, percentiles,
Topic 1: Exploratory data analysis (EDA) • Explain the concept of variability in data
interquartile range)
• Generate basic statistical and visual summaries of data using Python
• Basic data visualisation
• Understand how to interpret box, histogram and bar plots
• Typical problems with data, such as missing values and the need for type • Be able to deal with missing values in data
transformation • Transform data into different representations
• Correlation (and causation) and covariance • Describe the relationship between numeric variables
Topic 2: Data preprocessing, correlation and
• Scatter plots and heatmaps • Understand the purpose and importance of probabilities
probability overview
• Introduction to probability theory • Work out probability values in various situations (e.g. conditional events,
• Conditional probability independent events and so on)
• Data sampling and why it is important and useful • Understand sampling, random selection and bias
• Random sampling, bias and the central limit theorem • Describe sampling distribution of a statistic and the central limit theorem
• Confidence intervals • Explain confidence intervals
Topic 3: Sampling and hypothesis tests • Resampling and bootstrapping • Differentiate between data distributions (i.e. normal distribution, binomial
• Various data distributions distribution and so on)
• Hypothesis tests • Understand hypothesis tests, the null and alternative hypotheses
• Understand the concept of hypothesis testing

• Critical appraisal and its purposes
• Understand and interpret the p-value
• Hypothesis tests: the null hypothesis and the alternative hypothesis
• Explain various types of significance tests and select the correct test based on the
Topic 4: Significance tests • Statistical significance and p-values
data
• t-Tests, ANOVA test, chi-square test, and power and sample size analysis
• Perform hypothesis tests in Python
• Understand regression as a modelling method

• Correlation and simple linear regression
• Apply the method of least squares and calculate useful metrics for diagnostics and
• The method of least squares
evaluation
• The null and alternative hypotheses in linear regression
• Be aware of two uses of regression as a hypothesis test and an aid to accurate
• Error and evaluation methods and metrics
Topic 5: Linear regression prediction
• Model interpretation
• Generate simple or multiple linear regression models in Python and interpret them
• Multiple linear regression
correctly
• The problem of collinearity
• Understand how model selection works and why it is useful
• A brief overview of regression and classification • Explain odds and odd ratio and explain the effect of the sigmoid function
• The sigmoid function, odds and odds ratio • Understand concept of maximum likelihood estimation (MLE)
• Maximum likelihood estimation (MLE) • Extract and interpret class probabilities from logistical regression
Topic 6: Logistic regression
• Interpreting results of logistic regression variable (i.e feature) • Obtain feature importance from logistical regression model
• Importance based on logistic regression • Logistic regression in Python
• Overview of gradient boosting, decision trees and bagging

• Understand why XGBoost is powerful
• How XGBoost works
• Become familiar with hyperparameter tuning
• XGBoost for classification and regression
Topic 7: Extreme gradient boosting (XGBoost) • Obtain feature importance from XGBoost model
• Overview of hyperparameters and how to tune them for XGBoost
• XGBoost in Python
• Variable (i.e feature) importance based on XGBoost
• Understand the problem of class imbalance

• Why class imbalance is a challenging problem and the drawbacks of • Choose a suitable metric (other than accuracy)
accuracy • Explain how data sampling can help in dealing with class imbalanced data
• Other useful classification evaluation metrics • Understand confusion and cost matrix
Topic 8: Working with imbalanced data
• Data sampling methods to balance imbalanced data • Understand cost-sensitive classification
• Cost-sensitive classification (with example models) • Python code with a real-world example problem (tuning the probability threshold
using a cost matrix)
• Compare different types of clustering techniques

• Revision of k-Means clustering
• Implement a simple k-means clustering method
• Hierarchical clustering
• Explain how SOM works
• Self organising map (SOM)
Topic 9: Unsupervised learning and feature selection • Understand what outliers are and automatically identify them
• Outlier detection
• Understand the importance and application of feature selection
• Feature selection
• Python code examples for some of these techniques
• Use SageMaker to launch a Jupyter Notebook instance

• Cloud computing and its benefits
• Use AWS’s Transcribe, Translate and Polly services
Topic 10: Machine learning in the cloud – AWS as an • AWS’s SageMaker service
• Become familiar with AWS’s Rekognition service
example • Several artificial intelligence and machine learning services on AWS
• Become familiar with AWS’s Comprehend service

Topic Key Concepts Learning Outcomes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Key Concepts Learning Outcomes

Uploaded by

Copyright:

Available Formats

The module consists of ten topics that focus on key areas of the fundamentals of statistics and statistical data

mining that are required for data science.

Topic Key concepts Learning outcomes

• Understand the basics of statistics

• Understand the concept of hypothesis testing

• Understand regression as a modelling method

• Overview of gradient boosting, decision trees and bagging

• Understand the problem of class imbalance

• Compare different types of clustering techniques

• Use SageMaker to launch a Jupyter Notebook instance

You might also like