Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

The module consists of ten topics that focus on key areas of the fundamentals of statistics and statistical data

mining that are required for data science.

Topic Key concepts Learning outcomes

• Understand the basics of statistics 


• Overview of statistics and scales of measurement 
• Explain different data types 
• Different types of variables in data 
• Describe data via basic statistics, such as mean, median and so on 
• Measures of location and variation (e.g. mean, median, mode, percentiles,
Topic 1: Exploratory data analysis (EDA) • Explain the concept of variability in data 
interquartile range) 
• Generate basic statistical and visual summaries of data using Python 
• Basic data visualisation
• Understand how to interpret box, histogram and bar plots

• Typical problems with data, such as missing values and the need for type • Be able to deal with missing values in data 
transformation  • Transform data into different representations 
• Correlation (and causation) and covariance  • Describe the relationship between numeric variables 
Topic 2: Data preprocessing, correlation and
• Scatter plots and heatmaps  • Understand the purpose and importance of probabilities 
probability overview
• Introduction to probability theory  • Work out probability values in various situations (e.g. conditional events,
• Conditional probability independent events and so on)

• Data sampling and why it is important and useful  • Understand sampling, random selection and bias 
• Random sampling, bias and the central limit theorem  • Describe sampling distribution of a statistic and the central limit theorem 
• Confidence intervals  • Explain confidence intervals 
Topic 3: Sampling and hypothesis tests • Resampling and bootstrapping  • Differentiate between data distributions (i.e. normal distribution, binomial
• Various data distributions  distribution and so on) 
• Hypothesis tests • Understand hypothesis tests, the null and alternative hypotheses

• Understand the concept of hypothesis testing 


• Critical appraisal and its purposes 
• Understand and interpret the p-value 
• Hypothesis tests: the null hypothesis and the alternative hypothesis 
• Explain various types of significance tests and select the correct test based on the
Topic 4: Significance tests • Statistical significance and p-values 
data 
• t-Tests, ANOVA test, chi-square test, and power and sample size analysis
• Perform hypothesis tests in Python

• Understand regression as a modelling method 


• Correlation and simple linear regression 
• Apply the method of least squares and calculate useful metrics for diagnostics and
• The method of least squares 
evaluation 
• The null and alternative hypotheses in linear regression 
• Be aware of two uses of regression as a hypothesis test and an aid to accurate
• Error and evaluation methods and metrics 
Topic 5: Linear regression prediction 
• Model interpretation 
• Generate simple or multiple linear regression models in Python and interpret them
• Multiple linear regression 
correctly 
• The problem of collinearity
• Understand how model selection works and why it is useful
• A brief overview of regression and classification  • Explain odds and odd ratio and explain the effect of the sigmoid function 
• The sigmoid function, odds and odds ratio  • Understand concept of maximum likelihood estimation (MLE) 
• Maximum likelihood estimation (MLE)  • Extract and interpret class probabilities from logistical regression 
Topic 6: Logistic regression
• Interpreting results of logistic regression variable (i.e feature) • Obtain feature importance from logistical regression model 
•  Importance based on logistic regression • Logistic regression in Python

• Overview of gradient boosting, decision trees and bagging 


• Understand why XGBoost is powerful 
• How XGBoost works 
• Become familiar with hyperparameter tuning 
• XGBoost for classification and regression 
Topic 7: Extreme gradient boosting (XGBoost) • Obtain feature importance from XGBoost model 
• Overview of hyperparameters and how to tune them for XGBoost 
• XGBoost in Python
• Variable (i.e feature) importance based on XGBoost

• Understand the problem of class imbalance 


• Why class imbalance is a challenging problem and the drawbacks of • Choose a suitable metric (other than accuracy) 
accuracy  • Explain how data sampling can help in dealing with class imbalanced data 
• Other useful classification evaluation metrics  • Understand confusion and cost matrix 
Topic 8: Working with imbalanced data
• Data sampling methods to balance imbalanced data  • Understand cost-sensitive classification 
• Cost-sensitive classification (with example models) • Python code with a real-world example problem (tuning the probability threshold
using a cost matrix)

• Compare different types of clustering techniques 


• Revision of k-Means clustering 
• Implement a simple k-means clustering method 
• Hierarchical clustering 
• Explain how SOM works 
• Self organising map (SOM) 
Topic 9: Unsupervised learning and feature selection • Understand what outliers are and automatically identify them 
• Outlier detection 
• Understand the importance and application of feature selection 
• Feature selection
• Python code examples for some of these techniques

• Use SageMaker to launch a Jupyter Notebook instance 


• Cloud computing and its benefits 
• Use AWS’s Transcribe, Translate and Polly services 
Topic 10: Machine learning in the cloud – AWS as an • AWS’s SageMaker service 
• Become familiar with AWS’s Rekognition service 
example • Several artificial intelligence and machine learning services on AWS
• Become familiar with AWS’s Comprehend service

You might also like