TMLS20_Machine_Learning_Coursework-1

TMLS20 Machine Learning
Coursework
Niklas Lavesson
Niklas.Lavesson@ju.se
January 13, 2020
1 Introduction
This document provides information about the mandatory and optional coursework for the course
TMLS20 Machine Learning. This document is updated frequently. Students need to make sure
to download the latest version from the course homepage in the learning management system.
The document is to be considered frozen or locked during the duration of the course. That is;
students can assume that the version of the document downloaded when the course has started
is applicable until the course officially ends. For mandatory coursework, refer to Section 2 and
Section 3. For optional coursework, refer to Section 4 and Section 5.
2 Assignments
It is possible use the following programming languages and environments for the assignments:
Jupyter Notebook1 with Python or Swift Playground Book2 . Students need to ensure that the
source code compiles, or can be interpreted by, Python 3.8+ or Swift 5+. Additional program-
ming languages may be supported but it is always the responsibility of the student to ensure
that the selected programming environment is accepted by the examiner. Python source may can
depend on the following libraries only: default installation libraries, scikit-learn, numpy,
panda. Swift source may depend on the Foundations library only.
Submission Format
Assignments containing more than one file musts be compressed and archived using Zip format.
Students must ensure that the archive can be decompressed on Unix compatible systems (Linux
variants or BSD variants including Darwin). The source code must be documented clearly and
concisely. A README file with complete compilation and running instructions is required. If the
source code is embedded in a Jupyter Notebook or Swift Playground Book (as it should), the
need for instructions is minimal.
1 https://jupyter.org
2 https://developer.apple.com/documentation/swift_playgrounds/
1
Data Set Format
Data sets must conform to the ARFF standard or the TMLS20 Machine Learning Data Set
Standard described in this document. Alternatively, if Python is used for development, it is
possible to use datasets that can be loaded by scikit-learn utility functions. For the TMLS20
Machine Learning Data Set Standard, data sets are stored as comma separated files with two
header rows. The top header row (the first line in the file) provides the list of features (sometimes
referred to as attributes or variables), including potential target features. The bottom header
row (the second line in the file) provides the type for each listed feature. The following types are
available: n (nominal), r (real). The last feature represents the default target. The following
file is an example of a data set with five real input features and one nominal target feature:
a,b,c,d,e,f
r,r,r,r,r,n
0.1,0.7,218.3,17,?,yes
The file includes one data instance, classified as yes. The second to last feature has a missing
value. Any white space excluding end-of-line must be skipped by a data reader. The comma
symbol is used to separate features. The period symbol is used before fractional digits. Students
should expect the examiner to test assignment submissions with data sets unavailable to the
students but which adhere to one of the standards above.
Assignment 1 (1.5 credits)

The aim of the assignment is to implement from scratch a) a Naive Bayes learning algorithm for
classification tasks, b) a cross-validation test and c) to plot the average ROC of a 10-fold cross-
validation test. The submitted code should demonstrate the 10-fold cross-validation average
area under the ROC curve value and the ROC plot for Naive Bayes on three different versions of
the iris dataset. This dataset is directly available via scikit-learn. The dataset contains
three classes (categories), remove one class (50 instances) for each scenario (while keeping the 100
instances from the two remaining classes). There are a number of different ways to implement
Naive Bayes. The source code should include motivations for the design choices taken.

The aim of the assignment is to implement a decision tree induction algorithm for classification
tasks and to demonstrate that it works as expected. The algorithm shall process data sets
according to an approved standard (see above). It must be able to handle real-valued and
nominal features. The algorithm does not need to handle missing values or real-valued target
features (regression tasks). The student chooses whether to use information entropy or gini
impurity as split criterion. To calculate binary splits for real-valued features, the following rule
must be applied: an instance with a feature value lower than the mean feature value follows the
left edge from the split node while all other instances follow the right edge from the split node.
Demonstrate that the algorithm works as expected on three classification data sets: Iris3 , Wine4 ,
and one additional data set of your own choice from the UCI machine learning repository.
3 http://archive.ics.uci.edu/ml/datasets/Iris
4 http://archive.ics.uci.edu/ml/datasets/Wine
2
The aim of this assignment is to implement a multi-layer feed-forward neural network with back-
propagation for classification tasks. It should be possible for the user to specify (in the code)
the number of hidden layers and the number of neurons in each hidden layer. Choose at least
three benchmark datasets from a public repository and four hyperparameters in order to perform
parameter tuning to optimize predictive performance (accuracy) for each data set. The source
code should include justifications for the choice of hyperparameters as well as the interval and
step size used for parameter tuning.
3 Project
The main deliverable for the project is the project report. Project reports should be prepared and
typeset in Latex or Word using the IEEE conference proceedings template5 . The recommended
length of a report is four pages but students are allowed up to six pages, excluding references.
For the project, it is possible to use any freely available open source libraries and software
platforms. It is also possible to use any data set format. However: if the project depends on
other data set formats or any additional software and libraries compared to what is accepted for
assignments, students may be asked to book an appointment with the examiner after submission
to demonstrate compilation and running of the code for the project using their own computer
and equipment.
Students are recommended to work in pairs on projects. For pair projects, a section entitled
Disclosure of Contribution must be included in the project report. In that section, the students
clarify the individual contributions of each student. Both students need to submit identical files
for examination in pair projects.
Machine Learning Project (3 credits)

The project should be of sufficient size and complexity. You should describe, in the project
report, which activities were necessary to perform and the approximate time to perform each
activity. The total time of a project of sufficient size and complexity is 160 hours (1.5 credit is
roughly equivalent to one week of full-time study, which is 40 hours, 3 credits equals 80 hours,
2 students thus have 160 hours in total). Note that the total time includes the time required to
study a topic and write the report.
The aim of the project is to choose a machine learning task, identify an appropriate learning
problem, identify a reasonable model type and learning algorithm, and to choose a systematic
approach to evaluate a solution of a real-world problem. The task, learning problem, model,
evaluation procedure, and learning algorithm should be described and justified in the report.
The student group is encouraged to choose an application of interest (natural language pro-
cessing, computer vision, data mining, pattern recognition, etc.). Kaggle describes a variety of
competitions that can be used as is or elaborated upon as inspiration.
A project should demonstrate the results of an independent investigation into an advanced
machine learning topic or application. In most cases, the project report could be used as a
preliminary study before taking on a Master’s thesis.
5 https://www.ieee.org/conferences/publishing/templates.html
3
4 Laboratory Exercises
Exercise 1 – Instance-based Learning
Implement the K-Nearest Neighbor algorithm for classification and regression from scratch and
verify that you achieve comparable results to the scikit-learn implementation of the algorithm,
using different K values, for various standard datasets available through scikit-learn. Use
cross-validation to compute average performance scores. Use accuracy (for classification) and
mean squared error (for regression) to compute performance scores.
Exercise 2 – K-Means Clustering

Implement the K-Means Clustering algorithm from scratch and evaluate your solution, using
different K values, for various standard datasets available through scikit-learn. Use cross-
validation to compute average performance scores. Search scientific literature to find a suitable
evaluation measure by which to evaluate your solution.
Exercise 3 – Hyperparameter Tuning

Use the scikit-learn implementation of Random Forests and pick at least two hyperpa-
rameters to optimize. Choose three datasets from a public repository. Perform a systematic
hyperparameter tuning to optimize performance for each dataset using a suitable performance
measure.
Exercise 4 – Generating Explanations with LIME

Use LIME to generate explanations for one natural language processing task and one image
recognition task.
Exercise 5 – Reinforcement Learning

Description will be available soon
5 Seminars
Seminar 1 – Experiments in ML
This seminar focuses on empirical machine learning and the use of experiments to explore topics
and to advance the field. The idea is to discuss the motivation for experimentation in computer
science in general and machine learning in particular. The seminar should bring up discussions
on maturity and quality of published results from machine learning experiments, the need to
perform scientific experiments in machine learning, and the overarching question concerning
whether experiments are relevant to computer science as a discipline
Learning outcomes addressed: i) Demonstrate the ability to plan and conduct machine
learning experiments and to describe algorithmic performance and behavior through analysis of
experimental results, ii) Demonstrate the ability to evaluate algorithms and algorithm parameter
configurations for a concrete task
4
Seminar 2 – Explainable AI
This seminar focuses on the area of explainable artificial intelligence (XAI) and, more broadly:
fairness, accountability, explainability, and ethics (FATE) in artificial intelligence and machine
learning. The idea is to discuss the motivation for XAI, including potential trade-offs with
other important factors to consider when implementing AI and machine learning in real-world
applications.
Learning outcomes addressed: i) Demonstrate knowledge of the machine learning area of
research ii) Demonstrate the ability to suggest a suitable machine learning approach for a problem
or real-world challenge iii) Demonstrate the ability to motivate the potential costs and benefits
of machine learning application for a given context
Seminar 3 – Lifelong and Transfer Learning

Seminar 4 – Machine Learning for Manufacturing

Seminar 5 – The Data-driven Industry and Society


TMLS20_Machine_Learning_Coursework-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TMLS20_Machine_Learning_Coursework-1

Uploaded by

Copyright:

Available Formats

TMLS20 Machine Learning

Assignment 1 (1.5 credits)

Assignment 2 (1.5 credits)

Machine Learning Project (3 credits)

Exercise 2 – K-Means Clustering

Exercise 3 – Hyperparameter Tuning

Exercise 4 – Generating Explanations with LIME

Exercise 5 – Reinforcement Learning

Seminar 3 – Lifelong and Transfer Learning

Seminar 4 – Machine Learning for Manufacturing

Seminar 5 – The Data-driven Industry and Society

You might also like