Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

HOMEWORK ASSIGNMENT

NAZARBAYEV UNIVERSITY | SCHOOL OF SCIENCE AND TECHNOLOGY



PROJECT 1
In this project students are required to apply the knowledge learned about association rules acquired in the
Statistical Methods and Machine Learning class to learn rules from a big dataset provided. Data preprocessing
skills, programming capabilities and completed work reporting are also evaluated as part of the assignment.

DUE DATE
Thursday, 4
th
of September

METHOD OF DELIVERY
Assignment deliverables should be submitted via Moodle to the course instructor before the due date.

LEVEL OF COLLABORATION ALLOWED
Collaboration is not allowed on this assignment each student should perform the assignment individually.

ESTIMATED TIME FOR COMPLETION
20 hours

ADDITIONAL SUPPORT
Please contact the course instructor if you need any assistance or have any concerns about this assignment.

ASSIGNMENT DELIVERABLES
- Matlab (or C++) program written by a student to accomplish the task
- Report which reveals in detail how a student approached the problem and solved it. Association rules
learned from the data must be discussed.
GRADING CRITERIA
- 60% - implementation, functionality and documentation of the work
- 30% - based on student ranking provided using an evaluation program implementing a metric
- 10% - discussion of the limitations of the implemented approach and suggestion of another one with
proper justification



ASSIGNMENT DETAILS
The Computer Age provided people with enormous amount of digital information and introduced the concept
of the Big Data, which is associated with difficulties of traditional data processing applications.
Association Rule learning is a popular and wide spread machine learning technique allowing extraction of
important and/or interesting relations between elements in large datasets. The concept provides tools to
identify strong rules using different measures, for example support and/or confidence. These strong rules then
describe regularities between variabels in transaction datasets, which are very important in applications such
as market basket analysis, web data mining, bioinformatics, etc.

In this assignment you are provided with large dataset of 989818 transactions. The data comes from Internet
Information Server (IIS) logs of msnbc.com and news-related portions of msn.com for the entire day of
September, 28, 1999 (http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data). Each
transaction (row sequence) represents page views of a user during 24 hour period. Sequences consist of
numbers which represent a web page categories as follows:
1 = Frontpage
2 = News
3 = Tech
4 = Local
5 = Opinion
6 = On Air
7 = Misc
8 = Weather
9 = MSN-News

10 = Health
11 = Living
12 = Business
13 = MSN-Sports
14 = Sports
15 = Summary
16 = BBS
17 = Travel


For instance, the sequence 4,3,5,1,10,1 means that the person visited Local, Tech, Opinion, Front, Health and
then again Front pages. You are required to implement an association rule algorithm (for example Apriori
algorithm, but dont forget that there might be better ones) to learn 50 strong rules from your dataset. This will
involve manipulations on support and confidence thresholds and your analysis. Rules should depict which
web page categories are commonly accessed together, for example
Frontpage -> News : (support, confidence)

Your program must generate an output file in txt format with the following values listed at each line
separated with commas:

Support (float), Confidence(float), Product of Support an Confidence values (float), Consequent (integer), list
of Antecedents (integer)

Note that each line might have different number of entries due to different number of Antecedents. The rules
should be listed in sorted order with the strongest rule first. The strength of a rule will be determined by the
multiplicated value of the support and confidence.

You might also like