This homework assignment requires students to:
1) Apply knowledge of association rule learning to analyze a large dataset and learn 50 strong rules using an algorithm like Apriori.
2) Write a program to generate an output file listing the support, confidence, and product of support and confidence for each rule.
3) Submit a report discussing their approach, the rules learned from the data, limitations of their method, and a suggested alternative approach.
This homework assignment requires students to:
1) Apply knowledge of association rule learning to analyze a large dataset and learn 50 strong rules using an algorithm like Apriori.
2) Write a program to generate an output file listing the support, confidence, and product of support and confidence for each rule.
3) Submit a report discussing their approach, the rules learned from the data, limitations of their method, and a suggested alternative approach.
This homework assignment requires students to:
1) Apply knowledge of association rule learning to analyze a large dataset and learn 50 strong rules using an algorithm like Apriori.
2) Write a program to generate an output file listing the support, confidence, and product of support and confidence for each rule.
3) Submit a report discussing their approach, the rules learned from the data, limitations of their method, and a suggested alternative approach.
NAZARBAYEV UNIVERSITY | SCHOOL OF SCIENCE AND TECHNOLOGY
PROJECT 1 In this project students are required to apply the knowledge learned about association rules acquired in the Statistical Methods and Machine Learning class to learn rules from a big dataset provided. Data preprocessing skills, programming capabilities and completed work reporting are also evaluated as part of the assignment.
DUE DATE Thursday, 4 th of September
METHOD OF DELIVERY Assignment deliverables should be submitted via Moodle to the course instructor before the due date.
LEVEL OF COLLABORATION ALLOWED Collaboration is not allowed on this assignment each student should perform the assignment individually.
ESTIMATED TIME FOR COMPLETION 20 hours
ADDITIONAL SUPPORT Please contact the course instructor if you need any assistance or have any concerns about this assignment.
ASSIGNMENT DELIVERABLES - Matlab (or C++) program written by a student to accomplish the task - Report which reveals in detail how a student approached the problem and solved it. Association rules learned from the data must be discussed. GRADING CRITERIA - 60% - implementation, functionality and documentation of the work - 30% - based on student ranking provided using an evaluation program implementing a metric - 10% - discussion of the limitations of the implemented approach and suggestion of another one with proper justification
ASSIGNMENT DETAILS The Computer Age provided people with enormous amount of digital information and introduced the concept of the Big Data, which is associated with difficulties of traditional data processing applications. Association Rule learning is a popular and wide spread machine learning technique allowing extraction of important and/or interesting relations between elements in large datasets. The concept provides tools to identify strong rules using different measures, for example support and/or confidence. These strong rules then describe regularities between variabels in transaction datasets, which are very important in applications such as market basket analysis, web data mining, bioinformatics, etc.
In this assignment you are provided with large dataset of 989818 transactions. The data comes from Internet Information Server (IIS) logs of msnbc.com and news-related portions of msn.com for the entire day of September, 28, 1999 (http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data). Each transaction (row sequence) represents page views of a user during 24 hour period. Sequences consist of numbers which represent a web page categories as follows: 1 = Frontpage 2 = News 3 = Tech 4 = Local 5 = Opinion 6 = On Air 7 = Misc 8 = Weather 9 = MSN-News
10 = Health 11 = Living 12 = Business 13 = MSN-Sports 14 = Sports 15 = Summary 16 = BBS 17 = Travel
For instance, the sequence 4,3,5,1,10,1 means that the person visited Local, Tech, Opinion, Front, Health and then again Front pages. You are required to implement an association rule algorithm (for example Apriori algorithm, but dont forget that there might be better ones) to learn 50 strong rules from your dataset. This will involve manipulations on support and confidence thresholds and your analysis. Rules should depict which web page categories are commonly accessed together, for example Frontpage -> News : (support, confidence)
Your program must generate an output file in txt format with the following values listed at each line separated with commas:
Support (float), Confidence(float), Product of Support an Confidence values (float), Consequent (integer), list of Antecedents (integer)
Note that each line might have different number of entries due to different number of Antecedents. The rules should be listed in sorted order with the strongest rule first. The strength of a rule will be determined by the multiplicated value of the support and confidence.