Professional Documents
Culture Documents
Weka A Study On The Application of Data Mining Methods in The Analysis of Transcripts
Weka A Study On The Application of Data Mining Methods in The Analysis of Transcripts
Weka A Study On The Application of Data Mining Methods in The Analysis of Transcripts
Transcripts
Luis Raunheitte*, Rubens de Camargo*, TAkato Kurihara*, Alan Heitokotter*, Juvenal J. Duarte*
*School of Computer and Informatics (FCI) – Universidade Presibiteriana Mackenzie – São Paulo – SP -
Brasil
m
with a view to search for tacit information [6].
er as
incorporation of knowledge to improve techniques and Through data mining, it is possible to avoid massive
technologies used in the production of goods and data exploitation and to use the knowledge acquired
co
services has caused a major demand for highly by the interpretation of ascertained patterns.
eH w
qualified professionals and, in order to meet that need,
the teaching process must understand and adapt to the The transcript represents a source of information that
o.
profile of the students. The transcript is the most used allows depicting not only the individual performance
rs e
document to measure the performance of a student. Its characteristics of students, but also their profiles, in
ou urc
digital storage combined with data mining addition to providing several details about the course
methodologies can contribute not only to the analysis at issue. Among the several potentials of data mining,
of performances, but also to the identification of it is possible to recognize distinct groups of students,
significant information about student's profiles and distinguishing their difficulties and virtues; to evaluate
o
deficiencies in the structure of a course. This study academic disciplines that concentrate a greater need
shows an example of the application of data mining for efforts aiming at interlinking them; and to set forth
aC s
techniques in transcripts, based on the real Computer right at the beginning the academic disciplines in
vi y re
Sciences course of Universidade Presbiteriana which each student, in average, will face greatest
Mackenzie, and the use of the open source tool challenges, as well as to establish secondary areas in
WEKA. the course through which the student will have more
chances of achieving a successful career.
Introduction
ed d
several countries, mainly within the academic sphere, analysis of transcripts: cluster analysis and
is to handle the difficulties of students during the classification. For that, there were used an open
learning process, which in many cases might result in source tool named WEKA and a sample of actual data
the lack of motivation and even university drop-out. regarding the Computer Sciences course of
sh is
In this sense, better understanding the students and Universidade Presbiteriana Mackenzie.
their characteristics is crucial for the application of
Th
pedagogical techniques with a specific focus, aiming Knowledge Discovery In Database (KDD)
at reaching optimum productivity during the learning
process. In the 80‟s, managers of major organizations started to
be concerned about the volume of stored data and its
Since that up to recent times the storage of transcripts uselessness for their purposes [1]. In tandem, as the
used to be made on paper and their analysis had to be market became increasingly dynamic and competitive,
manually processed, a deeper evaluation on these companies had to quickly make strategic decisions,
documents would be rejected and deemed unfeasible. even though this could imply potential risks [2]. Given
However, the implementation of computer processing this scenario, studies on data exploitation became
technology in several branches has been contributing more intense and, in 1989, the term Knowledge
to the increasing usage of digital data storage in view Discovery in Databases (KDD) was created by [4]
of the savings on material resources and space, as well with reference to the entire process of extraction of
as the ease of handling allowed by this format, which useful information from large data volumes [4]. Ever
also foments environmental sustainability. since, the entire KDD process arouses the interest of
people both within the academic and the corporate
spheres.
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
12 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012
As opposed to data mining itself, KDD is a complex of Time Series, which operate through supervised
methodology with focus on the production of learning techniques. As to description, the goal is to
consistent and construable knowledge, comprising find construable patterns that describe the input data,
issues such as storage and access, development of based on classic methods such as Cluster Analysis,
efficient algorithms, visualization and interpretation of Discovery of Association Rules and Sequential Pattern
results and the way the user is able to interact with Analysis, which use non-supervised learning
them [4]. techniques [4], [3].
The definition of KDD as a process composed of Classification, as the name suggests, consists basically
sequential steps is quite renowned and, in its essence, of mapping a log back to existing predefined
it comes down to the preparation of data, and mining categories. A powerful but simple classification
and interpretation of the results. technique consists of building decision trees, which is
accomplished through a series of questions carefully
The preparation is an elemental step for the success of developed and focused on the training set [8].
a KDD project, insomuch that 80% of the efforts for
understanding the data are focused on its cleansing When certain classes cannot be previously identified
and preparation [7]. There are three subjects that in a clear fashion, clusters provide a good alternative
provide grounds for the importance of data by identifying items that will naturally group together
preparation: [6]. The analysis of clusters aims at analyzing a set of
m
er as
logs and dividing it so that the values of the elements
Real world data are impure like noise and within one group are more similar inwardly than as
co
manifested as errors which, when provided as an compared to other groups [9]. Clusters oftentimes
eH w
input to a mining algorithm, tend to spend the indicate tacit data patterns.The WEKA (Waikato
results acquired from reality, hiding relevant Environment for Knowledge Analysis) is a widespread
o.
patterns or even compromising the credibility of open source tool that consists in the creation of a
rs e
ascertained patterns.
High performance mining applications require
compilation with machine learning and data pre-
processing algorithms, comprising interfaces for the
ou urc
quality data. operation of input data, statistical validation of
Quality data produce quality patterns. learning schemes and visualization tools.
It is crucial for the representation of the results The WEKA environment gathers the most important
o
acquired in the search for patterns to be simple and data mining methods: Regression, Classification,
aC s
construable. [8] base this need on the fact humans Cluster Analysis, Association Rules and Attribute
have the ability to analyze large amounts of
vi y re
Selection.
information when such information is presented in a
visually organized manner. Among the distinct data Preparation Of The Data Used
analysis methods, the majority derives from the
statistics, and the graphic analysis is more specifically The data used in this research were acquired through
related to the branch of Exploratory Data Analysis
ed d
Data mining represents the most important step of the period between the first semester of 1999 and the
process of Knowledge Discovery in Database (KDD), second semester of 2010.
as in this step the patterns are effectively exploited and
analyzed. Once the data have been organized and The records were found in an Excel spreadsheet, in the
their quality assured, mining is applicable with a view xlsx format, comprising the fields Student, Discipline
to extract their relevant patterns. Code, Name of the Academic discipline, Final
Average, Absences, Year/Semester and Final
Under a simplified perspective, the goals of data Situation, as shown by Figure 1. The attributes
mining might be resumed as prediction or description. Academic discipline Code and Name of the Academic
When the purpose is prediction, the database is discipline represent the discipline to which the grades
analyzed so that unknown values can be found or are related, and Year/Semester is the attribute that
future ones predicted based on previously existing indicates the period in which the student attended the
data. The most common examples of predictive classes. The final averages, absences and final
methods are Classification, Regression and Analysis situation are the target information and the patterns are
analyzed according to these measures.
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012 13
some information relevant for the interpretation of
In addition to the spreadsheet, it was possible to results and not for the mining algorithm.
acquire the description of the curricula on the page of
the university and a definition of the theme areas of Before the tests, the data went through a cleansing and
the academic disciplines from the course coordination. transformation process in which the duplicate lines
were removed and the uncommon values treated, as
The description of the 1999 curriculum acquired from the case of averages 22.22 and 99.99 attributed to
the website of the university is the most complete one, certain records. Furthermore, the data had their format
comprising code, name, stage and the total, theory and changed to a new structure as shown in Figure 2.
laboratory credit hours for each of the academic
disciplines in the curriculum. For the 2004
curriculum, the same information is available, except
for the code of the academic disciplines, which could
be acquired through the Informative Academic
Terminal (known as TIA). For the 2009 curriculum,
the codes of the academic disciplines are also not
present; however it shows all other attributes and the
indication of the new academic disciplines regarding
the previous curriculum. The most relevant data are
m
the code of the academic discipline and the stage,
er as
through which it is possible to associate a discipline to
co
a curriculum and the level that this discipline is taught;
eH w
this information is not available in the original data.
o.
Studente Dicip. Discipline Name Final Absence year Final
Code average % attended situation
rs e
ou urc
FIGURE 2. Part of the data is provided in its original format
o
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
14 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012
Tests
m
analysis of the results acquired for each of the K
er as
values.
co
eH w
When running the algorithm with the K value set to 2,
an enhanced performance of one of the groups could
o.
be observed, which is shown as blue dots in Figure 3.
rs e
ou urc
FIGURE 4. Results acquired with the analysis of seven
o
clusters.
aC s
Note that cluster 0 integrates most of the dots to the By replacing the distance function of the K-means
ed d
right of the horizontal axis of Figure 3, where the algorithm from Euclidian to Manhattan, using the
value 10 as parameter K, the results indicated
ar stu
of the horizontal axis. Algorithms and Programming Basics, while the logs
of cluster 8, interpreted as having difficulties, show its
Th
When the K value is gradually increased, these aptitude in dealing with the English language.
patterns are preserved and, beginning with K equals 5,
the groups start to present extra peculiarities.
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012 15
accuracy this tree has defined that Average2 is under
the academic discipline “Instrumental English <= 6.8”
and “Ethics and Citizenship I <= 8.1”. Moreover, the
instances of the class WithDifficulty have been placed
in the academic discipline “Instrumental English >
6.8” and “Differential and Integral Calculation I <=
6.1”.
Conclusion
m
Manhattan distance were used for classification, and feasibility of data mining in transcripts, stressing that
er as
gained an extra attribute in which the cluster several patterns appear quantitatively during the day-
associated to the instance is defined.
co
by-day of courses, allowing the application of
eH w
algorithms and the identification of trends, not always
The instances that belong to clusters 4 and 7 were so evident, regarding the characteristics of the students
grouped under the label Average1, and cluster 3 was
o.
and the courses. The method presented here might
renamed to Average2. Clusters 0, 1 and 2 became
rs e
WithAptitude1, WithAptitude2 and WithAptitude3.
function as a base to the strategic development of new
courses, if used as a tool to aid the learning process.
ou urc
Lastly, cluster 8 was once again labeled
WithDifficulty. REFERENCES
When running WEKA‟s J48 algorithm, which is an
1. Amo, S. D. (2004) “Técnicas de Mineração de
o
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
16 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012
[S.l.]: Addison Wesley, 2005. cap. 3, p. 97_144.
ISBN 0321321367.
9. Wu, X.; Kumar, V. The Top Ten Algorithms in
Data Mining. 1st. ed. [S.l.]: Chapman &
Hall/CRC, 2009. ISBN 1420089641,
9781420089646.
m
er as
co
eH w
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
sh is
Th
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/
SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 10 - NUMBER 3 - YEAR 2012 17
Copyright of Journal of Systemics, Cybernetics & Informatics is the property of International Institute of
Informatics & Cybernetics, IIIS and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.
m
er as
co
eH w
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
sh is
Th
https://www.coursehero.com/file/10213786/Weka-A-Study-on-the-application-of-Data-Mining-Methods-in-the-analysis-of-Transcripts/