Table of Contents

1.0 Introduction 8
2.0 Selection of data files and brief description 8
2.1 District 8
2.2 Division 9
2.3 Area 9
2.4 Club Number 9
2.5 Club Name 9
2.6 Charter Date or Suspend Date 9
2.7 Distinguished Status 9
2.8 Club Status 10
3.0 ETL Process and Data Pre-processing 10
3.1 Clean up data 10 - 11
3.2 Dropping irrelevant dimension 11
3.3 Dealing with missing value 11 - 12
3.4 Dealing with outliers 12
3.5 Generating a new dimension 12 - 13
4.0 Data Analytics 13
4.1 Question 1 13 - 14
4.2 Question 2 14 – 15
4.3 Question 3 15
4.4 Question 4 16
4.5 Question 5 16 - 17
5.0 Conclusion 17 – 18
6.0 Evaluation and Personal Reflection 18
A 18
B 19
C 19 – 20
D 20 - 21
E 21 – 22
7.0 References 23
8.0 Appendixes 24 - 30
9.0 PowerPoint Slides 31 -

1.0 Introduction

       Toastmasters International is a non-profit educational organization located at Englewood,

Colorado, United States. It helped people who came from different backgrounds by training
them to become confident speakers, communicators and leaders through a worldwide
network of clubs. It consists of more than 300,000 members in exceeding 18,800 clubs in
about 149 countries. Toastmasters’ mission stated that they could empower people to be
effective communicators and leaders by increasing their confidence and skills of

            In Toastmasters International, it consists of two parts of the mission which is district
and club. The mission of the district is to build new clubs and support the clubs to achieve
excellent results. Additionally, the club’s mission is to provide a supportive and positive
learning experience that enables members to develop communication and leadership skills
that increase self-confidence and personal growth. For the sustainability of Toastmasters
International, they would like to become the premier provider of dynamic, high-value,
experiential communication and leadership skills development.

           Initially, Toastmasters International is founded by Ralph C. Smedley when he worked

for Young Men’s Christian Association (YMCA) in the United States. On 22 October 1924, a
club is set up from a series of courses designated to improve the communication skills of
young people. The club is named Toastmaster Club which is named by Smedley from the
popular term for people who toast at banquets and other occasions. In 1930, about 30 clubs
had been formed and expand to become an organization. On 19 December 1932,
Toastmasters International is formed as a non-profit organization in California (Toastmasters
International, 2022).

2.0 Selection of data files and brief description

     The data files that we refer to are Toastmasters International club performance and club
renewal. In the Club performance file, there are 1,862 rows and 29 dimensions. However,
there are 1,862 rows and 14 dimensions in the Club Renewal file.

       2.1 District

       From Toastmasters’ data, the dimension of the district used an integer which is the
numeric data type for numbers that do not have a fractional component. Toastmaster
organization is subdivided geographically, with each higher level encompassing a wider

geographic scope. Every district has its population and number of local clubs and it has its
own officers.

       2.2 Division

       In Toastmasters, the character data type is used to distinguish the divisions. Eleven
uppercase like A, B, C, I, J and more are applied to differentiate the divisions in this
dimension. Divisions are made up of several areas. In Toastmaster, there are about 10
divisions that consist of one area in size or have six, seven and more areas. 

        2.3 Area

         The data types of areas are integers that are a number with no decimal and fractional
parts. Number “1-7” is used to distinguish the areas and each area includes three to eight
clubs as well as the performance of the club. Each Area has its own Area Governor, who is a
member of one of the clubs nominated to represent the Area by the District Governor. 

      2.4 Club Number

      The data types under the club number are integers in which it is a numerical value without
decimal such as 800, 2000, 23. Every club has its own name and unique number under
Toastmaster. The club number here allows people to find and identify the club. 

      2.5 Club Name

         The data types under the club number are strings that can hold a list of characters of any
length. The club name contains the full name of the society/club. It will indicate the unique
name of each club/society based on the areas and district and also functions of the club. The
people can join the clubs by just looking at the name which already indicates the area, district
or goals they want to achieve.

             2.6 Charter Date or Suspend Date

         The data types under the charter date or suspend date are string and date where it
includes the list of characters and also the date of the established date. For your information,
a charter refers to where it has been given the grant to operate or establish. Hence, the charter
date means the date of the club that has been newly established and the suspend date means
where the club has been stopped from operating. 

           2.7 Distinguished Status

           The data type of 'distinguishing status' is a character that is used for single letters such
as P, S, and D. Yet each letter represents certain meaning. P is President’s Distinguished Club
is a club that has achieved 9 of the 10 goals and meets the membership requirements. S is
Select Distinguished Club is a club that has achieved 7 of the 10 goals and meets the
membership requirements. D is Distinguished Club is Clubs that have achieved 5 of the 10
goals and meets the membership requirements.

       2.8 Club Status 

        The data type of club status is the string that is used for a combination of any characters
that appear on a keyboard, such as letters, numbers and symbols. So the characters of club
status included active, ineligible, and suspended. Active status is representing how active the
members are. Ineligible status is the meaning of the member no longer paying for renewal
fees and staying at this particular status. Lastly, suspended status is a suspended club that has
six months to reinstate, starting from the date of suspension.

     The first question provided by A is to analyse the charter date or suspend date of all
clubs. Secondly, the question prepared by B is to identify the status of clubs based on active,
ineligible, low and suspended. The third question provided by C is to study the club's
distinguished status. Fourthly, the question prepared by D is to calculate the ratio of total
renewal member rate compared to the total member to date. The fifth question prepared by E
is to calculate the rate of goals met achieved by all clubs in 100%.

3.0 ETL Process and Data Pre-processing

      3.1 Clean up data 
Firstly, we had selected the data file named ‘club performance’ and ‘club renewal’
which is the first step of ‘cleaning up data while doing the data upload to Rapid Miner’. This
step is otherwise known as data preparation. It is the meaning of selecting the related data
based on the requirement of the assignment with the aim of continuing the following cleaning
process. Next, will be followed by inserting the ‘join operator’. This ‘Operator’ joins two
example sets using one or more attributes of the input example sets as key attributes. It is the
meaning that the ‘join operator’ assists the user by combining data files with a large amount
of data. Hence, the data files ‘club performance’ and also ‘club renewal’ will be grouped

Moving on to the step of cleaning up the data. After inserting the ‘join operator’, we
click on the ‘join’ icon, and there is a column named ‘join type’ on the right side. In the ‘join
type’ column, we select the word ‘left’. Next, we make an amendment on the edit list that is
at the slightly bottom of the ‘join type’ under the column ‘key attributes’ and we chose the
‘club name’ for both. And hence the process of cleaning data while doing the data upload to

      3.2 Dropping irrelevant dimension

The step after uploading the relevant files is dropping irrelevant dimensions. 
Dropping irrelevant dimensions is an important process because it allows us to specify a
regular expression and helps the result look tidy.

First, we drag the file that is relevant to our topic which is Toastmaster’s Club
Renewal into the ‘Process’ panel. To drop the dimensions that are irrelevant, we use ‘Select
Attributes’. Search the ‘Select Attributes’ in operators and drag it into the ‘Process’ panel
afterwards. Connect the Toastmaster’s Club Renewal and ‘Select Attributes’.

At the parameter, set the attribute filter type to subset. There are two attribute columns
that will be shown which are ‘Attributes’ on the left side and ‘Selected Attributes’ on the
right side. The dimensions at the left attributes will be removed while the dimensions on the
right attributes will be retained. The dimensions that we had selected were Area, Club
Number, Club Name, Late Renewal, Oct. Renewal, Apr. Renewal, Total Renewal,
Distinguished Status and Charter Date/Suspend Date. Connect the ‘Select Attributes’ port to
the res port. Then, apply and run. The result will show on the result view.

The steps above will also apply to the Toastmaster’s Club Performance file. The
difference will be at the ‘Select Attributes’ file. The dimensions that we had selected for Club
Performance were District, Division, Area, Club Number, Club Name, Club Status, Member
Base, Active Members, Goals Met, Members dues on time Oct, Members dues on time April
and Club Distinguished Status.

      3.3 Dealing with missing value

There are two files which will be the club performance and club renewal that has the
missing value. The missing values are present in “Distinguished club'' and “Charter Date or
Suspend Date”. In order to find the missing value, firstly, should import the data file from the
repository panel which will be the club performance and club renewal file and save it under
data. In the club performance, we can see there is a missing value under the Club
Distinguished status which shows the symbol “?” in a few columns.  The symbol “?”
represents inactive clubs under Club Distinguished status. In the data, it shows the total
number of 775 clubs that did not achieve the minimum performance status. Based on the
Club Distinguished status, there are 835 clubs that have achieved the President’s status,
followed by146 clubs that have achieved club distinguished status and for the select status,
106 clubs have achieved it.

Next, in the club renewal it shows there are missing values under the Distinguished
club which show the symbol “?” in a few columns.  The symbol “?” represents inactive clubs
under Distinguished clubs. In the club renewal file, there are missing values under the Charter
Date/Suspend Date. The symbol “?” represents the missing charter and suspend date which
means the date of the club that has been newly established and the suspend date means where
the club has been stopped from operating.  Based on the Charter Date/Suspend Date data,
there are 1539 missing values on Charter Date/Suspend Date and 323 values with the charter
and suspend date. There are 1 charter date for the year 2019, 80 suspend date for the year
2020 and 242 values with the mixer of charter and suspend date for the both years 2019 and

      3.4 Dealing with outliers

According to the club performance and club renewal files, there are no outliers in
either of these two files. The outliers refer to the abnormal value that is very far away from
the other value of the sample (Lemonaki, 2021). For example, from the observation of the
district column, all values are very constant, that is around 100 and from the division, we can
see it only consists of A and B and no other value. Areas in the column only consist of 1,2,3
digit. There's no big value in this column. The club number and club names on the other hand
are present in text format and not in the digit. Furthermore, the column such as the late
renewal, Oct. Renewal, apr renewal is showing a constant digit number ranges between 1-99.
It shows that there’s no missing or high deviation between the data and hence we can

conclude that there are no outliers in the data files. There’s no need for us to deal with the
outliers since the data of this spreadsheet is very consistent and normal.

      3.5 Generating a new dimension for analytics purpose

For the fifth step is generating a new dimension for analytics purposes. In generating
a new dimension, general attributes are used to process out the objective which is the rate of
April renewal against the member base. Based on the attributes that we used to measure the
rate, the attributes come from different data files which are club performance and club
renewal. It is required to choose suitable attributes by using select attributes to make the data
more clean and tidy. In selecting the attributes, we choose the necessary attributes such as
club name, club ID, April renewal, October renewal, Total renewal and Member base. It is
used for cleaning or dropping irrelevant attributes so that the data are more tidy and clean.

According to the function which is April Renewal divided member base times 100%,
it is using generate attributes for calculating the rate of April renewal compared to member
base in percentages. A new dimension has formed after applying the general attributes
function [(April renewal/Member base) x 100%], the dimension or attributes has listed out all
clubs’ percentage of April renewal (Appendix 7). In the data result, it showed that there is
added another column for adding the April Renewal percentages. As a conclusion, generating
a new dimension should be done after cleaning up the data, dropping unnecessary
dimensions, and checking missing values and outliers. With this, it could show the completed
data preparation or ETL processing so that the further process which is generating a new
dimension can be done (Appendix 8).

4.0 Data Analytics

      4.1 Question 1

First Question: To analyse the charter date or suspend date of all clubs. The dates are
recorded in the Charter Date/Suspend Date field in the Club Renewal data file with Charter
Date, Suspend Date and blank represent not granted.

           In order to answer the question which is to analyse the missing value of a charter date
or suspend date of all clubs. From the Rapid Miner, it shows that the charter date or suspend
date column has many missing values and consists of the charter with a date or suspend with
date. Thus, to analyse the data file that has joined two files which are club performance and
club renewal, we have cleaned up the data files and selected the attributes which are
necessary. The dimensions or attributes that we used are Club Name, Club ID, Division, Area
and Charter Date or Suspended Date.

         In solving the problem of missing value in Charter Date or Suspended Date column,
we have used the generate attributes to analyse the Charter Date or Suspended Date that
which clubs are granted or not granted. As we know, Charter Date or suspended date are
granted and missing value would be not granted. Therefore, we use generates attributes by
entering the function as mentioned in Appendix 9 which is
“[if(missing(CharterDateorSuspendDtae),“Not granted”,“Granted”)]” and attribute name is
used “Grant”. After the generate attributes is applied, we are required to click the “Run”
button for further process that generates the result. From the result in Appendix 10, it clearly
shows the missing value with the “?” logo is not granted and granted is charter with date or
suspend with date. With this new dimension, the company could easily analyse the granted
clubs or not granted clubs so that could easily manage the clubs.

     4.2 Question 2

Second Question: To identify the status of clubs based on active, ineligible, low, and

First and foremost, we should import the data file into the Rapidminer software in
order to start up the identification process. So, import data by pressing the ‘import data’
button and selecting the files that we want to import. Based on the question I set, I, therefore,
have to import the data file named ‘club performance’ so that I will have related data like
club status to continue with the identification process. After the importing step, the data will
appear at the ‘local repository’ which is under the ‘repository’. The following step is that we
need to pull out the ‘club performance’ data file into the ‘process’.

Next, we need to type a ‘maps’ under the ‘operator’. The operator maps can be used
to replace nominal values. For example, we can use the ‘map’ to replace the word ‘ineligible’
with the word ‘active’. The function of ‘maps’ is to replace data with data, hence it is suitable
for my assignment question. After we have the ‘map operator’, we connect the ‘club
performance’ data files and ‘maps operator’ together.

  Before we type out the replacement value, we need to select the ‘attribute filter type’
and the ‘attribute’. This is because one use of the operator can do mappings for attributes of
only one type. A single mapping can be specified using the parameters replace what and
replace by as in Replace operator. Thus, in the ‘attribute filter type’ column we will be
selecting the word ‘single’ and for the ‘attributes’ column will be ‘club status’. We will then
need to key in the old values which we want to be replaced and type the new values that we
want to replace with. For old values, I typed the word ‘ineligible’ and ‘low’. For the new
values, I key in the word ‘suspended’ and ‘active’. The reason why I typed all these values
respectively is that based on my assignment question, I need to make club status which got 4
types initially, and makes them categorized into 2 types.

After applying the old and new values, we will just need to press the ‘play’ icon to run
the process. Finally, the process was successfully completed. You can open the visualizations
which is a bigger version of the graph to see it clearly. As a result, you can see that there are
only two types of ‘club status’ appearing on the graph. 

     4.3 Question 3

Third Question: The status is recorded in the Distinguished Status field. If the Distinguished
status is blank that means the club did not achieve any status.

The business question is to study the club's distinguished status. The status is recorded
in the Distinguished Status field. If the Distinguished status is blank that means the club did
not achieve any status.  This question is to find out which activities in the data file in Achieve
and Not Achieved. Firstly, should import the data file from the repository panel which will be
the club performance file and save it under data. When opening the Club performance, there
will be columns and the Distinguished Status will be at the end of the columns. Before the
process, it has shown there are few columns with “?” which will be the symbols. The symbol

“?” represents inactive clubs under the Distinguished status. Next, to start up the process, the
student should drag the club performance file to the process panel. Then, search for the
“select attributes” from the operators' panel and drag to the process panel beside the club
performance file and join it. Next at the attributes filter type, we can choose the subset and
filter up the attributes. As per in Appendix 15, there are eight attributes chosen. After
complete filtering, at the operator panel a search for the generated attributes and join with
club performance and select attributes file. Then, at the edit list, the attributes name is named
as Distinguished status and add the function in the expression box and the formula will be “if
(missing (Club Distinguished status)” Not Achieved”, “Achieved”) and it will appear
expression is syntactically correct which means the function is correct and press apply and
generate the process. As a result, the new attribute is called “Distinguished Status” and it will
show ``Not Achieved beside the “?” columns.

4.4 Question 4

Fourth Question: To calculate the rate of late renewal compared to the total renewal

To find the result of the ratio between late renewal and the total renewal member, the
first step we need to do is import relevant data which is club renewal from our device to local
repository. We can drag the club renewal file from the local repository to Process. The
Process is a collection of Operators that work together to change and analyse data.

After that, we can search the select attributes from Operators. Select Attributes can
allow us to specify a regular expression. Here, we can connect the two files which are club
renewal and select attributes together. At the parameters, we will select the ‘subset’ in the
attribute filter type. This subset allows us to choose from a list of numerous properties. At
this attribute, we can apply the Club Name, Club Number, Late Renewal and Total Renewal.

Again, drag the Generate Attributes from Operators and connect with the previous file
which is Select Attributes. Click the edit list, we can start to insert the attribute name and
function expressions for generating new attributes. At the Generate Attributes operator’s
parameter, we insert the new attribute name-Late Renewal Rate (%) while for function
expressions is [Late Renewal]/[Total Renewal]*100. After applying the attribute name and

function expressions, connect the generated attributes with the result set and press the run

The results will be shown in the Result View. We can see that there is a new
dimension which is the late renewal rate in the last column. In the result, 1862 examples will
be shown. The late renewal rate of every club had been calculated by using function
expressions. For example, Chittagong Speakers Club (CSC) has 100% while WNS Chennai
has 50% in late renewal rate. As a result, we can easily generate the data and significantly
accelerate data exploration by using this awesome tool.

       4.5 Question 5

Fifth Question: To calculate the rate of goals met achieved by all clubs by more than 5.
According to the question set, it is to calculate the rate of goals met achieved by all
clubs by more than 5. First and foremost, we need to first import the required files that we
want to use into the process section. We can do it by clicking the import data under the
repository or directly dragging and dropping the data into the process section. After having
the required documents inside the process section, we can now proceed to the column on the
bottom left named operators. We can now use the ‘select attributes’ and drag and drop to the
process section to further continues. After that, we can now proceed to select our required
attribute whereby we can choose what types of data that we want to include for our final
result. Hence, in this case, we would choose those values that we want to refer to. For
example in this case we would choose club name, club number, district, division and goals
met. The reason for adding the extra information other than the single goals met is to give us
to have a better understanding of when the result comes out and what does it refer to. Moving
on,  after we’ve gathered and filtered out what are the criteria we want, we can now proceed
with adding the ‘ generate attribute’ from the operator and inserting our function expression.
In this case, if we want to know what is the rate of goals met achieved by all clubs by more
than 5 then we will name it as ‘Goals Met=>5’ and insert the expression as” if([Goals Met])
>=5, “Yes”, “No”)”. Moving on to the last step after inserting the function expression, we can
now click on apply and the result will be displayed to us right away.

5.0 Conclusion
As a conclusion, I think that the rapid miner is a very useful tool because it helps to
reduce the burden of the workplace especially when we need to classify the data files and
filter out what are the important ones for us.  For example, if we have spreadsheet data that
consist of thousands of information such as the customer age, gender, address and so,  it is
very difficult for us to locate only the info that is important for us. This is said so because it
includes a lot of unnecessary info and we don’t have time to scroll thru one by one especially
we are handling data of more than 10000++. In this case, the Rapid Miner will becomes
handy as it not only helps us to filter up the data but also perform data analytics. It speeds up
the process of finding the data indirectly and by using the functions such as selecting the
attributes, it helps us only focus on what are the data that are important to us it is said to be
very useful especially we are now in the midst of the big data era whereby all information are
collected in almost every second but there’s no proper way to handle it. Furthermore, the
rapid miner is also very handy on analyzing multiple types of data such as the customer data
pattern such as their demographic, behavioral, geographic and psychographic segmentation.
To this extend, the rapid miner is also very user friendly that it suits multiple group category
of people and it is because it’s very simple to operate and use especially for the first-time user
because it has the built-in step by step tutorial for us to get used to it and it has a community
with 500,000 active members where we can browse for some useful thread, tips or tricks and
even ask question if we do not know how to operate. The rapid miner suits for almost all
types of industry including servicing, manufacturing, construction, automotive, banking and
etc. For you information, we will need to sign up an account before we can use or access to
the Rapid Miner and it is available to download for both mac and windows operating system.

6.0 Evaluation and Personal Reflection

Name: A

Evaluation of assignment

         In my opinion, the strength of this assignment is developing the skills of managing the
data files which are separated such as club performance and club renewal by using Rapid
Miner and having a better understanding of using the Rapid Miner. In the assignment, it is
required us to process the data files to search for relevant information easily. However, the

weakness of this assignment is most of the group members or students are not familiar with
using the Rapid Miner software because of a lack of experience and skills. This is due to the
fact that some of the information is required to use different operators and parameters, but the
students are very concerned to use it in the correct way. It also because the practical class too
less and caused the students to lack practice in using Rapid Miner.

Personal Reflection

         From the assignment, I had faced some problems that I took a very long time to solve
such as lack of relevant information, lack of practice and technical problems. Firstly, the lack
of relevant information is because there some part is difficult to find from learning material
even can find from the internet, it also difficult to understand. It took a long time to
understand the formula or function of using generate attributes’ functions. Secondly, I also
lack practice that the practical class too less and I would like to learn more extra skills and
knowledge in using the Rapid Miner. Lastly would be the technical problem, it is the most
concern or problem for me when every time I open the Rapid Miner. The Rapid Miner will
cause my computer lag and I could not perform my task as fast as possible. 

Name: B

Our assignment is something about data rearrangement, data grouping, data

analyzing, etc. Hence the software named Rapidminer is needed to assist us to complete the
assignment. A brief talk about Rapidminer. It is a data science and machine learning platform
for data mining. If a company has to find useful information through a huge amount of data,
hence the function of data mining in the Rapidminer is what they need. In another word, data
mining is used to process data that initially has no meaning into information and then the
information becomes knowledge. Data Mining, also known as Knowledge-Discovery-in-

In my opinion, I felt that this assignment was very interesting, and it taught me a lot.
First of all, is because it needed the software Rapidminer to complete the task and the
particular software is really a well-invented software for users nowadays. It is because

Rapidminer is really fast at reading all kinds of databases. Also, it can perform all kinds of
transformations, calculations, dates, percentages, joins, and filters without coding. Hence, we
as a user will have several different databases and this makes our life a lot easier. Most
importantly is that users can self-learn online because this software is on-trend and popular.
Hence, most learning videos had been uploaded on the internet.

Nevertheless, there are still weaknesses throughout the whole assignment process.
Adding on the online class policy, students staying at home and studying causes a lot of
problems. The most serious issue that brings to some students is the software Rapidminer that
we needed most of the time to do assignments. This software is a good invention yet when
the lecturer is not together with students, students may miss out on some important parts and
hence the process gets delayed. Also, sometimes we follow the lecturer’s steps on practising
yet the software will cause bugs or technical issues to occur all of a sudden and cause the
student to miss the lesson.

Name: C


In my opinion, the strengths of this assignment allow us to have a better

understanding and how to use RapidMiner to process the data. This assignment focuses on
processing data and there are a few steps that we need to use in order to complete the process.
The assignments need to use some operators to process the data and to do the data
collections. Data cleaning is needed when getting information to make the process
convenient. However, there are weaknesses in using the RapidMiner as there is a lack of
experience using it. When processing the data, the result that the process generates can be
correct or wrong as there is a lack of experience. Even though there are notes provided on
how to do the process, it is still hard to do the work on RapidMiner as there is not enough
understanding of the whole system of RapidMiner.

Personal reflection

I faced the problem of doing the assignments using RapidMiner in the analysis part as
I was not familiar with using RapidMiner. I spent a lot of time searching for how to use the

RapidMiner in order for me to complete my task. I have difficulty in operating the data
processing steps and I had to keep on doing the same thing many times because I am not
betting the correct data for my assignment. I am not confident the result that I got when
processing data is correct or not. Moreover, I couldn’t connect the theories from practical
class and connect with my assignments as the data was not the same and the formula that I
used in practical class was different and not the same as the assignments that I did. In order to
solve the problem I am facing, I did a lot of research on how to use RapidMiner and watched
a lot of videos on how to process the data correctly. I also did watch all the practical class
videos again for me to get used to the RapidMiner. I attend all my practical classes and put
my full focus on how the teacher teaches the RapidMiner and screen record all the steps in
RapidMiner. Besides that, I will also google meet with my friend and discuss how to use
RapidMiner and learn how to use RapidMiner to solve the problems. When using the
RapidMiner leant on how to use the correct operator to process the data and this is to get the
correct data. I also learnt how to use the correct formula for me to get the right data. I can say
that RapidMiner is very helpful for me because it helps me to know how to solve the error of
not being able to generate the data.

Name: D

Evaluation of Assignment

         Overall, we have chosen the important dimensions that relate to our topic to evaluate
this task. The relevant dimensions we chose were from the Toastmaster’s Club Performance
and Club Renewal. Although I already learned some functions that can be used in my
assignment and practical, I am still not familiar with some of the parts. Besides, the Rapid
Miner Studio takes a long time to log in and it will become slow when it takes too much

Personal Reflection

         Rapid Miner Studio is a bit hard for me to operate at the beginning. Yet, after
exploring and teaching by my friends, I started to understand the concept of data mining and
was able to use the functions in the Rapid Miner. This application is easy to use, and it was a
great tool for me because I do not have strong programming experience. I like the operators

in 'Generate Attributes' since it allows me to handle my workflow without having to write
code. The ‘Generate Attributes’ will state there the expression evaluation error when the
function expressions are wrong and it will suggest what type of symbol or numerical can be
used in the function expressions. Also, I can straight away calculate the data needed by
clicking the apply and run. I really like dragging and dropping operators related to my work
to generate models. These help me save time, I do not need to generate it myself. Thus, I
learned a lot of knowledge and skills that I had never been exposed to before and it may be
useful for me in the future.

Name: E
Evaluation Of Assignment 
After completing this assignment, I’ve definitely gain extra knowledge on data analysis. I
understand what’s the importance of handling the data thru processing it using Rapid Miner.
Although sometimes I might get errors using the Rapid Miner but I am able to solve it after a
few trial and error. To this all, since the rapid miner is a huge software, some of our computer
may not be able to run it smoothly when handling the data. This creates some difficulties for
us to complete this assignment due to the lagging of computer processor. However, the
combination of using rapid miner to complete this assignment gives us a whole new
experience to learn something new as we not only understand thru the lectures notes but in
real life experience which is really effective.
Personal Reflection
For me personally, I think I definitely need to familiarise myself more on operating the Rapid
Miner. This is said so because this is my first time using this software, there’s a lot of things
that I’m still not get used to. However, I think it is just a matter of time because after a few
times of rewatching the videos on how to operate the rapid miner, I am able to operate it
without an issue. Moreover, I am very thankful and grateful to all my groupmates that are
willing to help when I don’t know how to use the Rapid Miner. I like this subject because it
combines practical tutorial and lecture together whereby we can have the hands-on
experience while understanding the theory part. Last but not least, I think it is overall a very
good experience using the Rapid Miner because I’ve learn something that is very useful in
my future 

7.0 References
1. Lemonaki, D 2021, ‘What is an Outlier?’, viewed on 26 February 2022, <

2. Monica, S, Angeles, W and Angeles, G 2022, ‘Toastmasters ORganization and
Structure’, viewed on 26 February 2022,
3. Toastmasters International 2022, ‘Toastmasters International’, viewed on 26 February
2022, <>.

8.0 Appendixes
Appendix 1: Clean up the data files which in part

Appendix 2: Result of Part A

Appendix 3: Process of Part B

Appendix 4: Result of Part B

Appendix 5: Missing value of Part C

Appendix 6: Dealing with outliers in Part D

Appendix 7: Process of generating new dimension in Part E

Appendix 8: Result of Part E

Appendix 9: Process of the first question

Appendix 10: Result in the first question

Appendix 11: Process of the second question

Appendix 12: Result of the second question

Appendix 13: graph of the second question

Appendix 14: Process of the third question

Appendix 15: Result of the third question

Appendix 16: Result of the Fourth question

Appendix 17: Process of the Fifth question

Appendix 18: Result of the fifth question


