Bahan Ajar Lte F

CHAPTER I
1. Measurement
1.1 The Definion of Measurement
Measurement E.L Thondike stated that anything that exist in some quantity and anything that exists in some quantitiy is capable of being
measured. (Thorndike, 2016, p.4). In quoting the definition provided by James M. Bradfield state that measurement "is a process of
assigning symbols to the dimensions of a phenomenon to characterize the status of the phenomenon as precisely as possible. (Adom et. al,
2020, pp. 111)1
Measurement is the process of giving numbers or measurements related to the assignment of quantitative data using one or more
instruments to obtain results from students achieving certain characteristics. According to Cangelosi (1995) 2 measurement is the process of
collecting data through empirical observations to collect information that is relevant to the goals that have been determined.
1.2 Purpose of Measurement
The purposes of measurement can be categorized as measurement being in the service of quality, monitoring, safety, making something fit
(design, assembly), and problem solving. We should note that measurement sometimes serves multiple purposes. example, a lab technician may
measure the concentration of potassium in drinking water at a water bottling plant in order to monitor the production process, for quality control
and safety. The purpose mentioned most frequently was quality, followed by monitoring, making something fit and solving problems. Safety
was mentioned much less frequently.
1.3 Types of Measurement
1
Adom, D., Mensah, J. A., & Dake, D. A. (2020). Test, measurement, and evaluation: Understanding and use of the concepts in education. International Journal of
Evaluation
2
Calongesi, James S. (1995). Merancang Tes untuk Menilai Prestasi Siswa. Bandung : ITB
In statistics, there are four data measurement scales: nominal, ordinal, interval and ratio (Market reseach method, 2020). 3 These are
simply ways to sub-categorize different types of data (here’s an overview of statistical data types) . This topic is usually discussed in the
context of academic teaching and less often in the “real world.” If you are brushing up on this concept for a statistics test, thank a
psychologist researcher named Stanley Stevens for coming up with these terms.
These four data measurement scales (nominal, ordinal, interval, and ratio) are best understood with example, as you’ll see below.
1. Nominal
Let’s start with the easiest one to understand. Nominal scales are used for labeling variables, without any quantitative value. “Nominal”
scales could simply be called “labels.” Here are some examples, below. Notice that all of these scales are mutually exclusive (no
overlap) and none of them have any numerical significance. A good way to remember all of this is that “nominal” sounds a lot like
“name” and nominal scales are kind of like “names” or labels.
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called “dichotomous.” If you are a student, you can use
that to impress your teacher.
Ordinal
With ordinal scales, the order of the values is what’s important and significant, but the differences between each one is not really known.
Take a look at the example below. In each case, we know that a #4 is better than a #3 or #2, but we don’t know–and cannot quantify–
how much better it is. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy”
and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.
3
My Market Reseach Methods. In Data Analysis, 2020
2
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal scales”–it is the order that
matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the mode or median; a purist will tell you
that the mean cannot be defined from an ordinal set.
Interval
Interval scales are numeric scales in which we know both the order and the exact differences between the values. The classic example of
an interval scale is Celsius temperature because the difference between each value is the same. For example, the difference between 60
and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees.
Interval scales are nice because the realm of statistical analysis on these data sets opens up. For example, central tendency can be
measured by mode, median, or mean; standard deviation can also be calculated.
Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval” itself means “space in between,” which is
the important thing to remember–interval scales not only tell us about order, but also about the value between each item.
Here’s the problem with interval scales: they don’t have a “true zero.” For example, there is no such thing as “no temperature,” at least
not with celsius. In the case of interval scales, zero doesn’t mean the absence of value, but is actually another number used on the scale,
like 0 degrees celsius. Negative numbers also have meaning. Without a true zero, it is impossible to compute ratios. With interval data,
we can add and subtract, but cannot multiply or divide.
3
Confused? Ok, consider this: 10 degrees C + 10 degrees C = 20 degrees C. No problem there. 20 degrees C is not twice as hot as 10
degrees C, however, because there is no such thing as “no temperature” when it comes to the Celsius scale. When converted to
Fahrenheit, it’s clear: 10C=50F and 20C=68F, which is clearly not twice as hot. I hope that makes sense. Bottom line, interval scales are
great, but we cannot calculate ratios, which brings us to our last measurement scale…
Ratio
Ratio scales are the ultimate nirvana when it comes to data measurement scales because they tell us about the order, they tell us the exact
value between units, AND they also have an absolute zero–which allows for a wide range of both descriptive and inferential statistics to
be applied. At the risk of repeating myself, everything above about interval data applies to ratio scales, plus ratio scales have a clear
definition of zero. Good examples of ratio variables include height, weight, and duration.
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be meaningfully added, subtracted,
multiplied, divided (ratios). Central tendency can be measured by mode, median, or mean; measures of dispersion, such as standard
deviation and coefficient of variation can also be calculated from ratio scales.
Example of Ratio Scales

This Device Provides Two Examples of Ratio Scales (height and weight)
In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order
of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference
4
between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero”
can be defined.
1.4 Models of Measurement
The measurement models that we can use are the Rasch models. The Rasch Model, named after Georg Rasch, is a psychometric model
for analyzing categorical data, such as responses to questions in a reading assessment or responses to a quiz, depending on the trade-off
between (a ) the abilities, attitudes or personality traits and (b) the difficulty of the item. [1] For example, they can be used to estimate a
student's reading ability or a person's extreme attitude towards capital punishment from responses to a questionnaire4.
In addition to psychometrics and educational research, the Rasch model and its extensions are used in other fields, including health
professions, [2] agriculture [3] and market research [4] in because of their general applicability. [5] The mathematical theory underlying
Rasch models is a special case of item response theory and, more generally, a special case of a generalized linear model. However, there are
important differences in the interpretation of model parameters and their philosophical implications [6] that separate the supporters of
Rasch's model from the tradition of item response modeling. A central aspect of this division concerns the role of specific objectivity, [7] a
defining property of the Rasch model according to Georg Rasch, as a requirement for successful measurement.
In Rasch's model, the probability of a specified response (eg, right / wrong response) is modeled based on the person and item
parameters. Specifically, in the original Rasch model, the probability of a correct response is modeled as a logistic function of the difference
between the person and item parameter. The mathematical form of the model is provided later in this article. In most contexts, model
parameters characterize respondent competence and item difficulty as locations on a continuous latent variable. For example, in educational
4
Mokshein, S. E., Ishak, H., & Ahmad, H. (2019). The use of rasch measurement model in English testing. Cakrawala Pendidikan, 38(1), 16–32.
https://doi.org/10.21831/cp.v38i1.22750
5
tests, the item parameters represent the difficulty of the items, while the person parameters represent the ability or level of achievement of the
people evaluated. The higher a person's ability relative to the difficulty of an item, the greater the probability of a correct answer on that item.
When a person's location on the latent trait equals the difficulty of the item, there is by definition a 0.5 probability of a correct answer in the
Rasch model.
2. Assessment
2.1 The Definion of Assessment
Assessment is a process to obtain data or information where the information is obtained relative to some known goals or targets.
Assessment includes a broad term about testing which is a special test form of assessment. Assessment also refers to various ways of
gathering information about language ability or achievement in a student's learning process.
According to Stiggins (1994) assessment is an assessment of the process, progress, and learning outcomes of students. Meanwhile,
according to Kumano (2001) assessment is defined as "The process of collecting data which shows the development of learning".
According to Jerrold (2012), assessment is how to identify learning needs, document their progress, and determine how teachers are
performing as teachers and planners.
So, it can be concluded that assessment is a term used in an effort to assess a student's learning process with a systematic process before
collecting information and making a final decision on the assessment or providing an overview of the development of student learning
outcomes.
2.2 The Purpose of Assessment
6
The purpose of assessment is to gather relevant information about student performance or progress, or to determine student interests to
make judgments about their learning process. After receiving this information, teachers can reflect on each student’s level of achievement, as
well as on specific inclinations of the group, to customize their teaching plans.
Conceptualizing the purpose of assessment as threefold provides us a solid basis from which to interrogate assessment to support quality
of education. This purpose-based grouping was proposed by Newton (2007) 5 and is adapted in this article according to Brookhart (2001) 6 and
Black et al. (2003, 2010)7. Some functions are served exclusively by a single purpose-developed assessment, but there are cases where
assessments are designed to fulfill multiple purposes. This exacerbates the risk of not appropriately fulfilling any of the roles of the
assessment or only paying lip service to a particular purpose, while actually focusing on another. This is, however, not always the case. In the
following paragraphs, the definition and conceptualization of the purposes of assessment employed in this article are explored.
1. Assessment to Support Learning. Assessment to support learning is often referred to as formative assessment. In line with the arguments
of Brookhart (2001) and Black et al. (2003, 2010), who emphasize the pedagogical role that summative assessment plays in supporting
learning. Summative assessment has been judged by both students (Brookhart, 2001) and educators (Black et al., 2010) to serve not
merely as a tool for reporting learners’ progress but to support learning. Assessment to support learning refers to the interaction between
learning and assessment that is forward going. This means employing assessment data in a diagnostic approach to determine competence,
gaps, and progress so learners may adapt their learning strategies and teachers their teaching strategies (Black and Wiliam, 1988; Black,
1998). This role is usually, but not solely associated with formative assessment. This type of assessment—be it formative or summative
—may be a distinct event or integrated into the teaching practice. It is employed to determine the degree of mastery attained to that point
5
Testing, Friend or Foe? The Theory and Practice of Assessment and Testing. London: The Falmer Press.
6
Clarifying the purposes of educational assessment. Assess. Educ. Princ. Pol. Pract. 14, 149–170. doi:10.1080/09695940701478321
7
Data-informed curriculum reform: which data, what purposes, and promoting and hindering factors. Teach. Teach. Educ. 26, 482–496. doi:10.1016/j.tate.2009.06.007
7
and to inform the learning required to move toward mastery. Formative assessment in particular usually has a high frequency and focuses
on smaller units of instruction (Bloom et al., 1971; Newton, 2007).8
2. Assessment for Accountability. Assessment for accountability is a function of the responsibility of educational institutions to the public
and government for the funding received (Black, 1998; Pityana, 2017) 9. This is mainly achieved through providing evidence that learning
is being promoted. The most viable manner in which to do this on a wide scale is through aggregated learner results. International and
national comparative and benchmark studies have also been popularized as a means of providing accountability (Jansen, 2001; Howie,
2012; Archer and Howie, 2013).10
3. Assessment for Certification, Progress, and Transfer. Assessment for certification, progress, and transfer needs to be served on both an
institutional and individual level. Programs and qualifications need to be certified and acknowledged by accreditation bodies to have
value for further studies or employability (Altbach et al., 2009) 11. The certification of an institution is therefore an acknowledgment by
the accreditation body, such as a national education system or professional board that a qualification meets with the requirements set by
the authority. On an individual level, certification is necessary to endorse attainment of certain skills and knowledge. This certification
8
Testing, Friend or Foe? The Theory and Practice of Assessment and Testing. London: The Falmer Press.
Clarifying the purposes of educational assessment. Assess. Educ. Princ. Pol. Pract. 14, 149–170. doi:10.1080/09695940701478321
Data-informed curriculum reform: which data, what purposes, and promoting and hindering factors. Teach. Teach. Educ. 26, 482–496. doi:10.1016/j.tate.2009.06.007
9
Leadership and ethics in higher education: some perspectives from experience,” in Ethics in Higher Education: Values-Driven Leaders for the Future, 1st Edn, eds D. Singh,
and C. Stückelberger (Geneva: Globethics.net), 133–161.
10
On the politics of performance in South African Education: autonomy, accountability and assessment. Prospects 31, 553–564. doi:10.1007/BF03220039
High-stakes testing in South Africa: friend or foe? Assess. Educ. Princ. Pol. Pract. 19, 81–98. doi:10.1080/0969594X.2011.613369
“South Africa: optimising a feedback system for monitoring learner performance in primary schools,” in Educational Design Research – Part B: Illustrative Cases, eds T.
Plomp, and N. Nieveen (Enschede: Netherlands Institute for Curriculum Development (SLO)), 71–93.
11
rends in Global Higher Education: Tracking an Academic Revolution Trends in Global Higher Education: Tracking an Academic Revolution. Paris: United Nations
Educational, Scientific and Cultural Organization (UNESCO).
8
then serves as the entrance criteria to the next grade or level of learning. Assessment data are also required to attest to progress and
facilitate transfer to a different institution. In a similar fashion, assessment data should facilitate movement between different institutions,
even if these are in different territories or countries. The certification is required to allow the receiving institution to make a decision as to
whether or not previous learning will be recognized and credits transferred (Black, 1998; Garnett and Cavaye, 2015).12
2.3 The type of Assesment
In general, the purpose of a teacher doing an assessment is to determine what things students should know, understand, and can do. From
this assessment, it must be able to provide clear information about student progress in relation to student learning outcomes while receiving
learning activities in the classroom. This information can later help us a teacher to make the right decisions about supplementing what is
lacking as long as we educate students. There are two main types of assessment, each occurring at a different point in the learning process,
namely: Formative Assessment, which occurs both before and during the learning process, and Summative Assessment, which occurs at the
end of the cycle or end of the learning process.
These types of assessments also have different purposes and uses.
1. Formative Assessment
According to Popham (2011)13, formative assessment is a process that “involves”
12
Recognition of prior learning: opportunities and challenges for higher education. J. Work Appl. Manage. 7, 28–37. doi:10.1108/JWAM-10-2015-001
13
Boraie, D. (2018). Types of Assessment. The TESOL Encyclopedia of English Language Teaching, 1–7. https://doi.org/10.1002/9781118784235.eelt0350
9
Collect and analyze the evidence obtained from the assessment for the purposes of: Determine when and how to adjust instructional
activities or learning tactics in sequence. To achieve learning objectives”. To achieve this goal, formative assessment is divided into two,
namely: Pre-assessment and continuous assessment.
a. Pre-assessment
Pre-assessment is a type formative assessment that occurs before learning activities begin. Whether formal or informal, pre-
assessment is never assessed or in the sense that it is not in the form of evidence or tangible . They are purely diagnostic in nature or
in other words opinions in their own right. When conducting a pre-assessment, the teacher should try to find out :
What the students already know, to be able to determine the provision of the right material related to the material they have
understood.
Before giving the next material, the teacher must know whether there is material that is incomplete or poorly understood. So from
here the teacher can assess whether students have understood the material given in order to achieve the learning objectives
b. Continuous assessment. Continuous assessment is what most people think of when they think of formative assessment. Ongoing
assessments occur at various intervals throughout
Learning process. The aim is to determine the extent to which students are “with” the teacher in terms of meeting learning objectives,
so learning methods, material delivery, atmosphere, and skills taught can be adapted to further facilitate student growth.
Formative assessment, which may be formal or informal, for example: includes homework, quizzes, and class discussions.
2. Summative Assessment.
10
Summative assessment occurs at the end of the learning process and usually Graded. Some examples of summative assessments
include tests, projects, demonstrations, presentations, and performance assignments. The purpose of Summative Assessment is to find
out the extent to which students have mastered the knowledge, understanding, and skills of the unit in the form of tangible / evidence.
Experts, such as Wiggins and McTighe (2011) 14, recommend the summative Assessment should be planned prior to instruction. So ,
in this way , the teacher can manage the learning system and student skills to gain understanding and knowledge that will lead to the
student ‘s success in Summative Assessment .
2.4 Models of Assessment

The goal of carrying out an assessment is usually to identify the levels of need or risk or to understand each other during the first contact
with the user of the service. Depending on the type of information we need to collect, Smale et al. (1993) offer us three models - the
Procedural, the Questioning and the Exchange - to guide us in carrying out evaluations.
 The procedural model, often combined with guidance related to legislation, involves the use of systems designed to ensure
consistency and completeness of data collection. Therefore, the eligibility and allocation of services is often decided as a result of
the collection of this data. This can provide only an instantaneous assessment, taking the assessment away from the examination
of the individual's strengths and abilities, and can deflect individual rights or concerns about quality of life (Milner and O'Byrne,
2009). The concern is that such systems can replace rather than support or inform the judgments made by professionals (Barry,
2007 cited in Milner and O'Byrne, 2009), and can be seen as rigid, time-consuming (many ways) and one - in a way that meets
the needs of the worker and agency rather than the service user. The difficulty arises when information is collected about an
individual by different professionals with a different focus (ie health, housing, etc.) but stored separately. This results in an
14
Boraie, D. (2018). Types of Assessment. The TESOL Encyclopedia of English Language Teaching, 1–7. https://doi.org/10.1002/9781118784235.eelt0350
11
inadequate understanding of the total experience of any one individual by a single professional. Workers can get caught up in the
information gathering process instead of trying to understand what the service user needs. On a more positive note, this
systematic way of collecting large amounts of data has also contributed to the evidence base for social work practice.
 The question assessment model focuses on the nature of the questions and how the information is used. Using this approach, the
problems and solutions reside with the individual and the job of the social worker is to identify the problem and highlight the
most appropriate approach to solving the problem. One criticism of this model is that it can be seen as oppressive since the social
worker assumes the role of expert and makes the final decision. However, if you ask questions to try to understand what is
impacting the current situation, and if you are looking for a variety of perspectives, then it doesn't have to be.
 By adopting the Exchange model, the service user becomes the expert regarding her own needs and, through his participation in
his own assessment, is empowered. It recognizes that the experience of the worker lies in their ability to solve problems. The
objective, through the development of trust, is to seek a compromise between options and needs through the participation of all
parties. The worker assumes the responsibility of managing the evaluation process. The focus is on a holistic assessment of the
context in relation to the individual over time .
3. Evaluation
3.1 The Definiton of Evaluation
Evaluation is an act or process that assigns ‘value’ to a measure. When we are evaluating, we are making a judgment as to the suitability,
desirability or value of a thing. In the teaching–learning situation, evaluation is a continuous process and is concerned with more than just the
formal academic achievement of students. Evaluation refers to the assessment of a student’s progress towards stated objectives, the
efficiency of the teaching and the effectiveness of the curriculum.
12
Evaluation is a broad concept dealing not just with the classroom examination system; but also evaluating the cognitive, affective and
psychomotor domain of students. The success and failure of teaching depends upon teaching strategies, tactics and aids. Thus, the evaluation
approach improves the instructional procedure. Glaser’s basic model of teaching refers to this step as a ‘feedback function’.
Evaluation in simplistic terms, making judgement or determination of the quality or worth about an object, subject or phenomenon can be
referred to as evaluation. Evaluation as the determination of how successful a programme, acurriculum, a series of experiments, etc. Has
been achieving the goals laid out for it at the outset (Coleman, 2020, p. 111). Evaluation is as the assignment of symbols to phenomenon in
order to chararacterize the worth o the value of the phenomenon usually with reference to some social, cultural and standards. (Bradfield,
2020, p.7)15
Evaluation is one of the most commonly used assessment tools in the world of education, namely the activity of collecting and making
decisions based on the results of an assessment. Another definition of evaluation is a program in the context of a goal, namely as a process of
assessing the extent to which educational goals can be achieved (Tayibnapis, 2002) . Broadly speaking, it can be said that evaluation is
giving value to the quality of something. According to Purwanto (2002) evaluation is a systematic process to determine or make decisions to
what extent the teaching objectives have been achieved by students. Arikunto (2003) also revealed that evaluation is a series of activities that
aim to measure the success of an educational program16.
In general, every evaluation process always requires information about the situation in question that takes into account ideas such as
goals, objectives, procedures, and so on to produce information related to feasibility, suitability, and so on.
3.2 Purpose of Evaluation
15
Adom, D., Mensah, J. A., & Dake, D. A. (2020). Test, measurement, and evaluation: Understanding and use of the concepts in education. International Journal of
Evaluation
16
Wulan, A. R. (2001). Pengertian Dan Esensi Konsep Evaluasi, Asesmen, Tes, Dan Pengukuran. FMIPA Universitas Pendidikan Indonesia, 1–12.
13
In general, there are three reasons why evaluations are conducted: to determine plausibility, probability, or adequacy. As discussed in
the Constraints on Evaluations module, resources for evaluations are limited and determining the reason for the evaluation can save both time
and money in program budgets.
1. Adequacy
An adequacy assessment is conducted if stakeholders and evaluators are only interested in whether the goals, set by program developers,
were met. For example, if a child health program seeks to reduce child mortality to 25% in selected villages, an adequacy assessment will
attempt to show whether or not this 25% target was reached. The benefit of performing an adequacy assessment is that it does not require
a control group, which can significantly cut the budget of an evaluation, as well as time and effort levels. However, without
randomization or a control group, many indicators cannot be appropriately linked directly to the program activities. Although limited in
the information that may be inferred, adequacy assessments do show progress toward pre-determined targets, which may be sufficient to
argue for increased or continued funding (Vaughan, J.P, 1999)17
2. Plausibility
A plausibility assessment similarly determines if a program has attained expected goals, yet identifies changes as potential effects of
program activities rather than external or confounding sources. This is possible with the use of an experimental control group.(3) Ideally,
a plausibility assessment will also incorporate baseline and post-intervention data points to explicitly show improvements in target
indicators.(4) Again, the benefits of having a control group over not having a control group is the ability to link program activities to
program outcomes; without a control group it is not possible to determine whether decreased child mortality rates, for example, are
associated with a particular intervention. Without measuring identical indicators in control villages, there is no way to link a decrease in
17
Habicht, J.P., Victora, C.G., and Vaughan, J.P. (1999). Evaluation designs for adequacy, plausibility, and probability of public health programme performance and impact.
International Journal of Epidemiology, 28:10-18.
14
child mortality to program activities because other external and confounding factors may have contributed. In plausibility assessments,
the control groups are not required to be truly randomized; control groups can be chosen from historical epidemiological databases or
internally (i.e. those individuals who were chosen for an intervention but chose not to participate in the program activities).(5) This,
however, does allow for certain selection bias confounders that are not accounted for in the analysis. Therefore, the results of a
plausibility assessment only truly determine that there was a difference between the control and the intervention groups that most likely,
but not wholly, can be attributed to the intervention.(Vaughan, J.P, 1999)18
3. Probability
Like both plausibility and adequacy assessments, probability evaluations look to determine the success of a programs activities and
outcomes. Unlike the two previously discussed assessments, probability assessments use the most robust study design, randomized
control trials (RCTs) to determine the true effect of the intervention on the indicators of interest.(7) Due to the complexity of determining
causal relationships in public health, this type of assessment is the most expensive and time consuming of the three so it should be used
only when evaluators and stakeholders have found it necessary for funding or research purposes; RCTs involve more data collection and
more emphasis is placed on compliance, and so RCTs often increase costs associated with personnel to collect data and incentives or
vouchers to improve participation and compliance rates may also increase costs.(8) In some cases due to the nature of the project or
intervention it may be impossible or unethical to conduct a true RCT.(9) This strategy may not be feasible if the evaluation is not
discussed in the initial phases of the program planning, as a randomized control group is required, and is difficult to conceive mid-
intervention. An RCT involves complete randomization when selecting the intervention and control groups to reduce the influence of
bias on the data. For example, for plausibility assessments, if the evaluators choose to use an internally-created control group, there is a
risk of the control group being influenced by others in the household or village who are participating in the program (also called spillover
18
Habicht, J.P., Victora, C.G., and Vaughan, J.P. (1999). Evaluation designs for adequacy, plausibility, and probability of public health programme performance and impact.
International Journal of Epidemiology, 28:10-18.
15
effects). With a probability assessment, evaluators take this into consideration when choosing their control groups (ensuring physical and
social distance between the groups, for example). (Black, N, 1996)19
3.3 Type of Evaluation

Common European Framework (2002)20 distinguishes following types of Evaluation :
a. Achievement assessment / Proficiency assessment
 Achievement assessment
Accomplishment appraisal assesses the subject of educating. This kind of assessment includes student’s interaction in no time,
terms or content of the course book. It is focused on certain course and the upside of this evaluation is its nearer connection to
student’s present insight.
 Proficiency assessment
Capability appraisal assesses student’s capacity to apply his insight in practice and assists with evaluating student’s present
capacities by and large and obviously, which is a extraordinary benefit.
b. Norm-referencing / Criterion-referencing assessment

 Norm-referring to assessment
Norm-referring to evaluation arranges students as per their accomplishment . Who are then assessed in examination with their
classmates. For this situation a gathering of slow students would be surveyed by various, simpler standards than a gathering of
19
Black, N., (1996). Why we need observational studies to evaluate the effectiveness of health care. BMJ, 312:1215-8.
20
Malcová, P. (2006). The Role of Evaluation in Enhancing the Language Competences Diploma Thesis Brno 2006.
16
quicker students. This sort of evaluation is frequently utilized in position tests when students are arranged into classes. Heaton
(1990) proposes the gathering as a standard. Understudies will be educated on the off chance that they fall in the top or lower part
of the class. In aggressive testing circumstances, a standard referring to evaluation is utilized. The up-and-comers are going up
against one another. On the off chance that there are numerous excellent understudies, the normal understudy will absolutely have
a higher likelihood of breezing through the assessment.
 Criterion-referencing assessment
The student is assessed uniquely on the premise how well he is performing comparative with his own past presentation, or
comparative with a gauge of their singular capacity. Heaton (1990) recommends that measure referring to might be more pleasant
from an understudy’s perspective since it contrasts the understudies’ outcomes and fixed measures. Understudies are decided on
how well they can play out an assignment alone.
c. Mastery learning CR / Continuum criterion-referenced assessment

 Mastery learning CR
This sort decides unequivocal least degree of language skill. Students are separated into gatherings – the individuals who
‘dominated’ and the people who ‘didn’t dominate’ the course.
 Continuum criterion-referenced assessment
Student’s capability is contrasted and the size of abilities applied in the tried Region. Basis referred to appraisal is focused on the
student’s effective capability also, it identifies with the substance of the specific language course.
d. Continuous assessment / Fixed assessment points

 Continuous assessment
17
This evaluation mirrors student’s outcomes in examples, his composed work or task work. The last grade covers his work done
during the entire course, year or term. Ceaseless evaluation sees student’s singular imagination and his needs. The educator is
more unbiased in this kind of appraisal.
 Fixed assessment points
On a specific day (generally toward the end or the start of the course) students are given grades based on the test. Past experience
is not viewed as without a doubt. This kind of evaluation particularly benefits specific sorts of students since they are under a
great deal of pressure.
e. Formative / Summative assessment

 Formative assessment
Developmental appraisal happens during the course or the guidance time frame and is intended to direct instructors to change
their instructing in case need be. It is the most normal Sort of appraisal in advanced education and establishes the main part of
teachers’ endeavors to assess understudies.
 Summative assessment
Summative assessment appears differently in relation to developmental evaluation most importantly by its Reason. Normally it
needs to squeeze into the regulatory prerequisites of an organization, for model a private language school, or a school educational
program in which all subjects are needed to be surveyed. Sometimes, the outcomes from summative appraisals are utilized To
really take a look at quality, for instance results from schools and establishments might be analyzed Broadly, or provincially, to
set guidelines.
f. Direct / Indirect assessment
18
 Direct assessment
This kind of appraisal assesses the on-going student’s action. Direct evaluation is limited to talking, composing and listening
abilities. The ordinary illustration of the immediate test is a meeting.
 Indirect assessment
Backhanded appraisal, then again, is performed by composed tests. This Appraisal regularly assesses perusing abilities when
students should demonstrate that they comprehended The text. The commonplace illustration of the backhanded test is perception
or cloze tests.
g. Performance / Knowledge assessment

 Performance assessment
This sort of evaluation requires the student to play out his composed or oral language capability straightforwardly. Student’s
presentation includes his language creation. His degree of language dominance can be considered as an ability applied practically
speaking.
 Knowledge assessment
The student responds to different kinds of inquiries on various subjects to show his language ability. All tests assess just student’s
exhibition yet with the quality Of significant capability.
h. Subjective / Objective assessment

 Subjective assessment
19
Abstract appraisal implies instructor’s emotional demeanor towards the student. It relates with stamping which is done based on
instructor’s judgment. The benefit of emotional evaluation is the way that the language and correspondence are confounded
perspectives which can’t be judged naturally.
 Objective assessment
Target appraisal ignores educator’s abstract view on the student and is For the most part acted as aberrant tests with one right
reply, for example various decision.
i. Checklist rating / Performance rating

 Checklist rating
The student is characterized into a specific level or scale he has accomplished. The effect of this evaluation is in vertical line. This
includes the degree of execution the student can reach – greatest, center, least.
 Performance rating
The student is evaluated by the rundown of things applicable for certain degree of authority or model circumstance. The effect of
this evaluation is in level line. This includes student’s immediate methodology and is generally as surveys or yes/no replies.
j. Impression assessment / Guided judgment

 Impression assessment
This sort of evaluation relies upon the educator’s abstract choice about the student’s presentation in illustrations. It is performed
without explicit measures for explicit assessment.
 Guided judgment
20
Instructor’s abstract impression is controlled and finished by explicit measures. Measures can be characterized by focuses or
checks. The benefit of directed judgment is reasonable way to deal with students.
k. Holistic / Analytic assessment

 Holistic assessment
Comprehensive appraisal includes student’s all out presentation which is assessed as a worldwide classification bringing about
one grade. The assessor assesses different parts of execution naturally.
 Analytic assessment
Logical appraisal assesses different parts of student’s exhibition independently. Now and again it very well may be hard for the
instructor to focus on person classifications and feel free of absolute impression.
l. Series / Category assessment

 Series assessment
In this sort of appraisal different undertakings are typically assessed by grades from 1-4 independently. Aftereffects of individual
classifications don’t impact one another.
 Category assessment
Just one errand is assessed in this kind of evaluation. The exhibition is judged As per models expressed in evaluation lattice.
m. Assessment by others / Self-assessment

 Assessment by others
21
The student is evaluated by the instructor.
 Self-assesment
The student self-assesses his language authority which urges him to become more included and liable for his learning. Normal
European System (2002) proposes a ton of evaluation scales the educator can browse. Nonetheless, it would be hard for the
instructor to focus on an enormous number of them. Evaluation ought to particularly think about student’s necessities, his
physical and mental condition. Besides, every assessing methodology should be material practically speaking.
3.4 Models of Evaluation

a. Kirkpatrick evaluation models
The four-level assessment model was first recognized in 1959 when Donald L. Kirkpatrick wrote a series of four articles entitled
"Techniques for Evaluating Training Programs" which was published in Training and Development, the journal of 'American Society
for Training and Development (ASTD)21. The four-level assessment method represents the sequence of each step to assess a training
program. The sequence in question is that each level must be done in stages. This is because each level in the four-level model is
important, and each level has an impact on the next level. Kirkpatrick's four-level evaluation model essentially measures:
 Student reactions: what they think and feel about the training.
 Learning: the resulting increase in knowledge or ability.
 Behavior - The extent to which behavior, capabilities, and implementations / applications improve.
 Results: the effect on the business or the environment resulting from the student's performance.
Level 1 reaction
21
Owston, R. (2008). Models and Methods for Evaluation EVALUATION MODELS. Handbook of Research on Educational Communications and Technology, 605–617
22
The reaction stage is basically an evaluation of the satisfaction of the training participants with the various activities that are
followed. The reaction of the participants can determine the level of achievement of the objectives of the education and training
organization. The education and training program is considered successful if the training participants are satisfied with all the
elements involved in the implementation process. The success of the learning activity process cannot be separated from the interest,
attention and motivation of the training participants to participate in the training. Participants learn best when they react positively to
the learning environment. There are two types of reaction instruments for evaluating Level 1 reactions, namely the reactions of the
participants towards the organization and the specialists.
Level 2 learning
At the learning level, this training participant learns the knowledge or skills conveyed in teaching activities. Measuring learning
means determining one or more things related to the objectives of the training, such as what knowledge has been learned, what skills
have been developed or improved, and what attitudes have changed. Syafril Ramadhon dans le Journal de ESDM Oil and Gas
Training Center(2012)
There are stages of evaluation at the learning level, namely: a. Evaluate improved knowledge, skills and attitude changes before and
after training. b. Measure attitudes using tests that have agreed indicators vs. Measure knowledge using pretest and posttest. D.
Measure skills using performance tests; e. The results of these measurements to take appropriate action. What is meant by
appropriate action in this case is to take confirmatory action with the evaluation results at the reaction level, whether because the
teacher is less communicative in the delivery of the material, related learning strategies that do not correspond to the expectations of
the participants. , or due to other factors at Level 1 that may cause participants to become demotivated in learning, so that the lack of
feedback assessment can immediately attract attention.
Level 3 Behavior
23
Behavior according to Kirkpatrick, D., L. (1998)s, defines behavior as the extent to which changes in behavior arise because
participants follow the training program. Level-3 evaluation was conducted to identify the extent to which the material in the training
was applied to the participants' jobs and workplaces. According to Tan, K. & Newman, E. (2013), behavioral evaluation measures
what knowledge, skills, or attitudes are learned to be applied or transferred to work. To be able to apply this behavior change,
according to Kirkpatrick, D., L. (1998), there are four conditions needed, namely: (1) Individuals must have a desire to change; (2)
The individual must know what to do and how to do it; (3) Individuals must work in an appropriate work environment; and (4) and
the individual must be rewarded because he changed. Learning programs can provide the first and second conditions with programs
that support attitude change in accordance with the training objectives by providing material related to knowledge, skills, or attitudes.
But for the third thing about the right working conditions environment, it is directly related to the supervisor and the participant's
environment.
Level 4 Impact
In the implementation of learning, of course, has a goal to get good results, such as improving quality or productivity. Evaluation of
results according to Kirkpatrick, D., L. (2006,134) can be defined as an end result that occurs as a result of participants participating
in the training program. Steps in conducting an evaluation at level-4 is :
1. Conduct the evaluation process at level-3 first.

2. Give yourself time to see impact emerge or be achieved. There is no specific time to evaluate the results, so that in determining
the time of the evaluation, the various factors involved must be considered
3. Can be done with a survey method using a questionnaire. or interviews with training participants and company leaders.
4. Take measurements, both before and after training program where possible.
5. Do a re-evaluation at the right time at the right time
24
6. Consideration of costs incurred with the results obtained.
7. Can use secondary data, such as sales data, production data, and other data that support survey results in analyzing results.
b. Anderson’s Value of Learning
Model Anderson’s Value Of Learning model encourages us to focus evaluation on the alignment between the learning program’s goals
and the strategic goals of the organization. Only once the goals are aligned can we evaluate the success of the learning program in meeting
those goals. The Anderson model of learning evaluation is a unique three-stage learning evaluation cycle that is designed to be applied at an
organizational level. While other training evaluation models focus on specific learning interventions, Anderson’s model is more concerned
with aligning the training goals with the organization’s strategic goals.
Anderson's model is a three-step cycle that helps an organization determine the best training strategy for its needs. The three stages are:
Step 1: Determine the current alignment of the training with the strategic priorities of the organization. Step 2: Use a range of methods to
assess and assess the contribution of learning. Step 3: Establish the most relevant approaches for your organization. The most relevant
approach for a given organization will depend on the goals and values of its stakeholders. Anderson's model offers four categories of metrics,
as follows: Focus on short-term benefits Focus on long-term benefits Senior leadership confidence in contributing to learning The
organization needs measures of value of learning.
3.5 Steps of Evaluation
Logically we can identify three phases; the planning phase, the process phase and the product phase. The planning phase This initial phase
of evaluation takes place prior to actual implementation and involves making decisions about what course of action will be taken and toward
what ends (Gafoor,2013)22. The planning phase involves a number of processes which are discusses below.
22
Gafoor, A., & Associate, K. (2015). Types and Phases of Evaluation in Educational Practice. ResearchGate, (February), 1–9. https://doi.org/10.13140/2.1.3801.1680
25
a. Situation analysis
The first step is to analyze the situation as it presently exists in order to establish the parameters of the effort. This step includes
activities such as the collection of background information and assessment of existing constraints. For a teacher this may involve
examination of the commutative records of his or her students in order to get a frame of reference based on their abilities and histories.
After the parameters have been established, more realistic goals and objectives can be formulated.
b. Specification of objectives
Goals are general statements of purpose, or desired outcomes and not as such directly measurable. Each goal must be translated
into one or more specific objectives which are measurable. Thus, objectives are specific statements of what is to be accomplished
and how well and are expressed in terms of quantifiable, measurable outcomes. Objectives may process oriented or product
oriented. Process objectives describe outcomes desired during the execution of the effort, and they related to the development and
execution. Product objectives, on the other hand, describe outcomes intended as a result of the effort. Objectives give direction
to all subsequent activities and achievement of objectives is ultimately measured. Objectives, whether instructional or program
objectives, form the foundation of all subsequent evaluation activities, and therefore it is critical that they themselves be evaluated in
terms of relevance, measurability, substance, and technical accuracy.
c. Specification of pre-requisites
Objectives entail unique procedure with respect to student evaluation. In most cases, specification of a given set of instructional
objectives is based on the assumption that students have already acquired certain skills and knowledge. If the assumption is
incorrect, then the objectives are not appropriate. The assumed behaviours are referred to as pre-requisites or entry behaviours.
Systematic instruction and evaluation require that these pre-requisites be specified and measured. Assessment of entry behaviour is
specifically important at the beginning of any instructional unit. To arrive at pre-requisites, you simply ask yourself the following
question: What must any students know or be able to do prior to instruction order to benefit from instruction and achieve any
objectives.
26
d. Selection and development of measuring instruments
Collection of data to determine degree of achievement of objectives requires administration of one or more instruments.
Such instruments must either be developed or selected. Selection of an instrument involves examining those that are available and
selecting the best one. Best, in this case, means the one that is most appropriate for your objectives and users. Introduction to
Educational Measurement and evaluation. Prepared by Dr K Abdul Gafoor, Associate Professor, Department of Education 23.
Development of a good instrument takes considerable time, effort and skill. Training in measurement in the process is necessary
for such end.
e. Delineation of strategies
Strategies are generally approaches to promoting achievement of one or more objectives. There may be instructional strategies,
curriculum strategies, and program strategies. Each strategy entails a number of specific activities, and there are typically a number of
strategies to choose from. Execution of these strategies must be planned for, to ensure the availability of necessary resources.
Strategies which must be thoroughly thought of before evaluation is conducted include: task analysis, review of concepts,
sequencing, provision of feedback and practice.
f. Preparation of time schedule
Preparation of realistic time schedule is important for all types of evaluation; rarely do we have as long as we please to conduct
evaluation. Basically a time schedule includes a list of the major activities of the proposes evaluation effort and corresponding
expected initiation and completion times for each activity. You should allow yourself enough time, so that if an unforeseen
minor delay occurs, you can still meet your final deadline.
The process phase
23
Gafoor, A., & Associate, K. (2015). Types and Phases of Evaluation in Educational Practice. ResearchGate, (February), 1–9. https://doi.org/10.13140/2.1.3801.1680
27
The process phase involves making decisions based upon events which occur during actual implementation of the planned instruction,
program or project. The first step in the process phase is to administer pre-tests, if such are appropriate. Based on the pre-test results, decisions
may be made concerning the appropriateness of the already specified objectives. Following initial testing, planned strategies and activities
are executed in the predetermined sequence. Data collected during this phase provide feedback concerning whether execution is taking place as
planned and whether and whether strategies and activities are being effective. The basic purposes of this phase are to determine whether the
effort is being executed as intended, to determine the degree of achievement of process objectives, and to identify ways in which improvements
can be made. The process phase is referred to as formative evaluation.
The product phase

The product phase involves making decisions at the end or more likely at the end of one cycle of instruction, a program or a project. Decisions
made during the product phase are based on the results of the post-tests and on other cumulative types of data. The major purpose of the
product phase is to collect data in order to make decisions concerning the overall effectiveness of instruction, a program or project. During this
phase it is determined whether and/or what degree intended product objectives were achieved. Data analysis and interpretation is followed by
the preparation of a report which describes the objectives, procedures, and outcomes of the effort.
The results of the product phase of evaluation are used in at least in two major ways.
1) They provide feedback and direction to all who are were involved in the effort,
2) They provide feedback to outside decision makers, such as parents, principals, school board members and funding sources.
Results of the product phase need to be interpreted with care. Failure to meet objectives, for example, is not necessarily fatal; degree of
achievement needs to be considered. The product phase of evaluation is referred to as summative evaluation.
28
It is important to consider the following, if evaluation procedures are to bear fruit:
 Deciding when to evaluate;
 Deciding what precisely to evaluate;
 Deciding whom the evaluation is intended to serve;
Introduction to Educational Measurement and evaluation
Prepared by Dr K Abdul Gafoor, Associate Professor, Department of Education
 Deciding who should conduct the evaluation;

 Deciding what questions the evaluation should address;
 Planning the evaluation study;
 Deciding how to report the evaluation study; and
 Dealing with the political, ethical and interpersonal issues in evaluation.
Deciding when to evaluate;
 Deciding whom the evaluation is intended to serve
29
 Deciding when to evaluate;
 Deciding what precisely to evaluate;
 Deciding whom the evaluation is intended to serve;
 Deciding who should conduct the evaluation;
 Deciding what questions the evaluation should address;
 Planning the evaluation study;
 Deciding how to report the evaluation study; and
 Dealing with the political, ethical and interpersonal issues in evaluation
4.1 Summary
There is a lot of confusion over these two terms as well as other terms associated with assessment, testing, and evaluation. The big
difference can be summarized as this: assessment is information gathered by the teacher and student to drive instruction, while evaluation is
when a teacher uses some instrument (such as the CMT or an end-of-unit test) to rate a student so that this information can be used to compare or
sort students. Assessment is for the student and the teacher in the act of learning while evaluation is usually for others. “If mathematics teachers
were to focus their efforts on classroom assessment that is primarily formative in nature, students’ learning gains would be impressive. These
efforts would include gathering data through classroom questioning and discourse, using a variety of assessment tasks, and attending primarily to
what students know and understand” (Wilson & Kenney, page 55).24
Assessment is a lot more important because it is integral to instruction. Unfortunately, it is being hampered by the demands of evaluation.
The biggest demand for evaluation isgrading or report cards. There shouldn’t be a problem with that, except historically evaluation (grades) were
determined exclusively by computing a student’s numeric average on paper and pencil assessments called quizzes or tests. “Most experienced
24
Sohnata H, B. (2021). Compilation of English Language Testing
30
teachers will say that they know a great deal about their students in terms of what the students know, how they perform in different situations,
their attitudes and beliefs, and their various levels of skill attainment. Unfortunately, when it comes to grades, they often ignore this rich
storehouse of information and rely on test scores and rigid averages that tell only a small fraction of the story.
The myth of grading by statistical number crunching is so firmly ingrained in schooling at all levels that you may find it hard to abandon.
But it is unfair to students, to parents, and to you as the teacher to ignore all of the information you get almost daily from a problem-based
approach in favor of a handful of numbers based on tests that usually focus on low-level skills” (Van de Walle and Lovin, page 35). The reason
this is a problem is that students learn what is valued and they strive to do well on those things. If the end-of-unit tests are what are used to
determine your grade, guess what kids want to do well on, the end-of-unit test! You can do all the great activities you want, but if the bottom line
is the test, then that is what is going to be valued most by everyone: teachers, students, and parents, alike.
GROUP 2
A. Definition of Nature and Target of Language Test (Definisi Sifat Dasar dan Sasaran Tes Bahasa)
Penjelasan mengenai apa itu nature, target dan language test. Nature sendiri memiliki banyak arti seperti alam, sifat dasar, dan bawaan, serta
target merupakan patokan pencapaian, sasaran, ataupun tujuan. Dimana Nature and target of language test merupakan sifat dasar dan juga tujuan
dari tes bahasa. “Tes dapat didefinisikan sebagai tugas atau serangkaian tugas yang digunakan untuk memperoleh pengamatan-pengamatan
sistematis, yang dianggap mewakili ciri atau atribut pendidikan atau psikologis” (Sax, 1980) 25.Selain itu juga, “tes didefinisikan sebagai
instrumen atau prosedur sistematis untuk mengamati dan menggambarkan satu atau lebih karakteristik siswa dengan menggunakan skala
numerik atau skema klasifikasi” (Nitko, Brookhart, 2007)26. Atas dasar pengertian di atas, ada benang merah yang tampaknya sepakat bahwa tes
prestasi hasil belajar siswa merupakan salah satu cara untuk menggali kemampuan yang sudah dimiliki siswa setelah mengikuti proses
25
Sax (1980)
26
Nitko, Brookhart (2007)
31
pembelajaran selama waktu tertentu. Tes juga bertujuan agar informasi mengenai seberapa besar dan seberapa dalam bakat siswa di bidang
pengajaran dapat diperoleh melalui ujian.Ujian bahasa dengan tujuan utama tingkat bahasa dikenal sebagai tes bahasa dalam pengajaran bahasa.
Tes dalam pengajaran sendiri merupakan hal yang alamiah dimana dengan tes, maka seorang guru dapat mengevaluasi sistem pengajarannya,
dan dengan tes pula guru tersebut dapat menentukan tidak atau tercapainya tujuan dari pengajarannya.Menurut Iskandar (2013), menyatakan
bahwa sistem tes dan penilaian yang baik akan mendorong siswa untuk meningkatkan motivasi dan prestasi dalam belajar (Iskandar, 2013,
seperti dikutip dalam Iskandar and Rizal, 2017). Tujuan akhir penggunaan bahasa sendiri sebagai metode komunikasi adalah agar seseorang
dapat menyampaikan gagasan dan isinya kepada orang lain. Orang-orang yang menggunakan suatu bahasa tanpa mengerti secara jauh mengenai
bahasa terseut menandakan bahwa hal tersebut merupakan bagian dari kemampuan bahasa, dan bukan bagian dari pengetahuan bahasa.Antara
mereka yang memiliki kemampuan bahasa dan yang memiliki pengatahuan bahasa harus dibedakan, meskipun ada hubungannya dengan bahasa.
Kemampuan bahasa dibagi menjadi dua kategori dalam linguistik yaitu kompetensi bahasa dan keterampilan bahasa.Kompetensi memungkinkan
pengguna bahasa untuk memahami apa yang dikatakan orang lain atau mengekspresikan diri mereka melalui bahasa. Kompetensi bahasa tidak
dapat dilihat, didengar, atau dibaca karena sifatnya yang abstrak, meskipun selalu ada di balik penggunaan bahasa.Kemampuan bahasa, di sisi
lain, nyata dan berkaitan dengan penggunaan bahasa yang sebenarnya, apakah itu dalam bentuk lisan yang dapat didengar atau bentuk tulisan
yang dapat dibaca.
Ujian bahasa dapat dibagi menjadi empat kategori berdasarkan tujuannya: tes mendengarkan, tes membaca, tes berbicara, dan tes menulis.
Keterampilan mendengarkan, keterampilan membaca, keterampilan berbicara, dan keterampilan menulis semuanya termasuk dalam ujian bahasa
yang biasanya ditargetkan pada keterampilan bahasa.Meskipun ada terlalu banyak literatur tentang tes dan pengujian, masalah ini masih
merupakan area yang sangat diabaikan oleh banyak guru bahasa dan oleh beberapa perancang tes. Kurangnya perhatian ini disebabkan
kurangnya pengetahuan tentang teknik pengujian dan kurangnya kesadaran akan pentingnya pengujian di kalangan komunitas guru. Masih
banyak guru yang mengabaikan ini, memungkinkan juga telah anyak tes yang dilakukan tidak memenuhi kriteria tertentu yang dapat dikatakan
32
seagai tes yang baik untuk menguji siswa, seperti yang dinyatakan oleh Hughes, (1995), bahwa tes yang sangat andal biasanya konsisten dan
dapat dipercaya(Hughes,1995, seperti dikutip dalam Siddiek, 2010). Dimana kita dapat menyimpulkan bahwa tidak semua tes yang dilakukan
merupakan tes yang dapat dipercaya, sebab masih banyak dari beberapa kalangan guru yang mengabaikan perilah tes bahasa kepada siswa
mereka.
Place and Function of Language Test (Tempat dan Fungsi Tes Bahasa)
Sebuah tes dilakukan untuk menilai sesuatu yang berkaitan dengan kualitas suatu hal. Dalam tes bahasa, ujian dilakukan untuk melihat seberapa
bagus kualitas seseorang dalam menguasai bahasa yang digunakannya sehari-hari atau seberapa dalam pengetahuan dan keterampilannya atas
bahasa baru (bisa disebut juga bahasa kedua) yang baru dipelajarinya. “Agar sesuai dengan kebutuhan di lapangan, peserta didik akan diukur
dirinya berdasarkan ketentuan yang berlaku dan kenyataan yang ada” (Dimova, Yan and Ginther, 2020) 27. Tes bahasa dilakukan berdasarkan
suatu tujuan tertentu tergantung dari pihak yang mengadakan tes atau kebutuhan dari seseorang yang menjalankan tes tersebut. Namun, tujuan
utama dari tes bahasa adalah agar setiap ranah kehidupan bisa diisi oleh masyarakat yang memiliki pengetahuan dan keterampilan yang
mumpuni dalam bahasa agar informasi yang beredar bisa memunculkan manfaat dan mencegah timbulnya hal-hal yang bisa merugikan banyak
pihak.
Di masa kini, tes bahasa menjadi penting bagi sebagian besar orang yang hidup di dunia. Terhubungnya setiap tempat karena munculnya
globalisasi mampu mendorong masyarakat dunia untuk meminati bahasa lain selain bahasa ibunya untuk banyak tujuan seperti menambah
pengetahuan, memperluas relasi bisnis, bahkan mencari sebanyak mungkin pendukung yang bisa didapat. Akibatnya, hal-hal yang terkait dengan
tes bahasa menjadi sedikit bergeser. “Selain sebagai tolok ukur perubahan, penilaian juga dinilai sebagai proses perubahan itu sendiri” (Barros
27
Dimova, Yan, Ginther (2020)
33
and Vine, 2019)28. Hal penting yang menjadi inti dari tes yaitu penilaian telah mendapatkan posisi strategis di tengah-tengah masyarakat dan
memiliki makna yang lebih luas lagi. Untuk itu, seseorang yang mengikuti tes bahasa setelah mengikuti atau melalui proses pembelajaran yang
panjang tidak peduli apapun tujuannya ketika belajar dan mengambil tes mampu memasuki peristiwa yang akan menjadi pengalaman luar biasa
dan bekalnya di masa depan nanti dengan perubahan-perubahan yang terjadi pada individu tersebut.
The Purpose of Language Test (Tujuan Tes Bahasa)
Secara umum, evaluasi adalah tes yang dilakukan secara sistematis dan berproses untuk mengetahui nilai sesuatu berdasarkan standar yang telah
ditentukan melalui penilaian. Dengan melakukan evaluasi, peserta didik akan mengetahui kemampuan mereka dalam menyerap dan memahami
pembelajaran bahasa yang telah mereka pelajari, apabila mereka menghasilkan nilai yang memuaskan maka hal ini akan menjadi penyemangat
bagi mereka dalam mempelajari bahasa. Sebaliknya, jika nilai mereka kurang memuaskan maka mereka terdorong untuk lebih giat mempelajari
bahasa tersebut.
Evaluasi bahasa merupakan kegiatan yang sengaja dilakukan oleh guru dan memiliki tujuan. Dengan melakukan kegiatan evaluasi selain
untuk mengetahui kemampuan peserta didik, evaluasi bahasa juga bertujuan untuk mengetahui apakah pembelajaran yang diberikan oleh guru
telah berhasil atau tidak, sehingga guru akan mengetahui langkah atau metode selanjutnya yang akan digunakan dalam melakukan pembelajaran
bahasa tersebut.
Menurut Sudirman, dkk., tujuan penilaian dalam proses pembelajaran adalah:

a. Mengambil keputusan tentang hasil belajar.
b. Memahami peserta didik.
c. Memperbaiki dan mengembangkan program pembelajaran.
28
Barros, Vine (2019)
34
Tujuan dan Bentuk Tes Bahasa:
Alabi dan Babatunde (2001) mengidentifikasi tiga tujuan tes bahasa:
1. Untuk menentukan berapa banyak yang telah dipelajari dari silabus tertentu; (ujian prestasi jatuh ke dalam kategori ini).
2. Untuk mengetahui kekuatan dan kelemahan yang terdapat pada bahasa siswa, (tes diagnostik adalah contoh di sini)
3. Untuk memungkinkan guru memahami jenis tes lain (tes tersebut mencakup kecakapan, dan bakat)
The Approach in Language Testing (Pendekatan dalam Tes Bahasa)
Secara umum, pendekatan mengacu pada cara pandang atau landasan yang dimiliki dalam melakukan sesuatu. Menurut Depdikbud (1990:180),
pendekatan dapat diartikan sebagai prosedur, perilaku, dan langkah dalam mengarah terhadap sesuatu. Lebih jauh lagi, Brown (2000)
mendefinisikan pendekatan sebagai "posisi dan keyakinan yang secara teoritis terinformasi dengan baik tentang sifat bahasa, sifat pembelajaran
bahasa,dan penerapan keduanya pada pengaturan pedagogis."29
Tes bahasa dan pengajaran bahasa memiliki hubungan yang erat. Berbagai pendekatan mempengaruhi penerapan pengajaran bahasa dalam
linguistik. Secara umum pendekatan tes bahasa diklasifikasikan ke dalam 5 bagian, yaitu: (1)pendekatan tradisional (the traditional approach),
(2)pendekatan diskrit (the discrete approach), (3)pendekatan integrative (the integrative approach), (4) pendekatan pragmatis (the pragmatic
approach), dan (5)pendekatan komunikatif (the communicative approach)
a. Traditional Approach (Pendekatan Tradisional)
29
Brown (2000)
35
Grammar Translation Metode (GTM) telah digunakan hingga awal abad ke-19. “GTM terutama dicirikan oleh pengguna pusatpenggunaan
aturan tata bahasa sebagai yang utama untuk menerjemahkan dari bahasa kedua ke penutur asli”(Brown 2000). Hebatnya, GMT menahan godaan
pada pergantian abad kedua puluh untuk "meningkatkan" metodologi pengajaran Bahasa (lihat seri Metode dan Metode Langsung, kata) dan
suatu hari itu dilatih dalam terlalu banyak konteks sekolah. Tercatat ciri-ciri dominan Grammar Translation menurut Prator dan Celce-Murcia
(1978:3):
1. Dengan sedikit penggunaan bahasa tujuan yang hidup, kelas-kelas dikemas dalam bahasa ibu.
2. Banyak kosa kata yang dituangkan dalam model rekaman kata-kata yang terisolasi.
3. Dukungan panjang dan tepat dari seluk-beluk tata bahasa diberikan.
4. tata bahasa menyediakan aturan untuk menempatkan frase secara kolektif, dan instruksi sering membuat spesialisasi bentuk dan
kata-kata infleksi.
5. sedikit minat yang diberikan pada isi teks, yang mungkin dianggap sebagai olahraga dalam evaluasi tata bahasa.
6. Seringkali latihan yang paling sederhana adalah melatih menerjemahkan kalimat-kalimat yang terputus dari bahasa tujuan ke dalam
bahasa ibu
7. Memulai sejak dini adalah membaca teks klasik yang sulit.
8. Sedikit atau tidak ada perhatian yang diberikan pada pengucapan.
Pendekatan ini ditandai dengan penggunaan terjemahan kata, kalimat, atau paragraf, dari bahasa pertama ke bahasa sasaran, atau
sebaliknya; dan penggunaan evaluasi gramatikal, termasuk meminta mahasiswa untuk menemukan atau menguraikan komponen pidato, atau
mengatakan metode untuk mengganti kata benda tunggal ke jamak. Skor pertandingan ditentukan dengan menggunakan penilaian subjektif dari
pelatih. Guru tidak menginginkan keterampilan unik apa pun untuk dapat menyusun tes (Brown, 2000).30
30
(Brown, 2000).
36
b. Discrete Approach (Pendekatan Diskrit)
Menurut Khosiyono (2021), pengujian adalah unit diskrit dimana dalam ujian yang dilakukan setiap masalah yang harus diselesaikan bertumpu
pada satu unit bahasa saja. Siswa biasanya harus dapat memilah beberapa kata dan membentuk sebuah kalimat melalui beberapa kata yang telah
dipilah.
Menurut Dewi dan Natiti (2012), pendekatan diskrit adalah pendekatan yang dimana tes pada pendekatan ini mengandung beberapa pengajaran-
pengajaran yang terdapat dalam ujian. Ujian ini biasanya sangat andal karena bersifat netral serta proses evaluasinya sesuai dan bisa
menggunakan media elektonik dalam mengevaluasi hasil ujian siswa. Selain itu pendekatan diskrit juga memiliki kekurangan yaitu dalam
mengembangkan unit ujian diskrit dapat menguras tenaga serta waktu. . Ujian dalam pendekatan diskrit tidak mengandalkan interaksi antar
siswa dalam ujian tersebut tetapi mengandalkan interaksi siswa pada masyarakat umum dalam lingkungannya.
Khosiyono (2021) menyatakan bahwa beberapa hal dalam ujian bahasa tidak digunakan dalam pengujian aturan-aturan penggunaan bahasa. Hal
ini karena ada ketidakseimbangan antara metode yang digunakan dalam tes berbahasa dengan metode yang digunakanan dalam mengaplikasikan
bahasa secara langsung. Materi dalam tes bahasa kadang tidak sesuai dengan penggunaan bahasa pada kehidupan nyata.
c. Integrative Approach (Pendekatan Integratif)
Seperti kita ketahui bahwa integrative approach merupakan pendekatan 2 unsur atau kegiatan. Unsur-unsur yang dimaksud adalah seperti proses
dan konsep. Contohnya konsep memadukan mata pelajaran satu dengan mata pelajaran lainnya atau sebaliknya. Atau juga berupa penggabungan
cara satu dengan cara lainnya.
Pada rambu-rambu kurikulum SD 2006 bahwa "kompetensi dasar mencakup aspek mendengarkan, berbicara, membaca, menulis, sastra, dan
kebahasaan dan di laksanakan secara terpadu." (Depdiknas, 2006:14). Dengan begitu, pendekatan terpadu adalah pembelajaran yang menyatukan
37
dan menyajikan bahan bahan pelajaran secara terpadu, yaitu dengan menghubungkan atau mengaitkan bahan ajar agar tidak berdiri sendiri atau
terpisah pisah.
Menurut Frazee dan Rosse (1995) menuju pada pembuatan pemikiran siswa secara utuh, usia pelajar secara kodrati menentukan sesuatu selalu
dengan penentuan yang utuh dan menyeluruh (holistik). Kemudian untuk pendapat lain bahwa kehidupan keseharian pelajar tidak secara per
bagian untuk menggunakan pengetahuan, tetapi secara menyeluruh.
Prosesbelajar yang dilaksanakan dengan menggunakan beberapa teks yang berkaitan dengan kebutuhan siswa lalu siswa diberi pelatihan dalam
berbagai teks hingga mereka menghasilkan teks tanpa bantuan dan bimbingan guru (Richard dan Bohlke, 2011). 31Inilah yang dimaksud dengan
integrative approach memadukan 2 unsur atau kegiatan, hingga siswa sendiri bisa lebih efisien dalam proses belajar yang diinginkan.
d. Pragmatic Approach (Pendekatan Pragmatik)
Pendekatan pragmatik dalam bahasa Arab disebut Madkhal yang mana hakikat bahasa dan hakikat belajar mengajar bahasa. Levinson
menjelaskan bahwa pragmatik focus dalam pembelajaran antara hubungan bahasa beserta konteksnya (Dikutip dalam Kusyowo, 2015). Kontes
yang dimaksud termodifikasi sehingga tidak boleh dilepaskan dari sistem bahasa. Abdurrahman (2006) menyatakan bahwa pendekatan
pragmatik adalah ilmu bahasa yang mengeksplorasi penggunabahasa yang sangat ditentukan oleh konteks yang melatar belakangi bahasa itu
sendiri. Pendekatan pragmatik berdasarkan fungsi bahasa yaitu sebagai alat komunikasi. Komunikasi adalah kebutuhan setiap orang. Apapun
yang akan dikatakan, bakal dianggap pesan bagi orang lain yang mencermati bahasa seseorang.
Prinsip-prinsip Pragmatik:
1. Peran pada siapa ujuran itu dialamatkan, disapakan, diperdengarkan, dan dimaksudkan.
31
Richard, Bohlke (2011).
38
2. Kerja sama Grice: prinsip ini memiliki parameter yaitu kuantitas, kualitas, relevansi, dan krama. Pembicara wajib bicara juju
dan relevan dari awal ke akhir.
3. Tata krama : Sopan santun dalam berbicara agar dapat menciptakan kebaikan, keselarasan, dan kedamaian.
4. Prinsip interpretasi local yang memaknai pembicaraan dan prinsip analogi yang tidak mengubah makna dalam pembicaraan.
5. Kewacanaan: Variasinya sesuai dengan konteks dan situasinya.
6. Pragmatik sosialisasi: Santun bahasa, norma lokal dan interlokal.
7. Pragmatik wacana: Tindak ucapan mengasumsi kohesi, koherensi dan berbagai ragam.
e. Communicative Approach (Pendekatan Komunikatif)
Pendekatan komunikasi merupakan pendekatan yang dalam proses pembelajaran bahasa menekankan pada kemampuan anak dalam
berkomunikasi dengan memanfaatkan bahasa-bahasa yang diketahuinya dan bahasa yang baru dipelajari sebagai bahasa kedua setelah bahasa ibu
yang dikuasainya. Hal ini terjadi dikarenakan oleh adanya bahasa sebagai alat komunikasi sehingga perkembangan anak dalam menguasai
sebuah bahasa secara langsung bisa dilihat dari pendekatan ini. “Pembelajaran bahasa memiliki hubungan yang kuat dengan ‘bahasa’ sebagai
‘konten pembelajaran’ begitu juga linguistik”[ CITATION Sek17 \l 1033 ]32. Bahasa sebagai alat komunikasi sangat mempengaruhi materi-materi
yang digunakan dalam proses belajar-mengajar karena setiap hal atau pengetahuan baru yang diberikan oleh guru kepada muridnya akan
disampaikan menggunakan bahasa tersebut. Juga, linguistik sebagai ranah perkembangan dan pendetailan bahasa ikut memberikan ruang kepada
guru dan murid untuk mencari, mempelajari, dan mengetahui materi yang ada sesuai dengan cakupan ruang lingkup masing-masing. Misalnya,
murid dan guru di sekolah dasar sampai menengah mempelajari lingkup dasar-dasar bahasa secara umum atau murid dan guru di perguruan
tinggi mempelajari lingkup lanjutan bahasa secara detail. Sebagai hasilnya, diharapkan bahwa setelah pelajaran selesai guru bisa selesai
mentransfer hal-hal yang diketahuinya terkait bahasa yang diajarkan dan murid bisa menggunakan bahasa yang dipelajarinya tersebut dalam
kegiatan-kegiatan yang dilakukannya.
32
Seken (2017)
39
Oleh karena itu, dalam prakteknya pendekatan komunikasi lebih difokuskan ke arah penguasaan bahasa dan penggunaan bahasa yang diketahui
anak daripada ke arah baku atau tidaknya kalimat-kalimat yang diketahuinya atau yang dilontarkannya tersebut.“Hal fundamental dari relasi
adalah percakapan dengan perasaan yang dalam”[ CITATION SMu10 \l 1033 ].33 Pendekatan komunikasi ini erat sekali kaitannya dengan bercakap-
cakap yang dilakukan oleh guru kepada murid, murid kepada guru, bahkan murid kepada murid yang lainnya yang tentunya akan melibatkan
aspek emosional dalam diri setiap individu. Meskipun tidak berfokus pada kebakuan kata atau kalimat, proses pembelajaran harus tetap
memperhatikan etika dan aturan-aturan yang berlaku di masyarakat maupun di Negara dalam skala yang lebih luas. Adapun kegiatan ini bisa
berjalan baik apabila hal-hal yang dibahas tersebut memiliki nilai positif dan diucapkan pembicaranya dengan perasaan yang tulus atau positif
pula. Menyinggung perasaan orang lain atau merugikan pihak tertentu tidak akan membuat siswa menjadi seseorang yang membanggakan
perilaku buruk tersebut selain menorehkan pengalaman yang tidak menyenangkan bagi siswa untuk dihindari di kemudian hari sebagai bentuk
belajar yang dilakukannya. Sehingga, bisa dikatakan bahwa hasil dari proses yang baik akan menghasilkan kesan yang baik atau hal yang positif
juga pada akhirnya begitu juga sebaliknya, dengan catatan bahwa baik atau buruknya proses yang dilalui sama-sama menghasilkan pembelajaran
yang berharga untuk siswa dan guru.
Perlu diketahui oleh guru bahwa dalam menerapkan pendekatan komunikasi, guru tidak hanya harus terus berfokus dengan tujuan yang ingin
dicapainya yang sudah ia tetapkan sebelumnya dalam rencana pembelajaran. “Harus dicatat bahwa selain perkataan seseorang ada aspek lain
yang perlu dipertimbangkan bagi pembimbing yang berkualitas”[ CITATION Ros05 \l 1033 ]. Seorang guru tidak boleh hanya bertumpu pada
kepentingan dirinya saja untuk memberikan materi pelajaran, namun guru juga berkewajiban untuk mendidik dengan menyimak dan memahami
setiap hal yang diungkapkan muridnya kepadanya selama proses pembelajaran terjadi. Hanya sekedar mendengarkan perkataan murid namun
tidak memfokuskan pikiran untuk memahami lebih dalam perkataan tersebut bisa menghambat terbangunnya hubungan siswa dan guru yang
pada akhirnya juga menghambat proses pembelajaran. Hal tersebut terjadi karena murid akan merasa tidak diperhatikan, guru mulai pilih-pilih
33
Mulyodiharjo (2010)
40
kasih, bahkan murid bisa kehilangan tujuannya dalam mengikuti proses pembelajaran. Sebagai gantinya, hasil tes yang telah dilewati oleh murid
tidak akan mencapai nilai yang memuaskan.
Hambatan yang ditemui dari tes bahasa dengan pendekatan komunikatif ini tentu butuh solusi untuk penyelesaiannya. Selain dengan
membangun hubungan saling menghormati dan mengasihi antar sesama warga kelas selayaknya keluarga, guru juga bisa mengatasinya dengan
keterampilan yang dimilikinya sebagai seorang pengajar dan pendidik. “Kegagalan menyampaikan informasi kepada orang lain dapat
diminimalisir dengan ekspresi yang dapat dimengerti dan diri yang mudah bergaul” [ CITATION Oet14 \l 1033 ] . Keterampilan memberi variasi
yang wajib dimiliki oleh setiap guru bisa dimanfaatkan untuk melancarkan proses tes bahasa yang akan dilakukan. Ketika materi tidak
dimengerti hanya dengan dibacakan, guru bisa memberikan alat peraga atau menghadirkan media gambar atau gambar bergerak (video) diiringi
dengan penjelasan guru untuk meningkatkan pemahaman siswa. Misalnya pertunjukkan video orang yang sedang berdialog atau diperdengarkan
audionya melalui pengeras suara untuk materi bunyi suatu kata sebagai bagian dari pembelajaran bahasa yang diberikan oleh guru. Setelah itu,
siswa diluaskan untuk mempraktekkannya sesuai dengan kreasinya untuk lebih meyakinkan bahwa siswa bisa memberikan penampilan yang
maksimal di tes yang akan diberikan nantinya.
Untuk itu, manfaat dari tes bahasa dengan pendekatan komunikasiini adalah dapat memperlihatkan secara gamblang sejauh mana siswa mampu
mengutarakan hal-hal yang diketahui atau dipelajarinya dengan menggunakan bahasa yang sudah diketahuinya. Namun, tetap ada masalah yang
muncul setelah hasil tes dengan pendekatan komunikatif ini keluar. Guru seringkali menemukan ada golongan murid yang mampu melewati tes
yang diberikan guru namun tetap merasa gugup untuk menggunakan bahasa yang sudah dipelajarinya dalam kehidupan sehari-hari. Hal ini
muncul karena adanya ketidakpahaman secara menyeluruh dalam diri siswa sehingga siswa hanya mampu mengeluarkan atau mempraktekkan
bagian yang dikuasainya dan menyembunyikan atau bungkam terkait dengan hal yang tidak mampu ia kuasai. Maka dari itu, perhatian dari guru
tentang perkembangan anak satu demi satu harus terus dieratkan agar segala perkembangan atau kendala yang dialami siswa bisa diketahui
dengan cepat oleh guru dan dicarikan solusinya baik di dalam maupun di luar proses pembelajaran.
41
42
GROUP 3
A. Definition Of Test
“The test is an assessment tool in written form to record or observe student achievement
in line with the assessment target ”34. Whereas Sax (1980: 13) suggests that, "a test maybe
defined as a task or series of tasks used to obtain systematic observations resumed to be
representative of education alor psychological traits or attributes".
According to F.L. Goodeneough in Sudijono (2008: 67), "a test is a task or a series of
tasks given to an individual or group of individuals, with the intention of comparing their
skills, one with another."
Test is "a tool or procedure used to find out or measure something by using a
predetermined method or rule"35. In this case must be able to distinguish between test,
testing, testee, tester. Testing are activities when taking the test at the time the testis carried
out. As for testing, according to Gabel (1993) states that "testing shows the process of
implementing the test". Taste is the subject or target that does the task. While a taste is
someone who takes measurements or takes tests to respondents orsomeone who give stests to
respondents.
From all the explanations above, it can be concluded that the testis a form of
measurement, assessment, and data collection to find out and observe the level of ability of
aobject.
B. Types Of Test
 According to Dr. S. Prakash & Dr. E. Dhivyadeepa

Types of test according to Dr. S. Prakash & Dr. E. Dhivyadeepa, divided into 3 namely.
A. Oral test:
Verbal tests are regularly utilized for lower classes and for symptomatic work to decide
learning troubles. They are utilized for testing people. The educator meets the understudies
confront to confront and inquires questions. Questions, when not caught on, can be rehashed
or reworded so that understudies get clearly in address. Verbal testing could be an
34
Jacobs & Chase, (1992: Alwasilah, 1996)
35
Arikunto and Jabar (2004)
i
exceedingly adaptable instrument within the hands of adroit instructors. But verbal tests are
not free from absconds. They are time-consuming and subjective.
For example: interview test to get a job or scholarship.
B. Written test:
These are tests where within the answers to the questions are to be recorded in sheets of
paper to be gone through and assessed by the educator lackadaisical afterward. In long reply
exposition address and brief reply passage address, the dialect productivity as well as the
specificity of the focuses of introduction which are absolutely subjective can be tried. In
objective express reaction questions, review or acknowledgment, the exactness and rightness
of the reactions without subjectivity are tried.
For example: writing an essay from the given title.
1) Essay Type Test:
The exposition test is likely the foremost prevalent of teacher-made tests. In common, a
classroom exposition test comprises of a little number of questions to which the understudy is
anticipated to illustrate in his/her reaction his/her capacity to (a) review real, conceptual, or
procedural information, (b) organize this information, and (c) decipher the data
fundamentally in a coherent, coordinates reply to the address. A paper test thing can be
classified as either an expanded reaction or a brief reply. The last-mentioned calls for a more
confined or constrained reply in terms of shape or scope.
Example : Try to mention the benefits of learning English in everyday life and give an
example.
2) Short Answer Type Tests:
Brief reply questions are regularly composed of a brief incite that requests a composed reply
that changes in length from one or two words to a number of sentences. They are most
frequently utilized to test essential information of key actualities and terms.
Example:
ii
this place is located on the northwest coast of the world’s most populous island of Java and
this place is the capital city of Indonesia. this place is the country’s economic, cultural and
political center, with a population of 10,075,310 as of 2014 and the official metropolitan area,
known as Jabodetabek (a name formed by combining the initial syllables of Jakarta, Bogor,
Depok, Tangerang, and Bekasi), this place is the second largest in the world.
what is the name of the city described in the text above?
Answer : Jakarta
3) Objective Type Tests:
The conventional framework of examination has fizzled to bring approximately

integration between educating, learning and assessment because it depended more on the
memorizing capacity of the understudies. Development of objective sort tests is pointed at
rectifying the abandons within the exposition sort questions.
4) Multiple Choice Types:
Different choice questions are composed of one address (stem) with numerous
conceivable answers (choices), counting the proper reply and a few inaccurate answers
(distractors). Ordinarily, understudies select the right answer by circling the related number
or letter, or filling within the related circle on the machine-readable reaction sheet.
Example.
The first singer sings …, but the second one j sings better.
a. Beautifully
b. Better
c. Well
d. Bad
5) Matching Type:
Understudies react to coordinating questions by blending each of a set of stems (e.g.,

definitions) with one of the choices given on the exam. These questions are frequently
utilized to survey acknowledgment and review and so are most frequently utilized in courses
iii
where procurement of point by point information is an vital objective. They are for the most
part fast and simple to form and stamp, but students require more time to reply to these
questions than a comparable number of numerous choice or true/false things.
Example :
Match the word on the left to the word with the same meaning.
A. B.
1. Angry 1. fortunate
2. lucky 2. Capable
3. forbid 3. Mad
4. able 4. Stay
5. sojourn 5. prohibid
6) True/False Type:
True/false questions are as it were composed of a articulation. Understudies react to the

questions by showing whether the explanation is genuine or wrong.
Example:
(T)-(F). The Prophet was born in 571 H to coincide with the year of the Elephant.
7) Completion type:
The completion thing requires the understudy to reply a address or to wrap up an

inadequate explanation by filling in a clear with the right word or express.
Example:
Fill in the blanks below with the correct and correct answer. The prime factor of 15 is .........
8) Analogy Type:
iv
At the beginning, the similarity is utilized to allude to the relationship between the target
and the source of data. It is utilized as a prepare of exchanging data from a specific target to
another target which is comparable to the other one. To be exact, similarity is treated as the
distinguishing proof of the relationship between two terms/conditions.
Example
Court is to judge as classroom is to . . . .
a. Teacher
b. School
c. Learning
d. house
3. Performance tests:
These are tests concerning the psychomotor action appraisal. The abilities of perception,
drawing, experimentation, control, verbalization, etc. are altogether tried as in execution tests.
Example :
Try showing the class how to teach using the jigsaw type of active learning model.36
 According to Dr. Aradhani Wani
Tests are divided into different types taking into consideration their content, objective,
administration system, scoring style etc.
According to mode of administration, tests are of two types:
- Individual test: Is a psychological test given to an individual at a certain time.

- Group test: Is a test given to a group of individuals at a certain time.
According to the method of scoring, test are of two types:
- Machine-score tests: Are tests that are graded by or via machines such as computers.
36
Dr. S. Prakash, et al., 2016. Evaluation in education. Page 21-31
v
- Hand score test: Is a test that is assessed by or through humans. A class achievement
test is an example of a hand score test.
According to the ability of the student, tests are of two types:
- Speed test: This test is a type of test that applies to individuals to find out the speed of
students. Here, the test time is limited and the number of questions is more and all
questions are of the same level of difficulty.
- Power test: This test is a type of test that is applied to individuals to determine the
strength or ability of students. Here, there is no time limit and individuals are
expected to answer questions in as much time as they like. The essay contest by the
media is a shining example of power test.
According to the principle of test construction, tests are of two types:
- Teacher-made Tests: Teacher-made tests are tests made by teachers to assess student
growth.
- Standardized Tests: Standardized tests are tests that are used to measure the general
objectives of various schools. The tests are generally prepared by specialists, experts
on the subject, and are tested and selected based on their relevance and effectiveness
so that the tests are of the highest quality. This test is also useful for measuring
educational development in general, determining student progress, analyzing learning
difficulties, and comparing achievement with studying capability.
C. Kind Of Test
1. Achievement Test
According to Anne Anastasi (Psychological Testing, 1976) That "the test is basically an
objective and standard measurement of a sample of behavior". While achievement is
evidence of the effort that has been achieved 37. In line with the above understanding
Achievement test is a structured test planned to reveal maximum performance of the subject
in master the materials or materials that been taught. In educational activities formal in the
classroom, learning achievement tests can be in the form of daily tests, tests formative,
37
W.S Wingkel (1996: 165)
vi
summative tests, even ebtanas and college entrance exams 38. From the above understanding,
it can be concluded that the achievement test is a test that is tested on students which aims to
determine the extent to which students understand the material that has been taught, the test
can be in the form of daily reviews or tests. Learning achievement test can be shared into two
types, namely the ability test (power test) and speed test (speed test). tests). These two tests
have differences in their application The principle of power testing does not determine the
time limit for taking the test because when time is limited, students cannot do the test
optimally so that they get a score that is not optimal. With the time limit of the test, we
cannot see the full ability of students. Different from the power test which does not limit
time, the speed test sets a time where the speed of this test that is measured is the student's
thinking speed in completing the test, usually the tests that are presented are relatively easy so
that what is measured is really the speed of thinking and students' skills in solving problems.
39
The function of the learning achievement test as stated by Ebel (1991), are as follows:
a) Main function achievement test is to measure student success in learning. The point is
that with the test we can find out the extent to which students understand the material
that has been taught
b) Test can also help teachers and instructors in making accurate values and meaning. It
means that with the test the teacher can determine the student's presentation through
the test that is tested and then determine the value that has been obtained.
c) Learning achievement test also serves to motivate and directing students in learning.
Usually students (students) will learn more active when they are faced with exams. In
other words, they will learn more serious on materials that according to their thinking
will be tested at the time of the test. In this way students' motivation in learning will
increase so that understanding of the material will increase as well.40
2. Placement Test
Placement test is an exam that will be given to students who will enter an institution in
order to determine the level of skill in a particular field, so that groups can be obtained
38
Azwar (2007).
39
Suharman. (2018). Tes sebagai alat ukur prestasi akademik. Jurnal Ilmiah Pendidikan Agama Islam, 93–
115.
40
(Merli, 1981)Merli, S. R. (1981). Profil persepsi terhadap tes prestasi belajar pada mahasiswa fakultas psikologi
universitas Mercu Buana Yogyakarta. Journal of Chemical Information and Modeling, 53(9), 1689–1699.
vii
according to students' abilities. For example is a test for majors in high school.The purpose of
the placement assessment is to place students in their true place based on their talents,
interests, abilities, abilities, and circumstances so that students do not experience obstacles in
participating in lessons or any material program presented by the teacher. The placement test
is shall assists teachers to provide student a quick way on English language learning41.
The form of the placement test can be in the form of written and oral tests or interviews.
Written tests can be in the form of elective tests and tests that have been provided previously.
The test package will be divided into several levels of question items. The level in question is
from basic 1 to follow-up (referring to skill level). While the oral exam is designed to provide
a further description of the productive abilities of the placement test participants.
3. Proficiency Test
Proficiency test is a test that is used to measure a person's ability to do something. This
test is usually carried out to measure the language skills possessed by a person, based on the
aspects contained in learning the language.
Proficiency test taken from ITP TOEFL, which has three aspects including listening
comprehension, written schema and expression, and reading comprehension42. This shows
that the ability to listen and understand is one of the important aspects in the implementation
of proficiency tests, where in the implementation of proficiency tests that use listening
aspects, that is where a combination of vocabulary mastery and understanding of each
meaning is needed. of each language vocabulary.
Of course, in the implementation of the proficiency test, it is done because there is a goal
to be achieved. Where to measures the ability to master the language possessed by a person.
What must be done to increase the ability or mastery of the language, training must be carried
out on better reading techniques (Yuyun et all., 2018). Therefore, it is important to do a
proficiency test, in order to be able to control the extent of students' abilities or mastery of a
language.
Then the proficiency test is also used to test the skills possessed by a teacher. Discusses an
investigative item centered on multiple choice questions used to measure the ability of
41
Cambridge Assessment English
42
Yuyun et, al (2018)
viii
Indonesian secondary school teachers who participate in teaching English 43. This shows that
proficiency tests can be used for various groups who want to measure their abilities regarding
the ability or mastery of a language.
4. Diagnostic Test
As a teacher, of course, really want quality students. In order to improve the quality of
children's self, teachers need to know the various obstacles that students have in learning in
order to provide the correct learning method according to the needs of students. Therefore, an
educator needs a test to test the ability of each student. This test is called a diagnostic test.
According to Sudijono: 2008, in order to provide proper handling of learning for students,
teachers need to know the weaknesses of students through a test called a diagnostic test.
Diagnostic tests are tests conducted to find out some of the problems faced by students in
learning, what are the causes of student learning difficulties and how educators overcome
difficulties and problems faced by students in learning. Ambiyar (2011) states that, to find
various kinds of student learning problems, teachers need to arrange the questions in such a
way. Diagnostic evaluation aims to analyze the learning difficulties faced by students and the
various factors that cause them44. Diagnostic tests are tests used to identify and examine the
types of learning difficulties faced by students, the causal factors and ways to overcome these
learning difficulties45
From some of the opinions above, it can be concluded that the diagnostic test is a test that
is carried out to find out student learning weaknesses so that they can be handled properly
according to the learning difficulties faced.
The function of the diagnostic test is to research and study student learning difficulties and
to develop a design for problem solving efforts from student learning difficulties.
Characteristics of Diagnostic Tests
a. Designed to detect student learning difficulties

b. Developed based on an analysis of the sources of errors or difficulties that may
be the cause of student problems
c. Using questions in the form of short answers so as to be able to capture complete
information
d. Accompanied by a follow-up plan according to the identified difficulties
Diagnostic Test Development Steps
The five steps of developing a diagnostic test aimed at cognitive assessment according
to Labudasari & Rochmah, (2019) are:
43
Tamah and Lie, (2019)
44
Widyanto, (2018)
45
Sofyan et. al, (2019)
ix
1. Based on the construction of a substantive theory. Substantive theory is the basis
for developing research-based tests or research reviews.
2. Selection design, measurement design can be used to create item constructs that
can be responded well by test takers based on specific knowledge, skills or
characteristics.
3. Test administration, covering several aspects, namely the format of evidence, the
technology used to make the test kits, the environmental situation at the time of
testing.
4. The scoring of the test results is the determination of the test scores that have
been carried out.
5. Revision, the adjustment process between theory and model, whether the test
developed supports the theory or not, if not then it must be revised.
Students who have learning difficulties get low achievement. Students are said to have
learning difficulties according to Burton in Syamsudin (2005:308) are:
1. If within a certain time limit the person concerned does not reach the minimum level
of mastery in certain teaching.
2. If the person concerned cannot do or achieve the proper performance.
3. If the person concerned does not succeed in achieving the required level of mastery as
a prerequisite for the next lesson.
Rajeswari (2004) “six conditions that must be considered in the implementation of

diagnostic tests. The six conditions are: (1) done to improve student achievement not to
determine graduation, (2) carried out in a comfortable and pleasant atmosphere, (3) done
honestly by students independently, (4) in the diagnostic test students can ask questions.
things that are not clear, (5) the teacher encourages students to work on all the questions,
and (6) the implementation schedule is not strict or students can take the test according to
the time they have (p.46)”.
Diagnostic test items tend to have a relatively low level of difficulty. According to
Suwarto in his journal, the notion of a diagnostic test is a test used to find out weaknesses
(misconceptions) on certain topics and get input about student responses to improve their
weaknesses46.
46
Suwarto, (2013).
x
GROUP 4
A. CRITERION REFERENCE TEST
1. The Definition of Criterion Reference Test (CRT)
Criterion Referenced Testing (CRT) Criterion-Referenced test is designed to provide a measure of

performance that is interpretable in terms of a clearly defined and delimited domain of learning tasks. 47
The term criterion-referenced test simply means that the items on the test are referenced to or drawn
from a carefully specified set of subordinate skills that make up the goal. On a domain-referenced test,
the test items are referenced to or drawn from a carefully delineated domain of tasks. Thus, students'
performance on such tests is referenced to the criterion set of skills or domain. Criterion-reference-
measures group student performances and are rated to develop a statement about the performance
ability of the student.
CRT informs us about a student's level of competence in determining whether a student requires
additional or less work on a set of abilities, but it says nothing about the student's position in relation to
other students (Bachman 1995; Kubiszyn & Borich 2007). A criterion reference test is a type of
evaluation in which is student’s performance is evaluated in the classroom against the same standard.
For example, if the evaluation is referring to a target attainment criteria (instructional), that has been
created previously, it is referred to as a criterion reference test. The values acquired by pupils in relation
to their ma accomplishment level.
2. The Purpose of Criterion Reference Test
Basically criterion-referenced assessments purpose is to see if students have gained the

necessary knowledge or ability sets for admission. If the student receives a score that is higher
than expected, they will receive a passing grade. They failed if their performance falls short of
expectations. 48. The reason Criterion Reference Test is to quantify the basis reference test for
specific objectives or skills characterized as progress standards
3. The Characteristic of Criterion Reference Test
47
Advances in Social Sciences Research Journal – Vol.4, No.21 Publication Date: Nov. 25, 2017
DoI:10.14738/assrj.421.3831.
48
https://www.theedadvocate.org/criterion-referenced-test/
11
The characteristic of Criterion reference test namely:
1. Authority, It genuinely evaluates whether they are measuring what they claim to be measuring
or not. Also, whether or not the scenarios and performances described in the goal apply to the
item.
2. Consistency, it is about accurate. Also, consistency refers to whether or not they have a high
level of confidence in the results. Any unpredictability in the tool can be caused by a random
error.
3. Practically, Because of cost and time constraints, not all assessments are reliable. As a result, the
decision should be based on a number of crucial variables.
4. Subject Mastery, This aids in the progression of students' academic success throughout their
studies. The student's knowledge and understanding of the issue is also assessed using the
criterion reference test.
5. Managed Locally, to increase classroom level, the teacher can quickly determine whether or not
the standards have been met. In addition, they detect shortages. Test results are immediately
obtained in order to provide students with useful feedback on their performance. 49
4. The Steps of Criterion Reference Test
Steps you need are known by teacher benchmark hearts do assessment references
a) Language Examination of Criterion Reference

Specifically, tagging on your answer is incorrect; but, for the purposes of a review of the Term,
know that your answers are correct..
b) Numberization
During that counting period, you should double-check your responses and write down the
numbers. Period raw value, which is only displayed period, there is your solution, which
participants test. The raw value must still be transformed into a final score, which provides a
more precise image than the value of the right rather than the value of the crude in terms of
participant skill level exams. The final value that can be used to lead with performance against
interpretation was low.
49
https://www.toppr.com/bytes/criterion-referenced-test/
12
5. There is test where we can use the Criterion Reference Test, namely:
1. Achievement Test
Achievement testing is a method of determining a student's progress. An achievement exam is a

test that determines how much material students have learned since beginning the learning
process/instruction. It consists of a formative, summative, and post-test. A formative test is one that is
given after one subject has been completed. This test is used to determine whether or not the pupils
have grasped the material at hand. The purpose of a formative exam is to monitor students' progress
during the learning process, to provide feedback for improving the teaching and learning process, and to
identify weaknesses that need to be addressed, so that the outcomes of the test can be improved so that
the teaching learning process proved to be more efficient The major goal is to enhance the learning
50
process. Exams on a daily and block basis, quizzes, and homework, for example. Following that, the
students for example, “superficial analysis” is more relevant for the “bad” performance criterion than
the “sufficient” level of performance. By light of the experience, the readability of the success criteria will
be improved through time. 51 (Carlson, MacDonald, Gorely, Hanrahan, Burgess, Limerick, 2000, p. 110).
6. The Advantage of Criterion Reference Test
Criterion-referenced tests provide the advantage of comparing a score to a performance benchmark

rather than to the scores of other people. To put it another way, when criterion-referenced assessments
are administered, everyone who takes the test may do well or poorly. Individual significance; not in
comparison to other candidates, but in terms of what an individual may do. In fact, each item is picked
to represent a specific piece of the skill or information that the candidate should acquire, as determined
by the teacher. The examiner is looking to see if the candidate has met the criterion skills requirements.
Above all, this strategy is the only one that pinpoints where both parties' strengths and shortcomings.
Over all else, this strategy is the only one that pinpoints where the student and teacher's strengths and
weaknesses are for remedial purposes. As a result, a major benefit of this type of testing is whether it
offers considerably more frequently information on student progress than norm-referenced testing and
can assist diagnose specific strengths and weaknesses. They do not, however, have a national contrast,
50
Burton, Kelley J. (2006) Designing criterion-referenced assessment. Journal of Learning Design, 1(2). pp. 73-82.
51
International Journal of Humanities and social science invention ISSN (Online);2319-7722 ISSN (Print) 2319-7714
www.ijhssi.org//volume4 Issues 10// October.2015// PP.24-30 www.ijhssi.org 24
13
and the test strategy and assessment would be less precise than on norm-referenced assessments. 52
(Bude & Lewin,1997).
7. The Weakness of Criterion Reference Test
Criterion-referenced assessments in schools have flaws of their own. It slows down teaching
since the teacher is frequently obliged to reteach until acceptable levels are reached. Criterion-
referenced tests should not be overdone because they are formative and ongoing. Students' interest
in what they're doing can be harmed if they get focused with how they're doing. Excessive
performance anxiety might suffocate curiosity. Furthermore, because the test must account for
content validity and reliability, the development of the test might be difficult for the teacher. The
ranking of performance scores each class, which is not provided by a CRT, is an important part of the
teaching and learning process.53
2. NORM REFERENCE TEST
1. The Definition of Norm referenced Test (NRT)

52
Mashingaidze, S The Dyke Criterion and Norm Referenced Tests in the Education System: Livening Up the Debate
Vol. 6.2 (2012). pp. 84
53
Mashingaidze, S The Dyke Criterion and Norm Referenced Tests in the Education System: Livening Up the Debate
Vol. 6.2 (2012). pp. 84
14
Norm referenced test is the evaluation action (and credit) the process of students learning by
giving value (and position) them towards the action of their partner. (Brown 2003, Hughes 2003; Huitt
1996; Wojctzak 2002, as cited in Kamal).
In the other hand, Norm-referencef tests (NRT) contrast a student’s score towards the values of a
community who have followed the same test, called the “norming group”.
NRT is a test that measures how the performance of a particular test taker or group of test takers
compares with the performance of another test taker or group of test takers whose scores are given as
the norm.
NRT is a test that compares a test taker's or a group of test takers' performance to the performance of
another test taker's or group of test takers whose scores are supplied as the norm. Local, state, or
national standards can be used as a foundation for norm-referenced standardized examinations. As a
result, rather than an agreed-upon criterion, a test taker's score is interpreted in relation to the scores
of other test takers or groups of test takers (Hussain et al., 2015).
As a result, NRT is a method of assessment in which a learner's own relative rank is compared to that of
other pupils in the classroom (Brown 1976; Mrunalini 2013; Salvia & Ysseldik 2007). If a student obtains
a percentile rank of 34, for example, it implies that he or she outperformed 34 percent of the students in
the norm group (Bond 1996).
Norms is used to contrast the action of a person or a group of students by using the norms of
group. Norms might be explained by the list of source of several factors as age, grade, region and special
need group on a test (Brown 1976; Noll, Scannell & Craig, 1979). To comprehend on how the person to
compare toothers, the score of examination are then appealed to the norms of examination. For
examples, if a kid take 3 questions correct on a problems of vocabulary and the norms describe us that
most kids that have the same age correctly take a revenge 8 to 10 questions, we could conclude that this
kid’s vocabulary ability are not good enough in contrasting with the most kids of the same age. 54
From the definition previously, we can take a conclusion that Norm reference test is a test that used
to contrast one pupil to another in a similar level in a group. The standard of NRT is depended on the
Norm/Achievement of tha group.55
54
52 Brown 1976; Noll, Scannell & Craig, 1979 :International Journal of Humanities and Social Science Invention
ISSN (Online): 2319 – 7722, page 26 53International Journal of Humanities and Social Science Invention ISSN
(Online): 2319 – 7722, ISSN (Print): 2319 – 7714 www.ijhssi.org ||Volume 4 Issue 10 || October. 2015 || PP.24-30
Sukardi. E, dan Maramis. W. F. Penilaian Keberhasilan Belajar,Jakarta: Erlangga:University Press,1986
55
Kamaluddin, Firdaus, Albert (2019)
15
2. The Purpose of Norm Reference Test
The purpose of NRT is to know the position or rank of the learners in the group (in Class). The
questions of the test that the teacher will be used are should not be too difficult or too easy, so the
difficulty index range should about 0.3 to 0.7. Beside that NRT can differ from the smart learners and the
stupid learners.56 (Djemari 2008).
It aids in the identification of people who may require specific care, such as those with mental
retardation, learning difficulties (autism, dyslexia), attention deficit hyperactivity disorder (ADHD), and
conduct disorder (Stiggins 1994). Bias and partiality among pupils are discouraged by this test. Because
NRTs are produced by testing professionals, piloted, and refined before being used with students, their
quality is typically excellent, and they are trustworthy and stable for what they are meant to assess. NRT
is also useful for administrative tasks, such as rating and categorizing pupils (Anastasi 1988). Its purpose
is to assess school performance and hold schools accountable for meeting learning standards and
sustaining educational quality. The exam can also be used to assess whether or not a young kid is ready
for preschool or kindergarten. Oral language abilities, visual-motor skills, and cognitive and social
development may all be assessed using these exams (Klein, 1990).
3. The Characteristics of Norm-Referenced Test
Here are some characteristics of Norm Referenced test:

1. The result/assesment of Norm Reference Test used to determine the status of each learner
about the ability of other learners. It means that, Assessment Normative used when we want to
know the ability of learners in their communities such as in classroom, school, etc.
2. Norm Reference Test is using "relative" criteria. It means that the values always change
according to the conditions or the needs at that time.
3. The value of the results of Norm Reference Test not reflect the level of student ability and
mastery of teaching materials in that test, but it uses to pointed the position of learners (rank) in
their community (group).
4. Norm Reference Test is giving a score that describes the control of that group.
5. Norm-referenced scores are generally reported as a percentage or percentile ranking. Scores
can be reported as “grade equivalent”, “stanines”, and “normal curve eqivalents”, e.i, the scores
56
Djemari (2008).
16
range from 1st percentile to 99th percentile, with the average students score set at the 50th
percentile. If Jamal scored at the 63rd percentile, it means he scored higher than 63% of the test
takers in the norming group.a student who scores in the seventieth percentile performed as well
or better than seventy percent of other test takers of the same age or grade level, and thirty
percent of students performed better (as determined by norming group scores). Another
example is like a race.
6. Norm-referenced tests often use a multiple-choice format, though some include short-answer
questions. They are usually based on some form of national standards, not locally determined
standards or curricula. IQ tests are norm-referenced tests, because their goal is to see which test
taker is more intelligent than the other test takers. The median IQ is set to 100, and all test
takers are ranked up or down in comparison to that level.
7. One more question right or wrong can cause a big change in the student’s score. In some cases,
having one more correct answer can cause a student’s reported percentile score to jump more
than ten points. It is very important to know how much difference in the percentile rank would
be caused by getting one or two more questions right.
8. NRT usually have to be completed in a time limit. Some students do not finish, even if they know
the material. This can be particulary unfair tostudents whose first language is not English or who
have learning disabilities. This “speededness” is one way testmakers sort people out.
9. Besides, Norm Reference test not only can be used in the school but also can be used in outside
such as theater auditions and job interviews. Theater auditions and job interviews are norm-
referenced tests, because their goal is to identify the best. 57
4. The Examples of Norm Referenced Test
The meaning of norm in this case is the capacity or performance of the group, while the group it
self are all students who take the test. Besides, the value of the NRT do not reflect the level of ability
and mastery of teaching students about the material that will be test, but only indicates the status of
the students in the group rankings. For example, in English subject, the student who get 80 score in class
B will get an A, while in class C the student who get 65 score will get an A to. Why? Because like I said
57
http://jurnalbidandiah.blogspot.co.id/2012/04/resume-penilaian-acuan-norma-pandan.html International
Journal of Humanities and Social Science Invention ISSN (Online): 2319 – 7722, ISSN (Print): 2319 – 7714
www.ijhssi.org ||Volume 4 Issue 10 || October. 2015 || PP.24-30 Fair Test , 342 broadway, cambrige, MA 02139
(617) 864-4810 ,FAX (617) 497-2224 Web Site: www.faitest.org E-mail: FairTest@aol.org
https://reference.com/education/advantages
17
before that Norm Referenced test is using relative criterion. It means that the result or value just have
related with the norm of that group. In class C, the norm is low so if you just get 65 score, you can get an
A and in class B the norm is so high, so the student should get 80 to get an A. Another example is SPMB
or SBMPTN.58
IQ tests, developmental-screening tests (used to discover learning problems in early children or
assess eligibility for special educational services), cognitive capacity tests, readiness tests, and so on are
examples of NRTs. NRT excellent practices include the SAT (Stanford Achievement Test), CAT (California
Achievement Test), MAT (Metropolitan Achievement Test), TOEFL, IELTS, and others. To put it another
way, theatrical auditions, course placement, program eligibility, or school admissions, and job interviews
are all NRTs since their objective is to choose the best applicant from a group of candidates, not to see
how many of them fulfill a set of criteria (Brigance, 2004).
Based on examples above we can conclude that the standard of NRT taken depend on the norm or
achievement of the learners after the test finished and this is the last step of learning. The highest score
will be the standard for other students/learners.
IQ tests are norm-referenced tests, because their goal is to see which test taker is more intelligent than
the other test takers. The median IQ is set to 100, and all test takers are ranked up or down in
comparison to that level. 59
5. The Advantages of Norm Reference Test
There are several advantages of the PAN, such as presented below:

 The result of NRT can enable the teachers to be positive in treating students as unique
individuals.
 The result of NRT would be good information about the position of the students in the group.
 NRT can be used to select prospective students are tested rigorously.
 The norm referenced test is easiest to use when comparing students progress and performance
 High and reliable test quality, standardized procedures and meaningful information about
average performance.
6. The Disadvantages of Norm Reference Test/ Drawbacks of NRT

58
Kamaluddin, Firdaus, Albert (2019)
59
18
There are some disadvantages in NRT, such as presented below:
1. Just a Little mention of the competence of students about what they know or can they do.
2. Multiple-choice tests—the dominant NRT format, promote rote learning and memorization in
schools over more sophisticated cognitive skills, such as argumentation, conceptualization,
decision making, writing, critical reading, analytical thinking, problem solving and creativity
(Corbett & Wilson, 1991).
3. It's not fair for students rank because the rank is not only depends on the xlevel of achievement,
but also the achievements of other students.
4. Not to determine the passing someone, but to determine the ranking of students / students in a
particular group
5. Underlining the differences in achievement between students / student;
6. The assessment is based on the score distribution (bell curve) by using the formula
7. Lack of transparency, because the results of the final assessment of the students is unknown.
8. NRT scores are often misused in schools when making critical educational decisions, such as
grade promotion or retention, which can have potentially harmful consequences for some
students (Huitt 1996).
9. Overreliance on NRT results can lead to inadvertent discrimination against minority groups and
low-income student populations, both of which tend to face more educational obstacles than
non-minority students and higher-income households (Bond 1996). 60
7. The Use of Norm Reference Test
Norm-referenced test are a form of standardized testing that compares “normal” skill levels to
those of individual students of the same age. By comparing students toone another, it is possible to
determine whether, how, and to what a degree a particular student is ahead of or behind the norm.
These test help to diagnose learning disorders and also help special education teachers and other
professional develop appropriate program planning for students with disabilities. students in the group.
In addition, every school which used Norm referenced Test because:
60
Sukardi. E, dan Maramis. W. F. Penilaian Keberhasilan Belajar,Jakarta: Erlangga:University Press,1986. Bistok
Sirait. Menyusun Tes Hasil Belajar. Semarang Press, 1985. http://jurnalbidandiah.blogspot.co.id/2012/04/resume-
penilaian-acuan-norma-pandan.html International Journal of Humanities and Social Science Invention: about the
disadvantages of NRT o the Drawbacks of NRT: page 28
19
1. To compare students, it is often easiest to use a norm-referenced test because they were
created to rank test-takers. If there are limited places, (such as in a “Gifted and Talented”
program) and choices have to be made, it is tempting to use a test constructed to rank students,
even if the ranking is not very meaningful and keeps out some qualified children.
2. NRT are a quick snapshot of some of the things most people expect students to learn. They are
relatively cheap and esay to adminiser. If they were only used as one additional piece of
information and not much importance was put on them, they would not be much a problem.
20
GROUP 5
2.1 Definition of Validity
Validity is defined as the ability of something to be true. “Validity is a standard or
basic measure that demonstrates appropriateness (appropriateness), usability (usability),
and validity, all of which contribute to the interpretation of an evaluation technique in
accordance with the assessment's goal.” 61According to the experts, validity definitions
include: Sudjana (2004: 12) “Validity, in Sudjana's opinion, is an assessment tool's
accuracy in terms of the concept being assessed, actually assessing what should be
examined.”According to Azwar (1987: 173), "the degree to which the accuracy and
precision of a measuring device (test) in fulfilling its measuring function."
Validity in language testing is about how logical and true interpretations and
decisions are made based on scores (or in general data) from assessments. Validity has
been considered a trait of tests: A test is valid if it measures what it has to measure and
nothing more (Brown & Abeywickrama, 2010; Lado, 1961).62 However, many teacher or
expert no longer use this opinion or view as their based measurement in educational
measurement or in general language testing.
The following definition of validity in assessment is from the American

Educational Research Association (AERA), American Psychological Association and
National Council on Measurement in Education (NCME; 2014, p. 11): ‘The degree to
which evidence and theory support the interpretations of test scores for proposed uses of
tests’. Earlier, Messick (1989, p. 13) 63provides a similar definition that since its inception
was welcomed in language testing. To him, validity is ‘an overall evaluative judgement
of the degree to which evidence and theoretical rationales support the adequacy and
appropriateness of interpretations and actions based on test scores’.
When an instrument performs a measuring function correctly or produces a

measurement result that is consistent with the measurement's purpose, it is said to have
high validity. According to Arikunto (1999:65), the validity of understanding is a metric
that indicates a test's level of validity. If a test measures what it claims to measure, it is
considered to be valid.” Validity is a metric that indicates the reliability or accuracy of an
instrument. 64Thus, the validity tester refers to the extent to which an instrument is said to
be valid in carrying out its function if the instrument can be used to measure what is to be
61
Nurmadi Dkk. 2019. Guidelines for SPMI PTMA, Yogykarta: PP Muhamadiyah Diktilitang Council
62
Brown, H.D. (1994). Teaching by principles: an interactive approach to language pedagogy. Englewood Cliffs, NJ:
Prentice Hall Regents.
63
American Educational Research Association, American Psychological Association, National Council on
Measurement in Education. (2014). Standards for Educational and Psychological Testing.
64
Euis Sunarti. 2021. Family Measurement Inventory, Bogor : IPB Press,
21
measured (achmad nurmadi; 2019: 277). The importance of validity in measurement
cannot be overstated. If a measure measures what it is supposed to measure, it is said to
be valid. As a result, it is closely linked to concepts and definitions. Euis Sunarti (Euis
Sunarti; 2021:4). The ability of data collection instruments to measure what should be
measured in order to obtain data that is relevant to what is being measured is referred to
as validity. Dempsey (2002 : 7965) defines formalized
65
Juhana Nasrudin, 2019, Educational Research Methodology, Bandung : Terra Firma
22
2.2 Types of Validity
1. Content validity
Content validity for an instrument refers to a condition of an instrument that is based on
the content of scientific subject matter. Sukardi (2008) is the degree to which a test
evaluation measures the scope of the substance to be measured. According to Suharsimi
Arikunto (2013: 82) a test is said to have validity if it measures specific objectives in
accordance with the material or lessons given. Based on the above understanding, the
validity of the content or material to be tested must be relevant to the experience,
abilities, and also the background of a subject to be displayed. We as researchers provide
a material according to the ability of the subject. For example, we will conduct research
on students to show their learning abilities, of course the material or material that we
provide to them must be relevant and in accordance with their learning material for high
school level English subjects. 66By doing this, it is said that the content validity that we
are testing is valid and appropriate. Content validity conducted on high school students
can be said to be invalid if we use English language material that is at the same level as
student students, because it is not relevant to the object carried out by the researcher.
2. Predictive validity
Predicting means predicting by predicting always about things that will come so now that
they haven't happened yet. Sukardi (2008) predictive validity is the degree to which a test
can predict how well a person will perform a planned task or job prospect67. Suahrsimi
Arikunto (2013) a test is said to have predictive validity or forecast validity if it has the
ability to predict what will happen in the future. From the understanding described above,
predictive validity in general is predicting a behavior of a person or a subject that is
expected to occur in reality based on a theory or an instrument that has been given, and
will occur directly or indirectly or in the future, the prediction Of course, it must be
proven in real terms as expected and meet the criteria of an individual or in a careful
group, in other words, it becomes predictively valid with predictive validity. For
example, we give a general knowledge test to get the highest score, of course, we expect
the test results to show a high or low value and be determined by predictive validity.
3. Empirical validity
The term empirical validity contains the word empirical which means experience.
Suharsimi Arikunto (2013) an instrument that can be said to have empirical validity if it
has been tested from experience. 68Empirical validity is based on an explanation that is
66
Arikunto, Suharini. 2005. Dasar-dasar Evaluasi Pendidikan. Jakarta: Bumi Aksara.
67
Sudijono, Anas. 1996. Pengantar Evaluasi Pendidikan. Jakarta: PT Raja Grafindo Persada.
68
Thoha, M. Chabib. 2003. Teknik Evaluasi Pendidikan. Jakarta: PT Raja Grafindo Persada.
23
known based on experience that occurred in the end, it is arranged based on the
background possessed by the object, so in other words other instruments can be said to be
empirically valid if it has been tested based on experience. For example, a person can be
said to be creative or honest if in his daily life it is proven that that person is honest and
creative, and from his experience the person is proven to be creative and honest.
4. Construct validity
According to Djaali and Pudji (2008) construct validity is the validity that questions how
far the test items are able to measure what they really want to measure in accordance with
a specific concept or concept definition that has been defined,69 while according to
Suharsimi Arikunto (2013) a test It is said to have construction validity if the items that
make up the test measure every aspect of thinking as stated in the Instructional
objectives. 70From the explanation above, it shows that an instrument or material will be
constructively valid if the material or instrument provided is in accordance with concepts
that are in accordance with theoretically existing and predetermined aspects. For
example, we will give a test of English subjects to high school students, of course we
have to give the test according to the aspects and concepts of the English material that has
been determined according to these aspects.
5. Appearance validity
Face validity is part of content validity which has the lowest significance because it is
only based on the assessment process of appearance (appearance format) on the meaning
of the test and the appropriateness of the context. 71Visual validity shows that in terms of
appearance or appearance, it shows a size of the measuring instrument to be measured.
This validity is generally used to measure a person's ability based on a person's
intelligence shown in a measuring instrument, skill, or talent of a person.
69
Arikunto, Suharsimi. 2011. Dasar-Dasar Evaluasi Pendidikan (Edisi Revisi). Jakarta : PT. Bumi Aksara.
70
Arikunto, S. (1997). Dasar-dasar Evaluasi Pendidikan. Jakarta: Bumi Aksara
71
Sukardi. (2008). Evaluasi Pendidikan. Jakarta: Bumi Aksara
24
6. Cultural validity
Cultural validity or rather inter-cultural validity is very important for research conducted
in countries where ethnic groups vary widely. In addition, research conducted
simultaneously in several countries with the same measuring instrument, will also face
the problem of cultural validity.72 A measuring instrument that is already valid for
research in a country will not necessarily be valid if it is used in another country that has
a different culture. Cultural validity is very important to do, it is of course useful to test
the validity of a culture from another country. To do this, the validity that we do must
follow the culture that we will examine, therefore the validity of the data should not be
used in the culture of other countries by following the cultural validity that we previously
did in the culture of the previous country. 73For example we have done cultural validity in
the United States of America, and then we did cultural validation in Australia based on
the cultural validity that we did previously in the United States, it could be invalid
because every country has a different culture.
7. Criterion Validity
Criterion or concrete validity is the extent to which a criterion is related to an outcome. It
measure how well one partner predicts an outcome for another partner. A test has this
type of validity if it is useful to predict performance or behavior in another situation (past,
present, or future). Criteria validity is an alternative perspective that does not emphasize
the conceptual meaning or interpretation of test scores. Test users simply want to use a
test to distinguish between groups of people or to make predictions about future
outcomes. For example, a director of human resources may need to use a test to predict
which applicants are likely to perform well as employees. From a very practical point of
view, she focuses on the test's ability to distinguish good employees from bad employees.
If the test does well, the test is "valid" enough for her purposes. From the traditional
three-phase validity view, criterion validity refers to the extent to which test scores can
predict specific criterion variables. From this perspective, the key to validity is the
empirical relationship between test scores and scores on the relevant criterion variable,
such as "performance". Messick (1989) suggests that “even for purposes of applied
decision-making, reliance on criterion validity or content coverage is not sufficient. The
meaning of the criterion, and therefore its validity, must always be pursued - not only to
72
Arikunto, Suharsimi. 2007. Dasar-dasar Evaluasi Pendidikan. Jakarta: Bumi Aksara.
73
Amir Daien Indrakusuma. 1975. Evaluasi Pendididkan. Jilid I terbitan sendiri.
25
support test interpretation, but also to justify test use ”. 74There are two types of validity
measures, namely; simultaneous validity, predictive and postdictive validity.
A. Predictive validity
The survey is predictively valid if the test accurately predicts what it should predict. It
can also refer to when scores of the predictor partner are first taken and then the criterion
data is collected later. In other words, the ability of one assessment tool to predict future
performance in an activity or on another assessment of the same construct.75 The best way
to determine predictive validity directly is to conduct a long-term validity study. For
example, by administering employment tests to job applicants and then checking whether
the test scores are correlated with the future performance of the hired employees.
Predictive validity studies take a long time to complete and require fairly large sample
sizes to obtain meaningful total data. In short, predictive validity evaluates the
operationalization's ability to predict something that it should be able to predict
theoretically.
B. Simultaneous/Concurrent validity
Simultaneous validity is a type of evidence that can be gathered to defend the use of a test
to predict other outcomes.76 It refers to the extent to which the results of a particular test
or measurement correspond to those of a previously determined measurement for the
same construct. In short, simultaneous validity assesses the operationalization's ability to
distinguish between groups that it should be able to theoretically distinguish.
C. Post dictive validity

74
Saifuddin azwar. 1997. Reliabilitas dan Validitas. Yogyakarta: Pustaka Pelajar.
75
Arikunto, Suharini. 2005. Dasar-dasar Evaluasi Pendidikan. Jakarta: Bumi Aksara.
76
Sudijono, Anas. 1996. Pengantar Evaluasi Pendidikan. Jakarta: PT Raja Grafindo Persada.
26
For this type of validity, the measure is in the past. That is, the criterion (for example,
another test) has been applied in the past. 77It is a form of criterion referred to, which is
determined by the extent to which the scores on a given test are related to the scores on
another, already established test or criterion administered at a previous time.
2.1 Validate Assessment
There are several steps you can take to validate your assessment. But keep in mind, you should
also ask yourself questions to more easily validate the assessment.78
1. Have you understood the learning information you got?
2. Have you previously received this information in a different place?
because you must first complete the questions you give yourself and if you have passed or have
been able to answer these questions then you can be said to be able to validate your assessment.
1. What will it be like when it's easy to get?
2. Who did you do all that for?
3. What will happen if you give up without knowing what will happen?
4. do you know what kind of experience you will get?
5. What is the purpose of the information you get?
6. What skills did you get after getting the information?
7. What are the benefits after you get the information or the concept of the assessment.
77
Thoha, M. Chabib. 2003. Teknik Evaluasi Pendidikan. Jakarta: PT Raja Grafindo Persada.
78
Leanguage Assessment Principle, Kamaluddin, Sale Firdaus, Alberth.
27
How to Validate Assesment
Validation is a process.
Validation refers to the process of collecting validity evidence to evaluate the appropriateness of
the interpretations, uses, and decisions based on assessment results . This definition highlights
several important points. 79First, validation is a process not an endpoint. Labeling an assessment
as “validated” means only that the validation process has been applied—i.e., that evidence has
been collected. It does not tell us what process was used, the direction or magnitude of the
evidence (i.e., was it favorable or unfavorable and to what degree?), what gaps remain, or for
what context (learner group, learning objectives, educational setting) the evidence is relevant.
Second, validation involves the collection of validity evidence, as we discuss in a following
section.
Why is assessment validation important?
Rigorous validation of academic assessments is critically vital for a minimum of 2 reasons. First,
those victimization associate assessment should be able to trust the results. Validation doesn't
provides a straightforward yes/no answer relating to trait (validity); rather, a judgment of trait or
validity depends on the meant application and context and is often a matter of degree. Validation
provides the proof to create such judgments and a assessment of remaining gaps. Second, the
quantity of assessment instruments, tools, and activities is actually infinite, since every new
multiple-choice question, scale item, or test station creates a factual new instrument.
79
Cook, D. A., & Hatala, R. (2016). Validation of educational assessments: a primer for simulation and beyond. Advances in
Simulation, 1(1), 1-12.
28
Steps To Validate
According to Mitchell in a journal that entitle Surgical Education, Simulation, and Simulator
-Updating the Concept of Validity show that many steps to validate80
1. Define the construct and proposed interpretation Validation begins by considering the
construct of interest.
2. Make explicit the intended decision(s) Without a clear idea of the decisions we anticipate
making based on those interpretations, we will be unable to craft a coherent validity argument.
3. Define the interpretation-use argument, and prioritize needed validity evidence
4. Identify candidate instruments and/or create/adapt a new instrument We should identify a

measurement format that aligns conceptually with our target construct and then search for
existing instruments that meet or could be adapted to our needs. A rigorous search provides
content evidence to support our final assessment.
5. Appraise existing evidence and collect new evidence as needed Although existing evidence
does not, strictly speaking apply to our situation, for practical purposes we will rely heavily on
existing evidence as we decide whether to use this instrument
6. Keep track of practical issues including cost An important yet often poorly appreciated and
under-studied aspect of validation concerns the practical issues surrounding development,
7. Formulate/synthesize the validity argument in relation to the interpretation-use argument
8. Make a judgment: does the evidence support the intended use? The final step in validation is
to judge the sufficiency and suitability of evidence, i.e., whether the validity argument and the
associated evidence meet the demands of the proposed interpretation-use argument.
Factors Affecting Validity

80
Assesment Goldenberg, M., & Lee, J. Y. (2018). Surgical education, simulation, and simulators—updating the concept of
validity. Current urology reports, 19(7), 1-5.
29
Based on Asaad, Abu Bakar (2004) regarding the Measurement and Evaluation of Concepts and
Applications, the factors that affect the validity test included :81
1. Inaccuracy of test items.
This means that in measuring understanding, thinking skills, and other types of complex
achievements, it is not possible to use a form of test that is only suitable for measuring factual
knowledge, because this will invalidate the results.
2. Direction of the questions.
This means that instructions that are not displayed clearly will make students confused in
responding to the questions and recording the answers will tend to reduce the validity of the
items.
3. Reading vocabulary and sentence structure.
Vocabulary and sentence structures that are not in accordance with the student's level will result
in an invalid test measuring reading comprehension or intelligence
4. The level of difficulty of the items.
The questions used for the test must be according to their level, because if the questions that
appear are too easy or difficult, it will not be able to distinguish which students can and cannot,
so it will reduce the validation of the test.
5. Poorly constructed test items.
Test items that inadvertently provide answer clues will tend to measure students' alertness in
detecting clues and important aspects of student performance that will be measured by the test
will be affected.
6. Length of test questions.
81
Asaad, Abubakar S. (2004). Measurement and Evaluation Concepts and Application (Third Edition). Manila: Rex
Bookstore Inc.
30
A test must have a sufficient number of items to measure what it is supposed to measure. If a
test is too short to provide a representative sample of the performance to be measured, its validity
will decrease.
7. Arrangement of items.
The test questions should be arranged in increasing difficulty. Tying difficult items at the
beginning of the test can cause mental blocks and may take too much time for students+
therefore, students are prevented from reaching items that they can easily answer. Therefore,
inappropriate settings can also affect validity by having a detrimental effect on student
motivation.
8. Pattern of answers.
The pattern of correct answers that are structured, in this case will reduce the validity of the test
again.
9. Ambiguity.
Ambiguous statements in test items contribute to misinterpretation and confusion. Ambiguity

sometimes confuses bright students more than poor students, cloaking items to discriminate in a
negative direction.
GROUP 6
31
a. 1. Validity
This refers to the extent to which the test measures what it purports to measure. For example,
when an intelligence test is developed to assess the level of intelligence, it should assess the
intelligence of the person, not other factors.
It means measuring what it is supposed to measure. It tests what it is supposed to test. A good
test that measures grammatical control should not have difficult lexical items. Validity tells us
whether the test meets its development goals.
A test is valid if it used to measure what we want to measure and nothing else is not a point to
measure. Validity is a concept more dependent on tests, but reliability is a purely statistical
parameter.
1.1 Definition Validity According To Experts
 According to Goodwin, a test was considered to be either valid or not as evidence by the
correlations between the test and some other “external criterion measure82.
 According to Hughes, validity in testing and assessment has traditionally been understood
to mean discovering weather a test “measures accurately what it is intended to measure”
(198922‫ ׃‬p. 4)
 According to Messick,
a. Validity is an overall evaluative judgment, founded on empirical evidence and

theoretical rationales, of the adequacy and appropriateness of inferences and action
based on test scores (1998 p. 33).
b. Validity is thusan overall evaluative judgement of the degree to which empirical
evidence and theoretical rationales support the adequacy and appropriateness of
interpretations and actions based on test scores or other methods of assessment.
(1989, p. 245)
82
EL.LE Vol.1-Num.1-Marzo 2012. New Views of Validity In Language Testing. Claudia D’este.
Davies, A.; Elder, C. (2005). Validity and validation in language testing. In: Hinkel, E. (ed.), Handbook of research
in second language teaching and learning. Long Beach: Lawrence Erlbaum Associates,
Lado, R. (1961). Language testing: The construction and use of foreign language tests. London: Longman.
Messick, S. (1989). Validity. In: Linn, R.L. (ed.), Educational measurement. 3rd edn. New York: Macmillan, pp. 13-
104.
32
 According to Sugiyono, validity is the genuineness level of measuring instrument.
Instrument which is valid means that it shows the instrument which is used to get data is
valid, or can be used to measure what ought to be measured. (2007137‫ ׃‬p. 17)
 According to Lado, Validity is a matter of relevance: a test is considered valid when test
content and test conditions are relevant and there are no «irrelevant problems which are
more difficult than the problems being tested» (1961, p. 321).
 According to Alderson, Clapham ,and Wall, Validity refers to studies of the perceived
content of the test and its perceived effect, whilst the second type relates to studies
comparing students test scores with measures of their ability collected from outside the test
(1995, p. 171)
In definition above, we can conclude that ‫׃‬
 Validity refers to the extent to which a test, an instrument or a process measures what it is
intended to measure
 Validity uncovering the appropriateness if a given test or any its component parts as a
measure of what it is purposed to measure83.
1.2. Types of Validity
 Content Validity : The extent to which the best measures using a representative sample of
the content to be tested the level of learning desired by the teacher.
 Criterion-Related Validity : The validity associated with the existence of criteion

invetigated the correspondence betwen the scores obtained from the newly developed test
and the scores obtained from several independent external criteria.
 Construct Validity : this types refers to the measurement of certain properties of theoretical
constructs. It is based on the extent to which the problems in the test reflect important
aspects of the theory on which the test is based.
b. 2.Reliability
2.1. Definition of Reliability
Volume 19. 1999, p. 254-272. Carol A. Chapelle doi: https://doi.org/10.1017/S0267190599190135. Publish
83
Online: 08 agustus 2003
33
According to Anastasia and Susana (1997), reliability is something that refers to the
consistency of the scores achieved by the same people when they were retested with the same
test on different occasions, or with a set of grain equivalent (equivalent items) different, or
under different test conditions.[84]
According to Masri Singarimbun, reliability is an index that indicates the extent of a

measuring instrument is reliable or unreliable. If a measuring instrument is used twice - to
measure the same symptoms and the results of measurements that obtained relatively consistent,
then the measuring instrument is reliable. In other words, reliability shows the consistency of
measuring instrument in the same symptoms measurement.
According Sumadi Suryabrata, reliability indicates the extent of the measurement results
and the instrument can be trusted. The measurement results must be reliable in the sense should
have a level of consistency and stability.
According to Clark (1975) said that reliability is in fact a prerequisite to validity in

performance assessment in the sense that the test must provide consistent, replicable
information about candidates' language performance. And Jones (1979) said that there is no test
can achieve its intended purpose if the test results are unreliable.
In view of Aiken (1987: 42) a test is said to be reliable if the scores obtained by
participants are relatively the same despite repeated measurements. [85]
Base on some definition that mentioned above, so we can conclude that the definition of
reliability generally is something, indicator, and a series of measurements that indicates
consistency and constancy, in which the condition that occur when the score of learners that do
test obtain a same score although the test is done repeatedly.
For example, if teacher A gives a test to Johnny on Monday morning and scores him with
90/100, a reliable test will give the same result to Johnny if he had taken the test on Monday
84[]
https://p4mristkippgrisda.wordpress.com/2011/05/10/uji-validitas-dan-reliabilitas/
85[]
https://mmeri3328.blogspot.com/2016/01/reliabilitas-dan-validitas-tes.html?m=1
Alderson, J.C.; Clapham, C.; Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge
University Press
34
afternoon with teacher B or if a similar student to Johnny takes the test on Tuesday with teacher.
[86]
c. 3. Practicality87
3.1 Definition Practicality

Practicality in a test refers to a test that is not that charge with question. If it too long pr if it’s
too short, if it’s easy to grade and appropriate for the student and it has the accrued grade for the
student. An effective test is practical. This means that it
 Is not excessively expensive
 Stays within easy to administer
 Has a scoring/ evaluation procedure that is specific and time efficient.
A test that is prohibitively expensive is impractical. A test of language proficiency that

takes a student five hours to complete is impractical it consumes more time ( and money )than
necessary to accomplish its objective. A test that requires individual one on one proctoring is
impractical for a group of several hundred test- takers and only a handful of examiners. A test
that takes a few minutes for a student to take and several hours for an examiner too evaluate is
impractical for most classroom situations.
Validity and reliability are not enough to build a test. Instead, the test should be practical
across time, cost, and energy. Dealing with time and energy, tests should be efficient in terms of
making, doing, and evaluating. Then, the tests must be affordable. It is quite useless if a valid
and reliable test cannot be done in remote areas because it requires an inexpensive computer to
do it (Heaton, 1975:158-159; Weir, 1990: 34-35; Brown, 2004: 19-20).
3.2 In Term of practicality88

Brown (2004:19) defines practicality is in terms of :
1. Cost
86[]
Zamani, Kamaluddin. 2017. Language Testing. Kendari : Halu Oleo University, p. 20.
87
Bachman and Palmer, 1996: 35-36, Brown 2004, Unit 8A book of Language Testing and Assessment an advanced
resource by Gleen Fulcher and Fred Davidson, Davies et al., 1999: 148)
88
(Brown 2004:19)
35
a.) The test should not be too expensive to conduct.
In process to make a test and to test, the teacher should carry on the expensive test. In
order a test not be too expensive to conduct. The cost should easy to conduct.
b.) The cost for the test has to stay within the budget.
Every school in Indonesia there is estimate for test. So, if the teacher will make a test
should appropriate with estimate.
c.) Avoid conducting a test that requires excessive budget.
In this case, the teacher should work appropriate with the estimate in order to avoid
conducting a test that requires excessive budget.
For example:
What do you think if a teacher conducts "daily tests" for one class consisting 30 students
in junior high school level that spends IDR 500.000 for every student? It is practical in term of
cost? It is not practice because IDR spend much money, in the case IDR should appropriate
with the practical a test or assessment for make a test to be easy.
2. Time
a) The test should within appropriate time constraints.

In this case, process to make a test should within appropriate time constraints. The
teacher should smart to carry on the process to make a test finish in appropriate time
constraints.
b) The test should not be too long or too short.

The mind of the test should not be too long or too short is the test should appropriate with
the lesson that the teacher bring. And then, the test should appropriate with time. If the
test not long, so the time that used also not long.
For example:
What do you think if a teacher wants to conduct a test of language proficiency that is will
takes a student the hours to complete? It is that practical in term of time? It is not practice
because spend much time. The teacher 89 should smart in exploit time because if the teacher use
much time for test so, time will expelled.
89
(Brown 2004:19)
36
3.3 Effective practical90
An effective practical test. This means that it:

1) is not excessively expensive, this statement refers to the coast. The coast of test not
expensive, when the teachers make a test should appropriate with the estimate that give
in school. For example in the process make a test the teacher should use money
appropriate with estimate that has given by school.
2) stays within appropriate time constraint, refers to the time, how the teacher appropriate
the time in the process of test and the process to make a test. For example when the
teacher make a test, time that use by the teacher is not too long in order the test can
allotted for the student. And then the process to finishing a test is not too long also.
3) is relatively easy to administer, in process to administer the test should be easy, if the
test easy to administer, students can easy to understand about the test. In administer the
teacher should make a test not complicated to conduct. For example, the teacher gives
the student a test that they can easy to understand, so they can happy to finishing the
test.
4) has a scoring/ evaluating procedure that is specific and time-efficient.
In the process scoring and evaluation the teacher should procedure most specific and used
time efficient in order when the teacher scoring the teacher did not used much time. For
example, the teacher evaluating about process learn student in the class and etc. But, in
the process of scoring teacher should use time efficient.
d. 4.Washback
4.1 Definition
The effects of tests on teaching and learning are called washback.Teachers must be able
to create classroom tests that serve as learning devices through which washback is achieved.
Washback enhances intrinsic motivation, autonomy, self-confidence, language ego, inter
language, and strategic investment in the students. Instead of giving letter grades and numerical
scores which give no information to the students’ performance, giving generous and specific
90
(Alderson and Wall 1993 have shown see Unit B5, Wall, 2000; Cheng and Watanabe, 2004)
37
comments is a way to enhance washback (Brown 2004: 29).Heaton (1975: 161-162) mentions
this as backwash effect which falls into macro and micro aspects. In macro aspect, tests impact
society and education system such as development of curriculum. In micro aspect, tests impact
individual student or teacher such as improving teaching and learning process.
The term of washback is used to talk about the effect that tests have on what goes on. In
the classroom, for many years it was just assumed that ‘good tests’ would produce ‘good
washback’ and inversely that ‘bad tests’ would produce ‘bad washback’.We now know that it is
much more complex than this, as Alderson and Wall (1993) have shown (see Unit B5).
The whole notion of washback extends beyond the immediate impact on the classroom
to how tests affect schools, educational systems and even countries, although this is sometimes
referred to as the study of impact, rather than washback (see Wall, 2000; Cheng and Watanabe,
2004).
The study of washback is also concerned with the political use of tests to 91implement
change in classrooms that are seen as improvements by governments. The term of washback or
backwash refers to the effect of testing on teaching and learning (Hughes, 2003, p.1).
The concept of washback is therefore part of what Messick (1989) calls consequential
validity. As part of consequential validity, Messick (1996:241) says that: Washback refers to the
extent to which the introduction and use of a test influences language teachers and learners to do
things that they would not otherwise do that promote or inhibit language learning.
Though washback is often described as "the influence of testing on teaching and learning"
(Anderson & Wall, 1993), other researchers have underscored different dimensions of this
concept.
1. The term of washback is commonly used in applied linguistics. It is rarely in dictionaries.

2. However, the word washback or backwash can be found in certain dictionaries and it is
defined as "an effect that is not the direct result of something" by Cambridge Advenced
Learner's Dictionary.
10 (Hughes, 2003, p.1, Messick 1996:241, Anderson & Wall,1993, Brown, 2004 or Backwash Heaton, 1990,
91
Cheng et al.(Eds), 2008:7-11, Cheng and Curtis, 2008:10)
38
3. In dealing with principles of language assessment, these two words somehow can be
interchangeable.
4. Washback (Brown, 2004) or Backwash (Heaton, 1990) refers to the influence of testing on
teaching and learning.
5. The influence can be positive of negative (Cheng et al. (Eds), 2008:7-11).
From some of those experts, we can conclude that washback is it refers to the outcomes of
the assessment for the learners, the teachers and it describe about the effect that tests have on
what goes on in learning and teaching process in the classroom.
4.2 Positive Washback

Positive Washback by (Cheng and Curtis, 2008:10).
 Positive has beneficial influence on teaching and learning. It means teachers and students
have a positive attitude toward the examination or test and work willingly and collaborate
towards its objective.
 From assessment can motivate the student to learn more, positively influence the 92teacher
in what and how to teach and can improve the classroom environment for more learn.
 Test motivates students to work harder to have a sense of accomplishment and thus
enhance learning.
Positive washback refers to expected test effects. For example, a test may encourage
students to study more or may promote a connection between standards and instruction.
4.3 Negative Washback

Negative Washback by(Cheng and Curtis) and (Fulcher & Davidson).
a) Negative washback does not give any beneficial influence on teaching and learning
(Cheng and Curtis, 2008:9).
b) Test that are designed to punish or fail many students do not result in positive washback
(Cheng and Curtis, 2008:9).
92
(Cheng and Curtis: 2008:9) and (Fulcher & Davidson, 2007:225, 229).
39
c) Tests which have negative washback are considered to have negative influece on
teaching and learning (Cheng and Curtis, 2008:9).
d) The quality of washback might be independent of the quality of the test (Fulcher &
Davidson, 2007:225).
e) It seems that there is no way to generalize about washback at the present time. Teaching
and learning will be impacted in many different ways depending upon the variables at
play at specific context. What these variables are, how they are to be weighted and
whether we can discover patterns of interaction that may hold steady across context is a
matter for ongoing research (Fulcher & Davidson, 2007: 229).
Negative washback refers to the unexpected, harmful consequences of a test. For example,
instruction may focus too heavily on test preparation at the expense of other activities.
Washback from tests can involve individual teachers and students as well as whole classes and
programs.
One way to ensure positive washback is through instructional planning that links teaching
and testing. By selecting a test that reflects your instructional and program goals, you can more
closely align testing with instruction. The principles of backward design (Wiggins & McTighe,
2005) provide a helpful model for integrating teaching and testing.
4.4 Ways to Improve Washback (Cheng and Curtis, 2008:9)

There are several ways to improve washback:
1) To comment generously and specifically on test performance.
2) Through a specification of the numerical scores on the various subsections of the test.
3) Formative versus summative test :
1) Formative tests provide washback in the form of information to the learner on progress
towards goals.
2) Summative tests provide washback for learners to initiate further pursuits, more
learning, more goals, and more challenges to face.
e. 5.Authenticity
5.1 Definition of Authenticity

1) General Definition
40
Authenticity comes from “authentic” means worthy of acceptance or belief as
conforming to or based on fact.
2) Definition in Language Testing

“An authentic text is a stretch of real language produced by a real speaker or writer for a real
audience and designed to convey a real message of some sorts” (Morrow, 1977, p. 13)”.
Authentic Materials spoken or written language data that has been produced in the course of
genuine communication and not specifically written for the purposes of language
teaching”(Nunan, 1999).
Bachman and Palmer (as cited in Brown, 2004:28) defined authenticity as the degree of
correspondence of the characteristics of agiven language test task to the features of a target
language. Several things must be considered in making an authentic test: language used in the
test should benatural, the items are contextual, topics brought in the test should be
meaningfuland interesting for the learners, the items should be organized thematically, andthe
test must be based on the real-world.
5.2 The Authenticity Debate

In applied linguistics, the term ‘authenticity’ originated in the mid1960s with a concern
among materials writers such as Close (1965)and Broughton (1965) that language learners were
being exposed to texts which were not representative of the target language they were learning.
Close (1965), for example, stressed the authenticity of his materials in the title of his book The
English we use for science, which utilized a selection of published texts on science from a
variety of sources and across a range of topics. Authenticity at the time was seen as a simple
notion distinguishing texts extracted from ‘real-life ‘sources from those written for pedagogical
purposes.
As Rea, 1978; Morrow 1978; 1979; 1983; 1991; Carroll, 1980) equated authenticity with
what Widows identified as genuine input and focused on the need to use texts that had not been
simplified and tasks that simulated those that test takers would be expected to perform in ‘the
real world’ outside the language classroom.
41
This understanding of authenticity, detailed in Morrow’s groundbreaking report of 1978,
gradually began to filter through to language testing practices of the 1980s and 1990s. In 1981,
for example, in response to Morrow’s (1978) report, the Royal Society of Arts introduced the
Communicative Use of English as a Foreign Language examination. This was the first large-
scale test to focus on students’ ability to use language in context (language use) rather than their
knowledge about a language (language use) (Hargreaves, 1987); it was also the precursor to the
Certificates in Communicative Skills in English introduced in the 1990s by the University of
Cambridge Local Examination Syndicate (UCLES)
5.3. Reconceptualization of Authenticity
In language teaching, the debate was taken forwards by Breen (1985) who suggested
that authenticity may not be a single unitary notion, but one relating to texts (as well as to
learners’ interpretation of those texts), to tasks and to social situations of the language
classroom. Breen drew attention to the fact that the aim of language learning is to be able to
interpret the meaning of texts, and that any text which moves towards achieving that goal could
have a role in teaching. He proposed that the notion of authenticity was a fairly complex one
and that it was over simplistic to dichotomize authentic and inauthentic materials, particularly
since authenticity was, in his opinion, a relative rather than an absolute quality.
Authenticity – that is, the perceived match between the characteristics of test tasks to
target language use (TLU) tasks – and interactional authenticity – that is, the interaction
between the test taker and the test task (Bachman, 1991). Like Breen (1985), Bachman (1990;
1991) also recognized the complexities of authenticity, arguing that neither situational nor
interactional authenticity was absolute. A test task could be situation ally highly authentic, but
interaction ally low on authenticity, or vice versa. This reconceptualization of authenticity into a
complex notion pertaining to test input as well as the nature and quality of test outcome was not
dissimilar to the view of authenticity emerging in the field of general education.
Although there was no single view of what constituted authentic assessment, there
appears to have been general agreement that a number of factors would contribute to the
authenticity of any given task. For an overview of how learning theories determined
42
interpretation of authentic assessment see Cumming and Maxwell, 1999. Furthermore, there
was a recognition, at least by some (for example, Anderson et al. (1996), cited by Cumming and
Maxwell, 1999), that tasks would not necessarily be either authentic or inauthentic but would lie
on a continuum which would be determined by the extent to which the assessment task related
to the context in which it would be normally performed in real-life. This construction of
authenticity as being situated within a specific context can be compared to situational
authenticity discussed above.
Authenticity, as the above overview suggests, has been much debated in the literature. In
fact, there have been two parallel debate son authenticities which have remained largely
ignorant of each other. Discussions within the field of applied linguistics and general education–
as Lewkowicz (1997) suggests – need to come closer together. Furthermore, such discussions
need to be empirically based to inform what has until now been a predominantly theoretical
debate. The questions identified earlier demonstrate that there is still much that is unknown
about authenticity. As Peacock (1997: 44) has argued with reference to language teaching:
‘research to date on this topic is inadequate, and $ further research is justified by the importance
accorded authentic materials in the literature’. This need is equally true for language testing,
which is the primary focus of this article.
One aspect of authenticity which has been subject to consider able speculation, but
which has remained under-researched, is related to test takers’ perceptions of authenticity. The
following study was setup to understand more fully the importance test takers accord to this test
characteristic, and to determine whether their perceptions of authenticity affect their
performance on a test.
5.4. Types of Authenticity
a. Emotional Authenticity
One speaker mentioned that she genuinely loves people, which obviously include her
audiences. Her definition of being authentic is sharing those feelings of being open and honest
with people – and being a little bit vulnerable too.
43
This is how many people feel about authenticity – that if you put your true self out there you will
resonate with those you were meant to make connections with.
b. Emotional authenticity comes from the heart.
1. Logical Authenticity
In contrast, logical authenticity is more from the head, which some might argue is not the
source of true authenticity. Nevertheless, for those that strongly believe in who they are, this
type of authenticity is the operational software of their lives.
The speaker that fits this type of authenticity suggested to the audience to take a stand on
what you believe in – and don’t waver from it.
His authenticity comes from a place that says – this is who I am and you aren’t going to
change me. It’s actually pretty good advice if you want to carve out a unique identity or brand
that will get you hired more frequently. Whether the source of your authenticity is from the
heart or the head, they both strengthen your identity and alignment with your ideal clients
or customers.
2. Practical Authenticity
When you combine emotional and logical authenticity you get a hybrid. While it would be
easy to view this as being inauthentic, it is arguably the most common form of authenticity – for
very significant reasons.
To illustrate this with an example, let’s look at it from the standpoint of a professional
speaker.
Not every audience will be comfortable with all-out authenticity that comes from the heart,
especially if that audience is predominantly male.
So, as a practical matter, it may be necessary to dial things back a bit to respect the group and
the circumstances that are presently before you.
As the speaker who uses this approach noted, when he is on stage he plays a character that is
very much the same as the original – but nevertheless different.
The difference is customizing his authenticity for each individual audience – while respecting
the degree of authenticity that they are capable of receiving.
44
5.5. Three Standards of Authentic Instruction
Why do many innovations fail to improve the quality of instruction or student

achievement? In 1990, we began to explore this question by studying schools that have tried to
restructure. Unfortunately, even the most innovative activities—from school councils and shared
decision making to cooperative learning and assessment by portfolio—can be implemented in
ways that undermine meaningful learning, unless they are guided by substantive, worthwhile
educational ends. We contend that innovations should aim toward a vision of authentic student
achievement, and we are examining the extent to which instruction in restructured schools is
directed toward authentic forms of student achievement. We use the word authentic to
distinguish between achievement that is significant and meaningful and that which is trivial and
useless. To define authentic achievement more precisely, we rely on three criteria that are
consistent with major proposals in the restructuring movement:
(1) students construct meaning and produce knowledge,
(2) students use disciplined inquiry to construct meaning, and
(3) students aim their work toward production of discourse, products, and performances
that have value or meaning beyond success in school.
5.6. The Use of Authenticity in English Skill
Authentic learning is based on students' use of reading and writing for purposes other
than satisfying the teacher's assignments. Authentic tasks are those that students might do even if
they are not trying to improve their reading or writing. In this activity, you will develop a list of
authentic reading and writing activities that support comprehension and active learning. When
you have finished, save your written work to submit as an assignment. Think about five topics,
themes, and/or books you will be teaching this year. What reading and writing activities will
promote successful learning of these topics/themes? Now make a list of at least two authentic
activities/experiences for each topic or theme.
45
 Authentic Reading is reading a variety of text for real purposes. Authentic reading is
most like that which occurs in everyday life. Example: reading a story from a novel or
reading a newspaper.
Authentic Reading includes : reading that is meaningful, relevant, and useful to the
reader; supporting readers with a print-rich environment • providing choice within a
variety of forms and genres; • having the opportunity to interact with others in
response to the text; • focusing on communicating ideas or shared understandings; •
providing authentic meaning-making experiences: for pleasure, to be informed, or to
perform a task.
 Authentic Writing is writing for real purposes and real audiences. Authentic writing
is writing that is most like that which occurs in everyday life. Example: writing an
argumentative text for persuading people to believe in author opinion or writing a story
based on real life situation.
Authentic Writing includes: • writing that is meaningful, relevant and useful to the
writer; • supporting writers with a print-rich environment; • providing choice within a
variety of forms and genres; • understanding that the writing process is recursive; •
having the opportunity to interact with others in response to text.
 Purposeful Speaking and Listening is the foundation of reading and writing

development in which students, formally and informally, comprehend, express, and
exchange ideas for a variety of authentic purposes. Example: writing an appropriate kind
of writing to an appropriate target.
Purposeful Speaking and Listening includes: • making relevant statements and

asking questions; • listening actively and responding; • sharing personal connections
related to the topic; • elaborating and explaining.
 Word Study is the active teaching of words and their meanings within authentic reading
and writing experiences. Example: read a couple of words from book or dictionary and
trying to memorize it.
46
Word Study includes: • building a word-rich environment which allows for
incidental and intentional learning of words; • developing students vocabulary through
intentional instruction using a variety of strategies and tools; • studying spelling
patterns in words such as rhyming, root words, suffixes, prefixes; • providing
opportunities for beginning readers to develop phonics and phonemic awareness.
 Authentic Listening is listening to an audio that sounds real life condition in variety
place and situation. Example: Listening to a conversation of two people in the market
trying to bargain groceries.
Authentic Listening includes: • Listening that is meaningful, relevant and

useful to the student and their study; • supporting students with a print-rich
environment; • providing choice within a variety of place and situation.
Have a look to this cycle for the brief table as following:
Characteristics of good test

o
Validity
Reliability
Practicality
Washback
Authenticity
GROUP 7.
A. DEFINITION
The item analysis is an important phase in the development of an exam program. In
this phase statistical methods are used to identify any test items that are not working well. If
47
an item is too easy, too difficult, failing to show a difference between skilled and unskilled
examinees, or even scored incorrectly, an item analysis will reveal it. The two most common
statistics reported in an item analysis are the item difficulty, which is a measure of the
proportion of examinees who responded to an item correctly, and the item discrimination,
which is a measure of how well the item discriminates between examinees who are
knowledgeable in the content area and those who are not. An additional analysis that is often
reported is the distractor analysis. The distractor analysis provides a measure of how well
each of the incorrect options contributes to the quality of a multiple choice item. Once the
item analysis information is available, an item review is often conducted.
B. LEVEL OF DIFFICULTY
A good test is a test which is not too easy or vice versa too difficult to students. It
should give optional answer that can be chosen by students and not to far by the key answer.
Very easy items are to build in some affective feelings of “success” among lower ability
students and to serve as warm up items, and very difficult items can provide a challenge to
the highest-ability students (Brown, 2004:59).
It makes students know and record the characteristics of teacher’s test if the test given
always comes to them too easy and difficult. Thus, the test should be standard and fulfill the
characteristics of a good test. The number that shows the level difficulty of a test can be said
as difficulty index (Arikunto, 2006:207). In this index there are minimum and maximum
scores. The lower index of a test, the more difficult the test is. And vice versa, the higher the
test, the easier it is. There are some factors that every test constructors must consider in
constructing difficulty level of test items.
Mehren and Lehmen (1984) point out that the concept of difficulty or the decision of
how difficult the test should depends on variety factors, notably 1) the purpose of the test, 2)
ability level of the students, and 3) the age of grade.
The formula that can be used to measure it is:
B
F= (Brown, 2004:59)
JS
Another formula for measuring item difficulty (P-value) given by Gronlund, (1993:
103) and Garrett (1981:363) is below:
R
P= X 100
N
Where
P = the percentage of examinees who answered items correctly.
R = the number of examinees who answered items correctly.
N = total number of examinees who tried the items.
In measuring level of difficulty of an essay tests or short answer items, the writer used
the different formula test below:
48
mean
P= (Zulaiha, 2008: 34)
max imumscore
C. DICRIMINATING POWER (ITEM DISCRIMINATION)

It is the extent to which an item differentiates between high and low-ability test-
takers. Discrimination is important because if the test-items can discriminate more, they will
be more reliable (Hughes, 2005:226). It can be defined also as the ability of a test to separate
master students and nonmaster students (Arikunto, 2006:211).
A master student is a student with higher scores of test, and a non-master student is a
student with lower scores on the test given. The same as the term of difficulty level,
discrimination has discrimination index. It is an indicator of how well an item discriminates
between weak candidates and strong candidates (Hughes, 2005:226). This index is used to
measure to the ability of a test in discriminating the upper and lower group of students.
Upper students are students who answer with true answer, and lower group are students with
false answer. In this index, it has negative point.
Different from difficulty index, the negative index of discrimination power shows that
the questions identify high group students as poor students and low group students as smart
students. A good question is a question that can be answered by upper group and cannot be
answered with true answer by lower group. imumscore Mean P max  The categorizing of
index of difficulty is divided into five types. They are too difficult, difficult, sufficient, easy,
and too easy test-items. Its categorizing is based on the standard stated by Brown (2004:59),
Arikunto (2006:208) that a test items is called too difficult if the number of P (index of
difficult) is 0.00. Next, a test item is called difficult if the index is between 0.00-0.300.
A test item is in range of sufficient if the index of difficulty is between 0.30-0.70.
Then, it is called easy test if the index is between 0.70- 1.00. It is called too easy if the
number of P is equivalent to 1.00. The appropriate test item will generally have P that range
from 0.15 to 0.85. (Brown, 2004:59) An item will have poor index difficulty if it cannot
differentiate between smart students and poor students. It happens if smart students and poor
students have the same score on the same item. Conversely, an item that garners correct
responses from most the high-ability group and incorrect responses from most of the low
ability group has good discrimination power (Brown, 2004:59).
D. ITEM DISTRACTORS
In addition to calculating discrimination indices and facility values, it is necessary to
analyze the performance of distractors (Hughes, 2005:228). It is defined as the distribution of
testee in choosing the optional answer (distractors) in multiple choice questions (Arikunto,
2006:219).
This item is as important as the other items considering that in view of nearly 50
years of research that shows that there is a relationship between the distractors students
choose and total test score (Nurulia, 2010:57).
49
Distractor analysis is an extension of item analysis, using techniques that are similar
to item difficulty and item discrimination. In distractor analysis, however, we are no longer
interested in how test takers select the correct answer, but how the distracters were able to
function effectively by drawing the test takers away from the correct answer. The number of
times each distractor is selected is noted in order to determine the effectiveness of the
distractor. We would expect that the distractor is selected by enough candidates for it to be a
viable distractor. What exactly is an acceptable value? This depends to a large extent on the
difficulty of the item itself and what we consider to be an acceptable item difficulty value for
test times. If we are to assume that 0.7 is an appropriate item difficulty value, then we should
expect that the remaining 0.3 be about evenly distributed among the distractors.
It can be obtained by calculating the number of testee in choosing the distractors. We
can calculate this form by seeing the answer form done by students. The distractors are good
if chosen by minimum 5% of the number of test takers. One way to study responses to
distractors is with frequency table that tells us the proportion of students who selected a
given distractor. Remove or replace distractors selected by few or no students because
students find them to be implausible (Nurulia, 2010:57).
Distractors that are not chosen by any examinees should be replaced or removed.
Distractors that do not work for example are chosen by very few test-takers should be
replacing imumscore MeanA MeanB D max  by better ones, or the item should be
otherwise modified or dropped (Hughes, 2005:228).
They are not contributing the test’s ability to discriminate the good students from the
poor students (Nurulia, 2010:57). They should be discarded because they are chosen by very
few test-takers from both groups. It means that they cannot function properly.
Let see the following example of Item of Distractor :
In the story,
She was happy because………….
A. She gets present
B. Her mother gave her permission to watch concert
C. She found money on the street
D. She met her old friend
Let us assume that 100 students took the test. If we assume that A is the answer and
the item difficulty is 0.7, then 70 students answered correctly. What about the remaining 30
students and the effectiveness of the three distractors? If all 30 selected D, the distractors B
and C are useless in their role as distractors. Similarly, if 15 students selected D and another
15 selected B, then C is not an effective distractor and should be replaced. The ideal situation
would be for each of the three distractors to be selected by 10 students. Therefore, for an item
which has an item difficulty of 0.7, the ideal effectiveness of each distractor can be
quantified as 10/100 or 0.1.
50
One important element in the quality of a multiple choice item is the quality of the
item’s distractors. However, neither the item difficulty nor the item discrimination index
consider the performance of the incorrect response options, or distractors. A distractor
analysis addresses the performance of these incorrect response options. Just as the key, or
correct response option, must be definitively correct, the distractors must be clearly incorrect
(or clearly not the "best" option). In addition to being clearly incorrect, the distractors must
also be plausible. That is, the distractors should seem likely or reasonable to an examinee
who is not sufficiently knowledgeable in the content area. If a distractor appears so unlikely
that almost no examinee will select it, it is not contributing to the performance of the item. In
a simple approach to distractor analysis, the proportion of examinees who selected each of
the response options is examined. For the key, this proportion is equivalent to the item p-
value, or difficulty. If the proportions are summed across all of an item’s response options
they will add up to 1.0, or 100% of the examinees' selections. The proportion of examinees
who select each of the distractors can be very informative. For example, it can reveal an item
mis-key. Whenever the proportion of examinees who selected a distractor is greater than the
proportion of examinees who selected the key, the item should be examined to determine if it
has been mis-keyed or double-keyed. A distractor analysis can also reveal an implausible
distractor. In CRTs, where the item values are typically high, the proportions of examinees
selecting all the distractors are, as a result, low. Nevertheless, if examinees consistently fail
to select a given distractor, this may be evidence that the distractor is implausible or simply
too easy.
Another example is :
Distractor No of upper group No of lower Discrimination Value
selected who group students who / index
selected selected
A. She gets present * 20 10 (20-10)/30 0,33
B. Her mother gave 3 3 (3-3)/30 0
her permission to
watch concert
C. She found money 4 16 (4-16)/30* -0,4
on the street
D. She met her old 3 1 (3-1)/30 0,07
friend
* Correct answer
*30 is N ( number of examiners)
The values from the table is alternative A is the key and a positive value is the value
that we would want. However, the value of 0.33 is rather low considering the maximum
value is 1. The value for distractor B is 0 and this tells us that the distractor did not
discriminate between the proficient students in the upper group and the weaker students in
the lower group. Hence, the effectiveness of this distractor is questionable. Distractor C, on
the other hand, seems to have functioned effectively. More students in the lower group than
51
in the upper group selected this distractor. As our intention in distractor analysis is to identify
distractors that would seem to be the correct answer to weaker students, then distractor C
seems to have done its job. The same can not be said of the final distractor. In fact, the
positive value obtained here indicates that more of the proficient students selected this
distractor. We should understand by now that this is not what we would hope for.
E. FOLLOW-UP
On the basic of calculation, the level of difficulty and the level of discrimination to
each item test, can be examined and further analysis of the overall test. The characteristic of
test have been expressed through studies and analysis that needs to be kept fully and neatly in
its own entry. As we know, a lecturer in general is in charge at teaching the same teaching
field from one academic year to the next year. Therefore, a note about the test ever organized
will give you more opportunities to develop a better test.
If the index between 0,20 and 0,80 are used as a range of difficulty levels are
acceptable, then immediately apparent that the item number 13 ( D = 0,13) were much
underneath. The test items don’t need to keep. The test item number 15 (D = 0,19) might still
be on hold, due to the difficulty level almost reached the lowest limit. From the study and
analysis of advanced levels of difficulty against it can be concluded, except the number 13
who didn’t qualify, and the number 15 test may stiill be considered, all other tests have
acceptable of difficulty.
From the study and analysis of the level of difficulty and discrimination levels it is
inferred that the grain of the test numbers 2, 5, 13 and 14 is doubtful, each test, and should
not be preserved because it has a discrimination negative levels. The caharcteristic of test that
have been expressed through studies and analysis that needs to be kept fully and neatly in its
own entry. As we know, a lecturer in general is in charge at teaching the same teaching field
from one academic year to the next year. Therefore, a note about the test ever organized will
give you more opportunities to develop a better test .
Answer keys test is part of complete record, regardless of the form of the test. For
multiple choice test, such as a note that contains a list of the numbers and letters of the
correct answer. With regard to the essay test, note is in the form of short answer to short
answer tests, or the core of the answer and signs for the long answer essay tests.
Have a look to this chart for the brief materials:
52
ITEM ANALYSIS
ITEM OF ITEM OF ITEM OF DIFFICULTY

DISTRUCTORS DISCRIMINATION LEVEL
FOLLOW-UP FOLLOW-UP FOLLOW-UP
For multiple-choice exams, distractors play a significant role. Do exam questions

effectively distract test takers from the correct answer? For example, if a multiple-choice
question has four possible answers, are two of the answers obviously incorrect, thereby
rendering the question with a 50/50 percent chance of correct response? When distractors
are ineffective and obviously incorrect as opposed to being more disguised, then they
become ineffective in assessing student knowledge. An effective distractor will attract test
takers with a lower overall score than those with a higher overall score.
F. CAUTION IN ITEM ANALISIS

 Item analysis data is just a reflection of internal consistency and therefore should not
be treated as item validity which requires an external criteria, e.g. experts’ opinions
to accurately judge the validity of test items.
 A low discrimination index does not make an item to be dropped from a test because
extremely difficult or easy items may have low ability to discriminate. But such
items can be included in a test to sample course content adequately. Similarly an item
may have low discrimination because multidimensionality of a test.
 Item analysis data are tentative due to influences by factors like sample of students,
quality of instruction, and chance errors.
 Distractor analysis involves the examination of item responses by abilitygroupings for
each option in a selected-response item.
53
54

Bahan Ajar Lte F

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bahan Ajar Lte F

Uploaded by

Copyright:

Available Formats

CHAPTER I

1.2 Purpose of Measurement

1.3 Types of Measurement

Example of Ratio Scales

1.4 Models of Measurement

2.2 The Purpose of Assessment

2.3 The type of Assesment

These types of assessments also have different purposes and uses.

2.4 Models of Assessment

3.2 Purpose of Evaluation

3.3 Type of Evaluation

b. Norm-referencing / Criterion-referencing assessment

c. Mastery learning CR / Continuum criterion-referenced assessment

d. Continuous assessment / Fixed assessment points

e. Formative / Summative assessment

f. Direct / Indirect assessment

g. Performance / Knowledge assessment

h. Subjective / Objective assessment

i. Checklist rating / Performance rating

j. Impression assessment / Guided judgment

k. Holistic / Analytic assessment

l. Series / Category assessment

m. Assessment by others / Self-assessment

3.4 Models of Evaluation

1. Conduct the evaluation process at level-3 first.

3.5 Steps of Evaluation

The process phase

The product phase

Prepared by Dr K Abdul Gafoor, Associate Professor, Department of Education

 Deciding who should conduct the evaluation;

The Purpose of Language Test (Tujuan Tes Bahasa)

Menurut Sudirman, dkk., tujuan penilaian dalam proses pembelajaran adalah:

Alabi dan Babatunde (2001) mengidentifikasi tiga tujuan tes bahasa:

The Approach in Language Testing (Pendekatan dalam Tes Bahasa)

a. Traditional Approach (Pendekatan Tradisional)

c. Integrative Approach (Pendekatan Integratif)

d. Pragmatic Approach (Pendekatan Pragmatik)

e. Communicative Approach (Pendekatan Komunikatif)

 According to Dr. S. Prakash & Dr. E. Dhivyadeepa

For example: interview test to get a job or scholarship.

For example: writing an essay from the given title.

1) Essay Type Test:

2) Short Answer Type Tests:

what is the name of the city described in the text above?

3) Objective Type Tests:

The conventional framework of examination has fizzled to bring approximately

4) Multiple Choice Types:

Understudies react to coordinating questions by blending each of a set of stems (e.g.,

True/false questions are as it were composed of a articulation. Understudies react to the

The completion thing requires the understudy to reply a address or to wrap up an

 According to Dr. Aradhani Wani

According to mode of administration, tests are of two types:

- Individual test: Is a psychological test given to an individual at a certain time.

According to the method of scoring, test are of two types:

According to the ability of the student, tests are of two types:

According to the principle of test construction, tests are of two types:

Characteristics of Diagnostic Tests

a. Designed to detect student learning difficulties

2. If the person concerned cannot do or achieve the proper performance.

Rajeswari (2004) “six conditions that must be considered in the implementation of

A. CRITERION REFERENCE TEST

1. The Definition of Criterion Reference Test (CRT)