Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Studies in Educational Evaluation 59 (2018) 278–287

Contents lists available at ScienceDirect

Studies in Educational Evaluation


journal homepage: www.elsevier.com/locate/stueduc

Are the tests scores of the Programme for International Student Assessment T
(PISA) and the National Educational Panel Study (NEPS) science tests
comparable? An assessment of test equivalence in German Schools
Helene Wagner , Inga Hahn, Katrin Schöps, Jan Marten Ihme, Olaf Köller

IPN - Leibniz Institute for Science and Mathematics Education at Kiel University, Olshausenstraße 62, D-24118 Kiel, Germany

ARTICLE INFO ABSTRACT

Keywords: The aim of this study is to link the science scale of the German National Educational Panel Study (NEPS) with the
Scientific literacy science scale of the Programme for International Student Assessment (PISA). One requirement for a strong linking of
Missing data test scores from different studies is a sufficient similarity of the tests regarding their constructs. The present study
Linking study aims to assess the similarity of the operationalized constructs of the NEPS and PISA scientific literacy tests with
Equipercentile equating
the aim to link the scales of the two tests. A linking study was carried out for this purpose in which 1079 students
PISA
NEPS
worked on the tasks of both studies. The results of the comparison between NEPS and PISA indicated a high
overlap regarding their constructs. However, both studies deal with missing responses differently. The linking
via equipercentile equating showed a high classification consistency which was highest when missing responses
were ignored in both studies.

1. Introduction This study examines the comparability of the Grade 9 NEPS science
test with the PISA science test. Connecting both tests could extend the
In 1997, the Standing Conference of the Ministers of Education and interpretation of their test scores. Until now, no proficiency levels have
Cultural Affairs of the Länder in the Federal Republic of Germany been defined in NEPS. Hence, the NEPS results cannot be interpreted
decided on Germany’s regular participation in international large-scale- and reported in a criterion-based manner. The link between the NEPS
assessments. Germany participates in the Trends in International and PISA tests can allow for classification of the NEPS test scores in the
Mathematics and Science Study (TIMSS) every four years at the end of criterion-based international reference framework of PISA which is well
primary school as well as in the Programm for International Student established in the public educational debate in Germany. The long-
Assessment (PISA) every three years at the end of lower secondary itudinal design in NEPS could help to identify the determinants of
education. competence acquisition which can predict the performance in the PISA
However, these studies only allow cross-sectional analyses and only test. Furthermore, the link between NEPS and PISA could be used to
address specific age groups. Until recently, no large-scale study mea- investigate in NEPS samples to what extent the performance on the
suring the development of competencies over the lifespan had been international PISA scale can predict success in upper secondary edu-
carried out in Germany. The National Educational Panel Study (NEPS; cation and the further professional career.
Blossfeld, 2008) which started in 2009 is the first German attempt to Due to the fact that the NEPS and PISA studies deal differently with
close this gap by assessing the development of skills and competencies missing responses, we also investigated how the different treatments of
over the lifespan (Hahn et al., 2013). NEPS strives to connect with missing values affect the comparability of the test scores and the quality
national and international large-scale assessment studies to achieve a of the linking. To link the NEPS and PISA tests to a common scale, 1079
common interpretation of scores (Blossfeld, 2008). However, com- 9th grade students took both tests in a linking study.
paring test results from different studies is a challenge because they are According to Kolen and Brennan (2004) the linking of test scores
based on different frameworks and their results are not reported on the from different studies requires sufficient similarities of the tests with
same scale. Therefore, the test instruments have to be linked to a regard to:
common scale.


Corresponding author.
E-mail addresses: h.wagner@ipn.uni-kiel.de (H. Wagner), hahn@ipn.uni-kiel.de (I. Hahn), schoeps@ipn.uni-kiel.de (K. Schöps),
ihme@ipn.uni-kiel.de (J.M. Ihme), koeller@ipn.uni-kiel.de (O. Köller).

https://doi.org/10.1016/j.stueduc.2018.09.002
Received 6 October 2017; Received in revised form 2 August 2018; Accepted 5 September 2018
Available online 11 October 2018
0191-491X/ © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/BY-NC-ND/4.0/).
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

• Inferences: To what extent are the scores for the two tests used to response format (OECD, 2014a; Schöps & Saß, 2013). However, PISA
draw similar types of inferences? In other words, to what extent do 2012 also used an open-constructed response format (32% of the total
the two tests share common measurement goals? number of items).
• Populations: To what extent are the two tests designed for testing NEPS and PISA deal differently with missing responses. PISA 2012
similar populations? used a two-stage procedure for handling missing responses (OECD,
• Characteristics and conditions of the measurement: To what extent 2014b): in the first step, not–reached and not valid items were ignored
do the two tests share common measurement conditions, for ex- and omitted items were scored as incorrect when estimating the item
ample, with regard to test format, administration conditions, test parameters. In the next step, the estimated item parameters were used
length, etc.? for the estimation of person parameters where missing responses were
• Constructs: To what extent do the two tests measure the same scored as incorrect. In contrast, NEPS 2010 ignored all missing re-
construct? sponses for the estimation of item and person parameters (Pohl &
Carstensen, 2012).
To closely examine these aspects, the next section will look into the A number of studies (De Ayala, Plake, & Impara, 2001; Pohl, Gräfe,
similarities and differences of the NEPS and PISA frameworks. & Rose, 2014; Rose, von Davier, & Xu, 2010) showed that scoring
missing responses as incorrect leads to a bias in the estimation of
2. Comparing the scientific literacy tests of NEPS and PISA parameters and to the overestimation of the reliability. Based on these
results we assume that ignoring the missing responses in NEPS and PISA
2.1. Inferences will increase the comparability of their test scores (hypothesis two) and
their scales (hypothesis four), and hence the quality of linking (hy-
NEPS and PISA have different objectives. The aim of PISA is to pothesis six).
monitor educational systems at the end of lower secondary school in
terms of student performance (OECD, 2013). This goal is realized every 2.4. Operationalized constructs: comparing the contents of the science tests
three years by a cross-sectional overview of the educational level of 15- of NEPS and PISA
year-old students. The aim of NEPS is to provide longitudinal data of
the competence development from early childhood to late adulthood in The definition of scientific literacy used by NEPS includes aspects of
Germany. In order to achieve this goal the data collection in NEPS is the concept of competence as defined by Weinert (2001), and of the
embedded in a multicohort sequence design (von Maurice, Sixt, & concepts of scientific literacy developed by the American Association for
Blossfeld, 2011) which makes it possible to compare the educational the Advancement of Science (American Association for the
level of 9th grade students from different cohorts. In other words, de- Advancement of Science, 2009) and by PISA (OECD, 2006). Therefore,
spite the different objectives of NEPS and PISA the measurements of the NEPS scientific literacy framework has a substantial overlap with
these studies allow to assess the educational level of students at the end the scientific literacy framework from PISA 2012 (Fig. 1).
of lower secondary school. Fig. 1 shows that the frameworks of both studies differ in the
number of components used for assessing scientific literacy: The fra-
2.2. Target populations mework of NEPS only considers the content-related components which
are related to the knowledge of science (KOS) in PISA, and process-related
The target population of the NEPS test are 9th grade students (von components which are related to the knowledge about science (KAS) in
Maurice et al., 2011). PISA examines the competence of 15-year-old PISA. The PISA framework differentiates further and also distinguishes
students (15 years and 3 months to 16 years and 2 months of age). In between the competencies identifying scientific issues, explaining phe-
Germany the target population of 15-year-old students for PISA 2012 nomena scientifically and using scientific evidence. At this point, it can be
was defined as the persons born in 1996. The analysis of the compo- concluded that the frameworks of the two studies differ in their con-
sition of the PISA sample in Germany in 2012 showed that 48% of the ceptual scope. But how different are they on the task level?
selected students attended Grade 9, 33% of them attended Grade 10 This question can be examined using the theory of bias and
and 11, and 19% of them attended Grade 7 and 8 (Sälzer & Prenzel, equivalence of van de Vijver (1998), which was modified for the pur-
2013). Hence, the target populations in NEPS and PISA are not iden- poses of equivalence research in the linking studies by Pietsch, Böhme,
tical, but the overlap of both selected samples is high. Robitzsch, and Stubbe (2009). Pietsch et al. (2009) suggested assessing
the similarity of the operationalized constructs of two tests by regarding
2.3. Characteristics and conditions of the measurement their conception, their dimensional structures and their scales.
According to Pietsch et al. (2009) two tests are equivalent regarding
PISA is a cross-sectional study which in 2012 assessed mathematics, their conception when their constructs have equivalent frameworks. In
reading, science and financial literacy of 15-year-old students (OECD, order to analyze the conceptual equivalence of the scientific literacy in
2014b). The 53 items of the science test were split into three clusters NEPS and PISA (Wagner, Schöps, Hahn, Pietsch, & Köller, 2014) seven
and presented to students with seven mathematics clusters and three experts in the field of science didactics familiar with large-scale as-
reading clusters in thirteen test booklets. Each booklet consisted of four sessments classified the NEPS items according to the categories of KOS
clusters with each cluster representing 30 min of test time. Each student and KAS and to the competencies in PISA. The results showed that 79
worked on one to two science clusters so that each item was processed percent of the NEPS items could be assigned to the contents of the PISA
by a sufficient number of students. framework. However, according to five of the seven raters some of the
NEPS provides longitudinal data on educational processes and KOS components in PISA (earth and space systems and technology
competence development in information and communication technol- systems) were not covered by NEPS items.
ogies, mathematics, reading and science (von Maurice et al., 2011). The
28 items of the NEPS science test were presented in 2010 in 28 min and 3. Linking methods and linking studies
each person got the same items in a fixed sequence (Schöps & Saß,
2013). Different methods of linking can be applied depending on the level
In Germany data collection and processing for PISA 2012 and NEPS of equivalence of the two tests. Mislevy (1992) and Linn (1993) dif-
2010 were coordinated by the IEA Hamburg. Both tests examined in ferentiate between five types of linking: equating, vertical scaling, con-
this linking study were administered as a paper pencil test. The majority cordance, projection and moderation. Fig. 2 illustrates the different
of the items in NEPS 2010 and PISA 2012 had a closed-constructed linking methods which, in terms of their applicability depend on the

279
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

Fig. 1. Framework of the scientific literacy by NEPS and PISA. The corresponding equivalents were underlined.

degree of similarity of the test instruments to be linked. In addition, international data with national data since the first PISA survey in
Fig. 2 shows the linking continuum (Ryan & Brockmann, 2009), which 2000. Depending on the goal of the linking studies, they can be clas-
reflects the above-mentioned methods’ strength of linking. The linking sified into three groups (Nissen, Ehmke, Köller, & Duchhardt, 2015):
continuum can be interpreted as follows: the more similar the tests are studies with the goal of explaining differences in outcome (Wu, 2010),
in terms of inferences, target populations, measurement characteristics, studies with the goal of explaining differences in proficiency levels
and operationalized constructs, the stronger a type of linking method (Hambleton, Sireci, & Smith, 2009), and lastly, studies to locate the
can be applied, and thus the closer these tests can be connected. outcome of the national study in an international reference (Cartwright,
According to Ryan and Brockmann (2009) equating is the strongest Lalancette, Mussio, & Xing, 2003; Pietsch et al., 2009).
type of linking. Equating can only be applied, if the tests are similar in In their study, Nissen et al. (2015) also wanted to locate mathe-
all features named in the Kolen and Brennan (2004) approach. The matical scores of NEPS on the TIMSS scale. This study is relevant to our
linking via equating makes it possible to predict the scores from one test study insofar that it aimed to link tests developed for similar but not the
to another and vice versa. However, this method also has one dis- same populations. Nissen et al. (2015) compared different linking
advantage: It is sensitive to irregularities in the distribution of the test methods by applying them to the mathematics tests from TIMSS 2011
scores. For example, the equating relationship cannot be determined for (students at the end of Grade 4) and NEPS 2010 (beginning of Grade 5).
score ranges that exceed the highest observed score and scores that fall The comparison of the equipercentile equating with the IRT linking
below the lowest observed score. One possibility of approaching the method (based on a sample of 733 4th graders) showed a higher clas-
problem is by pre- or post-smoothing the equipercentile equivalents sification consistency between the NEPS and TIMSS results on the
(Livingston, 2004). TIMSS proficiency scale when the equipercentile equating was used
Worldwide there have been different approaches to link (percentage matching of 44%). Furthermore the refered study can be

Fig. 2. Dependence of the linking methods on the degree of similarity of the test instruments.

280
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

used as a comparison for our linking results as the target populations of


NEPS and PISA showed a high overlap but are not identical.
Worldwide there have been different approaches to link the tests in
mathematics and reading. Hanushek and Wößmann (2015) showed
that the scientific literacy measure in PISA is a far better predictor for
economic growth of a country than the results for mathematics and
reading. The linking of the scientific scales of NEPS and PISA would
allow to examine the development of scientific competence in Germany
in the context of its economic growth. Therefore, the linking of scien-
tific literacy in NEPS and PISA is important not only with regard to
educational research but also with regard to economic research ques-
tions.

4. Research questions

A basic requirement for linking two tests by equating is a high si- Fig. 3. Single group design in the linking study.
milarity of their test constructs. According to Pietsch et al. (2009) the
similarity of the operationalized constructs can by assessed by re- science items split into three clusters. On the second day, the students
garding their conception, their dimensional structures and their scales. took the NEPS science test which consisted of 28 items. The test
The assesment of the first aspect – the conceptual equivalence of the booklets used in the presented study were the same as in the main
NEPS and PISA tests – was part of another study and showed a high studies of PISA 2012 and NEPS 2010.
overlap of the scientific literacy frameworks (Wagner et al., 2014). The data collection and processing in the main NEPS and PISA
Analysis of the two other aspects – the equivalence of the dimensional studies as well as in the linking study was coordinated by the IEA
structures and scales – are part of the present study. The comparison of Hamburg. The sample consisted of 1528 9th grade students from 65
the NEPS and PISA frameworks in the present paper showed that the schools who participated in the secondary school program Increasing the
studies differ in the response formats they use and in the treating of Efficiency of Teaching in Mathematics and Science Education in Secondary
missing data. In particular, the influence of this last aspect has to be School (SINUS; Prenzel & Ostermeier, 2006).
investigated regarding the dimensional and the scalar equivalence of Overall N = 1079 ninth grade students (50% female) took both the
NEPS and PISA and the linking of their scales. NEPS and the PISA science tests. The age mean of the sample was
According to Pietsch et al. (2009) the dimensional equivalence M = 15.5 years and the standard deviation was SD = 0.55. The analysis
means that two tests have the same factor structure, whereas the scalar of the age composition of the sample showed that 46% of the tested
equivalence means that the scores of two tests have the same meaning. students were born in 1996 and thus corresponded to the selection
To address our questions concerning the dimensional equivalence, we criterion for PISA 2012 in Germany. Eight percent of students were
examine (1) whether the items of the NEPS and PISA tests measure the born bevor 1996 and 46% of students were born after 1996.
same construct of scientific literacy and (2) to what extent a different
handling of missing data in NEPS and PISA influences the comparability 5.2. Scoring and data processing
of their science scores.
To analyze the scalar equivalence of both tests, we investigate (3) to Germany has participated in PISA (which takes place every three
what extent the scales in NEPS and PISA are equivalent regarding their years) since the year 2000. Each PISA cycle examines a major domain in
descriptive statistics such as means, standard deviations, skewness, and depth, so two-thirds of the testing time is devoted to this domain; the
kurtosis. Furthermore, we inspect (4) to what extent a different hand- other domains provide a summary profile of skills. Major domains were
ling of missing data in NEPS and PISA influences the comparability of reading literacy in 2000 and 2009, mathematical literacy in 2003 and
their scales. 2012 and scientific literacy in 2006 and 2015. PISA 2012 reused some
Subsequently, the scales of the two studies are linked and compared of the items developed in 2006, therefore the scoring of the PISA data in
regarding their descriptive statistics. (5) We also gauge the classification this study was conducted by applying the coding rules of PISA 2006
consistency between the linked test scores in NEPS and PISA according (OECD, 2009). The scoring of the NEPS data was carried out using the
to the PISA 2012 international benchmarks and examine (6) to what NEPS 2010 coding rules (Schöps & Saß, 2013).
extent a different handling of missing data in NEPS and PISA influences
the linking of the science scores. 5.3. Analysis of the dimensional equivalence

5. Method Our first research question addressed the extent to which the items
of NEPS and PISA measure the same construct of scientific literacy. In
5.1. Data collection order to assess this question, the software ConQuest (Wu, Adams,
Wilson, & Haldane, 2007) was used to compute the item parameters of
Although the assessments of NEPS and PISA showed an overlap both studies by assigning the items to the corresponding tests as two
concerning their framework conceptions, they have no common items. dimensions. In a first step, the two dimensions were correlated to
Accordingly, the science scores of these studies cannot be reported on analyze the dimensional equivalence of NEPS and PISA. Afterwards the
the same scale. Therefore, a linking study was conducted in four federal two dimensional model (NEPS and PISA) was compared with the one
states in Germany using a single group design (Fig. 3) to assess the dimensional model (scientific literacy) with regard to the information
construct equivalence in NEPS and PISA and to link their scientific criteria.
literacy scales. The information criteria (AIC, BIC, SABIC) are measures of the
The study was carried out on two consecutive days parallel to the goodness of fit of an estimated statistical model. The Akaike informa-
PISA survey in the spring of 2012. On the first day, every student tion criterion (AIC) takes into account the number of estimated para-
completed one of the five PISA booklets with science and mathematics meters in this model in addition to the likelihood of the model to be
items (two booklets with two science clusters and three booklets with analyzed. Compared to the AIC, the Bayesian information criterion
one science cluster) within two hours. In total, students worked on 53 (BIC) strongly penalizes the number of estimated parameters. The

281
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

sample-size adjusted BIC (SABIC) places a penalty for adding para- 5.5. Linking procedures
meters based on sample size but less strongly than the BIC. Given a set
of candidate models for the data, the preferred model is the one with The linking of the NEPS and PISA scores in the present study was
the minimum information criteria value. Raftery (1995) proposes to conducted via equipercentile equating (e.g. Cartwright et al., 2003;
interpret the difference of ΔBIC≥10 as very strong evidence for a better Nissen et al., 2015; van den Ham, Ehmke, Nissen, & Roppelt, 2016).
model fit. This procedure is based on the idea that the scores of two tests with the
In the next step the dimensional equivalence of NEPS and PISA was same percentile rank are declared as equivalent (Kolen & Brennan,
analyzed by examining the assumption of local item independence (LII), 2004). For example, if a score of 425 is equal to a percentile rank of 10
one of the cornerstones of the item response theory model (Embretson on the NEPS scale, and a score of 496 is equal to percentile rank of 10
& Reise, 2000). The LII means that the observed items are conditionally on the PISA scale, then the scores 425 and 496 are considered
independent of each other given an individual score on the latent equivalent. Due to the sensitivity of equipercentile equating to irregu-
variable (Henning, 1989). If the items of different tests measure dif- larities in the distribution of the test scores (Livingston, 2004) the NEPS
ferent constructs, the LII is violated. Any statistical analysis based on equivalents were post smoothed with a value of 0.3. Post smoothing
the Rasch model is unjustified if the assumption of local item in- means that equipercentile equating is performed on the basis of ob-
dependence is violated. Moreover, the violation of the local item in- served distributions and the equating relation is smoothed. This step
dependence is the evidence that the tests measure different constructs produces a smoothed distribution with nonzero probability at the
and are not equal. highest and lowest score levels (Kolen & Brennan, 2004). In our study
One of the statistical procedures to identify the local item de- equipercentile equating was carried out for each plausible value by
pendency is the computation of the partial correlation index (PRT) from using the computer software LEGS (Brennan, 2003). Afterwards the
Huynh, Michaels, and Ferrara (1995): the predicted value from the linking results were averaged.
linear regression on the raw score of the total test (in our study the
weighted likelihood estimates (WLE)) is subtracted from the item raw 5.6. Analysis of the classification consistency
score. The correlations between the residuals of items are partial cor-
relations and the mean of the partial correlations is the index of the In the last step, the students were classified according to the inter-
local item dependency. A PRT index above the critical value of 0.2 national benchmarks of PISA 2012 (OECD, 2013) based on the PISA
(Chen & Thissen, 1997) indicates a violation of the local independence. scores and the NEPS equivalents on the PISA metric. To analyze re-
In order to examine the comparability of scores in the main studies search question five regarding the classification consistency between
of NEPS and PISA, the dimensional equivalence in the linking study was the linked test scores in NEPS and PISA, the percentage matching (the
analyzed by handling missing data in a study-specific way when esti- percentage of the students assigned to the same benchmark) was cal-
mating item parameters. This means that not-reached and not valid culated. Another measure of the classification consistency in our study
items were ignored in both studies, whereas the omitted items were was the Cohen’s coefficient Kappa (Cohen, 1960). Landis and Koch
ignored in NEPS and were scored as incorrect in PISA. To examine the (1977) propose the following labels assigned to the corresponding
research question two addressing the influence of handling missing data ranges of kappa: “.21-.40″ fair, “.41-.60″ moderate, “.61-.80″ sub-
on the comparability of the NEPS and PISA scores, the dimensional stantial and “.81-1.00″ almost perfect agreement.
analyses described earlier were carried out in two additional ways. In As for the testing of equivalence of the NEPS and PISA results the
the first case the missing responses in NEPS and PISA were ignored missing data were handled study-specific by ignoring the missing re-
because we assumed that this would increase the comparability. To sponses in NEPS and scoring the missing responses in PISA as incorrect
ensure that this increase cannot be explained only by the equal hand- when linking the scales of both studies. In order to examine research
ling of missing data, the analysis of the dimensional equivalence was question six addressing the influence of handling missing data on the
performed when missing responses in both studies were scored as in- classification consistency between NEPS and PISA, the missing re-
correct. sponses in both studies were first ignored and than scored as incorrect.

6. Results
5.4. Analysis of the scalar equivalence
6.1. Assessing the dimensional equivalence
To assess the third research question regarding the scalar equiva-
lence in NEPS and PISA, their raw scores were analyzed based on the The aim of this section is to examine the first research question
1 PL Rasch model with fixed item parameters taken from NEPS 2010 regarding the dimensional equivalence of the NEPS and PISA scientific
(Schöps & Saß, 2013) and PISA 2012 (OECD, 2014a). Due to the aim of literacy tests. To achieve this objective, the comparability of the NEPS
this study to classify the NEPS scores into the international reference and PISA scores were analyzed in a first step by correlating the two
framework of PISA, PISA items were fixed to the international item dimensions. The correlation between NEPS and PISA was r = .85, in-
parameters. To estimate the scientific literacy in NEPS and PISA five dicating a high relation between the scores of both tests. Furthermore
plausible values were drawn per student and test and linearly trans- the estimated correlation showed that the majority (72%) of the var-
formed to the mean of 500 and a standard deviation of 100. The mean, iance in each test can be explained by the other test.
skewness and kurtosis were compared to assess the scalar equivalence In the next step, the dimensional equivalence of the NEPS and PISA
of the NEPS and PISA science tests. tests was analyzed by comparing the fit indices of the two dimensional
As for the testing of dimensional equivalence, the scalar equivalence model with the one dimensional model, which considered all items as
in the presented study was analyzed by handling missing responses in a indicators of the scientific literacy and did not differentiate between the
study-specific way when estimating person parameters. This means that tests. Table 1 shows the results of the factor analyses.
the missing responses in NEPS were ignored and the missing responses The information criteria (AIC, BIC and SABIC) of the one dimen-
in PISA were scored as incorrect. However, the fourth research question sional model were larger than the information criteria of the two di-
in our work addressed the influence of handling missing data on the mensional model. This result indicates that the factor of scientific lit-
comparability of the NEPS and PISA scales. Consequently, when ana- eracy cannot explain the systematic item variance completely. This
lyzing the scalar equivalence the missing responses in NEPS and PISA might be caused by the difference of the measurement conditions be-
were first ignored and, in a second step, scored as incorrect. tween NEPS and PISA. In order to examine this assumption, we address
the influence of the different handling of missing responses on the

282
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

Table 1
Summary of factor analysis of the NEPS and PISA data.
N of Log likelihood AIC BIC SABIC
parameters

2 dimensions 109 65572 65790 66283 65937


(NEPS & PISA)
1 dimension 107 65680 65894 66378 66038
(scientific literacy)

Note. AIC = Akaike information criterion; BIC = Bayesian information cri-


terion; SABIC = Sample-Size Adjusted BIC; * = p < .05.

Fig. 5. Average percentage distribution of missing responses in NEPS and PISA.

Fig. 4. The partial correlations of item pairs in NEPS and PISA.

comparability of the NEPS and PISA scores in the next section.


The factor analysis showed that the NEPS and PISA scores cannot be
mapped onto one common dimension. But does this result indicate that
the NEPS and PISA tests measure different constructs? This question
was analyzed by examining the assumption of local item independence
for the NEPS and PISA items. Fig. 4 provides information about the Fig. 6. Average percentage distribution of invalid responses in NEPS and PISA.
correlations between the item residuals in NEPS and PISA after the
individual score on the latent variable (scientific literacy) was sub- Our analyses showed that the relation between the two dimensions
tracted from the item raw score. seems to depend on handling missing data: when missing responses
The diagram shows that only few correlations exceeded the critical were ignored in both tests the correlation between NEPS and PISA in-
value of .2. The mean of the partial correlations was −.02 and in- creased to r = .90. Thus, the NEPS and PISA tests shared 9% more of
dicated the one-dimensionality of the NEPS and PISA items. Thus, the the variance when missing responses were ignored. This increase
analysis of local item independence in our study showed no evidence cannot be explained only by the equal handling of missing data. When
that the NEPS and PISA tests measure different constructs. missing responses in both tests were scored as incorrect the correlation
between NEPS and PISA decreased slightly (r = .83).
The analyses of the factor structure also showed a relation between
6.2. Assessing the influence of handling missing data on the dimensional
the handling of missing data and the comparability of the NEPS and
equivalence
PISA scores. If the missing responses in both tests were ignored
(Table 2), the two dimensional model fitted the observed data better
Our second research question addressed the influence of handling
missing data on the dimensional equivalence of the scientific literacy
scores in NEPS and PISA. Figs. 5 and 6 provide the average percentage
Table 2
distribution of missing and invalid responses in NEPS and PISA. They
Summary of factor analysis of the NEPS and PISA data (missing responses are
show that the PISA test has significantly (p < .05) more missing data ignored).
than the NEPS test. In PISA 2012 missing responses were scored as
incorrect when estimating the person parameters (OECD, 2009). N of Log likelihood AIC BIC SABIC
parameters
However, NEPS 2010 ignored missing responses when estimating the
item and person parameters (Pohl & Carstensen, 2012). 2 dimensions 109 63806 64024 64517 64171
In order to assess the influence of handling missing data on the (NEPS & PISA)
comparability of the NEPS and PISA scores (research question two), the 1 dimension 107 66865 64079 64563 64223
(scientific literacy)
analyses of the dimensional equivalence presented in the previous
section were conducted by handling missing data in two ways. In the Note. AIC = Akaike information criterion; BIC = Bayesian information cri-
first case the missing responses in both tests were ignored and in the terion; SABIC = Sample-Size Adjusted BIC; * = p < .05.
second case they were scored as incorrect.

283
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

Table 3 Table 5
Summary of factor analysis of the NEPS and PISA data (missing responses are Descriptive statistics for NEPS and PISA science scores and results of the
scored as incorrect). equipercentile linking (missing responses are ignored).
N of Log likelihood AIC BIC SABIC Mean (SD) SE Skewness SE Kurtosis SE
parameters
NEPS scores 539.45 (86.64) 2.64 −0.01 0.07 −0.14 0.15
2 dimensions 109 69702 69920 70413 70067 PISA scores 600.87 (75.12) 2.29 −0.06 0.07 −0.13 0.15
(NEPS & PISA) NEPSPISA 600.86 (75.05) 2.28 −0.06 0.07 −0.13 0.15
1 dimension 107 69843 70057 70541 70201
(scientific literacy) Note. SD = standard deviation; SE = standard error.

Note. AIC = Akaike information criterion; BIC = Bayesian information cri-


Table 6
terion; SABIC = Sample-Size Adjusted BIC; * = p < .05.
Descriptive statistics for NEPS and PISA science scores and results of the
equipercentile linking (missing responses are scored as incorrect).

than the one dimensional model, similar to the analysis in the previous Mean (SD) SE Skewness SE Kurtosis SE
section. But the difference between the information criteria of models
NEPS scores 521.92 (88.21) 2.69 −0.05 0.07 0.02 0.15
decreased when missing responses were ignored. PISA scores 580.62 (81.64) 2.49 −0.07 0.07 −0.10 0.15
Then again, scoring missing responses as incorrect led to the highest NEPSPISA 580.91 (81.94) 2.49 −0.07 0.07 −0.01 0.15
difference of information criteria and thus to the lowest comparability
of the NEPS and PISA scores (Table 3). Compared to the results in the Note. SD = standard deviation; SE = standard error.
previous section (when missing responses were handled study-specific)
the results of this section showed that ignoring missing responses in missing data study-specific, the difference between the means of the
both tests increased the relation between the NEPS and PISA scores. At person parameters in NEPS and PISA increased.
the same time, scoring missing responses in both tests as incorrect de- The average correlation of the competency scales in NEPS and PISA
creased the relation between the NEPS and PISA scores. decreased to .82, when missing responses in both tests were scored as
incorrect. The descriptive statistics in Table 6 (upper two lines) show
6.3. Assessing the scalar equivalence the same difference between the means of person parameters in NEPS
and PISA when ignoring missing responses in both tests.
In order to examine the construct equivalence of the NEPS and PISA
tests, the third research question regarding the comparability of their
6.5. Assessing the classification consistency between NEPS and PISA
scales is analyzed in this section. The average correlation of the NEPS
and PISA scales was .85. The descriptive statistics for the two dimen-
The aim of this study was to link the scientific literacy scales of
sions are provided in the upper part of Table 4. The statistics showed
NEPS and PISA. The results of the linking are provided in the last line of
significant differences (p < .05) between the means of person para-
Table 4. The descriptive statistics showed that the mean as well as the
meters in NEPS and PISA. The students achieved higher proficiency
skewness and kurtosis of the NEPS equivalents on the PISA metric
values in the PISA test than in the NEPS test. The data of both tests were
strongly resembled the PISA scale statistics.
normally distributed and did not differ in skewness and kurtosis.
In the next step the students were classified according to the PISA
Furthermore, the analyses showed that the mean of person para-
international benchmarks in science. The last line (for NEPS) and the
meters in PISA was higher than the ability mean of the German sample
last column (for PISA) in Table 7 show that both tests assigned ap-
in 2012 (524; OECD, 2014a). This might be caused by the participation
proximately the same number of students to the PISA 2012 interna-
of selected schools in the primary school program SINUS (Dalehefte
tional benchmarks. The distribution of students on the proficiency le-
et al., 2014), which might have led to a higher average ability than the
vels based on the linked NEPS and PISA scores was equivalent
ability of the reference sample.
(χ² = 0.72; df = 5, p > .05).
The matching of the students according to the PISA 2012 interna-
6.4. Assessing the influence of handling missing data on the scalar
tional benchmarks was calculated in order to analyze the classification
equivalence consistency between NEPS and PISA addressed in research question
five. This was done based on the linked NEPS and PISA scores. In
In order to assess the influence of handling missing data on the
contrast to the distribution of students on the proficiency levels, the
comparability of the NEPS and PISA scales (research question four), the
individual assignment of students differed between the tests. The per-
descriptive statistics reported in the previous section were computed by
centage matching of NEPS and PISA lay between 42% on the
firstly ignoring missing responses in both tests and by secondly scoring
them as incorrect.
Table 7
The average correlation of the competency scales in NEPS and PISA Percentage of students classified according to the PISA 2012 international
increased to .91, when missing responses in both tests were ignored. benchmarks in science.
Table 5 (upper two lines) provides the descriptive statistics for the
NEPSPISA
NEPS and PISA scores. Compared to the results computed by handling
I II III IV V VI Total number
Table 4 of persons
Descriptive statistics for NEPS and PISA science scores and results of the
equipercentile linking. PISA I 42 11 23
II 42 50 13 1 105
Mean (SD) SE Skewness SE Kurtosis SE III 16 36 55 22 5 293
IV 3 30 55 28 2 361
NEPS scores 539.45 (86.64) 2.64 −0.01 0.07 −0.14 0.15 V 1 21 57 44 240
PISA scores 580.62 (81.64) 2.49 −0.07 0.07 −0.10 0.15 VI 10 54 58
NEPSPISA 580.69 (81.64) 2.49 −0.06 0.07 −0.09 0.15 Total 100 100 100 100 100 100
Total number of persons 24 105 290 362 240 59
Note. SD = standard deviation; SE = standard error.

284
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

proficiency Level I and 57% on the proficiency Level V. The mean distribution of missing and invalid responses in NEPS and PISA showed
percentage matching was 52%. The Cohen’s coefficient Kappa (a mea- that the PISA test had significantly more missing data than the NEPS
sure of consistency) was k = .40, so the matching between the tests can test. This difference in the distribution of missing responses between the
be rated as fair (Landis & Koch, 1977). tests might partly be explained by the difference in response formats
used in NEPS and PISA, because test persons often tend to omit items
6.6. Assessing the influence of handling missing data on the classification with open-constructed response formats (De Leeuw, Hox, & Huisman,
consistency 2003). In connection with scoring missing responses as incorrect in
PISA, this tendency could lead to an overestimation of the difficulty of
Research question six addressed the influence of handling missing open-response items. Since the linking study focuses on the competence
data on the classification consistency. The missing data were handled in of students and not on the characteristics of the individual items, the
two ways to examine this question: in the first case the missing re- difference in the response formats should not have played a major role
sponses in both tests were ignored and in the second case they were in linking the results of NEPS and PISA.
scored as incorrect. The last aspect in assessing the equivalence of NEPS and PISA was
The distribution of students on the proficiency levels of PISA based the investigation of the construct equivalence of the tests. The analyses
on the linked scores was also equivalent (χ² = 0.95; df = 5, p > .05) in our study showed that the constructs in the NEPS and PISA tests are
when missing responses were ignored in both tests. However, the mean not equivalent but that they are comparable. This is indicated by a high
of the percentage matching increased up to 60% (Min. 29%, Max. 70%) relation and a high amount of shared variance between the tests, as well
when missing responses were ignored. Also, the Cohen’s coefficient as by one common construct of scientific literacy in both tests. The
Kappa increased to k = .55 and showed a higher classification con- investigation of the scalar equivalence of NEPS and PISA also indicated
sistency compared to the study-specific handling of missing data. a high comparability of their scales.
According to Landis and Koch (1977) the matching between the tests in Our analyses showed that although the tests of NEPS and PISA are
this case can be rated as moderate. not equivalent, they are nevertheless comparable with regard to their
Similar to ignoring and to handling missing data study-specifically, inferences, their target populations, their measurement conditions, and
scoring missing responses as incorrect led to an equivalent distribution their operationalized constructs. In this regard, the comparability of the
of students on the proficiency levels of PISA (χ² = 3.25; df = 5, NEPS and PISA tests provides a basis for linking their scientific literacy
p > .05). Compared to the previous results, the average of the percen- scores by equipercentile equating.
tage matching between NEPS and PISA decreased to 48% (Min. 33%,
Max. 53%). The analysis of the classification consistency with Cohen’s 7.2. The linking and classification consistency
Kappa (k = .36) showed a fair matching between NEPS and PISA.
The analyses of the classification consistency addressed in research
7. Summary of results and discussion question five showed equivalent distribution of students on the profi-
ciency levels of PISA on the basis of the linked NEPS and PISA scores.
The aim of this study was to link the NEPS 2010 science test with Also, the Cohen’s Kappa k = .40 and the percentage matching of 52%
the PISA 2012 science test. As a basis for this analysis, the compar- showed a good approximation between the tests.
ability of the scientific literacy scores was examined for both tests, The classification consistency is influenced by the correlation be-
taking into account the study-specific way of handling missing data. tween the tests, their reliabilities and the number of proficiency levels
The dimensional equivalence and the scalar equivalence of the test (Ercikan & Julian, 2002; Pietsch et al., 2009). Pietsch et al. (2009)
scores were assessed for this purpose. showed that the expected consistency between two tests with five
proficiency levels is 40% if the tests have reliabilities of Rel₁ = 1 and
7.1. Are the NEPS and PISA tests equivalent? Rel₂ = .8 and the correlation between the tests is r = .85. Furthermore
the classification accuracy decreases by 10% for an increase by one
In order to examine the possibility of linking the scientific literacy proficiency level (Ercikan & Julian, 2002). PISA has six proficiency
scales of NEPS and PISA, their tests were compared with regard to in- levels. Therefore, the percentage matching of 52% exceeded the ex-
ferences, target populations, measurement conditions, and oper- pected value and can be rated as very satisfactory. This value also ex-
ationalized constructs. The comparison of the inferences in NEPS and ceeded the percentage matching between NEPS and TIMSS found in the
PISA showed an overlap between the studies concerning the assessment Nissen et al. (2015) study. Thus, the assessed classification consistency
of the students’ educational level at the end of secondary school. in our study indicates that linking the NEPS and PISA scientific literacy
Furthermore, the target populations in these studies were not equiva- scales leads to a high concordance regarding the competence of stu-
lent, but our analyses showed that their overlap was high. dents. However, similar to the results of Cartwright et al. (2003) and
The next aspect in assessing the equivalence of NEPS and PISA was Nissen et al. (2015) there are some minor differences between NEPS
the comparison of the measurement’s characteristics showing that the and PISA regarding the individual classification into the benchmarks.
tests differ with regard to the response format and the handling of Thus, inferences from the NEPS and PISA results can be drawn for
missing data. A number of studies (De Ayala et al., 2001; Pohl et al., groups of students but not for individual students.
2014; Rose et al., 2010) showed that scoring missing responses as in- The classification consistency between NEPS and PISA addressed in
correct can lead to a bias in the estimation of parameters. Our analysis research question six was analyzed under different conditions of
of the dimensional equivalence indicated that ignoring the missing re- handling missing data. The analyses showed that the percentage
sponses in NEPS and PISA led to the highest comparability of their test matching between NEPS and PISA was higher when missing responses
scores. Furthermore, the presented results showed that the equal were ignored. This way of handling missing data led to an increase in
handling of missing data cannot explain the increase of comparability the average consistency by 8% compared to the study-specific handling
between the tests. From our point of view these results can be inter- of missing data. According to the study of Pietsch et al. (2009) a growth
preted as evidence that ignoring missing responses leads to more ac- of 8% in the classification consistency can be expected for the NEPS and
curate estimates of the person parameters. Thus, the findings in this PISA tests by increasing the correlation between the two tests to 1. Thus
study support the conclusions of the studies mentioned in this para- the results of linking in this study indicate that ignoring missing re-
graph. sponses not only increased the comparability of the tests but also the
Also, the difference in the response formats may influence the consistency of the classification. The last PISA assessment (OECD,
comparability of the tests. The comparison of the average percentage 2016) partially eliminated the differences between PISA and NEPS in

285
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

handling missing data by ignoring not-reached items when estimating Furthermore, the presented linking function could be used to in-
person parameters. This change in the framework of PISA 2015 in- vestigate, to what extent the performance in PISA can predict the suc-
creased the comparability of the NEPS and PISA results. cess in upper secondary education and in the further professional career
measured by NEPS.
7.3. Limitations The linking function between the NEPS and PISA science scores can
be used to classify the NEPS scores within the international benchmarks
This study examined the equivalence of the NEPS test from 2010 of PISA. Until now, proficiency levels have not been defined in NEPS.
and the PISA test from 2012. The NEPS and PISA tests have changed in Haschke, Kampa, Hahn, and Köller (2017) developed proficiency
many ways since 2012: new items were developed for the NEPS as- standards for adults based on the NEPS scientific literacy test using the
sessment in 2014 in addition to the items of 2010 so that a three-stage Item Descriptor Matching method. The linking function presented here
(easy, average and difficult) test could be administered and could still provides a basis for the proficiency standards in NEPS with regard to
be linked to the test from 2010 by using link items. PISA 2015 in- the outcome that students should have achieved in secondary school. In
troduced the following changes in the test administration and scaling connection with the longitudinal design of NEPS, the proficiency
(OECD, 2016): the assessment mode (computer-based instead of paper- standards can be used to analyze the influence of the educational de-
pencil), the scaling model (two-parameter model instead of one-para- velopment in Germany on the competencies internationally considered
meter model), the handling of differential item functioning across crucial for secondary school.
countries (calibration for a number of country-by-cycle-specific devia- Our study showed the influence of handling missing data on the
tions from the international item parameters instead of ignoring the comparability of two tests and their linking. The influence of this aspect
“dodgy” items for some countries), the handling of not-reached items on the comparability of two tests has not yet been investigated.
(dealing as not administered instead of as wrong answers when esti- Therefore, the conducted study is highly relevant with regard to future
mating the person parameters) and finally the changes in the frame- comparisons of the results from different studies.
work of scientific literacy (e.g., the KOS component technology systems
was excluded from the framework). These changes are not relevant for Acknowledgements
the linking carried out in this work, but they must be taken into account
in the future implementation of linking NEPS and PISA. The study was funded by the Centre for International Student
According to the PISA 2015 report (OECD, 2016) the scientific lit- Assessment (ZIB) and the Federal Ministry of Education and Research
eracy tests from 2006, 2009, 2012 and 2015 use the same science (BMBF) in Germany. The authors would like to thank the ZIB and BMBF
performance scale and therefore the comparison of the scores across for their support.
time is possible. Robitzsch et al. (2016), however, showed in their
work, that the change from paper pencil to computer tests could have References
biased the trend estimation of the German data in science. They sug-
gested using the field test data for the trend estimation. We would American Association for the Advancement of Science (2009). Benchmarks for science
follow this recommendation and suggest including field test data for the literacy. Project 2061. New York: Oxford University Press.
Blossfeld, H.-P. (2008). Education as a lifelong process. A proposal for a national educational
future linking of NEPS and PISA. panel study (NEPS) in Germany. Part B: Theories, operationalization and piloting strate-
Robitzsch et al. (2016) also investigated the question of how far the gies for the proposed measurements. Bamberg: Universität Bamberg.
new interactive PISA 2015 tasks could have biased the trend estimation Brennan, R. L. (2003). LEGS: A computer program for linking with the randomly equivalent
groups or single- group design. Version 2.0. Iowa City: University of Iowa: Center of
of the German data in science. For this purpose, the authors examined Advanced Studies in Measurement and Assessment.
the German trend in science based only on the old (less interactive) Cartwright, F., Lalancette, D., Mussio, J., & Xing, D. (2003). Linking provincial student
items from PISA 2012. The analyses showed no change in the trend assessments with national and international assessments. Education, skills and learning,
research papers, Bd. 005. Ottawa: Statistics Canada.
estimation compared to the trend estimation based on all items. Fur- Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item
thermore, the dimension formed by the old items correlates closely to response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.
one with the dimension consisting of new items. It can be concluded Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37–46.
that the interactive format of the new tasks did not change the science
Dalehefte, I. M., Wendt, H., Köller, O., Wagner, H., Pietsch, M., Döring, B., ... Bos, W.
performance scale in PISA 2015. (2014). Bilanz von neun Jahren SINUS in deutschen Grundschulen: Evaluation im
A second limitation of the study concerns the selectivity of the Rahmen der TIMSS 2011-Erhebung. Zeitschrift für Pädagogik, 60, 245–263.
tested sample. The choice of the sample can be justified by the second De Ayala, R. J., Plake, B. S., & Impara, J. C. (2001). The impact of omitted responses on
the accuracy of ability estimation in item response theory. Journal of Educational
goal of this study, namely the investigation of the long-term effect of Measurement, 38, 213–234.
the SINUS program. The aim of this school program is to more effi- De Leeuw, E. D., Hox, J., & Huisman, M. (2003). Prevention and treatment of item
ciently support teachers in teaching mathematics and science. This nonresponse. Journal of Official Statistics, 19(2), 153–176.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Multivariate
could lead to a higher performance in science by students at these applications books series. Mahwah, NJ, US: Lawrence Erlbaum Associates
schools. At the same time, the choice of the sample offers the oppor- Publishers371.
tunity to examine which contents of this program have an influence on Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance
to proficiency levels: Guidelines for assessment design. Applied Measurement in
the students’ performance. Another limitation concerns the test design, Education, 15(3), 269–294.
namely that the NEPS and PISA tests were administered over the course Hahn, I., Schöps, K., Rönnebeck, S., Martensen, M., Hansen, S., Saß, S., ... Prenzel, M.
of two days. A training effect or a decrease in motivation might be a (2013). Assessing scientific literacy over the lifespan – A description of the NEPS
science framework and the test development. Journal for Educational Research Online,
possible consequence. 5(2), 110–138.
Hambleton, R. K., Sireci, S. G., & Smith, Z. R. (2009). How do other countries measure up
7.4. Practical implications to the mathematics achievement levels on the National Assessment of Educational
Progress? Applied Measurement in Education, 22(4), 376–393.
Hanushek, E. A., & Wößmann, L. (2015). The knowledge capital of nations: Education and
Linking the NEPS and PISA scientific literacy scores allows classi- the economics of growth. Cambridge, MA: MIT Press.
fying the NEPS test scores in the criterion-based reference framework of Haschke, L. I., Kampa, N., Hahn, I., & Köller, O. (2017). ). Setting standards to a scientific
PISA. This and the longitudinal assessment in NEPS could help to learn literacy test for adults using the item-descriptor (ID) matching method. In B. P.
Veldkamp, & M. von Davier (Eds.). Methodology of educational measurement and as-
more about the development of scientific literacy. For example, the sessment. Berlin: Springer-Verlag.
linking function presented in our study could be used to identify the Henning, G. (1989). Meanings and implications of the principle of local independence.
factors of competence development measured longitudinally in the Language Testing, 6(1), 95–108.
Huynh, H., Michaels, H. R., & Ferrara, S. (1995). Comparison of three statistical
NEPS main studies, which can predict performance in the PISA test.

286
H. Wagner et al. Studies in Educational Evaluation 59 (2018) 278–287

procedures to identify clusters of items with local dependency. Annual Meeting of & U. Reynolds (Eds.). Competence oriented teacher training. Old research demands and
National Council on Measurement in Education. new pathways (pp. 79–96). Rotterdam: Sense Publisher.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and Raftery, A. E. (1995). Bayesian model selection in social research. Sociological
practices (2nd ed.). New York: Springer. Methodology, 25, 111–163.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for cate- Robitzsch, A., Lüdtke, O., Köller, O., Kröhne, U., Goldhammer, F., & Heine, J.-H. (2016).
gorical data. Biometrics, 33, 159–174. Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien.
Linn, R. L. (1993). Linking results of distinct assessments. Applied Measurement in Diagnostica. https://doi.org/10.1026/0012-1924/a000177.
Education, 4, 185–207. Rose, N., von Davier, M., & Xu, X. (2010). Modeling nonignorable missing data with item
Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: ETS Educational response theory (IRT). ETS Research Rep. No. RR-10-11. Princeton, NJ: Educational
Testing Service. Testing Service.
Mislevy, R. J. (1992). Linking educational assessments: Concepts, issues, methods, and pro- Ryan, J., & Brockmann, F. (2009). A practitioner’s introduction to equating with primers on
spects. Princeton, NJ: ETS Policy Information Center. classical test theory and item response theory. Washington, DC: CCSSO.
Nissen, A., Ehmke, T., Köller, O., & Duchhardt, C. (2015). Comparing apples with or- Sälzer, C., & Prenzel, M. (2013). PISA 2012-eine Einführung in die katuelle studie. In M.
anges? An approach to link TIMSS and the National Educational Panel Study in Prenzel, C. Sälzer, E. Klieme, & O. Köller (Eds.). PISA 2012. Fortschritte und
Germany via equipercentile and IRT methods. Studies in Educational Evaluation, 47, Herausforderungen in Deutschland (pp. 11–46). Münster: Waxmann.
58–67. https://doi.org/10.1016/j.stueduc.2015.07.003. Schöps, K., & Saß, S. (2013). NEPS technical report for science. Scaling results of starting
OECD (2006). Assessing scientific, reading and mathematical literacy: A framework for PISA cohort 4 in ninth grade (NEPS working paper No. 23)Bamberg: University of Bamberg,
2006. Paris: OECD. National Educational Panel Study.
OECD (2009). PISA 2006 technical reportParis: OECD. van de Vijver, F. J. R. (1998). Towards a theory of bias and equivalence. In J. Harkness
OECD (2013). PISA 2012 assessment and analytical framework: Mathematics, reading, sci- (Vol. Ed.), ZUMA-nachrichten spezial: 3, (pp. 41–65). Mannheim: ZUMA.
ence, problem solving and financial literacy. Paris: OECD Publishing. van den Ham, A.-K., Ehmke, T., Nissen, A., & Roppelt, A. (2016). Assessments verbinden,
OECD (2016). PISA 2015 results (Volume I): Excellence and equity in education. Paris: OECD Interpretationen erweitern? Zeitschrift für Erziehungswissenschaft, 20(1), 89–111.
Publishing. https://doi.org/10.1007/s11618-016-0686-2.
OECD (2014a). PISA 2012 results. Paris: OECD Publishing. von Maurice, J., Sixt, M., & Blossfeld, H.-P. (2011). The german national educational panel
OECD (2014b). PISA 2012 technical reportParis: OECD Publishing. study: Surveying a cohort of 9th graders in Germany (NEPS working paper No. 3).
Pietsch, M., Böhme, K., Robitzsch, A., & Stubbe, T. C. (2009). Das stufenmodell zur le- Bamberg: Otto-Friedrich-Universität, Nationales Bildungspanel.
sekompetenz der längerübergreifenden bildungsstandards im vergleich zu IGLU Wagner, H., Schöps, K., Hahn, I., Pietsch, M., & Köller, O. (2014). Konzeptionelle
2006. In D. Granzer, O. Köller, A. Bremerich-Vos, M. van den Heuvel-Panhuizen, K. Äquivalenz von Kompetenzmessungen in den Naturwissenschaften zwischen NEPS,
Reiss, & G. Walther (Eds.). Bildungsstandards deutsch und mathematik. leistungsmessung IQB-Ländervergleich und PISA. Unterrichtswissenschaft, 42(4), 301–320.
in der grundschule (pp. 393–416). Weinheim und Basel: Beltz Verlag. Weinert, F. E. (2001). Concept of competence: A conceptual clarification. In D. S. Rychen,
Pohl, S., & Carstensen, C. H. (2012). ). NEPS technical report – Scaling the data of the & L. H. Salganik (Eds.). Defining and selecting key competencies. Göttingen: Hogrefe and
competence tests. NEPS Working Paper No. 14Bamberg: Otto-Friedrich-Universität. Huber Publishers.
Pohl, S., Gräfe, L., & Rose, N. (2014). Dealing with omitted and not reached items in Wu, M. (2010). ). Comparing the similarities and differences of PISA 2003 and TIMSS. OECD
competence tests: Evaluating approaches accounting for missing responses in IRT education working papers, No. 32. Paris: OECD Publishing.
models. Educational and Psychological Measurement, 74, 423–452. Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACERConQuest Version 2:
Prenzel, M., & Ostermeier, C. (2006). Improving mathematics and science instruction: A Generalised item response modelling software. Camberwell: Australian Council for
program for the professional development of teachers. In F. K. Oser, F. Achtenhagen, Educational Research.

287

You might also like