|
| Teaching
Articles Home |
Volume
37
Professional Teaching Articles
July 2009
Article 2
Formats
PDF
SWF
Title
What Item Response Theory (IRT) Can Reveal to Us:
An Analysis of a Twenty-Item Vocabulary and Structure Test
Authors
Zhang Jianmin
Chen Zhiteng
Xiao Xi
Bio
Zhang Jianmin is an associate professor at the School of International Studies, Zhejiang University, P.R. China.
Chen Zhiteng is a senior teacher of English at Longshan High School, Ruian, Zhejiang Province, P.R.China.
Xiao Xi is a primary teacher of English at Lingxi No.2 Middle School, Cangnan, Zhejiang Province, P.R.China
Abstract
A language test serves two basic functions: 1) it tries to measure the true language ability of a student; 2) it aims to evaluate classroom teaching. Based on the results of an English test given to one class at a high school, this paper aims to answer two questions: 1) are the scores of some students from the test compatible with their regular performance in English? 2) is the test good enough to give us useful and reliable information about the test and the test takers? By comparing the test scores with those from other tests and by using the Rasch model, the authors find that IRT shows a very strong capacity in interpreting the test scores and predicting the language ability of the students in question.
Keywords: language test, scores, item response theory, interpret, revelation
1. Introduction
The most fundamental objective of administering a language test to students and other learners is to measure their language ability though technically and theoretically this has often proved difficult. In analyzing and interpreting test results, two testing theories are normally used. One is Classical Test Theory (CTT) and the other is Item Response Theory (IRT). CTT is based on the true score theory in that it is assumed that the observed score is composed of the true score and the error score. The observed score is usually seen as an estimate of the true scores of that test-taker plus/minus some unobservable measurement error (Crocker & Algina, 1986; Hambleton & Swaminathan, 1985). According to many researchers in language testing, CTT was the leading framework for analyzing and developing standardized tests. Since the beginning of the 1970’s, IRT has largely replaced the role CTT had and is now the major theoretical framework used in this scientific field (Crocker & Algina, 1986; Hambleton & Rogers, 1990; Hambleton, Swaminathan, & Rogers, 1991). One of the major weak points of CTT is that it is sample-dependent (sample of test takers here), that is, item parameters are obtained by calculating the mean of the items in question and their correlation coefficients with the whole test scores. In other words, since the sampled test takers are different in levels of ability, then such indices as facility values and discriminatory powers of test items will be different. As a consequence, it is almost impossible to determine the norm with which to measure the language ability of test takers unless we can get a fairly accurate scale or yard stick for such use. If for instance, the test-takers with different ability levels take a test, scores will be wildly different. Neither an easy test nor a difficult one can distinguish or discriminate the test takers. Therefore, it is difficult to compare test-takers’ results between different tests. Item Response Theory is based on an entirely different philosophy of psychometrics in that
in practical test development work, we need to be able to predict the statistical and psychometric properties of any test that we may build when administered to any target group of examinees. We need to describe the items by item parameters and examinees by examinee parameters in such a way that we can predict probabilistically the response of any examinee to any item.(Lord, 1980, p.11).
In other words, from the response patterns we can observe from those test takers to the items, we can judge fairly accurately and independently both item performance and the ability of the test taker. Certainly, we have to make sure, as Wright and Stone (1979) put it, each person’s response pattern must be assessed to determine whether the person was responding in an acceptably predictable way given the expected hierarchy of responses (i.e. the items are arranged in order to form a hierarchy).Unlike CTT, therefore, we do not need to stress on reliability or validity of a test. We concentrate rather on the response pattern of individual test takers. According to Bond and Fox (2001, p.8), the Rasch model provides us with useful approximations of measures that help us understand the processes underlying the reason why people (test takers included) and items behave in a particular way.
IRT does not need large numbers of test items to obtain scores of test takers to determine their language abilities. With ten calibrated test items, the abilities of some test takers can be revealed by referring to the response patterns they show on these items. (By some, we mean we only chose 20 test takers out of 45 in our research. By doing so, we hoped to avoid complicated calculations of the scores from all the 45 test takers.) In this sense, the latent ability of a test-taker is independent of the content of a test. The relationship between the probability of answering an item correctly and the ability of a test-taker can be modeled in different ways depending on the nature of the test (Hambleton et al., 1991). According to IRT, a test-taker with high ability should have a high probability of answering an item correctly. Theoretically, if the items that are used to test the ability of a test taker are arranged according to their facility values, then we can expect to see an orderly response pattern of that test taker. That is, the test taker is not likely to answer correctly the items beyond his/her ability. Even if he or she gets the items correct, the response pattern is disorderly. From this disorderliness, we can judge that guess work is involved in doing the test.
IRT seeks to reveal latent psychological constructs in terms of observable item responses. This information is useful in developing and evaluating tests, as well as interpreting examinees’ scores on the latent characteristics in question. If with only a small number of test items we could know a lot about the test takers, the advantages of IRT will show themselves as are revealed by Henning:
If only those items are used that approximate in difficulty the known ability region of the examinees, then fewer items will be required. … If items are pre-calibrated, banked and randomly summoned for any given measurement task, then there is less risk of a security breakdown that would disqualify large numbers of items for future use. All these advantages app up to greater economy of items over time and use (Henning, 2001, p.111).
And this is what we want to reveal from the results of a test by using the one parameter model or the Rasch Model. The reason for our choice of the Rasch One-Parameter Model is that for one thing, we are more concerned with the parameter: scale of person ability and item difficulty; for another, this model is not constrained by sample size (size of test takers). For the two parameter model, a sample of more than 200 test takers is required and the three parameter model would need a sample size of more than 1000 (Henning, 1987, p.116). The objective of our research here is very simple: we intend to find out if the scores from the test are compatible with: 1) the regular performance of these test takers; 2)with the assessment of their language teachers. At the same time we want to test if IRT is powerful in making sound judgment of test takers’ abilities or of the quality of a test. Moreover, we also attempt to see if IRT could make better interpretations of test results as well as those or any test takers.
II. Methods and Materials
1. Subjects and Test Items
A total of 45 students from Longshan High School in Zhejiang Province, China took an English test. They were 17 to 18 years of age at the beginning of the school year in September 2006. The given test, officially known as Final English Test Paper for 10 Schools in Wenzhou, Zhejiang Province for the first term of 2006 School Year, contains four sections: listening comprehension, reading comprehension, vocabulary & structure and essay writing. The participants were aware that the test was given just as a quiz. It was neither a mid-term test, nor a final test. They were told that the results of the test would not be counted to form the final grade for the subject of English. The English of these students for our study was below the average of their peers, but they showed willingness and cooperation in taking the test. For easy calculation as is usually the case, we selected 20 students from the 45 ones that took the test, 10 in the upper group and 10 in the lower group according to their scores in the test in question. Furthermore we only used the scores of the vocabulary and structure part for our analysis and discussion. We chose this part of the test for two reasons: first, the items are context-independent of one another; secondly, they cover a wider range of language points to be tested. That part contains 20 multiple-choice items and each item counts one point. So the full score of this section is 20.
Table 1 Scoring Matrix for a 20-Item Vocabulary and Structure Test
Item
person |
17 |
9 |
3 |
15 |
20 |
4 |
12 |
16 |
18 |
1 |
7 |
8 |
10 |
11 |
5 |
2 |
6 |
14 |
13 |
19 |
score |
A |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
13 |
Upper Group |
B |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
13 |
C |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
1 |
1 |
0 |
1 |
13 |
D |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
12 |
E |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
10 |
F |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
10 |
G |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
00 |
0 |
10 |
H |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
9 |
I |
0 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
9 |
J |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
8 |
K |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
8 |
Lower Group |
L |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
7 |
M |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
7 |
N |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
7 |
O |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
6 |
P |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
6 |
Q |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
5 |
R |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
4 |
S |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
T |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
total |
16 |
15 |
12 |
11 |
10 |
9 |
9 |
9 |
9 |
8 |
8 |
8 |
8 |
7 |
6 |
5 |
5 |
4 |
2 |
2 |
|
2. Data Analysis
The selection of 20 students as subjects and their scores of the 20-item test as data for analysis of our study were made by referring to their regular performances in similar English tests over the past school year. These students were then arranged in order according to their scores of the 20-item test as shown in Table 1 above. Here 1 stands for correct choice and 0 for wrong choice. The persons or items for which all responses are correct or incorrect are usually eliminated, because they do not have any discrimination power, but none of the examinees or items in our case belonged to this category.
With this matrix ready, we can now calculate the logit incorrect value for each possible number correct. We only need to follow the procedures proposed by Henning (1987, p.119) in doing the calculation. The logit incorrect value for each item was computed as the natural logarithm of the ratio of the proportion incorrect to the proportion correct. In fact, we do not need to do such tedious calculation. By consulting Table F in Appendix A (Henning,1987, p.171), we can easily get the logit correct and logit incorrect values. This procedure places item difficulties on an interval scale and eliminates the boundaries inherent in the zero-to-unity range of classical p-values. The zero point, or origin, of the item difficulty calibrations is arbitrarily set at the mean of the logit incorrect values for all analyzable items. This is done by subtracting the mean adjustment value, computed as the sum of the frequency times logit values divided by the sum of the frequencies, from the logit incorrect value for each item (Henning 1987, p.119). We do so in order to calculate the difficulty level of the items as shown in Table 2 below.
Table 2 provides us with information about the items. For example, we note in the second column1 6 students out of 20 got Item 17 correct, so the proportion correct is 0.8 (16/20), while the proportion incorrect is 0.2. But Item Frequency is 1, which means only one item (Item17) was done correctly by 16 students. Logit Incorrect can be obtained by consulting Table F in Appendix A (Henning,1987, p.171). Finally the Initial Item Difficulty can be calculated easily.
Table 2 only reveals information about item difficulty. To know more about the 20 students’ performance on this test, we need to calculate the initial person ability by using logit correct values instead of logit incorrect values. This is because we are effectively subtracting item difficulties from person abilities in order to place both these estimates on the same single ability-difficulty continuum (Henning,2001, p.120). Table 3 shows the result of this calculation.
Table 2 Calculation of Item Difficulty Calibrations
item
name |
number
correct |
item
freq. |
prop.
corr. |
prop.
incor. |
logit
incor. |
freq.x
logit |
freq.x
logit2 |
item
diff. |
17 |
16 |
1 |
16/20=0.80 |
0.20 |
-1.39 |
-1.39 |
1.93 |
-1.83 |
9 |
15 |
1 |
15/20=0.75 |
0.25 |
-1.10 |
-1.10 |
1.21 |
-1.54 |
3 |
12 |
1 |
12/20=0.60 |
0.40 |
-0.41 |
-0.41 |
0.17 |
-0.85 |
15 |
11 |
1 |
11/20=0.55 |
0.45 |
-0.20 |
-0.20 |
0.04 |
-0.64 |
20 |
10 |
1 |
10/20=0.50 |
0.50 |
0.00 |
0.00 |
0.00 |
-0.44 |
4,12,16,18 |
9 |
4 |
9/20=0.45 |
0.55 |
0.20 |
0.80 |
0.16 |
-0.24 |
1,7,8,10 |
8 |
4 |
8/20=0.40 |
0.60 |
0.41 |
1.64 |
0.67 |
-0.03 |
11 |
7 |
1 |
7/20=0.35 |
0.65 |
0.62 |
0.62 |
0.38 |
0.18 |
5 |
6 |
1 |
6/20=0.30 |
0.70 |
0.85 |
0.85 |
0.72 |
0.41 |
2,6 |
5 |
2 |
5/20=0.25 |
0.75 |
1.10 |
2.20 |
2.42 |
0.66 |
14 |
4 |
1 |
4/20=0.20 |
0.80 |
1.39 |
1.39 |
1.93 |
0.95 |
13,19 |
2 |
2 |
2/20=0.10 |
0.90 |
2.20 |
4.40 |
9.68 |
1.76 |
Total |
|
20 |
|
|
|
8.79 |
19.31 |
|
Logit Incorrect = Ln〔Proportion Incorrect/Proportion Correct)
Mean Adjustment =〔Sum of Frequency x Logit)〕/〔Sum of Frequency)
Initial Item Difficulty = Logit Incorrect - Mean Adjustment, e.g. mean adjustment = 8.79/20 = 0.44
Table 3 Calculation of Initial Person Ability Measures
persons |
person score |
person freq. |
prop. corr. |
logit corr. |
freq.x logit |
freq.x logit2 |
person measure |
T,S |
3 |
2 |
3/20=0.15 |
-1.74 |
-3.48 |
6.06 |
-2.16 |
R |
4 |
1 |
4/20=0.20 |
-1.39 |
-1.39 |
1.93 |
-1.81 |
Q |
5 |
1 |
5/20=0.25 |
-1.1 |
-1.10 |
1.21 |
-1.52 |
P,O |
6 |
2 |
6/20=0.30 |
-0.85 |
-1.70 |
1.45 |
-1.27 |
N,M,L |
7 |
3 |
7/20=0.35 |
-0.62 |
-1.86 |
1.15 |
-1.04 |
K,J |
8 |
2 |
8/20=0.40 |
-0.41 |
-0.82 |
0.34 |
-0.83 |
I,H |
9 |
2 |
9/20=0.45 |
-0.20 |
-0.40 |
0.08 |
-0.62 |
G,F,E |
10 |
3 |
10/20=0.50 |
0.00 |
0.00 |
0.00 |
0.42 |
D |
12 |
1 |
12/20=0.60 |
0.41 |
0.41 |
0.17 |
0.83 |
C,B,A |
13 |
3 |
13/20=0.65 |
0.62 |
1.86 |
1.15 |
1.04 |
Total |
|
20 |
|
|
-8.48 |
13.54 |
|
Logit correct = Ln〔Proportion Correct/Proportion Incorrect)
Mean Adjustment =〔Sum of (Frequency x Logit)〕/〔Sum of Frequency〕
Initial Person Measures = Logit Correct Values - Mean Adjustment
For example: Mean Adjustment = -8.48/20 =-0.42
III. Findings and Discussion
The scoring matrix for a 20-item multiple choices test depicts (Table 1) that no students obtained a score of less than 3 or more than 13. We could say therefore that the actual range of scores was 3-13, or 10, the possible range being 0-20, or 20. Since this was a multiple-choice type of test with four options, one would expect the examinees to get a sore of at least 25 percent by mere guessing. Thus, a score of 5 or below would actually be meaningless for discriminating among the participants in the ability being tested. Hence, we conclude that the effective range of this test was 5--13, or 8. Generally speaking, the broader the range, the more effective in discriminating among examinees on the ability under consideration will be. From Table 1, we know that some items are too difficult for these students (Items 13, 14, 19), and the level of several students is too low. We need to determine whether there is some correlation between the two.
The response patterns of some persons as revealed in Table 1 are informative. The response pattern of Person R, for example, is typical of a person who was making wild guesses, for he exhibited a highly unlikely response pattern. This is especially so since with an estimated ability -1.81, this person missed Item 3 and got Item 7 correct. He was one of the two who got Item 19 correct which has a difficulty level of 1.76. He failed in Item 3 with only a difficulty level of -0.85. By referring to his disorderly responses to the items, it is natural for us to come to this conclusion that guess work was involved in making the choices. This was later verified after we interviewed Person R. Consider Person Q. He only got five choices correct and one of them was Item 14 which has a difficulty level of 0.95. Referring to his ability measure, we know it is -1.52, meaning it is far below his ability to get Item 14 correct. The only possible conclusion is that he got the correct choice by mere guessing because he failed to correctly answer much easier items such as Items 3, 15, 20, 4, 12. Here what we need to point out is that CTT makes no provision for the identification of such persons and, therefore, increases the possibility that invalid scores may be reported for certain persons in the sample of examinees. IRT overcomes this weakness and can reveal this information to us.
To verify if the ability of these 20 test takers matches their regular performance, we have checked with their regular record and ranks among 45 students in the whole class. We also checked with the assessment of these 20 test takers by their English teacher, but we did not tell the teacher that we were doing this kind of research for fear that the facts he gave us might be contaminated. That is, to avoid the facts he gave us were in our favor. The regular record of their English performance and their assessment by the teacher match very well with the scores they obtained from the 20-item test as is shown in Table 4 below.
This 20-item test is of good quality, except that some items are too difficult for these students. For example, for items 13 and 19, the average final calibration is 0.06, which means its calibration is appropriate. It is equal to 0.5-0.6 according to the classical test theory (CTT). However, the level of the participants of the test is comparatively low. The average of the final measure is – 0.45, which is far below 1.
Table 4 Matrix of Scores from Four Tests and the Testees’ Ranks
test
persons |
20-item test |
mid term exam |
final exam |
monthly exam |
|
rank |
score |
rank |
score |
rank |
score |
rank |
assessment |
A |
1 |
13 |
1 |
106 |
3 |
98 |
4 |
102 |
Upper Group |
B |
1 |
13 |
2 |
104 |
2 |
102 |
3 |
105 |
C |
1 |
13 |
3 |
103 |
4 |
97 |
6 |
96 |
D |
2 |
12 |
4 |
101 |
1 |
106 |
2 |
107 |
E |
3 |
10 |
5 |
99 |
5 |
95 |
7 |
95 |
F |
3 |
10 |
6 |
99 |
1 |
106 |
5 |
97 |
G |
3 |
10 |
7 |
96 |
6 |
83 |
10 |
81 |
H |
4 |
9 |
8 |
94 |
8 |
83 |
8 |
91 |
I |
4 |
9 |
8 |
94 |
6 |
92 |
12 |
75 |
J |
5 |
8 |
9 |
92 |
7 |
85 |
9 |
87 |
|
K |
5 |
8 |
10 |
59 |
9 |
70 |
14 |
65 |
Lower Group |
L |
6 |
7 |
11 |
58 |
11 |
62 |
17 |
40 |
M |
6 |
7 |
12 |
56 |
10 |
69 |
18 |
39 |
N |
6 |
7 |
13 |
54 |
13 |
52 |
1 |
112 |
O |
7 |
6 |
14 |
48 |
15 |
48 |
11 |
80 |
P |
7 |
6 |
15 |
47 |
11 |
68 |
15 |
61 |
Q |
8 |
5 |
16 |
46 |
15 |
48 |
16 |
45 |
R |
9 |
4 |
17 |
45 |
14 |
50 |
18 |
39 |
S |
10 |
3 |
18 |
45 |
12 |
54 |
13 |
69 |
T |
10 |
3 |
19 |
38 |
16 |
29 |
19 |
37 |
Conclusions
From the above discussion, we may now come to these conclusions:
- IRT is more powerful in revealing the latent traits of the test takers;
- IRT is more economical so far as language testing is concerned because it does not need too many items to measure the true language ability of test takers;
- From the data of other tests given to the same group of students, we can judge that classroom teaching is going on smoothly as the participants of our study show fairly compatible results of their English studies.
Notwithstanding, this kind of study needs to be widely replicated if we are to safely say that IRT is a powerful tool to measure and assess language teaching and testing.
References
Baker, F. B. (1992). Item response theory parameter estimation techniques. New York: Marcel Derker, Inc.
Baker, F. B. (2001). The basics of item response theory (second edition). ERIC Clearinghouse on Assessment and Evaluation
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord and M.R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: Addison-Wesley,
Bond, T. G. & Fox C. M. (2001). Applying the Rasch model. New Jersey: Lawrence
Erlbaum.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff.
Hambleton, R. K., & Rogers, J. H. (1990). Using item response models in educational assessments. In W. Schreiber & K. Ingenkamp (Eds.), International developments in large-scale assessment (pp. 155-184). England: NFER-Nelson.
Hambleton, R. K., H. Swarmnathan and H. J. Rogers. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Harvey R. J. & Hammer A. L. (1999). Item response theory. The Counseling Psychologist, 27(3), 353-383
Henning, G. (2001). A guide to language testing: Development, evaluation and research. Foreign Language Teaching and Research Press & Heinle /Thomson Learning Asia
Jaeger, R. M. (1991). Series Editor’s Foreword. In Hambleton, R. K., H. Swarmnathan and H. J. Rogers. (1991). Fundamentals of item response theory.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. NJ: Lawrence Erlbaum.
Wright, B. & Stone, M. (1979). Best test design. Chicago: MESA Press.
|