Abstract
The construct
validation of a multiple-choice listening test requires some evidence that text
and text associated variables play a significant role in predicting item difficulty.
The purpose of this study is to investigate the effects of task features on test
performance in EFL listening tests by determining how well item difficulty can
be accounted for by text factors, item factors and text-item factors. A sample
of 159 items of CET listening tests was analyzed, based on which a summary of
task features of CET listening passages is presented. Furthermore, the results
of correlation and regression analyses indicate that text-by-item interaction
variables contribute significantly to item difficulty, thereby providing evidence
favoring the construct validity of CET listening tests. Two best predictors of
item difficulty are the redundancy of necessary information, and lexical overlap
between words in the text and words in an item's options.
Key words:
test task, construct validity, CET, EFL listening tests
Introduction
In
the field of language testing, there is a steadily growing interest in the identification
and characterization of those factors which affect the test performance of the
language learner with the objective of achieving more informed construct validation
results (Bachman 1990; Foster & Skehan 1996; McNamara 1996). Bachman (2002:
471) points out that we should clearly distinguish among three sets of factors
that can affect test performance:
1) Characteristics inherent in the task
itself
2) Attributes of test takers
3) Interactions between test takers
and task characteristics
Language test performance can be attributed to
test task features. Their effects may reduce the effect on test performance of
the language abilities we want to measure, and hence the interpretability of test
scores" (ibid.). It is, therefore, vitally important for language testing
researchers to determine what the nature of the relationship between test tasks
and test performance is, and how it affects the interpretation of test results.
The information can be used as the basis for the improvement of test reliability
and validity, and more specifically, for the design of tests for particular populations.
It is out of these considerations that an in-depth analysis is intended
in the study to explore the relationship between major test task features and
students' test performance in EFL listening tests. The decision on EFL listening
tests as the focus of the study is of particular significance in the context of
China's college English teaching. Developing students' ability to use English
as a tool of communication, especially their listening and speaking abilities,
is clearly specified as the objective of college English teaching in China.
The
main purpose of the present study focuses upon the construct validity of multiple-choice
listening comprehension tests. To be valid, a multiple-choice test of listening
must demonstrate sensitivity to the information in the text passages. One serious
criticism regarding construct validity of listening tests maintains that examinees
do not or need not have to listen to the passage in order to answer the items.
Freedle and Kostin (1999) point out that one could counteract such criticisms
by showing that some variables that reflect the structure and content of the text
passage are significantly correlated with item difficulty. Finding such significant
correlations would indicate that examinees are probably paying attention to text
information and are using this information to guide their selection of answers
to the items. Particularly, they suggest that the lowest level of validity of
multiple-choice test requires finding some significant support for the effect
of text variables on test item difficulty. Therefore, a related purpose of the
study is to examine whether text and text associated variables play a significant
role in predicting item difficulty.
The following two research questions
are addressed:
1) What are the major task features of EFL listening tests?
2)
How can task features affect performance in EFL listening tests?
Review
of related studies
Task features can be further categorized into those
related to task input (or text) and those to test item. A review of studies examining
task features and test performance suggests that variations in the specific characteristics
of task input and test item affect difficulty of items. In listening comprehension
we could find only several empirical studies in which a number of factors that
may affect listening task difficulty are examined and identified. Freedle and
Kostin (1996) examined 337 TOEFL items, which asked a small number of multiple-choice
comprehension questions on short-spoken passages. They found that a different
set of attributes worked better for each item type. For example, in the case of
items that asked for the identification of the main idea, three attributes identified
were lexical overlap, rhetorical structure of the passage, and topic.
Nissan
et al. (1996) analyzed TOEFL dialogue items and found five significant variables
relating to listening performance. The three best predictor variables were (a)
the presence of two or more negatives in the dialogues, (b) the need to draw an
inference beyond what is explicitly stated in the dialogue, and (c) the pattern
of utterances in the dialogue.
Brindley and Slatyer (2002) reported on
an exploratory study that examined the effects of task characteristics and task
conditions on learner's performance in competency-based listening assessment tasks.
Key variables investigated included the nature of the input and the response mode,
namely speech rate, text type, number of hearings, input source (live vs. audio-recorded)
and item format. Quantitative and qualitative analyses of test scores indicated
that speech rate and item format could affect task and item difficulty.
Kostin
(2004) explored the relationship between a set of item characteristics and the
difficulty of TOEFL dialogue item. This study has replicated some of the significant
findings in Nissan et al. (1996). In particular, it has found that the lexical
overlap between words in the text and words in an item's options affect listening
item difficulty.
Buck and Tatsuoka (1998) were concerned with identifying
cognitive abilities needed to perform short-answer comprehension questions. Three
structural components of the listening test tasks have been singled out as influencing
item difficulty.
1) The necessary information (NI): This refers to "information
in the text which the listener must understand to be certain of the correct answer"
(Buck & Tatsuoka 1998: 134). The location of the NI and its linguistic characteristics
are found to be key factors affecting item difficulty and candidate responses.
2)
The surrounding text: This refers to the text immediately surrounding the necessary
information. The characteristics of this part of the text are found to have a
greater effect on item difficulty than the characteristics of the whole text.
3)
The stem: This is defined as the written text on the answer sheet which test takers
have in front of them as they listen and which serves both as a listening guide
and a structure for the written response. In response-constructed tasks, the stem
would be the beginning of the short answer question (SAQ) to be answered.
The
present study builds on these findings and explores their applicability in EFL
listening comprehension test in China. On the basis of the literature review,
a framework of variables assessing test task features was presented which embraced
four groups of variables: text variables, item variables, text/item variables,
and item type.
Text variables characterize the content and structure of
the listening passage itself and these variables can be further classified in
terms of word-level, sentence-level, and discourse-level factors. These variables
are related to the linguistic characteristics which have been traditionally associated
with comprehension difficulty. Item variables constitute the so-called "pure"
item variables which can be coded without reference to the contents of the listening
passage. Only the contents of the item itself are used to quantify these particular
variables. Three types of item were studied (Freedle and Kostin 1999): detailed
explicit, detail implicit and main idea items. Text-by-item or alternatively text/item
overlap variables are defined as variables that necessarily reflect the contents
of both the test items as well as the text to which those items apply. These factors
typically involve an interaction between features of the text and features of
the item. Item types are a special type of text/item overlap and they refer to
the response expected from the test taker to the task. In general, there are two
types of response: selected and constructed (Bachman 1990: 129).
Materials
and method
1. Item sample
The objective of the analysis was to
investigate whether two factors of listening tasks-text and test items-exercise
a systematic influence on item difficulty. Items were coded on these factors believed
to affect performance-vocabulary frequency, syntactic complexity, topic, etc-and
then the item score on these factors was used to predict item difficulty.
The
159 listening comprehension items taken from 16 disclosed post-1992 CET Band-4
forms comprise the total item sample. The National College English Test of China
(CET) is a national standardized test of English proficiency administered to Chinese
college students. Listening comprehension is the first part in the CET. Students
should be able to get the gist of the discourse, understand the main points and
important details, and recognize the opinion and attitude of the speaker. The
listening sub-test has two sections and lasts 20 minutes. Section A contains ten
short conversations and Section B contains three passages.
After each
passage, there are three or four questions about it. Each recording is played
once only. The passages in Section B are stories, talks, etc on personal life,
social and cultural issues, and popular science. Item type includes multiple-choice
questions and compound dictation. A more detailed description of the current listening
comprehension sub-test is presented in Appendix A. In this study, the correct
option will be referred to as the key, and the incorrect options will be referred
to as the distracters.
The item sample included 19 inference and 140 explicit
questions. As each test form contains three passages and 10 items, there should
be 48 passages and 160 items. However, one item was deleted since it is a true-or-false
question and does not fit the two question types under investigation. The original
data on item difficulty for the 159 items were collected from three different
test centers in China and involved approximately 1000 college students learning
English as a foreign language. These students were randomly selected from a much
larger pool of test takers who responded to each College English Test (CET) Band-4
test form.
2. Study variables
The content and structure of the
items and their associated text passages were represented by a set of predictor
variables that included a wide variety of text and item characteristics identified
from the experimental language-comprehension literature. Given the practical difficulties
involved in investigating the effects of all of these variables simultaneously,
it was decided to narrow the range of investigation to 24 key variables that seemed
most relevant in the context of EFL listening tests under investigation. At the
same time, from a theoretical perspective the study presented an opportunity to
investigate some of the hypotheses that have been advanced in the research literature
concerning those variables that affect second and foreign language listening comprehension.
Below is a summary of the 23 coded variables for initial investigation.
Not all variables were used in the analyses. Because of low frequencies of occurrence,
defined as two or fewer occurrences in the N = 159 sample, the variables V02,
V03 and V13 were deleted. Thus a total of 19 variables were coded, including 10
text variables, 2 item variables, and 7 text/item variables.
Text variables
Word-level
variables
V01: number of words with more than two syllables among first
100 words
V02ª: presence of an infrequent word which is relevant to responding
correctly
An infrequent word refers to a word not in The Most Common 100,
000 Words Used in Conversations (Berger, K. 1977).
V03ª: presence of an
idiom which is relevant to responding correctly
An idiom is defined as an
expression consisting of two or more words having a meaning that cannot be deduced
from the meanings of its constituent parts in the American Heritage Dictionary
(2000).
Sentence-level variables
V04: average number of words
of text's sentence
V05: number of dependent clauses in text
V06: number
of words in the longest T-unit
A T-unit is defined as an independent clause
with any attached dependent clauses (Hatch & Lazaraton 1994).
Discourse-level
variables
V07: number of negations in text
Negative markers (e.g.,
no and not) are counted, as well as negative prefixes (e.g., un- and dis-). Negative
tags are also counted, even if their meaning is not negative.
V08: number of
interrogative sentences
V09: number of references
V10: coherence (1 = min
coherence; 3 = max coherence).
High coherence means elements of opening text
sentence densely represented throughout text, etc.
V11: position of main idea
in text
(0 = main idea implicit; 1 = in last text sentence; 2 = in middle
of text; 3 = among first three sentences)
V12: rhetorical organization
(description,
causation, comparison)
V13ª: topic of text (0 = non-academic topic; 1
= academic topic)
Item variables
V14: explicit (e.g., What is
the boiling point of lead?)
V15: inference (e.g., According to the passage,
one can infer
)
Text/Item variables
V16: position of necessary
information
(1 = among the last three sentences; 2 = in middle of text; 3
= among first three sentences)
Necessary information (NI) refers to "information
in the text which the listener must understand to be certain of the correct answer"
(Buck & Tatsuoka 1998: 134)
V17: indication of necessary information (explicit
indication that NI is coming next)
V18: redundancy of necessary information
(all, or part of NI is repeated in text)
V19: number of words in the key
V20:
lexical overlap in the key (key have more words than distracters overlap with
words in text)
V21: lexical overlap in distracters (distracters have more words
than key overlap with words in text)
V22: use of background knowledge to infer
the answer
Dependent variable
V23: item difficulty (equated
delta, a standardized measure of difficulty)
Finally, it should be noted that
this study did not examine phonological features of test tasks, although previous
studies have demonstrated effects of acoustic input on listening comprehension
(e.g., Yong Zhao 1997). The reason is that phonological factors including accent,
speech rate and sandhi are under strict control in test design of CET listening.
3. Procedure
The first data analysis task involved coding each
of the 48 passages for the use of task input features. The analysis was based
on the coding of the researcher. A second coder was recruited to establish inter-coder
reliability for those variables requiring subjective judgment. The correlation
coefficient between the two coders on a sample of 12 passages and 40 items is
.86. The high inter-coder reliability ensures the use of one researcher for the
rest of the coding.
As preliminary procedures, descriptive statistics were
first generated from the data for the purpose of indicating that the central tendency
and the dispersion were generally in normal distribution way in order to ensure
that the subsequent statistics are valid for the research questions.
A
series of ANOVAs was conducted with text organization as the grouping factor.
It was expected to discover whether passages of different text organizations may
vary in text features. Afterwards, correlations between three sets of task factors
(i.e., text variables, item variables, and item/text variables) and item difficulty
were computed. Multiple regressions were subsequently used to identify the best
predictors of item difficulty from the four sets of variables considered together.
It was expected to identify the variables predictive to item difficulty, or more
specifically, to explore specific task features associated with certain level
of item difficulty.
Results
and discussion
1. Overall results of text materials
In response
to the first research question "what are the major task input features of
EFL listening tests", CET listening passages were analyzed in terms of text
variables which characterize the content and structure of the passage itself.
The results obtained help us to make a summary of task input features of listening
comprehension passages (see Appendix 1). Among the 48 passages, the plurality
of text organization comes from description, followed by argumentation and comparison.
Listening passages show no significant difference in a number of text features,
including text length, vocabulary frequency, syntactic structure, and content.
Moreover, most passages are highly coherent and the main idea is explicitly stated
among the first three text sentences.
Meyer's (1985) framework of rhetorical
organization was modified to define passage groups in the study. During the coding
procedure, it was recognized that there is a certain amount of overlap in the
text organization. For example, the problem-solution might contain elements of
causation, whereas the listing structure might contain elements of both. In addition,
since too many text types would complicate the research design, it was therefore
decided to adopt only three types of rhetorical organization: description, comparison
and causation.
The variables of coherence and text organization present
a highly centralized distribution around the median, suggesting the consistency
of text type used. ANOVA results indicate number of dependent clauses is a significant
factor among rhetorically different texts (see Appendix 2). The causation text
contains significantly more dependent clauses. It is also worth noting that significant
differences exist in number of negations between texts of causation and comparison.
Another
interesting finding involves the topic of passages. The variable V13 was developed
to reflect academic vs. nonacademic topics. Differential familiarity with different
topics covered by listening passages may play a role in accounting for listening
performance. It seems likely that items that inquire about the nonacademic topics
may, because of their greater general familiarity, be easier than items about
academic topics. However, only three passages involve academic topic in passage
sample, suggesting that CET listening passages are not field-specific. Thus the
construct-irrelevant variance in topical familiarity can be minimized, and the
content validity of the test can be ensured.
In summary, the findings
concerning text variables can provide clear evidence for the construct validation
of CET-4 listening tests. Validity centers on the extent to which inferences and
interpretations from test scores are supported by the evidence available, what
the assessment instrument measures. Bachman (1996) describes validation as a general
process that consists of the gathering of evidence to support a given interpretation
or use, a process that is based on logical, empirical and ethical considerations.
Thus validation should ensure that the differences in test performance of different
test taker groups are related primarily to the abilities that are being assessed
and not to construct-irrelevant factors.
Construct-irrelevant factors
in terms of content bias include topical knowledge and technical terminology,
specific cultural content and dialect variations. Format bias could include multiple-choice,
constructed response, computer-based responses, and multi-media materials. Other
key construct-irrelevant factors include insensitive or offensive test materials
and materials that stereotype and show certain test taker groups in unfavorable
light (Kunnan 2000: 3). Our results demonstrate that construct-irrelevant factors
in terms of test materials are not related to performance in the context of CET-4
listening tests.
2. Correlations between task variables and item difficulty
Table
1 presents those variables that are correlationally significant in predicting
item difficulty. Of the 19 variables examined, four variables yielded a significant
correlation (p < .05) with item difficulty (equated delta).
V10: coherence
of text
V15: inferencing
V18: redundancy of necessary information
V20: lexical overlap in the key
V21: lexical overlap in distracters
As
expected, other task features (e.g., linguistic and discourse features of passages)
did not significantly contribute to the listening item model. Overall, the correlation
results suggest that many of those variables found to influence comprehension
in the experimental language comprehension literature also influence our multiple-choice
listening data.
Table 1
Please
click here to see table 1 in MS Word format
The
first variable whose p value is less than the critical probability is V10
(coherence of text); the correlation (r = .195*, N = 159) that text with high
coherence was associated with easier listening items, as expected. Coherence is
characterized as the degree of unity, or how well a text holds together. A well-organized
text would be better recalled, and a tight top-level rhetorical organization would
enhance comprehension because the ideas in the text are closely interlinked (Meyer
& Freedle 1984; Meyer et al. 1993).
The variable V15 (inferencing)
is significantly correlated with item difficulty (r = - .219**, N = 159),
indicating that items are more difficult when an inference is required to respond
correctly. This result was expected in that making inferences is more cognitively
demanding, and consequently, may impede listening comprehension performance.
With
regard to this result, the question arises whether question type might threaten
test validity. If scores on a listening comprehension test reflect only language
comprehension, item scores should be predictable only from linguistic features
of the items and from the language comprehension skills of the students. Other
item features, such as question type of items, are not supposed-or even allowed-to
influence the performance of students. However, only 19 of the 159 items in this
study, about 12% of the items, were coded for this variable. This renders it impossible
to draw conclusions about the effect of question type on item difficulty.
The
third variable meeting the critical probability criterion is V18 (redundancy of
necessary information). V18 correlates positively with item difficulty (r
= .388**, N = 159). When the necessary information was repeated, items were easier.
This is consistent with previous studies. Necessary information refers to information
in the text that listeners must understand to respond correctly. Its location
and linguistic characteristics are found to be key factors affecting item difficulty
and candidate responses (e.g., Buck & Tatsuoka 1998).
In addition,
some researchers maintain that redundancy has a significant effect on listening-item
comprehension. For example, Chiang and Dunkel (1992) found that redundancy does
play a significant role in comprehension; Parker and Chaudron (1987) found that
repetition of the information plus clear segmenting of the thematic structure
enhanced orally comprehension. Therefore, the repetition of necessary information
is undoubtedly associated with easier items.
There are substantial lexical
overlap effects operating in listening. The two lexical overlap variables (V20,
V21) yielded significant coefficients for prediction of item difficulty. Lexical
overlap between words in the key and words in the relevant text sentence was significant
for listening passages (r = -.356**, N = 159). A significant and fairly strong
positive correlation exists between lexical overlap between words in the distracter
and words in the relevant text sentence and item difficulty. (r = .404**, N =
159).
The variable V20 (lexical overlap in the key) was negatively related
to item difficulty, indicating that items with a high percentage of lexical overlap
in the key tend to be easier items. Similar findings in regard to percentage of
lexical overlap in the key have been reported for TOEFL mini-talks (Freedle &
Kostin 1999) and for TOEFL reading (Freedle & Kostin 1993). One might be concerned
that a test taker having little or no comprehension of a passage could nevertheless
perform well on CET items by simply choosing the option that had the most lexical
overlap with the passage. Some information relevant to this concern is provided
by results regarding V20. Only 36 of the 159 items in this study, about 23% of
the items, were coded for this variable. Thus, using a strategy of selecting the
option with the most lexical overlap would certainly fail to yield a good score
on this item type.
The findings also suggest that item difficulty is also
related to lexical overlap between words in the distracters and words in the passage.
The correlation for V21 (lexical overlap in distracters) indicates that items
tend to be easier when no distracter has more words that overlap with the passage
than does the key. This suggests that if distracters had more lexical overlap
with the passage as compared to the key, the item would be harder. Items tend
to be harder when all three distracters have more words overlapping with the passage
than does the key.
The direction with which these four variables correlated
with item difficulty is consistent with the findings in the research literature.
This provides evidence to suggest that the results regarding some of these variables
will be successfully replicated.
3. Regression analyses
In response
to the second research question "how do task features affect performance
in EFL listening tests", regression analyses were performed with item difficulty
as the dependent variable.
Linear regression is employed to model the
value of dependent variable (item difficulty) based on its linear relationship
to predictors (V01, V04, V05, V06, V07, V08, V09 and V19). As is shown in Table
2, the small value of R squared indicates that the model does not fit the data
well. Only 4.7% of variation in the dependent variable could be explained by the
regression model. As expected, average sentence length and syntactic complexity
effects were not significant for listening items. ANOVA summarizes the results
of variance analysis. The significance value of F is larger than 0.05, indicating
that these text variables on word and sentence levels can not explain the variation
in item difficulty.
Table
2

The
categorical nature of the variables V10, V11, V12, V15, V16, V17, V18, V20, V21,
and V22 and the nonlinear relationship between these task input variables and
item difficulty suggest that nonlinear regression may perform better than standard
regression. When all the independent variables were entered as a block, the fit
of the model was very strong. Measures of the model fit are displayed in Table
3. The overall F = 6.302, p <.001; the multiple-R = .613, the R-squared
= .316, which accounts for 31.6% of the item difficulty variance. The significance
value of the F statistics means that the variation explained by the model is not
due to chance. Apparently, the independent variables do a good job explaining
the variation in the dependent variable. The multiple-R shows the overall correlation
between predictors and the dependent variable is fairly strong.
Table
3

The
statistical results reported here clearly demonstrate that task input and test
item both contribute to the prediction of item difficulty. The regression procedure
yielded three significant predictors of item difficulty:
V18: redundancy
of necessary information
V20: lexical overlap in the key
V21: lexical
overlap in distracters
Lexical overlap in distracters is the best predictor
of item difficulty (
= 0.32).
The second best predictor is redundancy of necessary information (
=
-0.25), followed by lexical overlap in the key (
= -0.23). The direction with which these variables correlated with item difficulty
is consistent with the previous findings. It should be noted that although the
standardized coefficients were statistically significant, they were quite small
in value, ranging form 0.23 to 0.32. . In general, it seems fair to say that the
findings from this study are to a certain degree consistent with Freedle and Kostin's
(1996) assertion that lexical overlap and necessary information can be singled
out as influencing item difficulty.
Although, the simple correlation between
coherence of text (a text variable) and item difficulty is significant, regression
analyses indicate that coherence does not contribute significantly to the prediction
of item difficulty. This can be understood since the centralized distribution
of the variable may counteract its effect on item difficulty.
It is also worth
noting that pure item variable like question type appears to play a weak role
in influencing item difficulty, while text and text associated (text/item overlap)
variables play by far the major role in accounting for passage item difficulty.
We are led to conclude that there is modest evidence to support the claim that
the CET listening passages and their associated items appear to be valid in construct.
Conclusion
1.
Limitations of the study
There are some serious limitations to the design
of the present study. First item difficulty is not the dependent variable of theoretical
interest. We are generally far more interested in understanding person performance
ability than item difficulty (Buck & Tatsuoka 1998: 126). Our regression analysis
puts the emphasis on item characteristics rather than performance ability. Another
drawback with the use of regression is that it only provides information about
group performance; it cannot tell us what factors specific test takers have mastered.
Finally it is appropriate to note that the variables measured in this study are
far from being exhaustive or comprehensive. These variables simply come from a
survey of the research literature. Clearly these findings are compelling and merit
further investigation.
2. Summary of major findings
In this study
we have been interested primarily in determining how well the difficulty of listening
items can be accounted for by a set of task features which involve text factors,
item factors and text-item factors. The results concerning task input variables
provide clear evidence for the construct validation of CET-4 listening tests.
Listening passages used in CET Band-4 have no significant variance in linguistic
characteristics such as vocabulary frequency, syntactic complexity, and content.
Particularly, construct-irrelevant factors such as topical familiarity and dialect
variations are minimized in the test materials, suggesting that test takers' performance
on the test is primarily related to the abilities that are being measured.
More
importantly, the empirical results demonstrate the effect of text variables on
difficulty of test items and thereby provide evidence of test validity of CET.
Two text associated factors (text-item factors) are directly tied to item difficulty
in EFL passage listening:
1) Necessary information refers to the
information in the text which the listener must understand to be certain of the
correct answer, and its redundancy clearly contributes to item difficulty.
2)
Lexical overlap between words in the text and words in an item's options
may impact listening item difficulty. Easier items are characterized by a greater
amount of lexical overlap between words in the text and words in the correct option.
In contrast, if there is a greater degree of lexical overlap between words in
the text and words in the incorrect options as compared to the correct option,
the item tends to be more difficult.
These findings will, hopefully, inform
language test developers and researchers regarding the task features that may
influence listening test performance, and therefore, about the construct validation
of listening tests. Our results provide clear evidence that examinees do attend
to the text passages in answering the test items.
References
Bachman,
L.F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L.F. (2000). Modern language testing at the
turn of the century: assuring that what we count counts. Language Testing,
17(1), 1-42.
Bachman, L.F. (2002). Some reflections on task-based language
performance assessment. Language Testing, 19(4), 453-476.
Bejar,
I., Douglas, D., Nissan, S. & Turner, J. (2000). TOEFL 2000 listening framework:
A Working Paper. (TOEFL Monograph Series MS-19). Princeton, NJ: Educational
Testing Service.
Brindley, G. & Slatyer, H. (2002). Exploring task
difficulty in ESL listening assessment. Language Testing 19(4), 369-394.
Buck,
G. & Tatsuoka, K. (1998). Application of the rule-space procedure to language
testing: examining attributes of a free response listening test. Language Testing
15(2), 119-157.
Buck, G. (2001). Assessing listening. Cambridge:
Cambridge University Press.
Foster, P. & Skehan, P. (1996). The influence
of planning on performance in task-based learning. Studies in Second Language
Acquisition 18, 299-324.
Freedle, R. & Kostin, I. (1993). The prediction
of TOEFL reading item difficulty: implications for construct validity. Language
Testing 10(2), 133-167.
Freedle, R. & Kostin, I. (1996). The
prediction of TOEFL listening comprehension item difficulty for minitalk passages:
Implications for construct validity. (TOEFL Research Report RR-96-29). Princeton,
NJ: Educational Testing Service.
Kostin, I. (2004). Exploring item characteristics
that are related to the difficulty of TOEFL dialogue items. (TOEFL Research
Report RR-04-11). Princeton, NJ: Educational Testing Service.
Kunnan, A.J.
(2000). Fairness and validation in language assessment: Selected papers from
the 19th Language Testing Research Colloquium, Orlando, Florida. Cambridge:
Cambridge University Press.
McNamara, T.F. (1996). Measuring second
language performance. London: Longman.
Meyer, B.J.F. & Freedle,
R.O. (1984). Effects of discourse type on recall. American Educational Research
Journal 21, 121-143.
Nissan, S., DeVincenzi, F., & Tang, K. L.
(1996). An analysis of factors affecting the difficulty of dialogue items in
TOEFL listening comprehension. (TOEFL Research Report RR-95-37). Princeton,
NJ: Educational Testing Service.
Appendix
1 and 2. See MS Word document or PDF file