Abstract
In this article, I describe the development and trial of three measurement techniques each of which will be providing varying degrees of context for the assessment of the subjects’ lexical knowledge. These are the word-definition matching task with a complete lack of context; the gap-filling task with reduced context at the sentence level; and the rational cloze which provides discoursal clues extending to the whole of the text. Three language levels were included into the study to establish a range of language ability groups in tertiary education, each of which would be given these three assessment types at their own level of language difficulty. The scores were compared within groups and across groups to provide empirical evidence to the hypotheses introduced by the study. The researcher investigated how much of context was conducive to success in vocabulary tests at different stages of linguistic ability. Item/global comparisons yielded information on the discrimination power of each test format for each language ability level. The data were also submitted to a Principal Components analysis to see whether each assessment task had a separate construct underlying it. The results showed that the number and the magnitude of the factors that emerged from the analysis was determined by the language proficiency of the group and their ability to exploit the contextual information in the linguistic environment of the word.
Key words: vocabulary assessment; contextualized words; matching; cloze test; sentence-level gap filling
1. Introduction
Scholars in the field of testing argue that assessing vocabulary is necessary in the sense that words are the basic building blocks of language, the units of meaning from which larger structures such as sentences, paragraphs and whole texts are formed (Read, 2000; Schmitt, 2000). Vocabulary is regarded as a priority area in language teaching, especially with recent communicative language teaching techniques and comprehension-based teaching methodologies which promote the monitoring of learners’ progress in vocabulary learning and assessment of how adequate their vocabulary knowledge is in meeting their communication needs.
In vocabulary testing literature, considerable effort has been employed related to the use of receptive vocabulary tests (also known as vocabulary size tests) assessing the breadth of a learner’s word knowledge (Nation 1990; Goulden, Nation and Read 1990). Measurements of vocabulary size have been shown to correlate positively with proficiency levels in reading (Qian, 1999; Anderson and Freebody, 1981; Laufer, 1997), writing (Engber, 1995) and general language proficiency (Meara and Jones, 1988). Other studies have investigated the use of production tests which assess the learners’ depth of vocabulary knowledge. (Waring, 1999; Meara and Fitzpatrick, 2000; Verhallen and Schoonen, 1993; Wesche and Paribakht, 1996; Henriksen, 1999; McNeil, 1996, Nation; 2001, Laufer & Nation; 1995; Qian & Schedl, 2004). For many purposes, it is essential to know how well words are known, not just how many words are known (see Meara, 1999 for a discussion on what it means to know a word).Nation (1990) listed eight kinds of native-speaker word knowledge: knowledge of a word’s meaning, spoken form, written form, grammatical patterns, collocations, frequency, associations, and stylistic restrictions (e.g., levels of formality). A third area of concern has been how to measure the receptive and productive dimensions of vocabulary knowledge at the same time. Laufer and Nation (2001) and Laufer and Nation (1999) have carried out such studies and concluded that measures of subjects’ productive ability are reliable complements to receptive measures of vocabulary size and strength.
To show that vocabulary knowledge is a separate component of language ability, which in turn has its own subknowledges, involves the kind of research called construct validation of tests. Bachman and Palmer (1996) identify two approaches to construct definition: syllabus-based and theory based. A syllabus-based definition is appropriate when vocabulary assessment takes place within a course of study. Within this framework, the lexical items and the vocabulary skills to be assessed are specified in accordance with the learning objectives of the course. For research purposes, the definition of construct needs to be based on theory. When we try to define what any particular vocabulary test is measuring, we investigate into matters of what we mean by a “word”, what it means to “know” a word, how words are influenced by context, and such. A brief discussion on construct and contextualization in vocabulary assessment will bring to attention other dimensions of assessment in vocabulary assessment.
Construct and Context-related Issues in Vocabulary Assessment
Carrell (1987) defines construct as a particular set of mental tasks that an individual is required to perform on a given test. For any language assessment purposes, one has to recognize two major factors that influence test scores: the knowledge or ability represented by the construct and the testing task. In construct validation, these are generally referred to as trait and method, respectively (Bachman, 1990). Vocabulary knowledge has been defined differently by different researchers. It has often been viewed as the sum of interrelated “subknowledges” as the knowledge of spoken and written form, morphological knowledge, knowledge of word meaning, collocational and grammatical knowledge, connotative and associational knowledge, and the knowledge of social and other factors that restrain the use of a word (Qian, 1999; Richards, 1976;Nation, 1990). Meara (1999) and Nation (2001) have suggested adding yet another dimension to the existing ones - automaticity of access, (i.e. the speed at which one can perform some kind of operation on a word). Another researcher, Henriksen (1999) perceives lexical knowledge as a continuum of three levels: partial to precise knowledge, shallow to deep knowledge, and receptive to productive knowledge.
All of the studies above handle vocabulary knowledge as the knowledge of discrete word items independent of context, which Chapelle (1998) refers as the “trait” view. On the contrary, Read (2000) and Read and Chapelle (2001) argue that the lexical skill should incorporate communicative competence in addition to the knowledge of discrete items. As such, a vocabulary test should be defined in relation to a particular context typical of the test taker’s needs, which in turn will lead them to develop effective communicative strategies. Traditional vocabulary testing has been dominated by trait definitions, operationalized in discrete, selective and context-independent tests. For concerns that purely relate to the learner’s acquisition of specific, preselected vocabulary items covered in the language program, it might sound reasonable to assume that test items should contain a single target word without the risk of turning the task into one of reading comprehension. However, the long-term benefits of any vocabulary learning should be to give the learners the incentive to “develop effective communication strategies” (Read and Chapelle, 2001, p.23).
Read (2000) develops an extensive discussion on matters that relate to the use of context by L1 and L2 learners in making inferences about unknown words in texts. Among the valuable observations and evaluations he has made of the research results on this topic, the ones that directly relate to L2 vocabulary acquisition are as follows:
- level of proficiency in the language will determine the reader’s ability to make use of existing contextual clues;
- the presence of context, at whatever length or degree, does not necessarily make it easier for readers to understand the meaning of unknown lexis;
- partial knowledge of the learner and failure to confirm preliminary guesses against the developing context will lead to wrong guesses;
- successful lexical inferencing does not necessarily produce successful retention of the meaning of that word.
The same author has proposed three dimensions in relation to vocabulary assessment to account for a wide variety of testing procedures. These are discreet to embedded, selective to comprehensive, and context-independent to context-dependent (2000, p.9).
Context related studies since the 1940s have set out to identify and classify the contextual clues that can assist both first and second language readers in inferring the meanings of unknown words. Learners were found to be operating at four linguistic levels: syntactic (the structure of the sentence in which the words occurred), semantic (meaning found in the immediate and wider context of the word), lexical (the form of the word) and stylistics (the exact usage of the word in the given context) (Van Parreren and Schouten-Van Parreren, 1981). For both reduced context and extended context, what is tested is the ability to use contextual clues to determine which particular word fits a blank. On a syntactic level, the learner needs to identify the part of speech of the target word and search for grammatical clues in the clause and sentence the word occurs in. At the discoursal level, the learner can search for expressions of language functions such as definition, comparison and contrast, cause-effect, question-answer, and main idea-details (see also Kitao and Kitao, 1996).
Context studies have mostly employed the Cloze procedure. Some of these researchers (Chihara, Oller, Weaver and Chavez-Oller, 1977) have demonstrated that discourse structure of the text made a significant contribution to performance in the Cloze test while others (Alderson, 1979; Porter, 1976) have produced evidence for the argument that it would be sufficient to refer only to the immediate linguistic context of the Cloze blank for its successful completion.
According to Bachman’s (1985) classification, clues could lie at any of these four levels of context: a. within the clause in which the blank occurs, b. within the sentence, c. beyond the sentence, and d. beyond the text. Bachman suggests that in any fixed-ratio Cloze test, the proportion of items requiring lexical and other linguistic competence would be about the same in each category. Jonz (1990) made a small modification in the first level of Bachman’s classification. The within-the-clause category was split into two categories of clause-level syntactic clues and clause-level lexical clues. Among Jonz’ categories, the ones which strongly involved vocabulary knowledge were clause-level lexis (33.0%) and beyond the text (8.9%), where blanks had to be filled with content words for which there were few or no clues in the text.
If we consider the implications of Bachman’s classification and Jonz’ research results, it would be possible to argue that in a fixed-ratio Cloze test, almost half (42%) of the blanks would explicitly call for lexical knowledge. A very important outcome of this argument would be that if, in any assessment, test-takers are given a discoursal (extended) context which extends beyond the borders of a single sentence, this could mean giving them the advantage of using additional sources of clues, which according to Jonz’ study (1990) is no less than a 32% increase (beyond the sentence and beyond the text) in one’s chances of identifying a certain vocabulary item correctly. Chapelle and Abraham (1990), however, examined scores on four different types of Cloze test and concluded that the fixed-ratio Cloze had the strongest correlation with the writing test and relatively weak relationships with the reading and vocabulary test. The Cloze in Jonz’ study (1976) had a low correlation of 0.54 with the vocabulary subtest in the examination as compared to that of 0.61 with the reading subtest and 0.70 with the composition.
Alderson (1979) observed that fixed-ratio Cloze procedures produced different tests resulting in different degrees of success depending on whether one chooses the eighth rather than the sixth word. From this he concludes that maybe “the principle of randomness needs to be abandoned in favor of the rational selection of deletions” (p. 226). Read (2000) calls for attention to the need for research that investigates the rational Cloze to assess the learners’ ability to supply missing content words on the basis of contextual clues. Despite the fact that there is a shift away from decontextualized discrete-point tests, there is still a lack of research evidence that investigates the role of context in vocabulary assessment. Examining the issue from another perspective, Kanatlar (1995) looked for the difference between intermediate and elementary level students’ use of contextual clues in vocabulary learning and found that the latter group used contextual clues more frequently than the prior one. Subjects in Utar’s study (2005) benefited from a reading passage which accompanied the two vocabulary assessment tasks of multiple-choice and Matching as they scored significantly higher than subjects who were deprived of this discoursal context.
Campbell and Fiske (1959) developed a methodology known as multitrait multimethod (MTMM) construct validation, which provides a way of evaluating separately the contributions of traits and methods to test scores. Few studies (Hale et al., 1989; Corrigan and Upshur, 1982; Arnaud, 1989) have attempted to demonstrate that vocabulary knowledge is a distinct trait by employing the MTMM methodology but they failed to produce evidence for the construct validity of their vocabulary tests. Such results illustrate the difficulty of isolating particular elements of the language for assessment purposes.
In language testing, the intended effects and consequences of testing on its users (i.e. washback) should also be seen as an integral part of test design and its evaluation (Read and Chapelle, 2001). Hence, the actual consequences of implementing a test for a particular purpose should later be evaluated in relation to its intended effects. In our classroom progress and achievement tests, the intended effect is to encourage the students to study and revise the vocabulary items presented in each unit of their course textbook. Assessing vocabulary knowledge integratively in a discourse context will encourage more contextualized study of vocabulary by learners who will take such tests.
2. The present study
The findings of relevant studies previously discussed point to the necessity of carrying out research to investigate the effects of varying the amount of context surrounding the target words. None of the published studies, to the author’s knowledge, has involved a rational Cloze designed just to measure vocabulary. Such a design would aim to assess the learners’ ability to supply the missing content words on the basis of contextual clues provided through an extended (i.e. discoursal) context, as operationalized through the rational-deletion Cloze test. The author also aimed to assess the same set of vocabulary items under reduced context and zero context conditions, which are in turn operationalized through a sentence-level Gap-filling task and Matching of target words with their dictionary meanings, respectively. The comparison of data from these three tasks could yield valuable information about how much context will lead to greater success and how much will turn the task into one of grammar and reading comprehension. Subjects are expected to be operating on the given context at the syntactic and semantic level. When no context is given, the subject will be functioning mainly at the lexical level.
The present study was specifically designed to answer the following questions:
1) Will increasing the amount of context surrounding the target words from no context to reduced context to extended context produce significantly different scores for the same set of vocabulary items on related measures of Matching, sentence-level Gap-filling and rational-deletion Cloze test?
2) Will Intermediate, Upper-intermediate and Advanced language ability groups all have developed the linguistic and semantic skills which will enable them to process contextual information provided in assessment measures of Gap-filling and Cloze?
3) Will the same vocabulary items have different discriminatory powers under different contextual environments?
4) Will scores on three different measures of vocabulary knowledge, such as rational Cloze, sentence-level Gap-filling, and Matching, correlate with one another if they are in fact measuring different underlying traits relating to context?
5) Will the ability that each measurement task is designed to measure statistically prove to be a distinct construct?
The following hypotheses were proposed within the framework of these questions:
Hypothesis 1 - For the same set of preselected vocabulary items, there will be a significant difference in subjects’ mean scores obtained from three different measurement techniques of Matching, Gap-filling and Cloze for each of the language proficiency groups.
Hypothesis 2 - Subjects in all language ability groups will get higher scores in proportion to the amount of context provided in the three assessment measures.
Hypothesis 3 – All three measures will allow for mean item/total correlation indices higher than the minimum acceptable of .30 for all language ability groups.
Hypothesis 4 – There will be no significant correlations between subjects’ scores across different measurement tools but there will be a significant relationship between scores on the same measurement tool.
Hypothesis 5 - Principal components analysis of the data obtained from the three measures will load on more than one major factor giving evidence to the fact that they are in essence measuring more than a single ability.
The present study aims to validate and/or contribute to the scarce amount of research carried out on the effects of context in vocabulary assessment The scope of this study is then drawn as the comparison of three testing techniques of Matching, Gap-filling and rational-deletion Cloze as assessment measures of vocabulary attainment for three different language ability groups. Based on Messick’s (1989) suggestion that a demonstration of the validity of a test should include both logical argumentation and empirical evidence, the author of this study constructed the three measurement tools through meticulous work, administered them to 189 learners of English, and explored the results through item analysis, factor analysis, item/total comparisons of each item and test type, as well as comparison of the subjects’ performances on all test types and texts through correlating their scores. Alderson and Banerjee (2001), in their extensive review article, state that much research has been carried out on large-scale international tests while the more localized tests of the achievement type are neglected. Thus, “the language testing and more general educational communities lack empirical evidence” (p. 221) relating to the value of their assessment instruments.
3. Methodology
Subjects
The subjects of the study consisted of students that represented three language ability groups: two of these groups were subjects enrolled in a one-year preparatory English program during the academic year of 2005-2006 at the School of Foreign Languages at Gaziantep University, Turkey. As a result of the placement test administered at the beginning of the academic year, subjects were placed into their proper track as Group A (consisting of the prospective students of the Engineering Faculty and those of the Department of English Language and Literature), Group B (only the students of the seven departments of the Engineering Faculty), and Group C (a heterogeneous language ability group not included in the study). The medium of education for the English Literature department and all the departments of the Engineering Faculty is English. The third group of the study, freshman Literature students, had the greatest exposure to English since some of them had already completed the intensive English program or otherwise had been exempt from it. This third group of students had already undertaken two courses - ELL 107 Reading I (3-0) and ELL Reading II (3-0) – which primarily aimed for advanced vocabulary development and reading skills. All the subjects are assumed to be instrumentally motivated for learning English because they acquire it for academic and career purposes. Groups B subjects had 25 hours of learning a week whereas Group A students had 20 hours. Literature students took department courses of 18 credit hours a week. The Literature students are called “Advanced group” in this study by the virtue of having been exposed to language skills and practice for a longer period of time, while Group A is comparatively the “Upper-intermediate” group and Group B is the “Intermediate” group.
A total of 189 students participated in the main study. Almost all the students from Groups A and B, and the Literature group took part in the study with the exception of daily absentees. The tests were given to Groups A and B as a quiz during their class hour by their own instructors. Literature students were informed that the tests were a part of a research study and received them in their class hour under the supervision of the researcher. At the time of the study, the subjects had advanced into the middle of the second semester in their academic programs. The prep school students had taken a track-passing examination in the meanwhile and were reassigned to their appropriate levels based on their achievements.
Materials and Measurement Tools
Data for this descriptive study were collected through three measurement tasks – rational Cloze, Gap-filling and Matching – to meet the requirements of three context types – extended (discoursal) context, reduced context, and zero context, respectively. One important decision in the design of a selective vocabulary test is how to choose the target words. Usually the test writer must use his judgment in choosing the lexical items with a view to the learning objectives of the assessment. For a classroom progress test, the teachers normally make a selection from the list of words recently studied. In achievement and proficiency tests where vocabulary is to be tested in context, the starting point for selection is likely to be a particular text which contains the kind of lexical items that are required for testing purposes. This study followed the same method in specifying the text and target vocabulary items while constructing the measurement tools described below.
The Rational Deletion Cloze Test: The authors’ search for texts that would serve as rational deletion Cloze passages began with identifying short reading passages in an EFL reading textbook (Öndeş, 2004) that naturally contained vocabulary items within the range of the subjects’ lexical knowledge. These words had to be either explicitly or, as with the Literature group, implicitly taught as part of the language learning programs. Thus, it was necessary to get teachers’ opinions on which words to select for the assessment purpose. Classroom teachers had a choice of over 20 reading passages at the proper linguistic level of their students and were asked to mark in each text the words that could be considered as learnt by their students. The several passages that were marked with at least 10 familiar items of different parts of speech were later subjected to a program designed by Paul Nation and programmed by Alex Heatley (2002).
Nation’s RANGE and FREQUENCY program used in this study contained three base lists containing the most frequent 1000, 2000, and 3000 words of the English language respectively. Baseword 3 included words not in the first 2000 words of English but which are frequent in upper secondary school and university texts from a wide range of subjects. The sources of these lists are A General Service List of English Words by Michael West (Longman, London 1953) for the first 2000 words, and The Academic Word List by Coxhead (1998, 2000) containing 570 word families. RANGE provides a table which shows how much coverage of a text each of the three base lists provides. The program can be used with up to ten word lists; however, only three word base lists were accessible to the researchers. Words in the above-3000 level belong to any of the lower frequency word base lists above the first three thousand most frequent word lists.
Table 1 RANGE information on the six passages employed in the study

Of the several passages initially deemed proper by teachers’ judgments, some were eliminated because they contained a high percentage of high frequency words in the 1000 or 2000 word lists, thus under-representing the group’s vocabulary attainment. The vocabulary frequency profile of the reading texts and the pilot data allowed the researcher and teachers to reach a compromise on two texts for each level. The vocabulary items and distractors that would be assessed through the Cloze procedure were also the ones to be assessed with the Matching and Gap-filling measures. Table (1) gives information on the vocabulary frequency profile of the six texts used for the assessment of three language groups. It was equally important to decide from which frequency level the target vocabulary items would be selected. The distribution of these words to frequency levels are shown in Table 1.
Matching Test: The basic Matching task, as practiced in this study, requires learners to match the target words and their synonym or dictionary definitions. Therefore, it is a recognition rather than a recall task, focusing on the word meanings covered in students’ course books. The target items determined for the two Cloze passages were grouped under two sections with ten target and three distractor items under each column and definitions to match them with. Despite the economy and low guessing factor involved in the practice of Matching tests (Brown & Hudson, 2002), presenting words in isolation with only one single meaning may seem quite unrealistic; however, recognizing the sense meaning of a word form is accepted as a major component of vocabulary knowledge (Read,2000).
Gap-filling Test: Gap-filling measures present the testees with a number of unconnected sentences with a blank in each. The testees choose from among a list of words that are given (usually more words than the blanks) the word that best fits each blank. The target vocabulary words that were identified in the manner described above were put into a box at the bottom of a set of sentences which would provide the correct syntactic and semantic context for them. These sentences were mainly chosen from the illustrative statements provided in various dictionaries (The New Merriam-Webster; Longman Contemporary English, Oxford Advanced Learner’s Dictionary of Current English). The target items for the two different sections were grouped separately with three distractors to each set to make a total of 13 items for one section and 26 items for total of the test. Each statement had a blank of fixed length and was marked as right, wrong or missing. The following examples taken from the Advanced group test paper will illustrate how target words of each text were assessed with three test types to help understand the relatedness of the three instruments.
Text 1: British Prisons
(Cloze) In England the first use of prisons was to house vagrants and other...…(idle) persons. Later, minor offenders and debtors were imprisoned – major offenders, on the other hand, were ...…(executed)..
(Gap-filling)
1. The whole team stood _______(idle)_________, waiting for the trainer to come.
2. Thousands have been ______(executed)______ without any trial or legal process.
(Matching)
1.not doing anything; jobless (idle)
2.be killed as punishment for a crime (executed)
The three test types described above are chosen specifically because they establish a good range on the context dimension of vocabulary assessment, moving from the context-independent Matching to the context-dependent rational deletion Cloze with the Gap-filling test taking its place somewhere in the middle of the continuum.
Pilot study
A pilot study was carried out in order to predetermine any possible flaws in the testing instruments. During the pilot study, the vocabulary tests designed for the freshman Literature department students (Advanced group) were applied to second year students in the program. For the Intermediate group, of the three groups existing in the program, one was used to administer the pilot study while the remaining two groups provided data for the main study. For the Upper Intermediate group, since there was only one group to collect data from, freshman Literature students took the pilot study tests and the main study was administered to the target group.
As a result of the pilot data, the tools for the Intermediate and Upper Intermediate groups were corrected for words which showed no variance. The number of vocabulary items was reduced to 18 for the Upper Intermediate group. Four texts were piloted for the Advanced group before deciding on the best two as it was more difficult to estimate their vocabulary store. The Cronbach alpha reliability values for the Intermediate group were .65, .62, and .78 for the total scores on three test types, respectively. Alpha reliability values for the Upper Intermediate group was .80, .75, and .48, respectively. For the Advanced group, the alpha reliability values were .60, .64, and .58 for each test respectively. There was no attempt to improve these reliability values because the main purpose was to keep the target words fixed while experimenting with the context dimension of the measures. An alpha reliability coefficient of .50 was considered a satisfactory cut-off point for inclusion of the measurement tools in the main study. The reason this coefficient value is lower than the desired levels is various: firstly, subjects taking the test at each level were quite a homogeneous group in terms of their level of vocabulary attainment. Secondly, all the items that constituted these tests were believed to be either explicitly or implicitly taught within the related programs and therefore were not expected to discriminate too strongly. Thirdly, the study could have benefited from an increase in the number of its subjects and target vocabulary items but the circumstance for the piloting and the main study were stretched to the limits.
4. Statistical Analyses
Descriptive statistics for all test types and language level groups
As can be seen in Tables 2-4, the data for the three test types of Matching, Gap-filling and Cloze tests were analyzed separately for the three language groups to investigate whether the total mean values for all test types differed significantly from one another. Since each language ability group was tested with different vocabulary items of different difficulty levels, comparisons were not made among the language groups but within the groups to see the differences in their performance on measures with varying context.
Table 2 Descriptive Statistics for the Advanced Group

A one-way ANOVA showed that for the Advanced group (Table 2), the total means for the three different tests differed significantly from each other, F (2, 137.767) = 7.066, p<.002. A post-hoc Scheffé test showed that the Cloze mean (5.86) differed significantly from the means of Gap-filling (10.47) and Matching (11.92)(p<.028 and p<.004, respectively); however, the means of the latter two did not differ significantly (p<.960). The standard deviations showed quite a variety ranging between 2.11 and 3.10. The reliability coefficient value for the Cloze (.83) was quite higher than those of Matching (.79) and Gap-filling (.76).
Table 3 illustrates the values for the Upper-intermediate group. One-way ANOVA and Scheffé tests did not point to any significant difference among the three test types, F (2,3.385)= .117, p<.890. On the average, Gap-filling test yielded the highest mean score (10.41) with the greatest distribution of scores (4.71) and highest reliability (.89), which recommends this measure as the most reliable measurement of vocabulary for this language group. The Cloze test has yielded a rather low reliability coefficient of .69 compared to that of .80 and .89 of the other measures.
Table 3. Descriptive Statistics for the Upper-intermediate Group

Low values in the standard deviation scores for Matching Text 2 and Cloze Text 1 seem to make it difficult to make a rational interpretation of the group’s performance on these two text and test types. Another interesting observation was made of the high standard deviation values for Cloze Text 2 and Matching Text 1. When taken altogether, the answer could lie in the difficulty level of the target items (i.e. their word frequency levels) and the text they are embedded in (Table 1). The more difficult Text 1 appears to have a wide range of score distributions for the Gap-filling and Cloze tests, accompanied by high reliability values, but a low range, reliability and distribution values for Matching, which points to the homogeneous performance of the subjects.
For the Intermediate group (Table 4), one-way ANOVA indicated a significant difference among the total means of the three test types, F (2, 415.427) = 31.186, p<.001. Identical to the Advanced group, Scheffé test showed that Cloze mean (6.59) differed significantly from the means of Gap-filling (12.59) and Matching (12.23) (p<.001 and p<.001, respectively) but the difference between the latter two was not of any significance (p=.914). The alpha reliability value for Gap-filling (.78) was considerably higher than that of Matching (.62) and Cloze (.60). Judging by the standard deviations of the three tests at this linguistic ability level, it is possible to infer that Matching scores distinguish more strongly (4.22) among the subjects because of the lack of context; and as context is added through Gap-filling, the differences among the subjects diminishes relatively (3.97) because subjects begin to make use of contextual clues. Standard deviations for the Cloze test lower once again because subjects are equally incapable of making use of extended context and discoursal clues at this level.
Table 4 Descriptive Statistics for the Intermediate Group

The significance of these data as regards Hypothesis 1 is that for all language ability groups there is no significance difference between subjects’ scores on Matching and Gap-filling, meaning that subjects are able to handle vocabulary with reduced or zero context when they are tested on their recall of known items. Cloze scores, on the other hand, differed significantly from the other scores for the Intermediate and Advanced groups but not for the Upper-intermediate group. The Upper-intermediate group was able to process all types of context with items they had learnt explicitly, while the Advanced group had only implicitly learned the tested items and thus functioned differently with varying degrees of context. Standard deviations, however, tended to change with test types enabling one test type to distinguish more strongly than the others.
Descriptive data provided here does not bear evidence to hypothesis 2 which states that subjects’ scores will increase in due proportion to the context provided by the measurement tools. For all language ability groups, mean scores for Matching did not increase with the additional context of Gap-filling but remained the same; mean scores for the extended context of the Cloze dropped significantly for Intermediate and Advanced groups, and remained the same for the Upper-intermediate group. Extended context does not appear to be resulting in higher scores for any of the ability groups.
Inter-total item correlations
It is important to determine inter-total correlation values as it is an important aspect of the reliability of test items. An examination of the item-total correlations (Rit) will give us an idea of the quality of the items constituting the test. When a number of items address the same underlying construct, then these items are expected to relate to the construct in the same way. The authors wished to examine item/global comparisons in order to see if the discrimination quality of the items changed when they appeared in different contextual surroundings. Following the examples of previous researchers (Laufer & Goldstein, 2004; Nation, 1990) the assumption is made that each item is independent although the items appear in sets of ten (thirteen with the distractors). Ebel (1979, p. 267) developed a useful guideline for determining the value of individual items when considering item-total correlation statistics: .40 and higher are good items; .30 to .39 are reasonably good items possibly subject to improvement; .20 to .29 are marginal items in need of improvement; and below .19 are poor items which need to be revised or eliminated. When we apply Ebel’s criteria to the inter-item correlation indices in Table 5, the following information emerge.
For the Intermediate group, the mean item-total correlation values for the three test types were .449, .412 and .035 for the Darwin text, and .241, .352, and .111 for the Bigger May Not be Better text. For Matching, 55% of the items were good items with an item-total correlation indices of above .30 (.303 - .769) and 30 % of the items were poor items. For the Gap-filling items, 65% of the items had above .30 indices (.325 - .681) and two items had zero values. Looking at the Cloze test values was quite a disappointment. Only 25% of the items had indices over .30 (.360 - .734) and 55% of the items had indices zero or below. For the Intermediate subjects, Cloze test did not appear to be an efficient testing tool when compared with the Matching and Gap-filling assessment techniques. The mean discrimination indices point to Gap-filling as the most reliable measure of the three.
When examining the Upper-intermediate group, the discriminating values across three test types and two texts were not found to be significantly different (p=.213, p=186). Mean discriminating values were above .30 for all cases. This points to a great uniformity in subjects’ performance across items. For the Matching test, 69% of the items had discrimination indices above .30 (.338 - .903) and one item had minus value. The indices were extremely high for the English text. For the Gap-filling test, 78% of the discrimination values were above .30 (.309 - .866) while one item had minus value. The Cloze test items also discriminated quite successfully since 67% of the indices were above .30 (.321 - .908); three items had either zero or minus values. In general, items in the Gap-filling test correlated best with the total of the test with the Cloze items being next best. These values more or less agree with the Cronbach reliability values.
When the mean discrimination indices were compared for the Advanced group, the differences did not appear to be significant (p=.310 and p=.445). Items in the Theseus text appeared to be more difficult for these subjects than items in British Prisons text, as seen in the differences of their means (see Table 2). This explains the high discriminating value means of the first text (.507, .333, .410) compared to the low indices of the second text (.271, .276, .422). Among the three test types, the Cloze test had highest average discriminating index. For the Matching task, 72 % of the items had discrimination indices above .30 (.300 - .878) and two items had minus values. For the Gap-filling task, 47% of the items had indices above .30 (.352 - .683).

Cloze items yielded very high discrimination values: 80 % of the items were reasonably good (.306 - .711) and two items were poor. All the items in the extended context environment of the Cloze test yielded high inter-item correlations, which recommends this test type as the most discriminating measure for the more proficient subjects. Although subjects had higher score means for Gap-filling there is a randomness in their test performance over the whole of the test.
Only Gap-filling seems to be supporting Hypothesis 3 in producing item/total correlations above .30 for all cases, while the Cloze test failed to discriminate possibly because it is too difficult a task at all levels of learning. A richer and longer context could be perceived as functioning in two opposing ways: it could be considered a hindrance in the sense of requiring understanding of a greater number of words in the text before deciding on how to fill in a specific gap, otherwise helpful in the sense of providing many more clues to the meaning sense required by the gap. The student with a larger store of vocabulary may be better able to use context to his/her advantage. At the next level, subjects seemed capable of illustrating their knowledge of vocabulary irrespective of the test type but with some inconsistencies between Matching scores (.561 vs. .302) and Cloze scores (.349 vs. .502). For the Advanced group, the strongest discrimination was achieved through the Cloze test.
Correlating Measurement Tools
An examination of the patterns of correlations among test scores obtained from the various measurement tools will allow us to give further support to the construct validity of these tools. A correlation coefficient value will tell us to what extent variations in one measure are in agreement with variations in another. However, we must remember that our use and interpretation of correlation coefficients must be guided by the construct we have put down for the study. Our hypotheses about which measures should highly correlate with each other will be determined by the theory behind our construct. Our assumption is that there will be no significant correlations between different measures since subjects are expected to function differently under different contextual conditions; and yet, subjects are expected to function similarly when vocabulary items are assessed through the same contextual environment. In this study, confirmatory mode is used in order to identify the abilities and traits that influence performance on our measures. For all language ability groups, correlations are processed separately both for combined texts (K=18 or K=20) and separate texts (K=9 or K=10) so that one configuration does not confound results for the other.
For the Advanced group, Matching, Gap-filling and Cloze scores for combined texts (K=20) did not correlate significantly, a fact which could support the multidimensional aspect of the vocabulary knowledge. As further evidence to this argument, Cloze tests for two texts correlated with each other moderately at .686 (p=.001) and Gap-filling tests correlated with each other moderately at .663 (p<.01), which could be interpreted as indicative of a common construct or a common mental ability that is required of the test-takers. Matching scores did not correlate with each other. This result could be explained by the difficulty of the vocabulary items in Text 1 as against the ease of the items in Text2. Without context, subjects scored higher for the easier items and lower for the more difficult items with no correlation between the two. Easier items in British Prisons also gave rise to a negative relationship between Gap-filling and Cloze (-.555, p<.05). For the Advanced group, the evidence suggests that these tests are multidimensional and as such are measuring three separate psychometric constructs with regard to context. The provision of context allowed the scores on two measures to correlate with each other because both tests entailed the use of contextual clues surrounding the blanks.
For the Upper-intermediate group, Matching scores on one text correlated strongly with Cloze scores on the other text at .741 (p<.01). An unusual correlation was observed here. As will be remembered, scores from these tests yielded comparatively high standard deviation values (2.60 and 2.40 respectively), very high discrimination indices (.561 and .502 respectively) and will later be seen to load strongly on the same factor (.934 and .915 respectively). The sources of this pattern could lie in the interaction of item, text and task difficulty, as will be discussed later in the paper. Gap-filling tests correlated strongly with each other at .719 (p<.001) but not with the other two test types, which once again implies a separate construct.
For the Intermediate group, as it was for the Advanced, none of the test types correlated with each other for combined texts (K=20). Taken separately, Matching tests correlated moderately with each other at .631 (p<.01) and Gap-filling tests correlated moderately with each other at .579 (p<.01). Cloze tests did not correlate with each other at any significance.
As regards hypothesis 3, the assumptions of the researchers were confirmed by the results of the Intermediate and Advanced groups, where no significant correlation was observed among the three test types but moderate to strong relations were found between scores for the same test type. In other words, Matching tests and Gap-filling tests correlated in their separate categories for the Intermediate group, and Gap-filling and Cloze tests correlated in their categories for the Advanced group, and Gap-filling tests correlated with each other for the Upper-intermediate group.
Factor Analysis
In the context of construct validation, test scores are considered as observed variables while the hypothetical variables are what we wish to interpret as constructs, test methods, and other influences on test-taker’s performance on language tests (Bachman, 1990). These hypothetical variables (i.e. communalities) that underlie the observed correlations are called “factors”. Factor analytic procedure produces factor “loadings” that indicate the degree of relationship between observed scores and the various factors that emerge from the analysis. The underlying construct that the three measuring instruments aim to tap into is the usefulness of context in retrieving the form, and syntactic and semantic functioning of the target vocabulary items. The second underlying hypothesis is that higher language ability groups will be more sensitive to or perceptive of contextual clues. Since this study is experimenting with three types of context for three language ability groups, it is the expectation of its author to obtain loadings on more than one major factor for the different sections. The results support the element of context as an important factor for all groups.
When the scores from all the sections were submitted to a Principal Components analysis, three factors emerged for the Intermediate group, representing 29.73%, 26.68% and 20.09% of the variance respectively; three factors emerged for the Upper-intermediate group representing 29.43%, 28.50% and 25.78% of the variance respectively; and two factors emerged for the Advanced group, representing 42.40% and 26.86% of the variance respectively. This is in agreement with the constructs assumed to be underlying each task: first, the Matching task is a relatively discreet construct requiring no other skill than to display receptive knowledge of the given meaning sense of the target words; second, the Gap-filling task is presumably requiring the subject to make use of the syntactic and semantic clues existing in the sentence-level environment of the target word; and third, the Cloze task probably requires some degree of reading ability and advanced discoursal knowledge in addition to simple vocabulary knowledge in order to make use of inter-sentential clues to the meaning of the target words. Language difficulty was not taken as a major factor in this study because the frequency levels of the target words and the target texts were carefully chosen to match the ability level of the groups and their syllabus. And yet, of the two sets of items and texts presented to the same language group, one appeared to be slightly more difficult than the other for all three groups. Although unintentionally so, language difficulty might have become a factor after all, and this question needs to be resolved by further research.
Based on the present data, three different interpretations will be proposed here separately for each language group. A study of Table 6 will verify the fact that for the Intermediate subjects data load strongly on three factors, each factor representing one of the three context conditions as Matching (.892 and .891), Gap-filling (.881 and .873) and Cloze (.781 and .749). This could mean that the three tasks of Matching, Gap-filling and Cloze required a mental or linguistic ability which distinguished the subjects on the basis of whether they each had this required ability or not. And according to the research questions and hypotheses of this study, this key trait is the ability to make use of context provided in the assessment. Identifying the meaning of an item is one ability and processing it under semantic, syntactic and discoursal constraints is another.
For the Upper-intermediate group three factors emerged; Gap-filling task for both texts loaded strongly on Factor 2 (.889 and .942); the second and third factors are a little difficult to interpret. Performance on different texts and tasks seemed to load on the same factor; that is, Matching (Text 1) with Cloze (Text 2) loaded on Factor 1 (.934 and .915 respectively), and Matching (Text 2) with Cloze (Text 1) loaded on Factor 2 (.889 and .942 respectively). Several interpretations could result from the attempt to explain this incongruity. One reason could be the difficulty level of the target items as they differed from one text to the other. As the mean discrimination indices have already shown us, Matching task for Text 1 discriminated well (.561) between the better and poorer subjects whereas the same task for Text 2 failed to discriminate with the same success (.302) because neither the good nor the poor subject could score any higher due to the difficulty of the low frequency target words. A second reason could be the difficulty of the measurement task. For Matching, even the better learners found it difficult to identify the low frequency words through short dictionary definitions, but the same learners outperformed the poorer learners in the Cloze task because they could take advantage of the contextual clues to word meanings in the longer discourse. The general conclusion to be reached from all this is that Upper-intermediate subjects hold a middle ground in terms of their vocabulary achievement because when faced with a more difficult set of vocabulary items or a more difficult task which requires more than just discreet vocabulary knowledge, they do not perform much better than the Intermediate or as good as the Advanced group. Relating these two interpretations to the key trait being researched, one could argue that with high frequency words the better Upper-intermediate subjects can do well without too much context, but with target words of low frequency the same subjects need to use extended context and only the more proficient subjects are equipped with this ability.
The data for the Advanced group loaded on two factors, with all the Matching and Gap-filling scores loading on Factor 1 (.857, -.808, .691 and -.663 respectively) and Cloze scores loading on Factor 2 (.917 and .752). At an advanced stage of language development the distinction between zero context and reduced context seems to have disappeared for the receptive retrieval of known vocabulary. The subjects at this stage have perhaps reached a certain linguistic and/or lexical threshold and thus employ the underlying mental processes with equal success. In other words, placing target words in their proper linguistic and semantic context at the sentence level may not be more challenging (or more distinguishing) than matching the items with their dictionary definitions. However, the ability to fill in rationally deleted gaps with the target words may, even at this relatively advanced stage, requires a different mental ability which only the better students have acquired. Unlike the Upper-intermediate group, the difference in the difficulty parameters of the two texts did not lead to incoherent results for the Advanced group learner. The more difficult the text with lower frequency words, the higher were the discrimination values. Hence, extended context requiring discoursal knowledge was probably calling for a separate ability which went beyond the demands of merely syntactic and lexical knowledge.

5. Conclusion
The research data described and analyzed in this paper has tried to develop a discussion on contextual influences on vocabulary assessment. As the learners’ language proficiency grows, they are expected to be able to exploit contextual information to recognize the specific meaning of a learnt word or infer the sense of an unfamiliar lexical item when they encounter them in use. Three measurement tools were developed with the purpose of allowing students to work on vocabulary items under three different contextual environments of zero context (Matching), reduced context (Gap-filling) and extended context (rational Cloze). The study aimed to look into the question of how much context is necessary and conducive to success when assessing the lexical knowledge of learners, and to what extent lexical knowledge as distinct from other forms of language knowledge accounts for the learners’ success (or otherwise) in performing the task.
The article has described the construction of three different test types and has provided initial evidence that they each measure a different construct as validated by the factor analysis and correlation procedures applied to the data. Research results for the Intermediate and Advanced groups are especially supportive of the hypotheses Advanced by the researcher. The Upper-intermediate group yielded correlation results which were inconsistent and a little difficult to interpret. This could be explained by a lack of sincerity or serious effort on the part of the subjects, or the difficulty level of the texts and target words or their combined effect on subjects’ performance. Correlation coefficients did not point to a strong relationship between scores obtained separately from three measurement tools with the exception of the Intermediate group. However, strong to moderate relations were observed between two versions of the Gap-filling test for all language groups, which point to a unidimensionality in the ability being measured by a single test type. Cloze test scores and Matching scores correlated well within themselves for the Advanced and Intermediate groups.
Item analysis has shown that for the majority of the cases mean scores from the three measures differed significantly from each other, especially when Cloze means were compared with means of Matching and Gap-filling. An investigation into the discrimination indices of the items in each test group and language ability level gave rise to the conclusion that for the lower language ability groups, the rational Cloze format was not a very good discriminator of subjects’ lexical ability, whereas for the more Advanced levels it functioned as a strong discriminator. An important observation made from the present data was that discriminatory power of a measurement appeared to interact strongly with the lexical frequency content of the text and the task being employed. When the text and the task consisted of low frequency items, the rational Cloze item discrimination indices for the high language ability group seemed to benefit from this challenge but Upper-intermediate and Intermediate subjects were under-discriminated when items did not match their lexical knowledge. In general, all subjects regardless of their language abilities benefited from the reduced context environment at the sentential level, but the expectations of a greater success with the extended discoursal context at higher language ability levels was not supported by this study. Other abilities of reading comprehension and discoursal knowledge seem to bear the load of the work when dealing with the Cloze. It requires more than the mere knowledge of the sense meaning of an item or the ability to process it according to the contextual requirements and constraints of a sentence.
It is possible to argue based on the results of this study that it is one thing to match vocabulary items with their dictionary definitions and yet another skill to determine which words can fill in a certain blank based on the structural context surrounding the blank. The preference of using one assessment tool over the other should be based on the test user’s ultimate aim and the language level of the testees. If the aim is to obtain a test with a higher reliability and discrimination power to use with a group of intermediate to higher language ability groups, the choice of gap-filling could be better. With lower language ability groups using the rational cloze to assess the knowledge of content words runs the risk of assessing abilities other than the recognition of word meanings and is definitely measuring a different trait.
Although the findings of this study are far from being conclusive, it is hoped to pave the way for other studies to investigate the impact of context in assessing receptive vocabulary knowledge of foreign language learners in the schooling context. Measures other than the ones used in this study could be investigated for their contextual contribution. It would also be useful to carry out retrospective interviews with subjects after the administration of the tests to find out what kinds of contextual clues they actually attend to and what patterns of behavior they follow in making lexical decisions.
Acknowledgments
I am greatly indebted to my student and colleague Suna Utar for her most valuable assistance in administering the assessment tools of the present research in her classrooms and helping me collect a part of the data that went into the making of this paper. My most grateful feelings will be with her forever.
References
Alderson, J. C. (1979). The Cloze procedure and proficiency in English as a foreign language. TESOL Quarterly, 13, 219 – 227.
Alderson, J. C. & Banarjee, J. (2001) State-of-the-art review: Language testing and assessment (part 1). Language Teaching, 34, 213 – 236.
Anderson, R. C. & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthrie (Ed.),Comprehension and teaching: Research reviews (pp. 77 -117). Newark, DE:International Reading Association.
Arnaud, P. (1989). Vocabulary and grammar: A multitrait-multimethod investigation. AILA Review, 6, 56 -65.
Bachman, L. F. (1985). Performance on Cloze tests with fixed-ratio and rational deletions. TESOL Quarterly, 19, 535 -556.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.
Brown, J.D. & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press.
Campbell, D. T. & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81 -105.
Carrell, P. L. (1987). Readability in ESL. Reading in a Foreign Language, 4, 21 -40.
Chapelle, C. A. & Abraham, R. G. (1990). Cloze method: What difference does it make? Language Testing, 7,121 -146.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman and A.D. Cohen, (Eds.),Interfaces between second language acquisition and language testing research (pp.32-70). Cambridge: Cambridge University Press.
Chihara, T. J., Oller, J. W. Jr., Weaver, K. A. & Chávez-Oller, M. A. (1977). Are Cloze items sensitive to constraints across sentences? Language Learning, 27, 63-73.
Corrigan, A. & Upshur, J. A. (1982). Test method and linguistic factors in foreign language tests. IRAL, 20, 313 – 321.
Coxhead, A. (1998). An academic word list. English Language Institute Occasional
Publication, No. 18.Wellington: School of Linguistics and Applied Language Studies, Victoria University of Wellington.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34, 213 – 238.
Ebel, R. L. (1979). Essentials of educational measurement. 3rd ed. Englewood Cliffs, NJ: Prentice Hall.
Engber, C. A. (1995). The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing, 4, 139 – 155.
Goulden, R., Nation, P. & Read, J. (1990). How large can a receptive vocabulary be? Applied Linguistics, 11, 341 – 363.
Hale, G. A., Stansfield, D. A., Rock, D. A., Hicks, M. M., Butler, F. A. & Oller, J. W. Jr. (1989) The relation of multiple-choice Cloze items to the Test of English as a Foreign Language. Language Testing, 6, 47 -76.
Heatley, A., Nation, I. S. P., & Coxhead, A. (2002). RANGE and FREQUENCY programs.
http://www.vuw.ac.nz/lals/staff/Paul_Nation
Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in Second Language Acquisition, 21, 303 – 317.
Hornby, A. S. (1974). Oxford Advanced learner’s dictionary of current English. Oxford University Press.
Jonz, J. (1976). Improving on the basic egg: The m-c Cloze. Language Learning, 26, 255 –265.
Jonz, J. (1990). Another turn in the conversation: What does Cloze measure? TESOL Quarterly, 24, 61 – 83.
Kanatlar, M. (1995). Guessing words in context: Strategies used by beginning and Upper-intermediate level EFL students. Unpublished Master Thesis. Bilkent University,Turkey.
Kitao, S. K. & Kitao, K. (1996). Testing vocabulary. ERIC Document Reproduction Service, No ED 398 254. Washington DC: Eric Clearinghouse on Languages and Linguistics.
Laufer, B. & Nation, P. (1995). Vocabulary size and use: lexical richness in L2 written production. Applied Linguistics, 16, 307 – 322
Laufer, B. (1997). The lexical plight in second language reading: words you don’t know, words you think you know, and words you can’t guess. In J. Coady & T. Huckin, (Eds.), Second Language Vocabulary Acquisition: A Rationale for Pedagogy (pp. 20-34).Cambridge: Cambridge University Press.
Laufer, B. & Nation, P. (1999). A vocabulary-size test of controlled productive ability. Language Testing, 16, 33 - 51.
Laufer, B. & Nation, P. (2001). Passive vocabulary size and speed of meaning recognition: are they related? In S. Foster-Cohen, & A. Nizegorodcew (Eds.),EUROSLA Yearbook (pp. 7 – 28).
Laufer, B & Goldstein, Z. (2004). Testing vocabulary knowlgedge: size, strength, and
computer adaptiveness. Language Learning, 54(3), 399-436.
Longman Dictionary of Contemporary English (2003). New edition. Longman.
McNeill, A. (1996). Vocabulary knowledge profiles: evidence from Chinese-speaking ESL teachers. Hong Kong Journal of Applied Linguistics, 1, 39 – 63.
Meara, P. & Jones, G. ( 1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.), Applied Linguistics in Society (pp. 80-87). London: CILT.
Meara, P. (1999). The vocabulary knowledge framework, available over the Internetwww.swan.ac.uk/cals/vlibrary/pm96d.
Meara, P. & Fitzpatrick, T. (2000). Lex 30: an improved method of assessing productive vocabulary in an L2. System, 28, 19 – 30.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement. 3rd Edition. New York: Macmillan.
Nation, P. (1990). Teaching and Learning Vocabulary. Boston, MA: Heinle and Heinle.
Nation, P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.
Öndeş, N. S. (2004) ELS English language studies: English through reading. ELS Yayıncılık, İstanbul.
Porter, D. (1976). Modified Cloze procedure: a more valid reading comprehension test. ELT Journal, 30, 151 – 155.
Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary knowledge in reading comprehension. Canadian Modern Language Review, 56 (2).
Qian, D. D. & Schedl, M. (2004). Evaluation of an in-depth vocabulary knowledge measure for assessing reading performance. Language Testing, 21(1), 28 – 52.Read, J. (2000). Assessing Vocabulary. Cambridge: Cambridge University Press.
Read, J. & Chapelle, C. A. (2001). A framework for second language vocabulary assessment. Language Testing, 18 (1), 1 – 32.
Richards, J. C. (1976). The role of vocabulary teaching. TESOL Quarterly, 10, 77 – 89.
Schmitt, N. (2000). Vocabulary in language teaching. Cambridge: Cambridge University Press.
The New Merriam-Webster Dictionary (1989). Merriam-Webster Inc., Publishers: Springfield, Massachusetts.
Utar, S. (2005). The interactive effects of language proficiency level and context on subjects’ performance in vocabulary tests of Matching and Gap-filling. M.S.Thesis. University of Gaziantep, Gaziantep, Turkey.
Van Parreren, C. F. & Schouten-Van Parreren M. C. (1981). Contextual guessing: a trainable reader strategy. System, 9, 235 – 241.
Verhallen, M. & Shoonen, R. (1993). Lexical knowledge of monolingual and bilingual children. Applied Linguistics, 14, 344 – 363.
Waring, R. 1999. Tasks for Assessing Second Language Receptive and Productive Vocabulary. Ph.D Thesis. The University of Wales. Available at:http://www1.haranet.ne.jp/~warring/vocabindex.htlm
Wesche, M. & Paribakht, T. S. (1996). Assessing second language vocabulary knowledge: depth vs. breadth. Canadian Modern Language Review, 53, 13 – 39.
West, M. (1953). A general service list of English words. London: Longman.