Tuesday, November 6, 2007

G-Study of iBT

The most recent paper that I finished reading is:

Lee, Y. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23, 131-166.

Summary: Lee conducts a generalizability study of prototype speaking tasks for the iBT (internet-based TOEFL). The purpose of the study was to assess to what degree reliability increased with the number of tasks and ratings.

Tasks include three types: integrated reading-speaking, integrated listening-speaking, and independent. The effect of rater was ignored due to the difficulties of orchestrating a fully-crossed design; however, rating was considered as a facet, so all ratings were analyzed in a double-rating model.

Lee concludes that increasing tasks has a greater influence on increasing reliability than does increasing raters. Lee also found that there was sufficient evidence to justify collapsing analytic scores into a holistic score of speaking given that correlations among task types were quite high.

This has implications for the iBT speaking component. First, including more than one speaking task is more likely to increase the reliability of scores than adding extra raters. In other words, Lee suggests that more reliable scores are given when raters have a greater number of speaking sample from the examinees than when multiple raters are used. In essence, many raters can still come up with divergent scores when only one sample is used, but when many samples are provided, even a small number of raters is highly accurate at scoring the examinee.

The second implication affects the use of holistic scoring. Rather than require raters to score each sample separately (and then averaging the separate scores into a composite score for each examinee), Lee suggests that since all task types tend to correlate highly, there is sufficient evidence to warrant the use of a holistic score for each examine based on an overall impression rather than a combined average. The benefit is that this reduces the time it takes raters to score a single examinee.

Critique: Admittedly I am not impressed that Lee did not attempt a fully-crossed design. I think this makes the "rater has low effect in increasing reliability" claim given that we have nothing more than a 1-rating versus 2-ratings option. I recognize that fully-crossed designs are really hard to manage, but Lee was funded by ETS (who creates the TOEFL), so I don't know why ETS didn't demand a more rigorous research design.

I am also concerned that Lee does not describe the testing occasions. Did students take all 12 tasks at once, leading to enormous test fatigue? Or were the testing occasions spread out over several weeks thereby possibly confounding test results given that students may have improved in proficiency over the period of testing. I need to email Lee and ask about this specifically.

Connection: Thankfully I have been introduced to generalizability theory in LING 660 (Language Testing) and IP&T 752 (Measurement Theory), because it is hard for a researcher to really explain it well in a short article. I had also previously read a study (for my thesis) written by Rob Schoonen (who is a much better writer than Lee) and this helped me to understand the analysis and results that Lee described.

Lee does a decent job of connecting the results of this study to others. I found myself coming to similar conclusions and making connections to my limited experience with G-studies (such as Schoonen). I also found it valuable to read this study in connection to the iBT validation studies by Cumming. It helped me gain a bigger picture of the iBT validation project and how all these different elements function together to inform the development of this high-stakes exam.

Additional Reflections: This article sparked a lot of ideas for me.
  • Could my study include a comparison of analytic/holistic scoring? Or at the least, could I compare portfolio scores (holistic) with scores for just the integrated tasks? This may justify the need for a separate score for draft writing and timed writing.
  • I like that Lee suggests that test developers need to state the purpose of the task. Are independent tasks just about language, and integrated tasks just about content? In our case, probably not. They are both about language. This need to be clarified.
  • A scoring-related validity component would be a valuable part of my evaluation that might also include teacher and student content/context validity evidence.

No comments: