Wednesday, November 7, 2007

Teacher Verification of iBT

While in line for tickets for the university's production of Little Women the Musical (we're going on my birthday for a pre-Thanksgiving holiday warm-up), I finished another integrated tasks evaluation study:

Cumming, A., Grant, L., Mulcahy-Ernt, P., & Powers, D. (2004). A teacher verfication study of speaking and writing prototype tasks for a new TOEFL. Language Testing, 21, 107-145.

You may notice that this is an article written by Alister Cumming who also wrote the iBT integrated tasks textual analysis article that I also read. I also referred to Cumming's other writing test related work for my MA thesis as well as my rater decision-making study (which I need to finish up and then develop into an article). Cumming seems like a very busy man, but he has been kind enough to respond to a couple of emails that I have sent him about his research and about the UToronto program.

Summary: This article attempts to provide content, context, and concurrent-related validity evidence in regards to the speaking and writing tasks for the new TOEFL (which is now known as iBT). As with the previous studies that I have reviewed, this one focuses on both integrated tasks (combining writing/speaking with listening or reading) and independent ones (in which examinees use their personal experience or opinions to complete the speaking/writing tasks).

Whereas the previous articles focused on the language content (Cumming, 2005) and the scoring procedures (Lee, 2006), this one focuses on how teachers of ESL students feel about the structure of the iBT tasks and their students' performance on these prototype items.

Primary questions are:
  • Is the content domain of integrated tasks perceived to correspond to the demands of academic English requirements of university study?
  • Is the performance of examinees on these tasks perceived to be consistent with their classroom performance?
  • Are the tasks perceived to be adequate evidence for making decisions about examinee language ability?
The researchers found that the teachers felt that these tasks were fairly authentic and represented a variety of language skills required at university (so far as a test is able to recreate authentic situations). There were some concerns that the tasks were not fair for lower ability students who performed poorly on integrated tasks if they did not understand the input material; however, for the most part, teachers felt that student performance on the tasks were indicative of classroom ability. The greatest concerns that raters had with the evidence claims of the tasks were that some students may do poorly if they feel uncomfortable sharing personal opinion (for the independent tasks) or if they struggle with the stimulus content (for the integrated tasks). Cumming et al. conclude with some suggestions for improving the task format and content. They also suggest that students would benefit from exam preparation in order to understand the rhetorical and educational aims of the exam. Lastly, they suggest that more research be done into standardized ratings and on teacher perceptions of students' ability in the classroom versus an exam.

Critique: It's a shame that this article it stuck in the middle between an extensive evaluation (involving a variety of teachers and locations) and an intimate study (focusing on one or two individuals and their unique situation). As a result, we have neither the generalizablity of a large statistical study nor the valuable insight into a specific context. Instead we end up with a bunch of incomplete thoughts and ideas. It's as if Cumming et al. scratch the surface of several interesting ideas, but never uncover a single one. Still, for all its limitations, this study provided me with a good model for some qualitative research (including a survey that I adapted) and it is also likely to help me form some of my content and context validity questions. I also decided, as I read its account of teacher-based evaluation, that a student-based evaluation would be a great compliment to such a study. We'll see if I can make it work, or if I will have bitten off more than I can chew.

Connections: This study served to show me that I need to read a lot more about teacher evaluation study about tests. Was this a good example? How have others approached this avenue of validation? I did enjoy seeing how this person-focused (as opposed to text-focused or score-focused) study complimented other, more quantitative, studies on the iBT in order to give a more complete and well-rounded assessment of the exam.

Additional Reflections:
  • I would really like to do an evaluation of the integrated writing tasks at the ELC that incorporates myriad sources of validty evidence including teacher-evaluation, student-evaluation, textual analysis, score-analysis, and possibly more. Am I crazy to want to do this much? Will TREC approve such a study? Will I find a dissertation committee who will approve such a multi-faceted approach?
  • Who will I choose as teachers for my teacher evaluation? I think that I would prefer to use all of them (provided that they agree to participate in the study). I don't want to exclude someone because their reasons for not being eager to volunteer may actually be connected to their feelings about the exam and are therefore a real concern that I need to take into account.
  • Will I have time to do this for speaking as well as writing? My tentative schedule, which I need to present to potential committee members next week, outlines an evaluation of writing for winter 2008 and speaking in summer 2008. I think I can be ready, but the issue is as much about analyzing data as it is about collecting it. Who will I get to help, especially when I am so busy around exam time with administrative duties?
  • What kind of concurrent validity claims might I make? Will I ask teachers to rate their students and then compare performance with expectations, or will I use data that we already collect (classroom scores or rated evaluations)? Are teachers very good at predicting student ability? Lee's 2005 thesis of the L/S tests at the ELC say no (as did major portions of this study), but I think that both of those are problematic: Lee because the teacher ratings were probably based on Speaking when she was comparing those to Listening scores, and in Cumming's case, he admits that BICS/CALP may have a lot to do with false impressions. Of course BICS/CALP may be an issue for me too, but if students and teachers practice items in the classroom more than once, then teachers ought to be good at predicting success, shouldn't they?
  • How will tasks be adjusted for lower levels (adhering the concerns of Sara in this study)? Laura (our Level 1 teacher) has already told me that she thinks Level 1 students need to be able to hear the integrated listening passage twice. As it is, they barely understand it before it's over. A second listening for Levels 1 and 2 might be appropriate and justified (given the findings of this study and Laura's experience).

No comments: