robblog: IP&T: November 2007

Tuesday, November 27, 2007

Responding to the counteragument using academic templates

"I know what to do, but I just don't know how to do it." A common frustration in any writing class.

In these last few weeks of the semester, my upper-level writing class and I writing argumentative research papers. This assignment is pushing them to exercise new skills such as developing their own organizational outline, and encouraging them to respond to critics of their own positions.

I say "we" because I am doing this along with my students. In conjunction with the L2WRG (Second Language Writing Research Group), a few other researchers and I are "responding" to a recent debate in error correction. As my class and I discuss how to respond to research, it means as much to them as it does to me.

So how does an instructor teach how to write an argumentative research paper? Good question, and I wish I knew; this university - for all the emphasis that it places on writing - offers few courses on the teaching of writing. But despite my lack of training, I have done my best to pick up writing-pedagogy professional development opportunities. One such opportunity was a Writing Matters Seminar (sponsored by the university writing initiative) with a guest speaker who recently wrote the "They Say, I Say" composition help book.

"They Say, I Say" is based on the premise that new university students need to learn the language of academic English. Rather than simply expect students to learn this language implicitly, the authors suggest that university instructors need to raise students' awareness of these phrases and forms in order to facilitate this type of "language acquisition."

Although this book is intended for native speakers of English, it is even more relevant in our ESL class. If natiev speakers struggle to know how to put academic ideas into academic words,
then it is even more imperative that ESL students be explicitly taught these phrases of academic language.

The book offers templates for phrases and their functions, but the book does not explain how to teach them. So I'm trying to figure this out.

There are a few techniques I have used to teach phrases/vocabulary.

Allow students play vocabulary games with partners using the AWL (Academic Word List)
Require the use of AWL words in their essays (a portion of their grade is tied to this)
Model, show examples, and encourage practice with academic phrases in a lecture format

I wish there was more. I hope to improve my teaching techniques in the near future.

Monday, November 19, 2007

Vocabulary Measures

Because I plan to have some type of linguistic measurement as part of my validation study, I have been reading about types of language statistics. The most recent is a study by Batia Laufer and Paul Nation.

Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16, 307-322.

I am not familiar with Laufer, but I most certainly have heard about Nation. He is the man behind the Academic Word List (AWL), and list of commonly used words in university writing. We use the AWL at our center to help our students improve their reading and writing of academic texts.

This study appears to be pre-AWL since Laufer and Nation make no reference to it, and instead refer to the general service list and other university word lists. I imagine that the AWL was created not long after this study, since the inklings of the AWL are emanating from this work.

The primary focus of the study is an argument in favor of a new kind of vocabulary measure: the Lexical Frequency Profile. Although I don't exactly buy the LFP bid (the explanation of its use was verbose and confusing), I did appreciate the succinct discussion of existing measures of vocabulary use. In a matter of about 2 pages, Laufer and Nation clearly explain the pros and cons to several commonly used lexical measures. Even if I didn't get much out of the rest of the paper, I enjoyed their assessment of these measures.

The question is raises for me is: what linguistic measures will I use in my study? Almost certainly I would like to have at least one lexical measure, and hopefully this study will help me to explain and justify my decision.

Tuesday, November 13, 2007

Research Proposal Draft 1

I finished a draft of my dissertation research proposal yesterday and took it by the department to show it to potential committee members. I was well received and everyone agreed to look it over with the promise that I would check back with them later in the week. And that started today.

I met with the potential chair of the committee, whose first comment was, “Do you remember what advice I give to anyone who attempts to do a validity study for a dissertation?”

Of course I knew. The answer is, “Don’t do it.”

“It’s not that I think that validity studies are a bad idea; they are very important and should be done,” he said. “Just not as dissertations.”

“Why exactly is that?” I asked.

“They are complicated and intensive. It’s hard to get it done.” Which I knew. I recognize it. In fact, as I explained to him, I have already done the lit review and collected the data of for a study that might work in place of this validity one. It would be an easy route. But I’ve already done it.

The integrated skills study needs to be done. It is complicated, but it’s real. I’d rather do something that is needed than something that it easy. So, yes, in other words, I am crazy and stupid.

In our conversation, this professor and I discussed my proposal and how I can limit/focus my efforts while still maintaining the multi-faceted approach of a good validation study.

The next step is to flesh out specific research questions, a solid literature review, and a clear methods section (including all proposed analyses). If I can get that done by January, I think that I will be in a good position to still do this complicated study while aiming for a realistic graduation timeline.

Saturday, November 10, 2007

Language acquisition as construction refinement

Nick Ellis, a linguist from UMich, has been on campus this week discussing his theories of second language acquisition (L2A) and I was finally able to attend a session Friday afternoon. Although I would have preferred to attend an earlier session in which he discussed the need for an Academic Phrase List (in addition to Paul Nation's Academic Word List), I found his linguistically-heavy discussion of L2A theories fairly interesting. Although I have only ever skimmed the surface of L2A theories (I took my graduate course in L2A during my first summer semester of grad school), I was still able to follow some of it while I sat near the back and worked on my laptop.

As I listened, it seemed like Ellis's theories of L2A validate my teaching approach. From what I understand, it appears that Ellis is claiming that L2A is less about learning rules than it is about noticing and refining phrases (aka constructions). This is what I have been trying to get my more advanced levels to do. They frequently ask me to give them rules about language - and I try whenever possible - but the truth is that many of their questions cannot be answered by rules. Language isn't really a collection of rules, as much as grammarians would lead us to believe. Instead, language is rules by usage, not grammar. If students want answers about complex language, a rule book will not help them.

I answer these questions with corpus research. I show students that seeing how language is used is the best way for them to build and refine their own constructions of English. The easiest way to do this is to type an example phrase into a corpus viewer (such as Mark Davies's viewer). Even better, I encourage them to pay attention to constructions as they read and listen to authentic academic material. Of course students don't like this method because it takes more work, but in truth this is how I learned academic language, and this is how all native speakers learn language: noticing and refining constructions based on reading and listening.

Yet, as I came to this conclusion, a part of me questioned whether Ellis really was supporting my theory of learning, or whether I was interpreting his lecture in order to justify my own approach. Either way, I'll share my thoughts with my students next week and see what they think.

So what was I working on during the lecture? I was trying to finish the integrated writing tasks for this semester's final exams. I had written Level 1-4 earlier in the day, but due to the second language writing research group (L2WRG) meeting I had to put Level 5 on hold. So when a group of us moved directly from L2WRG to the Ellis lecture, I opened my laptop and wrapped up. Even though there was no internet access in the basement, I had taken enough notes on the selected exam prompt topic that I was able to write the reading passage without the source text. And as I wrote, I couldn't help but realize that everything that I was writing was, in fact, a series of constructions that I had learned exactly as Ellis explained it.

Wednesday, November 7, 2007

Teacher Verification of iBT

While in line for tickets for the university's production of Little Women the Musical (we're going on my birthday for a pre-Thanksgiving holiday warm-up), I finished another integrated tasks evaluation study:

Cumming, A., Grant, L., Mulcahy-Ernt, P., & Powers, D. (2004). A teacher verfication study of speaking and writing prototype tasks for a new TOEFL. Language Testing, 21, 107-145.

You may notice that this is an article written by Alister Cumming who also wrote the iBT integrated tasks textual analysis article that I also read. I also referred to Cumming's other writing test related work for my MA thesis as well as my rater decision-making study (which I need to finish up and then develop into an article). Cumming seems like a very busy man, but he has been kind enough to respond to a couple of emails that I have sent him about his research and about the UToronto program.

Summary: This article attempts to provide content, context, and concurrent-related validity evidence in regards to the speaking and writing tasks for the new TOEFL (which is now known as iBT). As with the previous studies that I have reviewed, this one focuses on both integrated tasks (combining writing/speaking with listening or reading) and independent ones (in which examinees use their personal experience or opinions to complete the speaking/writing tasks).

Whereas the previous articles focused on the language content (Cumming, 2005) and the scoring procedures (Lee, 2006), this one focuses on how teachers of ESL students feel about the structure of the iBT tasks and their students' performance on these prototype items.

Primary questions are:

Is the content domain of integrated tasks perceived to correspond to the demands of academic English requirements of university study?
Is the performance of examinees on these tasks perceived to be consistent with their classroom performance?
Are the tasks perceived to be adequate evidence for making decisions about examinee language ability?

The researchers found that the teachers felt that these tasks were fairly authentic and represented a variety of language skills required at university (so far as a test is able to recreate authentic situations). There were some concerns that the tasks were not fair for lower ability students who performed poorly on integrated tasks if they did not understand the input material; however, for the most part, teachers felt that student performance on the tasks were indicative of classroom ability. The greatest concerns that raters had with the evidence claims of the tasks were that some students may do poorly if they feel uncomfortable sharing personal opinion (for the independent tasks) or if they struggle with the stimulus content (for the integrated tasks). Cumming et al. conclude with some suggestions for improving the task format and content. They also suggest that students would benefit from exam preparation in order to understand the rhetorical and educational aims of the exam. Lastly, they suggest that more research be done into standardized ratings and on teacher perceptions of students' ability in the classroom versus an exam.

Critique: It's a shame that this article it stuck in the middle between an extensive evaluation (involving a variety of teachers and locations) and an intimate study (focusing on one or two individuals and their unique situation). As a result, we have neither the generalizablity of a large statistical study nor the valuable insight into a specific context. Instead we end up with a bunch of incomplete thoughts and ideas. It's as if Cumming et al. scratch the surface of several interesting ideas, but never uncover a single one. Still, for all its limitations, this study provided me with a good model for some qualitative research (including a survey that I adapted) and it is also likely to help me form some of my content and context validity questions. I also decided, as I read its account of teacher-based evaluation, that a student-based evaluation would be a great compliment to such a study. We'll see if I can make it work, or if I will have bitten off more than I can chew.

Connections: This study served to show me that I need to read a lot more about teacher evaluation study about tests. Was this a good example? How have others approached this avenue of validation? I did enjoy seeing how this person-focused (as opposed to text-focused or score-focused) study complimented other, more quantitative, studies on the iBT in order to give a more complete and well-rounded assessment of the exam.

Additional Reflections:

I would really like to do an evaluation of the integrated writing tasks at the ELC that incorporates myriad sources of validty evidence including teacher-evaluation, student-evaluation, textual analysis, score-analysis, and possibly more. Am I crazy to want to do this much? Will TREC approve such a study? Will I find a dissertation committee who will approve such a multi-faceted approach?
Who will I choose as teachers for my teacher evaluation? I think that I would prefer to use all of them (provided that they agree to participate in the study). I don't want to exclude someone because their reasons for not being eager to volunteer may actually be connected to their feelings about the exam and are therefore a real concern that I need to take into account.
Will I have time to do this for speaking as well as writing? My tentative schedule, which I need to present to potential committee members next week, outlines an evaluation of writing for winter 2008 and speaking in summer 2008. I think I can be ready, but the issue is as much about analyzing data as it is about collecting it. Who will I get to help, especially when I am so busy around exam time with administrative duties?
What kind of concurrent validity claims might I make? Will I ask teachers to rate their students and then compare performance with expectations, or will I use data that we already collect (classroom scores or rated evaluations)? Are teachers very good at predicting student ability? Lee's 2005 thesis of the L/S tests at the ELC say no (as did major portions of this study), but I think that both of those are problematic: Lee because the teacher ratings were probably based on Speaking when she was comparing those to Listening scores, and in Cumming's case, he admits that BICS/CALP may have a lot to do with false impressions. Of course BICS/CALP may be an issue for me too, but if students and teachers practice items in the classroom more than once, then teachers ought to be good at predicting success, shouldn't they?
How will tasks be adjusted for lower levels (adhering the concerns of Sara in this study)? Laura (our Level 1 teacher) has already told me that she thinks Level 1 students need to be able to hear the integrated listening passage twice. As it is, they barely understand it before it's over. A second listening for Levels 1 and 2 might be appropriate and justified (given the findings of this study and Laura's experience).

Tuesday, November 6, 2007

G-Study of iBT

The most recent paper that I finished reading is:

Lee, Y. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing, 23, 131-166.

Summary: Lee conducts a generalizability study of prototype speaking tasks for the iBT (internet-based TOEFL). The purpose of the study was to assess to what degree reliability increased with the number of tasks and ratings.

Tasks include three types: integrated reading-speaking, integrated listening-speaking, and independent. The effect of rater was ignored due to the difficulties of orchestrating a fully-crossed design; however, rating was considered as a facet, so all ratings were analyzed in a double-rating model.

Lee concludes that increasing tasks has a greater influence on increasing reliability than does increasing raters. Lee also found that there was sufficient evidence to justify collapsing analytic scores into a holistic score of speaking given that correlations among task types were quite high.

This has implications for the iBT speaking component. First, including more than one speaking task is more likely to increase the reliability of scores than adding extra raters. In other words, Lee suggests that more reliable scores are given when raters have a greater number of speaking sample from the examinees than when multiple raters are used. In essence, many raters can still come up with divergent scores when only one sample is used, but when many samples are provided, even a small number of raters is highly accurate at scoring the examinee.

The second implication affects the use of holistic scoring. Rather than require raters to score each sample separately (and then averaging the separate scores into a composite score for each examinee), Lee suggests that since all task types tend to correlate highly, there is sufficient evidence to warrant the use of a holistic score for each examine based on an overall impression rather than a combined average. The benefit is that this reduces the time it takes raters to score a single examinee.

Critique: Admittedly I am not impressed that Lee did not attempt a fully-crossed design. I think this makes the "rater has low effect in increasing reliability" claim given that we have nothing more than a 1-rating versus 2-ratings option. I recognize that fully-crossed designs are really hard to manage, but Lee was funded by ETS (who creates the TOEFL), so I don't know why ETS didn't demand a more rigorous research design.

I am also concerned that Lee does not describe the testing occasions. Did students take all 12 tasks at once, leading to enormous test fatigue? Or were the testing occasions spread out over several weeks thereby possibly confounding test results given that students may have improved in proficiency over the period of testing. I need to email Lee and ask about this specifically.

Connection: Thankfully I have been introduced to generalizability theory in LING 660 (Language Testing) and IP&T 752 (Measurement Theory), because it is hard for a researcher to really explain it well in a short article. I had also previously read a study (for my thesis) written by Rob Schoonen (who is a much better writer than Lee) and this helped me to understand the analysis and results that Lee described.

Lee does a decent job of connecting the results of this study to others. I found myself coming to similar conclusions and making connections to my limited experience with G-studies (such as Schoonen). I also found it valuable to read this study in connection to the iBT validation studies by Cumming. It helped me gain a bigger picture of the iBT validation project and how all these different elements function together to inform the development of this high-stakes exam.

Additional Reflections: This article sparked a lot of ideas for me.

Could my study include a comparison of analytic/holistic scoring? Or at the least, could I compare portfolio scores (holistic) with scores for just the integrated tasks? This may justify the need for a separate score for draft writing and timed writing.
I like that Lee suggests that test developers need to state the purpose of the task. Are independent tasks just about language, and integrated tasks just about content? In our case, probably not. They are both about language. This need to be clarified.

A scoring-related validity component would be a valuable part of my evaluation that might also include teacher and student content/context validity evidence.

Friday, November 2, 2007

Research Progress

I just spoke with Sharon (our programmer) about how the integrated writing tasks are going (we are test piloting them with Levels 1, 3, and 5 today). It was good to see that Level 1 (the only level that handwrote) was able to summarize – and even synthesize – the listening and reading passages. We talked about how Level 3 might be especially difficult (given that it is not Level 3 adjusted and it TOEFL level with just a few vocabulary changes to make it easier).

We also discussed writing ratings and the evaluation study. She is very open to programming an interface for online writing rating feedback since she previously developed something similar to it in the past (for speaking rating). I explained that raters would want to see what others had rated a portfolio after they submitted their own rating to the system. She thought I meant all current raters, but I explained that it would be a database of previous ratings (collected from previous semesters and maybe including a recent set of ratings that I solicit from current teacher to be added to the database of benchmark portfolios).

I also think that the database should include more than just scores; information from the evaluation surveys indicates that raters want to see/hear the justifications for a rating. So, perhaps I could get current raters to write justifications (or maybe audio record them and then transcribe them) and add those to the database along with their scores.

Until this week, the evaluation had been progressing slowly. At the end of last semester I distributed surveys to all raters, but my research assistants never picked up the surveys before they both left for out of town jobs at the end of the semester. Frustrated, but not discouraged, I resent the surveys at the beginning of this semester, but half of the raters had left the ELC (due to graduation and full-time job offers) so I was only able to send out seven surveys. At the beginning of this week I had only received two of the seven, so I sent out another reminder and by this morning I had gotten two more. I will follow-up with the remaining raters in order to get the maximum number of responses possible.

The next step for the evaluation? I need to schedule interviews or a focus group with the raters to follow-up on the interviews. TREC had counseled that I not lead the interviews/focus group in an effort to minimize rater inhibition for free expression. I could a grad student to do it for me, but I want to use someone who is familiar with the rating system. I could also use a current teacher, who is one of my raters, but will her personal bias influence how she interprets feedback from others?

The next step for the dissertation study will involve sending a pilot survey to teachers and students based on the practice LAT is being piloting this week and next. The survey piloting will help me to revise the survey for actual use at the end of the semester during LATs. I have received IRB approval, but I am still awaiting final approval on my release form (the ORCA office wants me to update it).

robblog: IP&T