Jay Michela addresses misconceptions in Alex Usher’s analysis of the Ryerson arbitration decision.
Guest post by Jay Michela, Psychology.
Alex Usher of Higher Education Strategy Associates (HESA) has offered his analysis of an arbitration decision at Ryerson University which ruled against conventional use of students’ course ratings for personnel decisions (tenure and promotion decisions). It has been circulated within our university and elsewhere (e.g., to OCUFA), and appears on the HESA website under the headline “Time to Talk Teaching Assessments.”
I was moved to respond to Usher’s statement because it expresses many of the misconceptions that exist around summative use of students’ ratings of courses and instructors.
What follows is the full text of Alex Usher’s analysis, with my responses interspersed. I hope this format for explaining the urgent need to change university practices around student questionnaires turns out to be more engaging and pithy than some of the literature reviews and other research reports on which this material is based.
USHER: Something very important happened over the summer: The Ryerson Faculty Union won its case against the university in Ontario Superior Court against the use of student teaching evaluations in tenure and promotion decisions (it was silent on merit pay, but I’m fairly sure that’s because Ryerson academics don’t have it – as legal precedent I’m 100% certain merit pay is affected, too). This means literally every university in the country is going to have to re-think the evaluation of teaching – which is a fantastic opportunity to have some genuinely interesting, important national conversations on the subject.
Let’s talk about the decision itself. Technically, it did not tell Ryerson to stop using teaching evaluations in tenure/promotion decisions. What it said was that the university could not use averages from teaching evaluations in tenure/promotion decisions because the averages are meaningless.
MICHELA: We are in agreement that the arbitration decision is an important one, with potential for expanding and hopefully raising the level of “national conversations on the subject.” Let’s bear in mind that arbitrators are chosen by disputing parties as someone with no axe to grind and with the sophistication to receive and evaluate relevant facts and analysis dispassionately, if not fully expertly.
Where we disagree initially is in terms of “what it said,” most fundamentally. The arbitrator’s report was crystal clear: no one should consider students’ ratings, expressed either by averages or score distributions, as measures of teaching effectiveness.
Usher is nonetheless correct that the decision allows use of the ratings in tenure/promotion decisions, as described next.
It left the door open to using distributions of scores, and I think it left the door open to adjusting the averages for various factors.
Yes, it left the door open to using distributions of scores, but for what purpose? The arbitrator said reasonably clearly that student questionnaires measure student satisfaction or related aspects of student experience. We must conclude that the arbitrator thus is allowing the university to use student satisfaction as a factor in personnel decisions including tenure/promotion decisions.
The decision may also have left the door open to adjusting the averages for various factors; this was certainly not crystal clear—either to Usher, as reflected in his language “I think,” nor to me (this responding writer, J. Michela). In any case, it would be an enormous mistake to make those adjustments and then consider the ratings information to have been enhanced. To allow adjustments is to acknowledge in the first place that the rank orders of instructors initially is incorrect (rank order from low to high on student satisfaction or whatever is measured, that is).
Adjustments are believed to improve the accuracy of the rankings. This belief is mistaken for the two reasons explained further in my fuller statement in this report (PDF) produced by myself and my colleagues in the Department of Psychology here at Waterloo. First, psychological research on bias and on decision making tell us that when people try in a subjective manner to adjust for bias, they are not at all equipped to do this (even if they think they are). Second, statistical or arithmetic adjustments, such as giving some amount of score increment to all female instructors, compounds the inaccuracy in rank orders of instructors because of statistical interactions that go far beyond what any statistical analysis can handle. It was shown, for example, at the University of Waterloo, that female instructors were evaluated less favourably than male instructors by students with low marks, but this effect for instructor gender disappeared among students with high marks. This difficulty with interactions probably applies to all or nearly all of the biasing factors. Are morning classes o.k. for students in later years but not for those in early years in a program? Are classes taken mostly by out-of-major students rated low if required by a different department, but not if their topic is of general interest? And so forth. Thus, statistical adjustments will merely add additional scrambling to the already scrambled positions of instructors on whatever dimension is being measured in the first place.
The experts brought in by the Ryerson Faculty Association showed convincingly (see here (PDF) and here (PDF)) that student evaluations have been shown to have biases concerning (among other things) race and gender.
The language here of “among other things” could seem to minimize the problem of bias. The list of potentially biasing factors just in the arbitrator’s report is quite long. Moreover, because those factors do not always operate (i.e., there is the vexing complication of statistical interactions) they are all potentially problematic yet insoluble in actual instances and practice.
I think it’s within the spirit of the decision to at least allow the university to use adjusted scores from the student evaluations. For instance, if women are systematically ranked lower by (say) 0.5 on a 5-point scale (or say by a third of a standard deviation if you prefer to calculate it that way), just tack on that amount to the individuals score. Really not that difficult.
In this response it has been argued that truly helpful adjustment (adjustment that restores the proper rank orders of instructors) is beyond difficult; it is impossible. A crucial question that arises here is: Where does the burden of proof lie for those who make such sweeping statements about the solubility of the problems of bias? The evidence for sizable operation of bias is overwhelming to the point of not even having been contested in the Ryerson case. This evidence comes from a very large body of empirical research which, however, does not find a statistically significant effect for every potential biasing factor in every instance in which it has been tested. This is exactly the pattern we would expect to see in the literature if there are many unknown statistical interaction factors also in play. What are Usher’s contrary evidence, analysis, and argument for likely solubility of bias through adjustment?
The problem, I think, is that there are a lot of voices out there that actually want to do away with student input on teaching altogether. To them, the fact that bias can be corrected is irrelevant.
Although there have been articles along the lines of “Student ratings are worthless” this is neither my position nor that of any others I’ve spoken with at my university. In drafting our statement previously referenced, my colleagues from psychology and I urged a shift toward designing these questionnaires truly to be for teaching improvement, by designing and using them for formative evaluation instead of summative evaluation. We took the strongest possible position that the problems with use of students’ ratings for summative evaluation are both insoluble and very, very serious. These serious problems from summative use involve not only severe injustice stemming from bias, but also definite harms to student instruction and learning as when innovative teaching is penalized or when intellectual or emotional challenge (e.g., uncomfortable facts or perspectives) are involved. We documented that injustice and those harms to learning in our statement.
The fact that you can ask way better questions about teaching than are currently being asked, or that you can use questionnaires to focus more on the learning experience than on the instructor, is irrelevant. To them, any student evaluation is just a “satisfaction survey” and what do students know anyway?
It is a fantasy that asking better questions either will solve the problem of bias that creates injustice or will remove the perverse incentives that yield ultimate harm to student learning. No one has been able to explain to me what is solved (in relation to the true nature of the problem) except for when the questions are truly garbled, which is infrequent.
The argument against item improvement as a solution follows from a deeper understanding of the problem here—how students arrive at their biased ratings on questionnaires. For example, why are today’s students, who are clearly concerned with justice and injustice collectively, prone to giving lower ratings to women? There are many reasons that have little to do with the precise wording of the questions. One line of analysis involves how students heuristically apply cognitive schemas such as stereotypes when they arrive at their ratings—partly because students do not know a lot about what would constitute an appropriate instructional method, course design, and so forth. But they do hold the (unfounded) stereotype that, for example, men are more competent than women. Given uncertainty, bias creeps in. Another line of analysis involves how ratings are influenced by “affect,” such as when it is difficult to feel enthusiastic in an 8 a.m. course offering or in a course that involves a lot of memorization as opposed to ideas or self-expression. Compounding such effects is the operation of “halo” effects or bias—the tendency to bring all survey item ratings into some degree of conformance with an overall reaction to the course, positively or negatively. Thus, for example, there have been empirical demonstrations that even when an instructor goes to great lengths to return all marked assignments completely on time, even to the point of documenting students’ ongoing (real time) acknowledgments of timely return of assignments, ratings of this matter on a reasonably-well-worded item do not square with this reality. (See Nilson, 2012, p. 218.)
While on the topic of the bases of students’ ratings, it should be acknowledged that some students will deliberately give ratings that are “inaccurate” for reasons such as retaliation for a poor grade. (A relatively strong correlate of overall ratings is grade expectation or award.) Nilson (2012, p. 212) addresses instances in which ratings do not square with evident facts:
“Were these misrepresentations of the truth due to students’ forgetting, misunderstanding, or lying? Clayson and Haley (2011) surveyed students about their honesty in their ratings and written comments, and the disturbing results confirmed Stanfel’s and Spoule’s worst suspicions: about one-third of the students confessed to “stretching the truth,” 56 percent said they knew peers who had, and 20 percent admitted to lying in their comments. Moreover, half the students did not think that what they did constituted a kind of cheating.”
These are not the only empirical reports of this kind.
Now, there are good ways to evaluate teaching without surveys. Nobel-prize winning physicist Carl Weiman (formerly – if briefly – of UBC) has suggested evaluating professors based on a self-compiled inventory of their (hopefully evidence-informed) teaching practices. Institutions can make greater use of peer assessment of teaching, either formative or summative, although this requires a fair bit of work to train and standardize assessors (people sometimes forget that one argument in favour of student evaluations is that they place almost no work burden on professors, whereas the alternatives all do – be careful of what you wish for).
Yes, peer assessment is worthwhile, though demanding. In various places Philip Stark has noted that peer assessment need not be nearly as onerous as some fear, because conducting it selectively (at career milestones) instead of annually can serve its purposes.
Usher’s source Wieman has given great service to academia by seeking to promote active learning and other forms of instructional innovation, which means moving away from the traditional approaches with which students are most comfortable–and therefore rate most highly (to the detriment of students themselves!). It is worth quoting Wieman (2015, p. 10) to show that he not only favours alternative assessment as Usher pointed out; he decries summative use of students’ ratings:
“Faculty almost universally express great cynicism about student evaluations and about the institutional commitment to teaching quality when student evaluations are the dominant measure of quality. At every institution I visit, this sentiment is voiced.”
In this vein, Weimer (2010, p. 75) adds
“Formative feedback as described in this chapter of the book is more likely than summative feedback to motivate change and to make the changes faculty implement more likely to improve learning.”
But I personally think it is untenable that student voices be ignored completely. Students spend half their lives in class rooms. They know good teaching when they see it.
Another esteemed science educator, Eric Mazur, concurs in his rejection of what he called “the standard way of evaluating teaching.” In his courses, students “teach one another.” This can count against the instructor, when students think the instructor is not doing his or her job. Yet many teaching and learning writers and coaches promote this and other forms of “active learning.”
Documenting how students do not see active learning and other innovations as good teaching, a PhD dissertation (by the head of the University of Waterloo’s Centre for Teaching Excellence, D. Ellis, 2013) quoted the following:
“Felder and Brent  indicate that “when confronted with the need to take more responsibility for their own learning, students may grouse that they are paying tuition—to be taught, not to teach themselves… course-end ratings may initially drop. It is tempting for professors to give up in the face of all that, and many unfortunately do” (p.43). Hockings (2005) corroborates this finding….” (p. 10).”
When a student told me (verbatim) that he knows good teaching when he sees it, the student had no answers to (gently asked) questions about how he would know what went into choosing a textbook or into establishing an overall arc or theme for a course; whether presented material is factually accurate; and so forth. The literature reveals many difficulties in defining good teaching in the first place, and the typical instructor is not highly articulate about it. Students know much less.
On reflection there is nothing immediately logical or compelling about the assertion that “students know good teaching when they see it.” They know likable or agreeable or attractive teaching (or instructors) when they see it (or them). And in this sense student questionnaires do indeed provide information about the “learning experience,” that is, mostly about satisfaction. Correlated but potentially distinct, additional aspects of the learning experience may also be pertinent, such as class climate or whether the course increased one’s interest in the topic. However, any such information, at a minimum, should be interpreted in context (e.g., was part of the point to make students uncomfortable or otherwise challenged?) and with awareness about contaminating factors (including the massive halo effects) often in play.
Student “voices” are useful when the context is one of formative evaluation. For example, an instructor can look for changes in student ratings, from term to term, in course aspects or components that are altered on the basis of prior student ratings or other information. Specifically, the textbook could have been rated previously as not well liked, and a different text could be substituted.
Personally, I could also support use of students’ ratings for an “early warning” or “canary in a coal mine” function. That is, if a course is “off the rails” in some regard, students’ ratings may reflect this with notably low ratings. Thus, it could be justifiable to structure the processing of student ratings data so that such instances would come to the attention of someone least directly involved in personnel decisions, such as an associate chair of undergraduate education within a department (as distinct from members of a committee for performance assessment, in particular). This would instigate a discussion between the departmental official and the instructor, potentially leading to instructional improvement planning or to an understanding that non-instructional factors explain the ratings. If procedures like this were to be adopted, great care would be needed to isolate them from performance assessment such as assigning annual scores that translate to pay. Otherwise users of these procedures would remain in the same pickle as exists today with respect to bias, harm to learning, and so forth.
They may have a whole bunch of implicit biases which they transfer to their assessment, but the idea that their input is worthless, that they are simply too ignorant to give valuable feedback – which is what a lot of the dismissals of their value amount to – is, frankly, arrogant snobbery. Anyone pushing that line is probably less against the concept of student teaching evaluation than he/she is against the concept of evaluation tout court.
Putting aside the strident language, let’s get back to the central questions:
- Are students’ ratings too “flawed” (to use the gentle term) to be sufficiently accurate as a basis for hugely consequential personnel decisions for instructors, and
- Does summative use of students’ ratings have perverse effects for student learning, connected with disincentive for teaching innovation and for challenging of students?
Strong evidence for an answer of “yes” to the first question comes from various writings of the external experts commissioned for the Ryerson arbitration. Strong evidence for “yes” to the second is most accessibly reviewed by Stroebe (2016).
(Don’t dismiss this point. The Ontario Confederation of University Faculty Associations has been very vocal in its campaign against teaching evaluations in the last few years, yet not once to my knowledge has it suggested an alternative. I get the problems with the current system, but if you’re not putting forward alternatives, you’re not arguing for better accountability, you’re arguing for less accountability).
The concern by OCUFA of which I am aware is with use of teaching “evaluations” (as Usher calls student ratings from questionnaires) for highly consequential personnel decisions (summative use). This is also the present concern in this reply to Usher. See: OCUFA’s Briefing note on student questionnaires on courses and teaching.
Two alternatives are very well-known and salient: Peer evaluation and teaching dossiers. These alternatives usually receive little or no use because of the mistaken belief that students’ ratings are up to the task of teaching “evaluation.” Students’ ratings are cheap and easy and numerical. Thus, they will continue to dominate until there is an understanding of their total inappropriateness for summative evaluation.
The tide is turning toward such an understanding, beyond Ryerson and thus Ontario. The provost at the University of Southern California declared summative use to be inadmissible. According to a May 2018 article in Inside Higher Education:
“[The Provost] just said, ‘I’m done. I can’t continue to allow a substantial portion of the faculty to be subject to this kind of bias,” said Ginger Clark, assistant vice provost for academic and faculty affairs and director of USC’s Center for Excellence in Teaching. “We’d already been in the process of developing a peer-review model of evaluation, but we hadn’t expected to pull the Band-Aid off this fast.”
The former head of Rice University’s instructional support service, E. Barre, described her 180-degree turn on the appropriateness of summative use of student questionnaires in a web posting entitled “Research on Student Ratings Continues to Evolve. We Should, Too”:
“The most important recommendation I would now make is the following: we should put a moratorium on using student ratings results to rank and compare individual faculty to one another.”
Barre goes on to recommend precisely what I have been recommending locally at my university and to the OCUFA panel on which I am a member:
“Second, while comparing faculty to one another is dangerous, the quantitative scores can still be valuable if used to chart growth of a single instructor over time. Presuming that most of the noise in the measure is the result of variables unique to each instructor and the courses they teach, there is likely to be much less variability over time unless there is genuine improvement. It will be important to not over-interpret small differences in this case, as well (dropping from a 4.3 to a 4.2 average is not a cause for concern!), but if an instructor moves from a 2.5 average to a 4.5 average over the course of their career, we can be fairly confident that there was real and significant growth in their teaching performance.”
In recent literature (over the past five or more years) one category of the remaining defenses of summative use are non-empirical opinion pieces much like Usher’s. Two of these leaned heavily on Barre’s outmoded position (!), one with the not-very-reassuring title “In Defense (Sort of) of Student Evaluations of Teaching” (K. Gannon, May 6. 2018, Chronicle of Higher Education). The other is from people who sell student-based course evaluation services to educational institutions. In these others I have seen claims similar to Usher’s about solubility of the inherent problems, but nothing that addresses the statistical reasons why those solutions can’t work nor that provides evidence that they can. Similarly, those articles do not address the issues raised here involving disincentive to innovation and challenge to students. Stark has noted that these proponents have not responded to experiments with strong (experimental or quasi-experimental) designs that show no correlation (or negative correlation) between student’s ratings and their actual levels of learning in a course.
There are alternatives, however. One that universities could consider is the system in use at the University of California Merced, which Beckie Supiano profiled in a great little piece in the Chronicle of Higher Education last year. The Merced program, known as SATAL (Students Assessing Teaching and Learning) trains students in classroom observation, interviewing and reporting techniques. Small teams of students then assess individual classes – some focussing on instructor behavior, others focussing on gathering and synthesizing student feedback. In other words, it professionalizes student feedback.
Aside from practical implementation questions, there is an empirical question as to whether student feedback can be made sufficiently accurate for summative use for personnel decisions in this manner. For example, there is considerable risk of continued bias in favour of younger, more attractive, white, male instructors who behave more enthusiastically, mark more leniently, and so forth. Most empirical research is not encouraging, but improvement certainly is possible where there is so much room for it.
Another potential improvement which has been suggested and then recommended by the arbitrator (based on the external input acquired) is to avoid interpretation of means or averages and, instead, interpret only score distributions (rates of ratings of “excellent,” “good,” etc. from a given rating scale) during summative evaluation. This advice is deeply misguided in the context of summative evaluation. The arbitrator and his sources have not made clear that the primary effect of bias is to shift a distribution of responses upward or downward among available rating responses. Consequently, gender bias, halo bias, and other biases (and errors or deliberate distortions) are still entirely present in a frequency distribution. So is the incentive to teach more pleasingly instead of more effectively. The related suggestion to avoid numerical designation of the available rating categories similarly does nothing to solve the fundamental problem here.
It is in the context of formative evaluation where reporting in terms of frequency distributions has potential to be more informative than mere averages. An instructor and his or her teaching supporter can look to see, for example, whether implementation of innovative instruction yields both high and low ratings, producing a seemingly unfavourable average. This discovery could be not only reassuring to some degree, but also point a direction toward seeking better understanding of the reasons for this pattern and toward seeking methods to address it.
Overall, any summative use of students’ ratings, whether framed as involving measures of “teaching effectiveness,” “student experience,” or “customer satisfaction” thus risks harm to student learning and injustice to instructors.
The real answer here of course is that multiple perspectives on teaching are required both for formative and summative purposes. The Ryerson Faculty Association was right to push back on using averages. The trick now is to use the opportunity this ruling provides to put the assessment of teaching on a more solid footing right across the country. It’s a particular opportunity for student unions: a once-in-a-generation chance to really define what is meant by good teaching and putting it at the heart of the tenure and promotion process. Any student union thinking about focussing on any other issue for the next 24 months is wasting a golden opportunity.
If student unions draw upon the empirical evidence that summative use of students’ ratings promotes less learning, not more, and that this use generates considerable injustice which is so widely discussed on campus currently, this will indeed promote a golden opportunity to shift to the more productive and justifiable emphasis on formative use.
An associate professor of psychology at the University of Waterloo, with extensive research and practice experience with survey research, John (Jay) Michela is also the designated methods analyst for an OCUFA panel that was convened to examine various issues connected with student questionnaires.