Student Evaluations Paint A Faulty Picture

Berklee student evaluations paint a faulty picture. How can we regain focus?

"We’re confusing consumer satisfaction with product value" -Philip B. Stark, UC Berkley

Spring is here and so is the season for finals, recitals, grading and…  student course evaluations. As faculty and students crunch towards the finish line in four weeks, stress levels and workloads for all are high. Candid conversations with colleagues and students reveal that neither party is satisfied with the existing evaluation process.

Some students complain that there are just too many evaluations to fill out each semester for each course. Some believe that evaluations are for complaints only and won’t take the time to comment if they had a positive experience. One student confesses, “I close my eyes and fill out the dots without reading the questions, so maybe I’ll get the free ipad”. Another believes the results “go straight into the trash”. Another doesn’t distinguish between the way they fill out the Berklee survey from RateMyProfessor.com.

Faculty members note that there is no place on the survey to respond to anonymous negative comments  and cannot offer any hard information to their supervisors beyond a guessing game. Some are deeply frustrated and threatened when their semester performance review is based primarily if not solely on student evaluations.

This is not an argument for avoiding evaluation.  Student feedback is a very important part of our pedagogy and teaching goals. But presently at Berklee, we have a serious disconnect in how students, faculty, chairs and deans interpret the existing surveys; and particularly how haphazardly their data is being used in the faculty review process. 

This dilemma seems to echo nationally according to two scholars of statistics at University of California, Berkley, Philip B. Stark and Richard Freishtat, who state in their 2014 study "An Evaluation of Course Evaluations",

The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of “effectiveness” do not measure teaching effectiveness. Response rates and response variability matter. And comparing averages of categorical responses, even if the categories are represented by numbers, makes little sense.

Dan Barrett, who interviewed the authors in his article for the Chronicle of Higher Education, Scholars Take Aim at Student Evaluations”, writes that [Stark and Freishtat] recommend methods of evaluating teachers including "teaching portfolio, syllabi, notes, websites, assignments, exams, videos, and statements on mentoring, along with students’ comments on course evaluations and their distribution."

Stark notes that "it’s totally valuable to ask [students] about their experience… but it’s not synonymous with good teaching.”

If you don't have time to read Freishstat and Stark's study, here are some bullet points:

·        Response rates themselves say little about teaching effectiveness. In reality  if the response rate is low, the data should not be considered representative of the class as a whole. An explanation solves nothing.

·       Averages of small samples are more susceptible to “the luck of the draw” than averages of larger samples. This can make student evaluations of teaching in small classes more extreme than evaluations in larger classes, even if the response rate is 100%. And students in small classes might imagine their anonymity to be more tenuous, perhaps reducing their willingness to respond truthfully or to respond at all. 

·       Personnel reviews routinely compare instructors’ average scores to departmental averages. Such comparisons make no sense, as a matter of statistics.

·      While some student comments are informative, one must be quite careful interpreting the comments: faculty and students use the same vocabulary quite differently, ascribing quite different meanings to words such as “fair,” “professional,” “organized,” “challenging,” and “respectful” (Lauer, 2012). 

·       Moreover, it is not easy to compare comments across disciplines (Cashin, 1990; Cashin & Clegg, 1987; Cranton & Smith, 1986; Feldman, 1978), because the depth and quality of students’ comments vary widely by discipline.
 

Additionally, the gender of instructors has been shown to have an impact on student evaluations, according to a study at North Carolina State University.

There was an info session on the subject of course evaluations on April 24th, sponsored by Institutional Research and Assessment and Faculty Development. Look for a report on this event in this space, coming soon.