AMBER ALERT

Friday

Economic Policy Institute Briefing Paper

The following is verbatim from the Executive Summary of one of the most important policy briefs about education in recent years. At a time when the Dept. of Education is pushing to tie teacher evaluation and compensation to student test scores, this Economic Policy Institute Briefing Paper (which is available in pdf), pulls together the extensive relevant research that demonstrates the dangers of pursuing such a path. Please continue reading as I explore this important document, released at 12:01 AM today, August 29.

First, let me clarify several things. 

This is a very long diary. That is because I am trying to reasonably thoroughly cover the contents of an extremely important document. My purpose in doing so is to convince people of the document's importance. Thus I will be perfectly happy should you decide you do not need to further read what I have written below. You can follow the link for the brief  (which is provided here), download the pdf, and begin reading. The executive summary is only four pages. The brief itself, without the critical apparatus of footnotes and sources, another 17. So if you want, one more time follow this link.


This document has been in the works for several months, and was NOT hurriedly put together as a response to the recent series by the Los Angeles Times which used value-added assessment to label teachers in the Los Angeles Unified School District. Second, the ten scholars whose names are on the document are some of the most eminent in educational circles, including among their midst former Presidents of the American Educational Research Association and the National Council on Measurement in Education, two of the three professional organizations most involved with psychological measurement, of which school-related testing is a subset. One of the scholars, Robert Linn, has not only presided over both of those organizations, he has also serve as chair of the National Research Council's Board on Testing and Assessment. The group also includes the immediate past president of the National Academy of Education, Lorrie Shepard, Dean of the School of Education at Colorado. A brief and applicable curricula vitae of each of the ten authors can be found at the end of the document, and briefer descriptions at the beginning, where each author is listed, along with the following statement:
Authors, each of whom is responsible for this brief as a whole, are listed alphabetically.
An email address is provided for further contact.

The ten authors, alphabetically, are as follows:
Eva L. Baker
Paul E. Barton
Linda Darling-Hammond
Edward Haertel
Helen F. Ladd
Robert E. Linn
Diane Ravitch
Richard Rothstein
Richard J. Shavelson
Lorrie A. Shepard

Let me be blunt. I do not know how anyone who knows the work of these scholars and who reads this brief can accept the idea of placing any stakes as to firing or awarding of merit pay based on the current status of Value-Added Assessment methodologies. The document is thorough. It reviews all the relevant studies, including one not yet in print. Those includes studies by Mathematica for the US Department of Education: by Rand: by the Educational Testing Service; done for the National Center for Education Statistics of the Institute of Education Sciences of the U. S. Dept. of Education; issued by the Board of Testing and Assessment of the Division of Behavioral and Social Sciences and Education of the National Academy of Sciences, and so on. There are citations from books, from peer reviewed journals. 

I am not a scholar. I am a high school social studies teacher. During now abandoned doctoral studies in educational policy I got interested in value-added assessment and devoured what studies there were in the educational literature. I also talked extensively with the technical person for one organization that offered a value-added methodology who cautioned me that the approach was not stable enough for it to be used as the basis for decisions with any kind of meaningful stakes. That was about a decade ago. What I had read since, and what I have absorbed from this study convinces me that the situation is not significantly better now. 

But you do not have to take my word for it. Let me offer a few key examples from the study. Those who follow me on Daily Kos already have seen in the study by Mathematica the high rate of error in determining superior and inferior teachers beyond the broad middle. In this diary, written on August 27, I noted that the error rate with 2 years of data was 36%, with 3 years 26%, and even with 10 years of data still 12%. 

But that is just the tip of the iceberg of the technical problems with using such an approach. 

Without recapitulating the entire brief, let me offer a couple of other key points.

1. Results for individual teachers are not stable:
One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year.


2. One key question is whether one is really accounting for teacher effects and excluding other influences in the results one gets from value-added assessment. Jesse Rothstein reported something interesting, about which I quote from the Executive Summary:
A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students’ fifth grade teachers were good predictors of their fourth grade test scores. Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that VAM results are based on factors other than teachers’ actual effectiveness.


3. The brief notes that arguments that the private sector evaluates professional employees using quantitative measures that are parallel. The authors of the brief point out that rarely are such quantitative measures the sole or even the primary factor, noting that management experts warning against using such measures for making salary or bonus decisions. They remind us that some of the distortion on Wall Street was the result of emphasizing short term gains that could be easily measured. They also touch on medicine:
In both the United States and Great Britain, governments have attempted to rank cardiac surgeons by their patients’ survival rates, only to find that they had created incentives for surgeons to turn away the sickest patients.


4. Students are not randomly assigned to teachers. While some control for school effects is possible, scholars are reluctant to place any weight on comparisons for teachers in different schools even within the same system. And even within a school, teachers may have varying numbers of students who are learning English or have learning disabilities or are homeless or who move multiple times, each of which is a factor that can affect learning.

5. Sample sizes are often too small. Even if the class makeup stays stable during the year, and all the students show up regularly, the N=30 of a large elementary class is too small a sample to provide a result that can allow strong inferences to be drawn. Often the makeup of the class changes during the year. If you exclude students who were not there all year, or whose absences exceed some designated level, the N decreases, providing a result of even less reliability. 

6. Some argue that statewide data banks can address the question of student mobility. But if you derive results on a year or two years of data where the student has moved, how much of the improvement can properly be assigned to any one teacher? Even in elementary school, do we account for pull-out instruction, or possible tutoring (that could in some cases be counterproductive) as a possible influence on the test results upon which we base our analysis?

7. Even with value-added analysis, to date scholars have not been able to isolate the impact of outside learning experiences, home and school supports, and differences in student characteristics and starting points when trying to measure their growth. 

8. A proper system of value-added assessment would have vertically scaled tests. Most states do not currently have such tests, for example, neither New York nor California does. That is, the tests in one grade are not necessarily congruent with those of the next along a continuum from year to year - we are not testing the same thing each year. As testing expert Dan Koretz of Harvard is quoted as noting,
"because of the need for vertically scaled tests, value-added systems may be even more incomplete than some status or cohort-to-cohort systems"
Here it is worth noting that cohort to cohort is comparing this year's fourth graders to last years, which is how Adequate Yearly Progress under No Child Left Behind has been calculated. 

9. If measuring end of year to end of year, even if there are vertically scaled tests, there is still the well-documented issue of summer learning loss, which falls disproportionally upon those of lesser economic means, which also means it falls disproportionally upon those of color, who are more heavily represented at the lower end of the economic scale. IF we do not control for summer learning loss, our results are skewed. Allow me to quote a relevant portion of the study:
researchers have found that three-fourths of schools identified as being in the bottom 20% of all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.
The authors also cite a study that shows "two-thirds of the difference between the ninth grade test scores of high and low socioeconomic status students can be traced to summer learning differences over the elementary years."

There is more, but this should give a real sense of how much there is in this paper, how thoroughly the authors examine relevant material to demonstrate that value-added assessment, the supposed magic bullet to allow us to tie student learning back to the effectiveness of teachers, cannot properly fulfill the task some wish to give to it.

The authors acknowledge that value-added approaches are superior to some of the alternatives methods of using test scores to evaluate teachers. These are

status test-score comparisons - compare average scores of students of one teacher to those of another

over change measures - compare the average test results of a single teacher from one year to the next - remember, these are different students

over growth measures - a comparison of the scores of the students of the teacher this year to the scores of those same students the previous year when they had different teachers.

Each of these approaches has serious problems with it. One can read the detailed explanation on p. 9. Value-added assessments may be an improvement, but
the claim that they can “level the playing field” and provide reliable, valid, and fair comparisons of individual teachers is overstated. Even when student demographic characteristics are taken into account, the value-added measures are too unstable (i.e., vary widely) across time, across the classes that teachers teach, and across tests that are used to evaluate instruction, to be used for the high-stakes purposes of evaluating teachers.


"WHERE THE HELL IS DAVID SANCHEZ "

No comments:

Post a Comment