Neither higher, deeper, nor more rigorous: One-type-fits-all national tests, Part 2

May 23, 2016 by

Richard P. Phelps – Let’s assume for the moment that the Common Core consortia tests, PARCC and SBAC, can validly measure all that is claimed for them—mastery of the high school curriculum and success in further education and in the workplace. The fact is no evidence has yet been produced that verifies any of these things. And, remember, the proof of, and the claims about, a new test’s virtues are supposed to be provided before the test is used purposefully.

Neither higher, deeper, nor more rigorous: One-type-fits-all national tests, Part 1

Sure, Common Core proponents claim to have just recently validated their consortia tests for correlation with college outcomes[i], for alignment with elementary and secondary school content standards, and for technical quality[ii]. The clumsy studies they cite do not match the claims made for them, however.

SBAC and PARCC cannot be validated for their purpose of predicting college and career readiness until data are collected in the years to come on the college and career outcomes of those who have taken the tests in high school. The study cited by Common Core proponents uses the words “predictive validity” in its title. Only in the fine print does one discover that, at best, the study measured “concurrent” validity—high school tests were administered to current rising college sophomores and compared to their freshman-year college grades. Calling that “predictive validity” is, frankly, dishonest.

It might seem less of a stretch to validate SBAC and PARCC as high school exit exam replacements. After all, supposedly they are aligned to the Common Core Standards so in any jurisdiction where the Common Core Standards prevail, they would be retrospectively aligned to the high school curriculum. Two issues tarnish this rosy picture. First, the Common Core Standards are subjectively narrow, just mathematics and English Language Arts, with no attention paid to the majority of the high school curriculum.

Second, common adherence to the Common Core Standards across the States has deteriorated to the point of dissolution. As the Common Core consortia’s grip on compliance (i.e., alignment) continues to loosen, states, districts within states, and schools within districts are teaching how they want and what they want. The less aligned Common Core Standards become, the less valid the consortium tests become as measures of past learning.

As for technical quality, the Fordham Institute, which is paid handsomely by the Bill & Melinda Gates Foundation to promote Common Core and its consortia tests,[iii] published a report which purports to be an “independent” comparative standards alignment study. Among its several fatal flaws: instead of evaluating tests according to the industry standard Standards for Educational and Psychological Testing, or any of dozens of other freely-available and well-vetted test evaluation standards, guidelines, or protocols used around the world by testing experts, they employed “a brand new methodology” specifically developed for Common Core and its copyright owners, and paid for by Common Core’s funders.

Though Common Core consortia test sales pitches may be the most disingenuous, SAT and ACT spokespersons haven’t been completely forthright either. To those concerned about the inevitable degradation of predictive validity if their tests are truly aligned to the K-12 Common Core standards, public relations staffs assure us that predictive validity is a foremost consideration. To those concerned about the inevitable loss of alignment to the Common Core standards if predictive validity is optimized, they assure complete alignment.

So, all four of the test organizations have been muddling the issue. It is difficult to know what we are going to get with any of the four tests. They are all straddling or avoiding questions about the trade-offs. Indeed, we may end up with four, roughly equivalent, muddling tests, none of which serve any of their intended purposes well.

This is not progress. We should want separate tests, each optimized for a different purpose, be it measuring high school subject mastery, or predicting success in 4-year college, in 2-year college, or in a skilled trade. Instead, we may be getting several one-size-fits-all, watered-down tests that claim to do all but, as a consequence, do nothing well. Instead of a skilled tradesperson’s complete tool set, we may be getting four Swiss army knives with roughly the same features. Instead of exploiting psychometricians’ advanced knowledge and skills to optimize three or more very different types of measurements, we seem to be reducing all of our nationally normed end-of-high-school tests to a common, generic muddle.

[i] PARCC officials claim to have conducted a predictive validity study, through a contract with Mathematica Policy Research (see Nichols-Barrer, Place, Dillon, & Gill) but, in fact, it was not a predictive but, rather, a concurrent study. They administered the test to a convenience sample of current college students and correlated their test scores with their college grades. Not at all the same thing as a predictive validity study, which would correlate high school students’ test scores with their own performance in college a few years later.

In sum, the report’s shortcomings:

  • First, the report attempts to calculate only general predictive validity. The type of predictive validity that matters most is “incremental predictive validity”—the amount of predictive power left over when other predictive factors are controlled. If a readiness test is highly correlated with high school grades or class rank, it provides the college admission counselor no additional information beyond what the other measures already provide. It adds no value. The real value of the SAT or ACT is in the information it provides college admission counselors above and beyond what they already know from other measures available to them.
  • Second, the study administered grade 10 MCAS and grade 11 PARCC tests to college students at the end of their freshmen years in college, and compared those scores to their first-year grades in college. Thus, the study measures what students learned in one year of college and in their last two years of high school more than it measures what they knew as of grade 10. The study does not actually compute predictive validity; it computes “concurrent” validity.
  • Third, student test-takers were not representative of Massachusetts’s tenth graders. All were volunteers; and we do not know how they learned about the study or why they chose to participate. Students not going to college, not going to college in Massachusetts, or not going to these colleges in Massachusetts could not have participated. The top colleges—where the SAT would have been most predictive—were not included in the study (e.g., U. Mass-Amherst, any private college, or elite colleges outside the state). For students not going to college, or attending occupational-certificate training programs or apprenticeships—one would expect the MCAS would be most predictive.

For more, see McQuillan, Phelps, & Stotsky.

[ii] For more, see Phelps, R.P. (2016, February). Fordham Institute’s pretend research. Policy Brief. Boston: Pioneer Institute.

For an extended critique of the CCSSO Criteria employed in the Fordham report, see “Appendix A. Critique of Criteria for Evaluating Common Core-Aligned Assessments” in McQuillan, Phelps, & Stotsky, pp. 62-68.


Print Friendly, PDF & Email