Part 1 and Part 2 of this blog series have dealt with some of the criticisms aimed at testing programs. There are certainly many complaints about too much testing, testing being irrelevant, and tests not being difficult enough. However, a lot of the discussion involves a basic misunderstanding of a test’s intended purpose. Most of the criticism of current accountability testing (summative tests) is a desire for the test to serve a purpose they were not designed to serve.
Accountability tests under NCLB were designed to provide an overall picture of student performance on many academic standards in a very short amount of time and with a limited number of items. For example, prior to the Common Core State Standards, the National Council of Teachers of Mathematics (NCTM) standards were adopted by most states as their mathematics standards. There were approximately 100–125 standards per grade, and a typical accountability test contains between 50–70 items. This type of test is not instructionally sensitive and does not provide diagnostic information, but it does provide a fairly reliable classification of students into performance levels.
Where the criticism comes into play is that researchers and teachers want more from assessments other than classifying students into performance levels. They want assessments to also be instructionally sensitive and provide more information about what students know and can do. However, this requires considerably more items and a mix of item types, and, thus, more testing time. For example, the Partnership for Assessment of Readiness for College and Careers (PARCC) consortium has designed its test to be more of what people would like to see, but it has been criticized for the proposed amount of testing time.
So, what is the bottom line? A test must be designed with a specific purpose that stems from the decisions or inferences the test user wishes to make with the test results. A test may support a few inferences, but it will not support all inferences we might wish to make about a student, teacher, school, district, state, or country. The latest version of the Standards for Educational and Psychological Testing (AERA, APA, NCME, 2014) continues to stress that multiple measures are needed for any decisions involving the use of test scores. Therefore, developing any test requires a defined purpose that takes into account the inferences or decisions to be made from the results. This is the cornerstone of a valid assessment, that it measures what it intends to measure.
To tie this blog series together so far, if people want accountability tests to be more difficult (i.e., more cognitively challenging and providing instructionally useful information), it is not a matter of item type but one of design. Careful design produces test and item specifications that go beyond alignment and examine how items evaluate declarative and procedural knowledge as well as surface and deep understanding of the content area. This includes the design of the test, the item specifications, the reporting, and the research to support the inferences from test results that the test was designed to provide. It bears repeating, no single test can do everything. That is true of tests used in medicine and it is true in education as well.
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, D.C.: AERA.