A number of educational researchers (e.g., Diane Ravitch) have begun to argue that test should be made more difficult. That tests should measure higher order thinking skills and challenge our children more. Although I do not disagree that our children should learn to think critically and master a broad set of content at very high levels, much of the debate seems to be focused on item types. Specifically, these pundits question the use of selected-response items (i.e., multiple-choice items). The arguments in the media about testing seem spurious at best.
Let us look closer at test difficulty. The question might be this, would most people question the rigor of the ACT, SAT, or GRE tests? I suspect the answer to that question is no. Most people would state these tests are good predictors of success in the first years of college or graduate school. Furthermore, they would argue these tests are very difficult. These tests are primarily multiple-choice items and are considered to be norm-referenced tests.
Let us look closer at what these types of tests look like. Typically these tests have a large number of items associated with a test administration covering several content areas such as reading, English Language Arts, mathematics, and science. The items are written and field-tested on the appropriate population and the items are chosen for both content and psychometric characteristics. For most norm-referenced tests the items are distributed so that the majority of the items are in the range where approximately 50 percent of the examinees will answer them correctly. This criterion is typically corrected for guessing rate. Using the simplest measure of difficulty, p-value, the majority of items will center on a p-value of approximately .65. For most norm-referenced tests the distribution of item difficulties is approximately normal from least difficult to most difficult with the majority in the middle of the distribution.
With the press to use criterion-referenced tests under NCLB, the construction of the state level tests changed. Several competing ideas/concepts were now at work. For example, if we are measuring what is being taught, what should the distribution of item difficulties resemble? Also, what criteria might we apply to the psychometric characteristics of items? That is, would it be acceptable to have items that most examinees would answer correctly? For most NCLB tests a uniform distribution of item difficulties would be reasonable. This would limit both the floor and ceiling effects of the test. That is, there would be less difficult items on the test so that most all students could demonstrate some achievement (floor effect) and there would also be some very difficult items that would allow well prepared students to demonstrate superior achievement (ceiling effect).
Thus changing the distribution of item difficulties will, in fact, change the difficulty of the test. Of course, this is but one-way to change the difficulty of a test. Writing better items that require students to think critically, compute, and evaluate content is the primary method. However, these items seldom have item statistics that indicate they are easy or less difficult. Finally, all item types can measure simple facts and knowledge and well written and thought out items of any type can go well beyond the simple facts and knowledge. Simply changing item types or using technology will not improve teaching and learning. Neither will making tests more difficult. It is how the information is used that can change teaching and learning.
– Timothy R. Vansickle, Ph.D.
Chief Academic Officer and Senior Vice President of Assessment Services