In my last post I discussed how the test author or developer could make a test more or less difficult simply by using different items with a different distribution of item difficulties. Although I used, as examples, the ACT Assessment and the SAT, two tests that use primarily selected-response items, the concept applies to all item types that might be used in a test. In this post I continue the discussion of how a test can be made more difficult or at least appear to be more difficult and then change focus to what really matters.
Currently, all accountability tests used in states (i.e., summative tests) are criterion-referenced. That is, the student is not compared to other students but rather to some criterion of achievement. This is typically a series of cut scores on the test that place students into one of several levels. The current version of the Elementary and Secondary Education Act, known as NCLB, requires at least three levels of proficiency (e.g., below proficient, proficient, and advanced). Most states have more than the required three such as below basic, basic, proficient, and advanced.
If one wishes to make tests more difficult it would be a simple matter of setting the cut scores for these levels at different places along the distribution. By doing so, an easy test could be made to look difficult and a difficult test could be made to look easy. Below is a typical distribution of student results on an Algebra I test. If we set the passing cut score using the red line, very few students would be considered passing. However, if we were to use the green line as the cut score, many more students would be considered passing. However, neither the test nor the students’ performance has changed. We will have a post on standard setting in the future but for now, this example and the previous post shows how test difficult is not a function of item type.
So the question at hand is: what should a good test measure? That question has to be answered by asking a second question: What inferences do the test user or test author wish to make based on the results of the test? This is the very beginning of test design. In response to the many folks who talk about making tests more difficult in order to improve instruction, it may be that what they are really asking is that the tests focus more on process and production, than on surface understanding of the content area being measured. This again is not a function of item type but rather in the structure and content of the items written and used in a given test. We will explore surface and deep content understanding in a future post. But for now, we should be asking what are we going to do with the results of any test? This will help us formulate better a design that meets the needs of the test user and the test taker.