Many people have never heard of the term psychometrics, even though it has become more well-known since the passage of No Child Left Behind in 2001, which brought the term into the public consciousness. In spite of that a New York Times article from 2006 stated that psychometrics is “…one of the most obscure, esoteric and cerebral of professions…”1

While this is likely true, I think its definition and a bit more about its purpose is worth an explanation.

Psychometrics is a specialized branch of statistics that deals with the design, administration, and interpretation of assessments that measure psychological and educational constructs, which are attributes or abilities that, unlike intelligence, for example, cannot be directly measured. In other words, psychometrics is the process by which a student’s cognitive ability is measured and a relative score is assigned. An example of such an assessment is an English language learner assessment that measures the construct of English language proficiency. Psychometrics ensures that the results of a standardized assessment are fair, reliable, valid, and usable. Without the psychometric underpinnings, assessment results are very subjective, vary widely, and lack meaning.

Psychometrics can differ depending on the assessment program, client, resources, and other such factors that affect how a test is developed, analyzed, and reported on. It is a constantly improving and fluctuating field that requires collaboration among psychometricians and an understanding of the needs of assessment programs. In order to come to conclusions about student performance and ability based on assessment results, we use models such as item response theory (IRT) to analyze item and test data. Many procedures and methods can be used, and not a single best practice exists. As psychometricians produce more and more research, new methods and procedures continue to be produced such as multidimensional item response theory.

Using this variety of methods, psychometricians run statistical analyses to calculate features of the test such as item difficulty (for example, p-value), item discrimination (for example, the point-biserial correlation), and the reliability of an assessment. Psychometricians also discuss the probability of something occurring or being true from the data obtained by assessments, such as a student being placed in a certain performance level based on his or her total test score. All of this information translates into numbers reported as test scores to students, teachers, and parents.

Statisticians of all types, including psychometricians, typically talk about two types of significance in their research: statistical and practical. In the ideal world we would prefer to have both statistical significance and practical significance when performing research studies, although that is not always possible. A professor once told me, “Statistical significance will get you published and tenure, but practical significance will make you famous.” The first part of that statement might be true but the latter is not…at least not in my experience. However, it does point to the importance of knowing when a set of metrics has real meaning — and when it does not.

Let me use a story to illustrate what I mean. During one of those monthly brown-bag seminars common in workplaces these days, the presenters described how they had improved the IRT calibration on a dataset by making a few interesting changes to the Newton and EM cycles of their IRT program. They claimed it made the calculation more reliable, which it did. However, the differences in the numbers were at the sixth decimal point — for example, -3.123421 versus -3.123423. While this made student test scores more reliable, this increased reliability did not change the student’s score. If a student scored 24 before this recalibration, he would still score 24 after the recalibration was applied. In other words, the difference at the sixth decimal point in the IRT calculations were statistically significant but did not have any practical impact on the student’s reported scores.

What accounted for this statistically significant event being such a non-event in practical terms? It had to do with the large number of students who took the test for which the data were calculated. With large n-counts, very small differences can produce statistically significant results that may have little or no practical value. Of course, the converse is also true. An event may not be statistical significance but still have enormous practical significance. Such is the case in medical research where a small change might mean the difference between life and death.

The bottom line is that numbers are very important to psychometrics. The goal is to achieve both statistical and practical significance; make test results as reliable as possible; have valid evidence that supports the use of test scores; and ultimately allow people to make better decisions about teaching and learning than they would without those results.