This article was written for the New York Analysis of Policy and Government by noted author and researcher Alex Bugaeff.
Large educational test publishers often make claims that their tests are “valid.” But, what is a valid test? Are their claims true? These questions can be asked of Common Core test publishers, as well as educational test publishers generally.
Before accepting or rejecting such claims, parents, school boards and the educational community should know the basics of test validity. That way, they can ask questions of test companies and make informed decisions about them. Here are the key principles of valid tests.
The terms “valid” and “validity” have specific meaning in the educational testing world. They mean that a test has been subjected to accepted statistical analyses and has satisfied them. These analyses “test the test” to see how well it does what it claims to do. So, the first question parents, school boards and the community should ask of test publishers is, “Have you tested your tests for the validity that you claim they have?”
There are three main levels of validity that a test can achieve with increasing strengths as you go up the levels. These levels are:
- Face Validity.
- Content Validity.
- Criterion-Related Validity.
Let’s take each in turn.
- Face Validity. Face Validity merely means that a test appears valid on its face. That is, the test has words that most people would associate with what is being tested.
For example, a test might be titled “Test of Historical Knowledge” and have a question such as, “What are the three main forms of rocks found in the American Colonial period?” The question really tests knowledge of geology and has nothing to do with history, but because the reference is to the American Colonial period, the test writer could claim that it is a valid test of history knowledge.
Most tests have Face Validity, but if that is the only level of validity that they have, they are worthless for the purposes of meaningfully testing student learning. Face Validity does not require an analysis of the test in its ability to assess learning; it is meant only to appear to test it, on its “face”.
If a test provider says that their tests are obviously valid, one could answer, “So, you claim your tests are valid on their face and that is enough?” (What they are saying, of course, is that you are not smart enough to understand.). That’s not enough in almost any application.
- Content Validity. Content Validity means that a test contains questions about the subject being tested. To write a content-valid test, one would study the curriculum of the course being tested and would write questions reflecting the information presented in it.
In the test above, “Test of Historical Knowledge,” a question might be, “What metal was used to make Continental Army uniform buttons in the American Revolution?” The question “contains” a reference to American history, but has nothing to do with any important historical aspect. It is trivial. The test writer could claim that it is a valid test of history knowledge, but the question would not matter in the assessment of student knowledge and understanding of history.
If a test publisher says that they test information about the subject and that their tests are, therefore, valid, one could answer, “What impact does this knowledge have on a student becoming educated?” Simple knowledge of a subject may be valuable, but given the scarce resources of time, money and school resources, how does it show a student’s progress toward a goal? Certainly knowledge of the Constitution, law and government operations are valuable to civics education, but how valuable is knowledge of abacus operation to high school algebra, say?
- Criterion-Related Validity. Criterion-Related Validity means that a test has been analyzed to determine how well it tests what it is claimed to test and that the results predict a desired outcome (that is, the test is worth giving). This requires substantial time, expense and cooperation to demonstrate. It is much more demanding than the first two levels of validity, but is much more valuable to decision-makers.
First, a criterion must be defined. For example, for students entering their senior year in high school, one such criterion might be “Scores on SAT or ACT tests, if taken.” Another criterion might be “Acceptance into an accredited trade school or apprenticeship, if pursued.”
Then, a measure of that criterion must be established, such as percent of students having been accepted out of those who have applied. Or, the score on the final taking of the SAT/ACT test by each student taking it.
Next, the curriculum of the course in question would be studied and the test of student performance would be designed and written. Then, the test would be tried out on a sample of eligible students, the results analyzed and the test edited to reflect the sample testing.
The test would be administered to a larger group of students and the results would be analyzed statistically once they have completed the criterion measure against which the test will be compared (SAT/ACT score or trade school acceptance, say). The statistical measure includes standards that show whether the test is valid or not (see below).
As the test is administered to larger numbers of students over years, the results are incorporated into the statistical analyses and the test is edited or discontinued. Those results can be used to demonstrate the ability of the test to predict the performance of students in reaching the desired goal(s).
For a large test publisher, Criterion-Related Validity should be the standard, given the stakes and their ability to invest in the necessary research to analyze it. Representatives of these large publishers should be able to report the results of the relevant statistics to their clients/customers. The following is a summary of the statistics and what to look for.
Criterion-Related Validity Statistics “by the Numbers.” Researchers will perform many statistical measures to their data as they develop their tests, but the last and most revealing is the Correlation Coefficient. Simply put, this statistic shows the extent to which the test results vary as the criterion results vary. If students who score high on the test also score high on the SAT, say, and the students who score low on the test also score low on the SAT, then it can be said that the test has a high level of Criterion-Related Validity. It can predict SAT scores.
The Correlation Coefficient scores look complicated, but stripped of their detail, can be directly interpreted. There are two numbers: the Coefficient and the Reliability of the Coefficient.
The Coefficient is a representation of the extent to which the test scores and the criterion scores vary together. The Coefficient varies from -1.0 to +1.0. A Coefficient of +1.0 means that the higher the test score, the higher will be the SAT score (say). The lower the test score, the lower will be the SAT score. They vary in perfect relation to one another (or correlation, in statistics terms).
A Coefficient of -1.0 means that the higher the test score, the lower will be the SAT score, and so on. A Coefficient of 0.0 means the test and the SAT bear no relation to each other. A high test score can mean a high or low SAT. To have a test that is valuable in predicting the criterion score, you want a Coefficient as close to +1.0 as possible.
The second number in the Correlation Coefficient is the test’s Reliability. This is the extent to which the Coefficient number has resulted from random circumstances, versus that one can expect an accurate result each time the test is administered. The Reliability score varies from 0.0 to +1.0. A +1.0 means that you will get the exact same Coefficient result every time you administer the test in like circumstances. A 0.0 means that there’s no way to predict that the Coefficient will be the same in identical test administrations. That is, there’s no way to predict that a test result is the product of chance or of its relation to the Criterion.
The Correlation Coefficient then, is reported in two numbers: the Coefficient and its Reliability. For example, a test might be said to have Criterion-Related validity if the Coefficient were +0.9, say, with a Reliability of 0.95. That is, the test predicts the Criterion accurately 90% of the time and does so in 95% of retests of it. As the Coefficient number goes down, the less likely it is that the test is able to predict the desired criterion (SAT score, say). In educational testing, the standard Reliability number would be 0.95 or higher. That is, as the Reliability sinks toward 0.90 and below, the ability of the test to produce consistent results over time comes into question.
Parents, boards and the educational community can demand that test producers back up their claims that their tests are valid. So far, it appears that such claims have not been questioned and these publishers have been able to sell their tests without accounting for them. These publishers can first be asked with authority, “What, specifically, do your tests test?” and, second, “What validity do they have in terms of educational testing?” Follow up questions, based on knowledge of test validity and their statistics, can then reveal the extent to which a test is defensible. Let’s hold these test publishers’ feet to the fire.
Photo: Pixabay