Validity

Bruce Frey

_____
A good test measures what it is intended to measure. A survey which is supposed to find out how often high school students wear seatbelts should, obviously, contain questions about seatbelt use. A survey without these items could reasonably be criticized as not having validity. Validity is the extent to which something measures whatever it is expected to measure.
____
Surveys, tests and experiments all require validity to be acceptable. Validity is not a yes-no question, however. It is not something which a test score either has, or does not have. Validity is an argument that is made by the test designer, those relying on the test's results, or anyone else who has a stake in the acceptance of the test and its results. Consider a "spelling" test which consists of math problems. Clearly, a test with math problems is not a valid spelling test. While it is not a valid spelling test, though, it might well be a valid math test. The validity of a test or survey is not in the instrument itself, but in the interpretation of the results. A test may be valid for one purpose, but not another. Interpreting a child's score on a spelling test as an indication of her math ability would not be appropriate. The score may be valid as a measure of verbal ability, but not as a measure of numerical fluidity. The score itself is neither valid nor invalid; it is the meaning attached to the score which is arguably valid or not valid.
_____
Because validity is an argument, it can never be completely established. There are, however, accepted ways in which evidence for the validity of a test can be provided. The most commonly accepted type of validity evidence is also, interestingly, theoretically the weakest argument one can make for validity. This argument is one of face validity. The face validity argument runs as follows: This test is valid because it looks (on its face) like it measures what it is supposed to measure. Those presenting or accepting an argument for face validity believe that the test in question has the sort of items that one would expect to find on such a test. The seatbelt use survey example above is accepted as valid if it has items asking about seatbelt use. It is a weak argument because it relies on human judgment alone, but it can be a compelling argument. Common sense is a strong argument, perhaps the strongest, for convincing someone to accept any aspect of an assessment. Though face validity seems less scientific than other types of validity evidence, and in a real sense, it is less scientific, few test instruments would be acceptable to those who make and use them if face validity evidence is lacking. If you, as a test developer or user, cannot supply the types of validity evidence which we discuss below, then you would be expected to provide a test with, at least, face validity.
_____
Three other types of validity evidence are generally accepted by those who ask for assessments. They are all part of the range of arguments which can be made for validity. The three types of evidence are content, criterion and construct.
_____
Content-Related Evidence - If you decide to measure a concept, then there are many aspects of that concept, many different questions which can be asked on a test. Some demonstration that the items you choose for your test represent all possible items which you could have chosen would be an argument for validity using content evidence. This sounds like a daunting requirement. Traditionally, this sort of evidence has been considered more important for tests of achievement. In areas of achievement- medicine, law, English, mathematics- there are fairly well-defined domains and content areas from which a valid test should sample. A classroom teacher also, presumably, has defined a set of objectives or content areas that a test should measure. Such concisely defined aspects of a subject are rarely available, however, when testing a range of behaviors, knowledge or attitudes. Consequently, making a reasonable argument that you have selected questions which are representative of some imaginary pool of all possible questions is difficult. So, what would be necessary for content evidence of validity in test construction? It seems that, at a minimum, some organized method of question selection or construction would be necessary. When measuring self-esteem, for example, questions might cover how one feels about themselves in different environments- work, home, school - or while performing different tasks- sports, academics, job duties- or how they feel about different aspects of themselves- their appearance, their intelligence, their social skills. For a classroom teacher measuring how much students have learned during the last week, a table of specifications, or an organized list of topics covered and weights indicating their importance, works a good method. The choice of how to organize a concept, or break it down into components, belongs to the test developer. The developer may have been inspired by research or other tests or may just be following a common sense scheme. The key is to convince yourself, so that you might convince others that you are covering the vital aspects of whatever area you are measuring.
____
Criterion evidence of validity demonstrates that responses on a test predict performance in some other situation. "Performance" can mean success in a job, a test score, ratings by others, and so on. If responses on the test are related to performance on criteria which can be measured immediately, the validity evidence is referred to as concurrent validity. If responses on the test are related to performance on criteria which cannot be measured until some future time- eventual college graduation, treatment success, eventual drug abuse- the validity evidence is called predictive validity. It may go without saying that the measures you choose to support criterion validity should be relevant; the criteria should be measures of concepts which are somehow theoretically related. This form of validity evidence is most persuasive and important when the express purpose of a test is to estimate or predict performance on some other measure. It is less persuasive, and perhaps irrelevant,  for tests which do not make this claim, like a weekly spelling test.
_____

The third category of validity evidence is construct evidence. A construct (pronounced with an emphasis on the first syllable) is the theoretical concept or trait that a test is designed to measure. We know that we can never measure a construct like intelligence or self-esteem directly. The methods of psychological measurement are indirect. We ask a series of questions which we hope will require the respondent to use the part of their mind which we are measuring, or reference the portion of their memory which contains information on past behaviors or knowledge, or, at the very least, directs the respondent to examine their attitudes and feelings on a particular topic. We further hope that they accurately and honestly respond to test items. Test results are often, in practice, treated as a direct measure of a construct, but we shouldn't forget that they are educated guesses only. The success of this whole process depends on another assumption. We hope that we have correctly defined the construct we are trying to measure and that our test mirrors that definition.

If you decide to measure teachers' perceptions of aggressive behavior in the classroom, your definition of aggression should be reflected in the items which appear on your test. For example, aggression is often divided into two subcategories- instrumental aggression, which means using aggression for some clear advantage, like forcing your way in front of someone in line for the drinking fountain, and hostile aggression, being aggressive for its own sake, like tripping a stranger. You might be interested in only hostile aggression, but your items may not ask teachers to only consider that type of aggression. In this case, an argument could be made that your survey lacks construct validity because teachers' responses do not reflect how they feel about hostile aggression, but, instead, reflect how they feel about aggression in general. This argument focuses on how well a survey samples a construct as it is defined. Another argument of a construct validity nature concerns the quality of the definition of the construct itself. If your survey did reflect two distinct types of aggression with, for example, two subscales, one for hostile aggression and one for instrumental aggression, one could still argue that even though you did a good job of reflecting your definition of the construct, the definition you used of the construct itself is invalid. One could argue that it is not useful to break aggression down into two subcategories; aggressive behavior all serves the same function for a student, so a general measure would be more appropriate.

Construct evidence, then, often includes both a defense of the construct itself as defined and a claim that the instrument used reflects that definition. Evidence presented for construct validity can include a demonstration that responses behave as theory would expect responses to behave items differ. Construct validity evidence is always accumulating, whenever a survey or test is used, and, like all validity arguments, can never be fully convincing. In a sense, construct validity arguments include both content and criterion validity arguments because all validity evidence seeks to establish a link between a concept and the activity which claims to measure it.