Personality Assessment
© 2011
This eText is the property of Toru Sato. All rights reserved © 2011. This eText is not to be copied, distributed, or downloaded without permission of the author. Any violation of copyright found in this eText is unintentional. Please notify the author if copyrighted material is found and not appropriately referenced.
To determine the quality of a personality test, there are many things we need to consider. The two main criteria we need to pay special attention to reliability and validity. Below is a brief explanation of various types of commonly examined forms or reliability and validity.
Reliability
When we assess people's personality using personality tests or behavioral observations, we need to know that these assessments are reliable. Reliability means that the test as well as the measures within the test are both repeatable and yield consistent results.
There are many things to consider when we examine the reliability of a personality test. One is called test-retest reliability (or temporal stability). This can be examined by administering the test to the same group of people at least two times. If the first scores of the participants correlate highly with the second ones, we can be more confident that the personality test has good test-retest reliability (Gregory, 2000). Personality tests usually measure personality traits and personality traits are assumed to be relatively stable over time. If you were very extraverted two weeks ago, we assume that you are very extraverted today. Therefore, a personality test with high test-retest reliability helps us become more confident that the test is actually measuring a personality trait rather than something temporary such as mood. Like positive correlations, test-retest reliability of a personality test can range from 0 to 1. The closer it is to 1, the better the test-retest reliability of the measure.
Another form of reliability that is important to consider is internal consistency. This can be examined by administering the test to a large group of people. We examine the responses that the participants have made to each question measuring one personality trait and see how they correlate with one another. The assumption is that if numerous items are used to measure one personality trait, the responses to those items should be correlated. A common way to calculate internal consistency is the Cronbach Alpha. It is calculated by examining the correlations between of all of responses to each item measuring the same personality trait (Gregory, 2000). Like positive correlations, the Cronbach alpha of a personality test can range from 0 to 1. The closer it is to 1, the better the internal consistency of the measure. If the internal consistency of a personality test is very low, we would most likely need to modify the test by taking out items (or questions) that do not correlate with the other items that are intended to measure that personality trait. In order to do this, we need to look at inter-item correlations. Inter-item correlations tell us how each item is correlated with the total scores of the remaining items in the measure. If we find an item with a low inter-item correlation, we may take it out and, if necessary, replace it with another. If we add new items in the measure, we need to collect data with this new measure and examine the test's internal consistency again. This process is repeated until the internal consistency of the personality test is improved.
Inter-rater reliability is a type of reliability we examine when we use open-ended questions or behavioral observations for personality assessment. With these types of assessments, we need to code the responses to the questions or observed behaviors. This can be somewhat subjective so we need to be confident that these ratings are somewhat reliable. In order to do this, we have two or more independent observers rate the responses in the questionnaire or the particpants' behaviors. For example, if we are assessing aggressiveness in children using behavioral observations, we might have two or more raters observe each child's behavior on a video-recording and rate their level of aggressiveness independently of each other. If the ratings of these raters are similar, we can be more confident that our assessment is a relatively objective indication of the child's personality. This is how inter-rater reliability is typically examined. This same procedure can be used for examining inter-rater reliability in responses to open-ended questions. For example, the responses to an open ended question asking, "Suppose your mother reprimanded you for something you did not do, how do you think you would respond to her?" may be examined by having two or more independent raters rate each participant's response based on certain criteria such as aggressiveness on a scale of 1 to 10. If the ratings of the independent raters are highly positively correlated, we are more confident that we are assessing the responses relatively objectively. When this happens, we say that the measure has some inter-rater reliability (Gregory, 2000).
Vailidity
When we assess people's personality using personality tests, we also need to know that these assessments are valid. Validity is the extent to which a test measures what we are intending it to measure. There are many ways we test for the validity of a personality measure.
One of them is known as criterion validity. With criterion validity, we examine whether the scores of the personality test relate to other variables (such as biological or behavioral data) that are assumed to be associated with the personality trait that the test is supposed to be measuring (Gregory, 2000). For example, if we are creating a personality test to measure aggression, we might correlate the scores with testosterone levels. We might also see if the aggression scores correlate with the number of times a child hits another child in the playground. If the scores correlate highly with these variables, we are more confident that this personality test is actually measuring aggression levels.
Another type of validity is called convergent validity. Convergent validity is examined by seeing whether the results of our personality test correlates positively with test scores of personality tests measuring similar personality traits (Gregory, 2000). For example, if we are constructing a personality test to measure extraversion, we might ask research participants to complete our extraversion test as well as a personality test measuring assertiveness. Even though, extraversion and assertiveness are not the same personality traits, people who are highly assertive a more likely to be extraverted than introverted. If the extraversion scores on our test is positively correlated with assertiveness, we can say that the extraversion test has some convergent validity.
Discriminant validity is another important type of validity and this is, in many ways, the opposite of convergent validity. Sometimes we need to make sure that the personality test we are using is not measuring something else by mistake. For example, if you are measuring extraversion among schoolchildren, you may want to make sure that it is not measuring verbal skill level by mistake since children with higher levels of verbal skill tend to be more talkative and therefore may look more extraverted in school. To test for this, we might give a large number of children the extraversion test and a test of verbal skill. If the scores of the two tests do not correlate very highly, we are more confident that the extraversion test is not accidentally measuring verbal skill. This is an example of testing for discriminant validity.
Other Factors to consider
Social desirability
Although reliabilty and validity are import factors to consider, there are numerous other factors we need to consider when we assess people's personality. It is human nature to want to look good or socially desirable to others. It is also human nature to try to look good to ourselves. We all have the desire to think that we are good, reasonable, decent people. Sometimes this gets in the way of answering surveys accurately and honestly. We might lie to look good to others or simply be in denial of some of the negative characteristics we may have. To minimize this as much as we can, most researchers try to collect data from individuals anonymously. This means that none of their data is not attached to the identity of the participants. Nobody knows who filled out which survey. When we answer questions anonymously, we are less likely to lie because nobody will know that you are the one who answered these questions. This helps us minimize the inaccuracy caused by our motivation to look good to others but it does not help researchers obtain accurate results if people are in denial of certain characteristics they have. To deal with this denial issue, researchers sometimes mix items from a lie scale (or social desirability scale) into the survey questions. These items are used to measure the likelihood that the participant is being dishonest in answering questions becuase s/he wants to look good to not only others but also to themselves. A common scale used for this purpose is the Marlow-Crowne Social Desirability Scale (Crowne & Marlowe, 1960). It includes questions like, "I am always willing to admit when I make a mistake" or "I like to gossip at times" as a reversed scoring item ("yes" or a high rating means low social desirability). When a participant scores above a certain cutoff score on the Social Desirability Scale, their data is omitted from the analysis because there is a high likelihood that the particpant is responding to the questions inaccurately. This helps minimize not only inaccuracies from our motivation to look good to others but also our motivation to look good to ourselves.
Cultural Bias
Another important factor to consider is cultural bias. People with different cultural backgrounds may interprest words or phrases in different ways. This may make people who have similar personalities respond to the same items on a personailty test in different ways. Part of this has to do with comparison groups. When someone asks you to rate how much you agree with the item, "It is important consider the opinion of my parents when making important decisions," we tend to compare ourselves with peers in our own cultrual group. If we tend to consider what our parents think more than our peers we might give ourselves a high rating. However, how we respond depends on the cultural norm of the society we live in. Someone living in a culture that emphasizes valuing the opinions of parents highly may have a similar personality but rate themselves low on this item because most everyone else around him or her values their parent's opinion more than s/he does.
Cultural variables can also cause response biases. For example, people in some cultures are more likely think of things in a categorical "yes or no" manner. It is not uncommon to find that people in these cultures have a tendency to rate themselves on the extreme ends when asked to respond to questions on a 7 point scale (i.e., most of their responses are 1 or 7). Response biases can also occur in other ways. People in some cultures are raised in certain ways that make them very agreeable to others. People in these cultures are more likely to openly accept most things people say. This can result in acquiescence bias, a tendency to agree (or respond "yes") to most questions. If we compare the scores of people in these cultures with the scores of people in cultures that promote critical and independent thinking, it is not uncommon to find that the scores of the former group is almost always higher than the latter regardless of the personality trait being measured.
References
Crowne, D. P. & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349-354.
Gregory, R. J. (2000). Psychological Testing (3rd ed.). Needham Heights, MA: Allyn & Bacon.