Designing and Reporting Experiments in Psychology Peter Harris
     
 
 
 
Designing & Reporting Experiments in Psychology 3/e
 
  Buy this Book  
     
  A. Choosing a statistical test  
  B. Reporting specific inferential statistics  
  C. More on main effects, interactions and graphing interactions  
  D. Rules for writers  
  E. Reporting studies that include questionnaires  
  E1 Studies involving questionnaire  
  E2 Design  
  E3 Questionnaire development  
  E4 Materials  
  E5 Procedure  
  E6 Computer presentation  
  E7 Results  
  E8 Reporting non experimental studies  
  E9 The reliability and validity of your measures  
  E10 An example to help you report studies using Theory of Planned Behavior (TPB) measures  
  F. Experimental and nonexperimental data: Some things to watch out for  
  G. Some tips for advanced students to improve your experiments yet further  
  H. Some issues to consider in the RESULTS sections of your later reports and your projects  
  I. Final year projects  
     
 
Related Statistics Books
 
  Pallant, SPSS Survival Manual  
     
  Greene & D'Oliveira, Learning to Use Statistical Tests in Psychology  
     
   
Reporting studies that include questionnaires

 

E9 The reliability and validity of your measures

From time to time in the book and on this Web site I have mentioned the importance of reliability and validity. It is impossible to overemphasise this. Reliabilty and validity are central to psychological research. If our measures are unreliable or invalid, we are wasting our time.

Reliability

This includes such measures as the test-retest reliability of a scale or measure – looking to see that the scores on it remain pretty much constant at two different time points when we believe that whatever it is supposed to measure should also remain pretty much constant. Where this is the case, you can obtain participants’ scores on the variable on two occasions and correlate them (see Section 4.6.2 and 4.6.3 of the book and Section B2 of this Web site). With Pearson’s product moment coefficient a value of r = .70 or above is generally thought to be necessary for the measure to be considered reliable.

If you collect two or more scores of someone’s performance or attitude or intentions and so on at any given time, you can calculate how well the scores measure the same thing by calculating their internal consistency. If they have high enough internal consistency you can then combine them and make a scale to measure the performance or attitude. Scales with high internal consistency are generally more reliable than single measures of the same variable. A commonly used statistic for assessing this is known as Cronbach’s coefficient alpha. Cronbach’s alpha ranges from 0 to 1. Scales with Cronbach’s alpha of .70 or above are generally considered to be reliable.

Cronbach’s alpha is easily calculated using statistical software packages. In SPSS, for example, you can find it in the scale window of the analysis package. All you need to do is to put the items you wish to comprise the scale into the analysis and see whether the alpha is at least .70. There is also an option in the statistics menu for ticking a box to find out what happens to alpha if any one of your items is deleted. You can use this to see whether removing a given item would improve the alpha.

Validity

Measures should also of course be valid – that is they should measure what they are supposed to measure. There are a variety of types of validity and once you start to think about this issue you will quickly come to realise that establishing that a measure is a valid one is not straightforward. The key thing is to be aware of the importance of validity, of the different aspects of validity, and wherever possible to use measures of your variables that have been validated. So, wherever possible use an established, validated measure of your variable, rather than make one up yourself; where there is no such measure available, attempt to validate your own; where this is not possible (e.g., time or the other resources available to you do not allow it) pay special attention in the DISCUSSION to the possibility that your measure is not measuring what you think it is measuring (i.e., may not be valid).

One of the least trustworthy types of validity is face validity. Face validity is the assumption that a measure is valid if it looks valid – i.e., if it looks as if it is a measure of whatever it is supposed to be measuring. For example, items assessing someone’s intentions to take exercise may combine items talking about their intentions, expectations, and plans to take exercise. On the face of it, these items look like they are tapping into the person’s underlying intentions. However, without further evidence of their validity, to rely on the appearance of such items alone as an index of their validity is to rely on face validity.

How might we establish the validity of such a set of items? One possibility is to demonstrate concurrent validity. With concurrent validity we take our measure and examine its relationship with another measure of the same variable. So, we could examine how well scores on our new intentions measure correlate with somebody else’s (validated) measure of intentions. The better it correlated, the more valid will be our new measure. However, if there is already a measure, why do we need to develop a new one – especially if these are very highly positively correlated (and therefore more or less measuring exactly the same thing)? (There can be reasons – for example, our measure may be briefer or more user friendly or in a different language.) What do we do, however, if there are no such pre-existing measures with which to assess concurrent validity?

An additional aspect of validity is predictive validity. With predictive validity we test whether our measure predicts the things it should (e.g., according to theory or previous research) and does not predict the things it should not. Thus our theories suggest that a measure of intentions to take exercise should be correlated positively with measures of attitude towards taking exercise, beliefs about one’s ability to successfully take exercise, beliefs about what people who are important to us (such as our friends and members of our family) think about us taking exercise and, ultimately, to the amount of exercise (all things being equal) we subsequently take. We can therefore look to see whether this is the case – if it is, we have evidence of predictive validity. In many studies we tend to rely on predictive validity. However, you may have spotted here that there is a potential problem – a lack of relationship (e.g., between intentions and subsequent behaviour) may reflect a flaw in our theory rather than the invalidity of our measure.

Demonstrating that your measure correlates with the things it should is known as convergent validity. Demonstrating that your measure does not correlate with the things it should not correlate with is known as discriminant validity. These are both aspects of establishing construct validity. Construct validity involves hypothesizing what the underlying construct should do and what it should not do and testing to see whether our measure performs in this way. For example, given how we think “intentions” function, how would we expect a valid measure of intentions to behave? What would it predict? What would it be correlated with negatively? What would it not be correlated with? The more our measure behaves in accordance with these expectations, the more we are confident that it has construct validity.

These are but some of the terms available for understanding validity – there are many others – and the true meaning of some of them, such as construct validity, is controversial. The main thing – as discussed above – is to think about what evidence you have that your measures are valid and, where there are grounds to question their validity, to bear this in mind when interpreting your findings in the DISCUSSION.

 

 

 

 

Open University PressMcGraw-Hill logo