Chapter 4: Observational
know the definitions of the following terms. You should also be able to apply
these concepts (i.e., recognize examples of them in several contexts and use
them to critically evaluate a study).
subject (participant) roles
We have seen that science is concerned with accurately describing
a phenomenon (identifying the important dimensions and relevant variables),
specifying the relationships between two or more variables (e.g., the
effect of an IV on a DV), and explaining why these relationships exist
(developing theories that organize our observations and generate predictions
about future observations). Thus, the first step in the scientific process
is observing/describing the phenomenon.
As the text's examples of perceptual illusions illustrate, although
it may be that "seeing is believing," it is certainly not the case
that our perceptions always bear an accurate correspondence to objective reality.
What amazes me to this day is how compelling these illusions are--even when
they have been explained (so we understand, or "know" their basis),
we can't overcome them--our brain persists in misperceiving these stimuli.
Thus, as scientists, we must be constantly vigilant of the various threats
to the validity (i.e., accuracy) of our observations.
We saw in Chapter 3 that reliability (consistency) of
measurement is essential to scientific observation. Reliability is a
necessary, but not a sufficient, condition for validity (accuracy) of measurement.
That is, reliability is a "prerequisite" of validity (a measure
can't be valid if it is not consistent), but it is possible for a scale
to be reliable, but still not valid. For example, we could develop an "algebra
test," that would yield very similar scores each time it is administered,
but if the test purported to measure "reading comprehension," then
although the test is reliable, it is not an accurate measure of reading comprehension.
So we need to be careful to insure the validity as well as the
reliability of our observations. I like to play darts, so I'll give a "dart
board" example of the relationship between reliability and
validity. Think of observation/measurement as trying to "hit
the bullseye" on the board (i.e, that's the variable, or construct we
are trying to measure). In the worst-case scenario, a measure is neither reliable
nor valid--my friend, Jody, illustrates this: her darts land all over the
board (if they hit it at all!), and if she hits the bullseye, it's a random
event! Next, a measure can be reliable, yet still be invalid--my friend, Ray,
is very consistent: all of his darts land close to each other (near the same
place), but the problem is that they miss the bullseye! What we strive for
is a measure that is both reliable and valid: Now, when I
throw the darts (of course!), most of them consistently hit the target (bullseye),
so I am both consistent and accurate--thereby winning the game whenever we
play (well, ok--maybe I'm stretching the truth a bit, but hopefully you get
Your authors touch on this issue in Chapter 3, where they discuss
the "predictive validity" of measurement scales (a good test
will yield strong, positive correlations between scores on the test and scores
on some relevant behavior the test should be able to predict). For example,
we become more confident of the SAT if it reliably predicts college GPA (high
SAT scores are related to high GPA). The present chapter discusses some other
aspects of validity, but they all relate to the accuracy--or truth--of our
- Construct Validity
- Is the independent variable a "true" cause of the dependent
variable, and are the scores on the DV "true" measures of the
- These issues were discussed in Chapters 1 & 2 (that is, confounding
variables threaten the construct validity of the IV), as well
as in Chapter 3 (i.e., random error threatens the validity
of our manipulations and measurements). BTW, there is a lot of repitition
in this text, a feature I think reinforces learning!
- As we have seen, careful operational definitions (procedures
for manipulating the IV and measuring the DV that control for possible
extraneous variables) minimize confounding, and protocols
(strictly-followed uniform procedures for treating particpants and measuring
the DV) minimize random error.
- External Validity
- Do the results of
the study generalize outside of the context of this particular
study--to other settings, situations and populations?
- Utilizing random sampling helps here, but there is no substitute for
replication--repeating the study using a different sample--to
demonstrate the generality of the results. Recall that this was discussed
in the context of "experimental reliability," i.e., will the
results consistently occur with replications?
- We will return to discuss this issue, but as your authors point out,
utilizing different types of tasks to manipulate the IV and measure the
DV are good ways to establish the generality of a relationship.
- Internal Validity
- Were the differences in the DV caused by the differences
in the levels of the IV, or could the differences be due to some other
- An internally-valid experiment is one in which potential confounding
extraneous variables have been ruled out as possible
causal influences on the DV
- This is accomplished by experimental control: holding extraneous
variables constant across the levels of the IV--this leaves
the IV as the only possible "true" cause of the differences,
thereby allowing us to infer a cause-effect relationship
between the IV & DV.
Much research in psychology can be classified as "descriptive observation,"
that is, empirical data that have been obtained from systematic observation.
This section describes several examples of observational research and discusses
issues to insure the reliability and validity of these observations.
- Naturalistic Observations
- As the term implies, this research involves directly observing naturally-occurring
behavior "outside the laboratory" in natural environments (sometimes
called "field research," a term borrowed from
biologists' observations "in the field"). It is often the first
step in research on a pheonomenon, and is therefore frequently exploratory.
- Important issues here include deciding on which behaviors
are important, identifying a classification system for categorizing
specific behaviors into broader units, and determining a protocol
for deciding what behaviors count as instances of a given category.
- The ethogram described in the text is a good example of
this type of research. As another example, I once constructed an ethogram
of my pet dog's communication behaviors for a term paper. First, I had
to create a protocol to decide which behaviors constituted "communication"
(i.e., which activities involved eliciting a response from me), so I eliminated
"asocial" behaviors (e.g., scratching), and only counted "social"
behaviors (e.g., approaching me with tail wagging). Then I established
a classification system for categorizing specific behaviors (e.g., I created
"behavioral categories," such as general attention-getting,
food-getting, getting outside, etc.). Then I classified specific actions
(e.g., jumping at the door as an example of communicative behavior in
the category of getting outside).
- It is important to establish the reliability of such observations. Interobserver
reliability is established by correlating the observations of
two people independently watching the same behaviors. So I asked another
person to independently observe the frequency of specific communication
behaviors of my dog, then correlated our separate observations of the
same behaviors. As usual, a strong positive correlation
between the independent raters demonstrates the reliability of the observations.
- Actually, although this example illustrates etholodgy, it would be better
considered a "case study," in that it involved many observations
of one organism. Naturalistic studies typically involve many observations
of numerous organisms of the same species.
- The Case Study
- As indicated above, the case study involves intensive observations of
a single individual over time. Case studies are typically conducted by
practitioners, such as clinicians. The rules of careful, systematic observation
discussed above also apply to the case study, but as with all research
that is descriptive (as opposed to experimental), one cannot infer cause-effect
- This is especially difficult in case studies, but one way of increasing
our understanding of cause-effect is to use the deviant-case analysis.
This is like using another individual as the "control" (or comparison
standard) for understanding what causes the deviance in the case of interest.
- The research makes the same observations on two individuals who are
similar in most respects, but differ in one important way (such as the
example of the memory loss of a scientist who suffered from alcoholism
with that of a similar scientist who did not have this disorder).
- Note that this is similar to the experimental psychologist's effort
to rule out extraneous variables by holding them constant (in this example,
both were scientists of similar age, so these variables were "constant"
between the two individuals).
- Survey Research
- As the term implies, a large sample of individuals from the larger population
is "surveyed" (i.e., a questionnaire or interview is used) to
determine the frequency of particular behaviors. For example, we might
be interested in the frequency of alcohol use at VU by measuring alcohol-related
behaviors in a sample of students.
- Good survey development is a complex process involving issues such as
the appropriate scale to use (e.g., Guttman vs. Likert), construction
of individual scale items, and protocols for administering and scoring
- A critical issue in survey research is sample representativeness--we
attempt to draw large, random samples from the population so that the
sample is representative of the larger population (note that this is related
to external validity, as discussed in Chapter 3).
- Large random samples are often difficult to obtain, so researchers frequently
use stratified sampling: the sample is drawn to reflect
the relative proportions of important segments ("strata") in
the population. For example, in opinion polling, the population is often
divided into socioeconomic strata (e.g., upper, middle, lower income people).
Then each stratum is sampled randomly in direct relation to its proportion
in the population.
- Another example from VU: if we wanted a representative sample of students,
we would draw 60% of our respondents from female students and 40% from
male students, reflecting the ratio of males to females in the population.
Note that this saves us time (e.g., we don't need as many male students
as female students), and makes the sample more likely to be representative
of the population than would a simple random sample--a prof of mine once
called stratified samples as "good-and-random" samples (that
is, they are random samples that also have the number of respondents in
a proportional amount to their distribution in the population).
- This is a recently-developed
statistical technique for assessing the generality of a particular
empirical relationship. Individual studies determine whether differences
are "real" (i.e., statistically "significant," meaning
they are due to the IV), or whether the differences are due to "chance."
Meta-analysis uses multiple studies which have investigated the same empirical
relationship, and determines whether the there is a "real" difference
across all the studies (that is, a reliable difference) or whether there
is no clear pattern across studies (i.e., a "chance" pattern).
- This technique goes
a step further, by yielding information about "effect size."
That is, just because a difference is "real" (that is, statistically
significant/reliable) does not mean that it is necessarily a "big"
difference (i.e., some differences are real, but are still small differences).
An example of this is research on gender differences in verbal ability.
Meta-analyses have revealed that although this appears to be a "real"
difference, the average "effect size" across studies is relatively
- This technique helps
to establish generality in two ways: by assessing the reliability of the
effect, but also by yielding information about how important the effect
ADVANTAGES OF OBSERVATION
- Observational research can be useful in the early stages of research,
helping to identify the parameters of a phenomenon
- Some studies can't be done experimentally, for practical or ethical reasons
- The relationships between variables observed can inductively lead
to theories that can subsequently be tested experimentally
- Observational research tends to be high in external validity,
since observations are made in natural settings (your authors refer to this
point as "ecological validity," a term similar to
external validity. The latter, however, refers to what some have called "mundane
realism," or the extent to which the setting is similar to the "real
SOURCES OF ERROR IN OBSERVATION
Despite it's advantages,
observational research is limited in that it is purely descriptive, and does
not allow inferences about cause-effect relationships. We do not have
the control found in an experiment, so there are a host of uncontrolled
extraneous variables that could account for the relationships
observed. Thus, Descriptive Research tends to be low on internal validity.
Sometimes we can't be confident of the reliability of the observations, especially
if they are difficult to replicate by other observers. The tendency to anthropomorphize
(attributing human characteristics to animals or inanimate objects) is hard
to resist in many observational studies.
The remainder of this chapter considers other sources of error in observation,
as well as ways to mitigate against these problems.
- Reactivity in Descriptive
- This concept was
introduced in Chapter 3 in the context of "rater errors"
(e.g., certain response biases, such as lenience), and is discussed later
in the context of "response styles." The fact
is that humans react to the very process of being observed,
and this reactivity could lead to behavior that the participant would
not exhibit if s/he were unaware that s/he was being observed.
- An example of reactivity is the phenomenon of "social desirability."
Most people want to appear to be "normal" and psychologically-healthy,
so when they know they are being observed, they may alter their behavior
to exhibit socially desirable behavior that they might not usually do.
Further, people often adopt a particular "role" as a participant
and may try to guess what the researcher's hypothesis is. Depending on
the role they assume, they may then behave according to what they think
the researcher expects.
- Clues in the research situation that suggest the purpose of the study
are called "demand characteristics," because early
researchers suggested that these cues "demand" a particular
response from the participant, which the participant then obeys.
- One solution to this problem is to employ "unobtrusive
observations," so called because they are made without
the participant's awareness. Thus, the participant is unlikely to
exhibit unnatural behavior.
- Hidden cameras, and participant-observation are examples of ways
to unobtrusively observe behavior, but of course, there are ethical
problems with their uses.
- "Unobtrusive Measures" avoid these problems
by measuring the behavior indirectly (e.g., examining route traveling
by looking for paths worn in the grass), or using "archival"
records (e.g., past crime records for a community).
- Reactivity in Case Studies is especially a problem due
to normal forgetting and motivated forgetting (e.g., people tend to forget
things that are potentially threatening to their self-esteem).
- Attempting to obtain independent verification of participant's self-reports
is a way to address this issue.
- Response Styles refer to "habits," or biases
that people exhibit when completing self-report measures, including response
acquiescence (this is sometimes called "yea-saying,"
because of the tendency to say "yes" more often than "no"),
response deviation (i.e., the opposite--this is called "nea-saying"),
and social desirability (described above).
- Using forced-choice tests solves these problems by
forcing a person to choose between two equally-desirable alternatives
for each question on the survey.
- The "volunteer problem" (the differences between
people who volunteer for research and those who don't) is a threat to
the representativeness of the sample.
- Providing incentives for people to participate (e.g., extra credit
in Intro. Psych.) is one way to avoid this problem.