Visayas State University
College of Education
Department of Teacher Education
VisCA, Baybay City, Leyte
“Semantic
Differential”
(A Written Report in PrEd 161 – Assessment of
Learning 2)
Prepared
by:
Ahldeter S. Mantua
Mary Luz Zuela
Submitted
to:
Helmar G. Ycong
Instructor
Summer 2014
I.
OBJECTIVES
At the end of the lesson, the students
are expected to:
·
Define
semantic differentials.
·
Determine
the theoretical background of semantic differentials.
·
Determine
the construction methods and uses of SD
·
Identify
the descriptive measures and procedures involved in SD.
·
Identify
the advantages and disadvantages of SD.
·
Determine
SD’s statistical properties.
II.
INTRODUCTION
The semantic differential measurement technique is a form of
rating scale that is designed to identify the connotative meaning of objects,
words, and concepts. The technique
was created in the 1950s by psychologist Charles E. Osgood. The semantic
differential technique measures an individual's unique, perceived meaning of an
object, a word, or an individual. The semantic differential can be thought of
as a sequence of attitude scales. The scales are designed such that the left
side is generally positive and the right is generally negative. This allows the
semantic differential to measure intensity and directionality. The rating scale
consists of a list of bipolar responses.
III.
BODY
Semantic differential (SD) is a type of a
rating scale designed to measure the connotative meaning of
objects, events, and concepts. The connotations are used to derive the attitude towards
the given object, event or concept.Osgood's semantic
differential was
an application of his more general attempt to measure the semantics or meaning
of words, particularly adjectives, and their referent concepts. The respondent
is asked to choose where his or her position lies, on a scale between two
bipolar adjectives (for example: "Adequate-Inadequate", "Good-Evil"
or "Valuable-Worthless"). An example of an SD scale is:
Usually,
the position marked 0 is labeled "neutral," the 1 positions are
labeled "slightly," the 2 positions "quite," and the 3
positions "extremely." A scale like this one measures directionality
of a reaction (e.g., good versus bad) and also intensity (slight through
extreme). Typically, a person is presented with some concept of interest, e.g.,
Red China, and asked to rate it on a number of such scales. Ratings are
combined in various ways to describe and analyze the person's feelings.
THEORETICAL BACKGROUND
Nominalists and Realists
Theoretical underpinnings of Charles E. Osgood's semantic differential have roots in the
medieval controversy between the nominalists and realists. Nominalists
asserted that only real things are entities and those abstractions from these
entities, called universals, are mere words. The realists held that universals
have an independent objective existence either in a realm of their own or in
the mind of God. Osgood’s theoretical work also bears affinity to linguistics and general semantics and relates to Korzybski'sstructural
differential.
Use of Adjectives
The development of this instrument provides an interesting
insight into the border area between linguistics and psychology. People have
been describing each other since they developed the ability to speak. Most
adjectives can also be used as personality descriptors. The occurrence of
thousands of adjectives in English is an attestation of the subtleties in
descriptions of persons and their behavior available to speakers of English. Roget's Thesaurus is
an early attempt to classify most adjectives into categories and was used
within this context to reduce the number of adjectives to manageable subsets, suitable
for factor analysis.
Evaluation, Potency, and
Activity (EPA)
Osgood and his colleagues performed a factor analysis of large collections
of semantic differential scales and found three recurring attitudes that people
use to evaluate words and phrases: evaluation, potency, and activity.
Evaluation loads highest on the adjective pair 'good-bad'. The 'strong-weak'
adjective pair defines the potency factor. Adjective pair 'active-passive'
defines the activity factor. These three dimensions of affective meaning were
found to be cross-cultural universals in a study of dozens of cultures.
The
Epa Structure
One
of the distinctive features of the SD is its reduction of ratings to three
basic dimensions of variation. A number of early studies were conducted to
determine the dimensions of bipolar adjective ratings. Of special importance
was the thesaurus study in which 76 adjective pairs were chosen from Roget's
Thesaurus to represent a great variety of semantic contrasts and the
corresponding bipolar scales were used by one hundred college students to rate
20 different concepts. Correlations between the ratings on different scales
were calculated and factored. The EPA structure was clearly evident in the
results of this and other early analyses; in the thesaurus study the EPA
dimensions accounted for more than two-thirds of the common variance. Some
additional dimensions were found in the early studies, and several scales that
made distinctions too narrowly descriptive or too highly abstract were found to
be unrelated to any of the major dimensions. Yet, for the most part, early work
with the SD revealed that ratings on most scales are highly predictable in the
three EPA dimensions alone.
The
EPA structure holds up with a wide variety of subjects, concepts, and scales.
Bopp (reported in Osgood, et al., pp. 223-226) had 40 schizophrenics
rate 32 words on a 13 scale form; the usual EPA structure was recognizable.
Wright (1958) had 40 concepts rated on a 30 scale SD by a survey sample of
2,000 men and women distributed over the spectrum of socioeconomic status. In
this study each concept was rated by a different sample of 50 persons so the
mean ratings for different concepts were entirely independent. Wright found
four factors in his data, the first three of which clearly were EPA. Heise
(1965) had 1,000 concepts rated on eight scales by Navy enlistees; factor
analyses of the data based on mean ratings for the 1,000 different words
yielded the usual EPA structure. DiVesta (1966) had 100 concepts rated on 27
scales by subjects in grades two through seven (20 subjects for each concept).
The usual EPA structure emerged, though there was some tendency for Potency and
Activity to merge into a single Dynamism dimension up until the fifth grade.
DiVesta also reports another study in which grade school children used 21
scales to rate 100 different concepts (this time with 100 subjects rating each
concept) and, combining the data for all grades, the usual EPA structure was
found.
Characterization of the EPA Dimensions
Considering
the generality of the EPA dimensions and their importance in research using the
SD, it is worth considering in more detail the distinctions that are involved.
In the following paragraphs the EPA dimensions are characterized in two ways.
First, some of the typical adjective contrasts that define each dimension are
presented. Second, a number of concepts which typically are rated near the
extremes of each dimension are given.
Evaluation
is associated with the adjective contrasts: nice-awful, good-bad, sweet-sour,
and helpful-unhelpful. Some concepts which lie on the positive (good) side of
this dimension are: DOCTOR, FAMILY, GOD, CHURCH, HAPPY, PEACE, SUCCESS, TRUTH,
BEAUTY, and MUSIC. Some concepts which lie toward the negative (bad) pole are:
ABORTION, DEVIL, DISCORDANT, DIVORCE, FRAUD, HATE, DISEASE, SIN, WAR, ENEMY,
and FAILURE.
Some
scales which define the Potency dimension are big-little, powerful-powerless,
strong-weak, and deep-shallow. Concepts which lie toward the positive
(powerful) pole are: WAR, ARMY, BRAVE, COP, MOUNTAIN, ENGINE, BUILDING, DUTY,
LAW, STEEL, POWER, and SCIENCE. Concepts which lie toward the negative
(powerless) pole are: GIRL, BABY, WIFE, FEATHER, KITTEN, KISS, LOVE, and ART.
Activity
scales are fast-slow, alive-dead, noisy-quiet, and young-old. Some concepts
high in Activity are: DANGER, ANGER, ATTACK, CITY, ENGINE, FIRE, SWORD,
TORNADO, WAR, WIN, CHILD, and PARTY. Among concepts which lie toward the
negative pole on the Activity dimension are: CALM, SNAIL, DEATH, EGG, REST,
STONE, and SLEEP.
SD Space
Sometimes it is convenient to think
of the EPA dimensions as forming a three-dimensional space. The SD, or
affective, space is illustrated in Figure 1;
Figure 1.The SD Space.
the origin or center of this space represents
neutralityon all three dimensions. Treating EPA measurements of a stimulus as
coordinates allows the stimulus to be positioned as a point in the space, and
this point graphically represents the affective response to the stimulus.
Construction and Use of SD
The
following sections discuss how one makes and uses an SD for research purposes,
and what kinds of information are provided for analyses. This discussion serves
to introduce vocabulary which will be helpful later on.
The
primary question in constructing an SD is what scales should be used. Two basic
criteria enter into scale selection; relevance and factorial composition.
Scale Relevance
Subjects
find it easier to use scales which relate meaningfully to the concepts being
judged and which make distinctions that are familiar. For example, in rating
persons, sweet-sour is less relevant, and thus harder to use, than
helpful-unhelpful; among laymen, talkative-quiet would be a better scale than
manic-depressive. Furthermore (and more important), relevant scales provide
more sensitive measurements. More variance is obtained in using relevant scales
and the variance of ratings involves less random error.
There
are two approaches to identifying scales which are relevant for a given class
of concepts and a given sample of persons. On the one hand, subjects can be
presented with a set of scales and asked to rank them in terms of their
meaningfulness in thinking about x, where x is a class of concepts to be rated
like People, Newspapers, Organizations, etc.. One then would use the scales
ranking highest in meaningfulness for a given population of raters.
A
second, more meticulous approach would be to present pairs or triads of
concepts from the stimulus concept domain and ask subjects how these concepts
differ. One would make up bipolar scales from the distinctions respondents
make, omitting any purely denotative distinctions {e.g., blond versus
brunette). For example, if subjects frequently drew the distinction of
crudeness, an appropriate scale might be crude-gracious. This approach,
developed for the study of individuals by Kelly (1955), has been applied
successfully in SD studies (e.g., Triandis).
Factorial Composition
The
basic goal in an SD study is to get measurements on the EPA dimensions, and
since factor analyses show these dimensions to be independent, one seeks
measurements that are independent. This means that appropriate scales will
measure the dimensions (i.e., scales that have high factor loadings on the EPA
dimensions) and will give relatively pure measures of the dimensions (i.e.,
each scale has a high loading on just one dimension). The only objective way to
select factorially pure scales is on the basis of actual factor analyses.
Researchers experienced with the SD are aware that intuition is an unreliable
guide in selecting factorially purescales. One can conduct ad hoc factor
analyses to learn the factorial composition of new scales, but this is an
expensive procedure since studies based on less than 30 concepts and hundreds
of subjects are likely to be misleading. The most common procedure is to select
scales on the basis of published factor analyses and following are some
available reports which indicate the factorial composition of SD scales. The
thesaurus study has been a standard source of factor analytic information on SD
scales. Because of the large number of scales considered (76), this is an
important source, but the factor loadings should be treated only as rough
indicators because of the unusual method of factoring and because only 20
concepts were rated in this study. Wright presents the factorial structure for
30 scales based on data from a survey sample of 2,000 adults rating 40
concepts. DiVesta gives the factor loadings of 27 scales used to rate 100
concepts by a large sample of children. Jakobovits gives the highest loading EPA
scales for 15 languages (including English) as derived from the pan-cultural
factor analyses.
The
published factor analytic studies provide a large fund of scales to draw on and
usually one can obtain a subset of scales which are relevant to the concept domain
of interest. It should be noted, however, that another problem arises in
selecting scales from previous studies—the matter of semantic stability. When
applied to a special class of concepts, the words in a scale may take on
special meanings and thus the scale is literally a different one than
previously studied. For example, the words HOT and COLD are used connotatively
in rating many concepts (like PEOPLE) but may be used denotatively in rating
physical objects. Since the scale takes on different meanings with different
concepts, its factorial composition may be different for the special class of
objects. The problem of semantic stability is (along with the problem of
relevance) the primary impetus for carrying out special factor analyses for
each new content area.
Number of Scales
Assuming
that one has a set of relevant scales, each of which loads on one and only one
of the EPA factors, the next question is how many scales should be included in
the final instrument. More than one scale for each dimension is desirable since
this improves the reliability of factor scores. On the other hand, reliability
characteristics of SD scales are such that it would rarely be useful to include
more than ten scales to measure a dimension, and generally speaking, four
scales per dimension can give adequate sensitivity for most purposes.
Contrary
to the practice in many published studies, the number of Evaluation scales
should not be more than the number of Potency and Activity scales. Evaluation
scales always are found to be more reliable than Potency or Activity scales and
thus fewer, not more, are needed for a given level of precision.
Equivalent Forms
In
research it is often necessary or desirable to do repeated measurement. This
introduces the question of equivalent forms. There is evidence that subjects
may recall the SD rating they have made previously when the time periods
between repeated measurements are short. Consequently, such repeated
measurements using the same form may not be independent. An example of how this
could confound research is given by Coyne and Holzman (1966) who had subjects
give SD ratings for their voice at points before and after listening to
themselves on a tape recorder. No significant differences were found when the
same SD form was used in all ratings, but highly significant changes appeared
when subjects used alternate forms of the SD for the different points of time.
This experiment suggests that equivalent forms of the SD are necessary in
experiments dealing with short range changes in attitudinal reaction.
The
primary problem in the development and use of equivalent forms is the large
fund of factor analyzed scales that is required; making up two equivalent forms
calls for twice as many scales. Given a fund of scales to draw on, one should
try to match factor loadings of scales in different forms. Then an experimental
design should be used such that some subjects should use Form A at time 1 and
Form B at time 2 while other subjects use Form B at time 1 and Form A at time
2.
Format of SD Test
Booklets
There are three
possible ways of graphically setting up scales and the concepts to be rated:
(1) Concepts can be presented one at a time, with
each concept followed by all of the scales on which it is to be rated;
typically, the concept is printed at the top of a page and the scales are
arrayed below, one after another, and centered on the page.
(2) A concept and one of the scales on which it
is to be rated can be presented as a single item with the various concept-scale
combinations arrayed randomly one after another. For example, item 1 might be
NEGRO followed by the good-bad scale, item 2 RUSSIAN followed by the
passive-active scale, item 3 JEW followed by helpful-unhelpful, etc. (It is
immaterial whether the stimulus word is placed to the left or right of the
scale.)
(3) A single scale can be presented along with
all of the concepts which are to be rated on it; for example, the good-bad
scale could be presented at the top of the page and concepts listed down along
the side, each followed by scale marking positions.
Studies
show that measurements differ very little in going from one format to another,
although format 3 is least desirable since there is some slight tendency for
ratings of one concept to affect ratings on another concept. From the
standpoint of data handling, format 1 is preferable since it groups the data
for a single concept, facilitating keypunching and statistical analyses.
When
format 1 is used, the order of concepts in the test booklet is immaterial since
anchoring or order effects are not evident using this format. Sommer (1965)
made a determined effort to produce anchor effects and found none: for example,
POLITICIAN was rated the same whether preceded by JANITOR, GARBAGE COLLECTOR,
FARMER or whether preceded by STATESMAN, SCHOLAR and SCIENTIST.
To
disguise the nature of an SD test and to prevent subjects from developing
response sets which could reduce sensitivity of measurements, it is customary
to mix the scales as much as possible. This means alternating Evaluation,
Potency and Activity scales rather than presenting them in blocks and
alternating directionality so that the scales' good poles, strong poles, or
active poles are not always on the same side.
Adverbial quantifiers
To
facilitate the rating of intensity, SD scale positions usually are labeled with
adverbs like "extremely," "quite," and
"slightly." The study by Wells and Smith inquired into whether the
adverbs serve any useful function. SD scales with and without adverbial
quantifiers were employed with a survey sample of 400 housewives. It was found
that the amount of differentiation in SD ratings was substantially greater when
adverbial labels were used: no labels led to many more ratings at the
end-points of the scales. Furthermore, interviewers reported that the labeled
scales were better understood by the respondents and led to greater cooperation
in the rating task. Hence, use of adverbial quantifiers is justified.
The
metric characteristics of adverbial quantifiers have been investigated in a
number of. The results indicate that adverbs "extremely,"
"quite," and "slightly" do define rating positions which
are about equidistantly spaced. The results from these studies also suggest
some other adverbs which might be used in some SD studies. For example, the
adverbs of frequency—"seldom," "often,"
"always"—might be meaningful in SD studies of roles (i.e., is a
LAWYER sometimes powerful, usually powerful, always powerful; is a MOTHER
sometimes nice, usually nice, always nice). However, the relationship between
such frequency ratings and intensity ratings using the customary adverbs is not
known.
Administration of an SD
SD’s
are easily administered to groups and, when possible, this is certainly the
most efficient way to obtain SD data. However, an SD also can be administered
successfully on an individual basis by survey interviewers.
Instructions
should routinely contain a statement that the purpose of the SD is to find out
how people feel about things and so the respondent should rate the way he
feels. He should use his first impressions and not try to figure out the
"right answer" or the answer that makes most sense. Instructions also
should contain an example in which the concept presented would elicit a
unanimous response from the subjects, for example, TORNADO. The concept is
rated by the test administrator, who explains while making the ratings what the
scale positions mean. It has been suggested that subjects should be urged to
work quickly; however, Miron found that subjects could be urged to work slowly
and thoughtfully and the same results were obtained, mainly because after the
first few ratings, subjects worked quickly, regardless of what they were
instructed to do.
For
many subject populations, one can turn to the literature to check the
experiences and procedures of others who have worked with similar groups, for
example: college students—Osgood, et al.; children—DiVesta (1966), and
Kagan, Hosken and Watson (1961); survey respondents—Wright, and Wells and
Smith; factory workers—Triandis; juvenile delinquents—Gordon, et al.
(1963); illiterates—Suci (1960).
Test length
Osgood,
et al. (p. 8O) suggested that a subject should be allowed about one hour
to make 400 SD judgments (for example, to rate 40 concepts on ten scales). Most
college students work faster than this and the allowance is generous for even
the stragglers in a college student population. On the other hand, this timing
estimate is a convenient round figure, and it is perhaps minimal for subjects
not accustomed to taking tests. In any case, the patience and endurance of
unpaid subjects can rarely be strained beyond 400 judgments, and for
non-college subjects (such as survey respondents), the maximum number of
ratings undoubtedly is far less—probably more like 50 judgments.
Descriptive
Measures and Procedures
A
typical SD study dealing with a number of concepts, using several scales for
each EPA dimension, and employing a sample of respondents, results in thousands
of ratings. Various statistics and procedures are available to compress this
data to a comprehensible set of measurements.
Factor Scores
The
first step in data reduction is to combine ratings on the separate scales into
factor scores. This involves first assigning numerical values to the scale
positions; for example, -3, -2, -1,0,1,2,3, going from one end of the scale to
the other. (To simplify calculations, numerical values should be adjusted for
the directionality of scales; for example, the positions numbered 3 through -3
for the scale nice-awful, and -3 through 3 for the scale bad-good.) The
responses that were obtained are then coded and a subject's ratings on a
concept averaged over all the scales representing a single factor. The product
is a single number representing one subject's reaction to one concept on one of
the SD dimensions.
Scale
Weights. Assuming that the factor loadings of the scales for a given
dimension are all high and comparable in size, that all the scales load mainly
on the one dimension, and that all the scales are of approximate equal relevance
so that the rating variances
are approximately equal, then it is reasonable to weight the scales equally in
calculating the factor scores (i.e., find the simple mean of the ratings). Only
if these assumptions are seriously violated, is a more complicated procedure of
differential weighting desirable; this could involve weighting each scale by
the squared factor loading or the use of multiple regression formulas.
Textbooks on factor analysis provide information on the more complicated
procedures.
Group
Means. A frequent second step in data reduction is finding the group means
for the factor scores corresponding to different concepts. This simply involves
averaging the factor scores over the subjects in the sample. The group means
can be viewed as estimates of true factor scores for the particular concept in
the particular group or culture—they are the points around which individuals
vary. Group means computed from the SD tend to be extremely stable.
Polarization
The factor scores for a concept
constitute a complete description of an affective reaction in terms of the EPA
dimensions. For some purposes one might not want such detailed information but
simply a general measure of the intensity of the affective response independent
of its character. This kind of measure, the emotionality of the concept, is
given by the polarization measure—the distance between the neutral point or
origin of the SD three-dimensional space and the particular concept under
consideration. If the neutral point of the scale was assigned a value of zero
in the coding process, the factor scores also have their neutral point at zero
and polarization is calculated simply by squaring the factor scores, adding,
and taking the square root of the sum. That is:
where e, p, and
a, are factor-score measurements of a given concept on the three dimensions.
Profile Analyses
The
majority of SD studies involve some hypothesis about differences in affective
reaction. For example, one might be interested in reactions to NEGROES versus
JEWS; the difference in reaction to NEGROES before and after seeing The
Birth of a Nation, or the difference in reactions to NEGROES among
Southerners and Northerners. Various approaches for analyzing differences in
affective response have been developed.
Dimensions
treated separately. One approach examines the differences on each EPA
dimension separately. That is, one would compare the means for concept a
versus concept b, for time 1 versus time 2, or for group x versus
group y on each of the three dimensions separately. This approach
provides the most detailed results, the statistical procedures for comparing
means are well-studied and relatively non-problematic, and it is definitely the
preferred procedure in most SD studies. In any case, it should accompany other
types of profile analysis.
D
scores.There are instances in which one would like to have a measure of the
combined differences on all three EPA dimensions—a summary measure of the total
difference in affective reactions. D scores have come to serve this
purpose in SD research. These represent the distance between two sets of SD
measurements when both are plotted as points in the three-dimensional SD space.
The formula for calculating D scores is as follows: let el, pl,
al, be the factor score for concept 1 (or time 1 or group 1) ; e2,
p2, a2, the measurements for concept 2 (or time 2 or
group 2). Then
D =
The
meaning of D scores can be illustrated by an example. The average EPA
factor scores for the concepts HOME, OFFICE, and WORK were drawn from Heise
(1965) and entered into the formula for D. It was found that the
distance between HOME and WORK is about 3.8 units while the distance between
OFFICE and WORK is .8 units. Thus, the affective reaction to WORK is more
similar to that for OFFICE than to that for HOME.
Considerations
in using D.The reliability of D scores based on group means (where N
= 30) is adequate; the correlations between test and retest or between
alternate groups are above .90 (Norman, 1959). The random distribution of D
under various conditions of rating errors has been studied by Cozens and Jacobs
(1961).
Despite
the simplicity and the reliability of the measure, D scores should be
employed conservatively. D scores completely hide the character of a
difference, and a large D could be due to a big difference on one
dimension or small differences on all three dimensions. When only the D
scores are presented, a reader has no way of determining which is the case.
Beyond
that, however, they can be misinterpreted and lead to artifactual findings. For
example, at one time a popular project was to show that the difference (D)
between evaluative ratings of Ideal Self and Actual Self is greater for
neurotics than for normals. To simplify matters, suppose that all persons see
their ideal selves as quite good (an assumption which is realistic). Now
suppose that neurotics have low evaluation of their actual selves, rating their
actual selves as slightly bad, whereas normals rate their actual selves as
slightly good. Since both groups have the same rating of the ideal self, it
inevitably follows that the neurotics are further from their ideal selves. It could
be a serious error to say that "what's wrong" with neurotics is the
discrepancy between their actual and ideal selves, since perhaps what is really
wrong with them is merely their low evaluation of actual self, which produces
the discrepancy as an artifact or inevitable side effect. In fact, Bass and
Fiedler (1959) did find that D scores added very little to the basic
factor scores in predicting maladjustment. Pitfalls involved in D scores
are discussed in greater detail in a series of articles by Cronbach.
Reliability of
SD Measurements
A
study of the absolute deviations between ratings of a concept in test and
retest (with retest up to three months later) was reported in Osgood, et al.
(p. 127). For evaluation scales it was found that the average difference
between ratings on the test and retest was somewhat more than one-half scale
units. For Potency and Activity scales the average difference between the test
and retest ranged from .7 to 1.0 scale units. The authors concluded from their
data that a difference of 3 scale units or more between two ratings on the same
scale could be considered statistically significant at the .05 level in a
two-tailed test.
DiVesta
and Dick (1966) studied the test-retest reliabilities of SD ratings made by
grade school children. In their study each subject rated a different concept on
a series of scales, and reliabilities were determined by correlating the
ratings made on a first test with ratings made one month later on a second
test. The correlations for different scales ranged from .27 to .56. DiVesta and
Dick found that reliabilities are somewhat higher in the higher grades and also
that Evaluation scales tend to be somewhat more reliable at all grade levels.
A
reliability study by Norman (1959) gives information on how much shift occurs
in ratings, relative to what might be expected if the ratings were purely
random. Norman had 30 subjects rate 20 concepts on 20 scales in a test and
retest spaced four weeks apart. On the average he found that the amount of
shift in ratings was about 50 per cent of what would be expected if the ratings
were completely random. More specifically, his results showed that 40 per cent
of the scale ratings do not shift at all from test to retest, 35 per cent of
the ratings shift by one scale unit, and 25 per cent of the ratings shift two
or more scale units. Norman found that ratings are more stable for some
concepts than for others, and this seems to be related to the number of
meanings for a concept. This may also be a function of how extremely the word
is rated. Other studies suggest that concepts whose true values are neutral are
rated with less reliability (Peabody, 1962; Luria, 1959). Norman also found
that some subjects were more stable than others in making their ratings; in
particular, there is a tendency for those who use the end-points of scales more
often to have lower test-retest stability. Finally, he found that certain
scales are associated with greater stability; in particular, Evaluation scales
evoke fewer shifts.
The
general impression produced by these test-retest reliability studies is that a
person's rating of a single concept on a single scale constitutes a
measurement, albeit not an extremely delicate one. Such results may be somewhat
misleading, however, because test-retest statistics measure stability as well
as reliability. Consequently, low correlations may be due to actual changes in
subjects' reactions as well as random errors. In any case, single ratings
rarely are used in SD research; instead, factor scores, which should be more
reliable because they are the averages of ratings on several scales, are more
commonly employed.
Factor
score reliability. A study is reported in Osgood in which several
controversial topics were rated on six evaluation scales, and factor scores,
representing each subject's evaluative reaction to a given topic, were obtained
by summing the ratings on the six scales. The correlations between test and
retest factor scores ranged from 0.87 to 0.97 with a mean of .91. DiVesta and
Dick in their study of SD reliability among children made up factor scores by
averaging ratings on two scales for a given dimension and correlating the
measurements from the first test with those from the second test given one
month later. For children in the fourth grade or higher the correlations ranged
between .5 and .8 and were highest for Evaluation factor scores; for students
in the third grade or lower, test-retest correlations ranged between 0.4 and 0.5.
DiVesta and Dick found that test-retest correlations were somewhat higher when
the retest followed the first test immediately. In this case the r's
ranged between 0.6 and 0.8. Norman examined the effect of making up factor
scores from various numbers of scales. His results indicate that factor scores
are more reliable than single ratings and that most of the gain in precision is
accomplished by averaging just three or four scales; going up to an eight-scale
factor score seems to add very little additional stability when looking at data
from a test and retest spaced one month apart.
The
various studies indicate that there is indeed a significant gain in test-retest
correlations when factor scores are used rather than individual scale ratings.
Furthermore, it appears that most of the possible improvement can be obtained
using relatively few scales in making up the factor scores.
Group
means. Many SD studies do not focus on an individual's rating of a concept
but on a group mean. That is, interest is in the average score in a certain
group rather than the score for anyone person. In such case, there is averaging
both across scales (factor scores) and across persons, and reliabilities should
be even higher.
DiVesta
and Dick calculated factor score means for groups of three to five children.
The immediate test-retest correlations ranged from .73 to .94, figures that are
significantly higher than the correlations based on individual subjects. Norman
calculated scale means for 20 concepts using groups of 30 raters. The test-retest
correlations between means was .96, and the correlations between means produced
by two different samples of student respondents was .94. Miron averaged the
factor scores for 20 concepts across 112 subjects and obtained test-retest
correlations of .98 or more.
These
studies reveal that group means on the EPA dimensions are highly reliable and
stable even when the samples of subjects involved in calculating the means are
as small as 30.
Advantage
SD is designed to measure both the direction and the intensity
of attitudes simultaneously. It enables the researcher to avoid the task of
creating bipolar adjective pairs. Scale may also permit finer discrimination in
measuring attitudes.
Disadvantage
Semantic
differential suffers from a lack of standardization. The numbers of divisions
on the scale are a problem. If too few divisions are used, the scale is crude
and lacks meaning; if too many are used, the scale goes beyond the ability of
most people to discriminate
Statistical Properties
Five items, or 5 bipolar pairs of adjectives, have been
proven to yield reliable findings, which highly correlate with alternative Likert numerical
measures of the same attitude.
One problem with this scale is that
its psychometric properties
and level of measurement are disputed.The most
general approach is to treat it as an ordinal scale, but it can be argued that the neutral
response (i.e. the middle alternative on the scale) serves as an arbitrary zero point, and that the intervals between the scale
values can be treated as equal, making it an interval scale.
IV.
SUMMARY and
CONCLUSION
The
SD is a general procedure for assessing affective responses. The technique has
three features that distinguish it as an instrument for social psychological
research. First, SDs are easy to set up, administer, and code. This, in
conjunction with the demonstrated reliability and validity of the procedure,
gives it favorable cost-effectiveness. Second, the EPA structure, which has an
unprecedented amount of cross-cultural validation, is interesting
theoretically, and measurements on all three dimensions yield a wealth of
information about affective responses to a stimulus. The information that the
three independent scores give about the character of responses inevitably is
lost with alternative measures depending on unidimensionality. Third, since the
form of an SD is basically the same whatever the stimulus, research using the
SD (and methodological research about the SD) can cumulate.
The
SD has been applied frequently as a technique for attitude measurement. Its
usefulness in this respect is indicated by the wide variety of meaningful
results that have been obtained. Further, SD measurements have been found to
correlate highly with measurements on traditional attitude scales. There are,
however, a number of questions in the use of SDs for attitude measurement.
When
subjects are highly invested in a topic and want to give socially desirable
answers, it may be advisable to use an instrument that is less direct than the
SD. Social desirability ratings of SD scales correlate very highly with the
Evaluation factor loadings of the scales. Thus, if subjects choose to distort
their responses toward social desirability, Evaluation scores would be biased
upward. If one does use the SD with especially sensitive topics (or
respondents) it is worth taking some precaution to guard against social
desirability effects (e.g., giving anonymity to respondents). Note, however,
that Potency and Activity measurements should be free of this problem since the
typical scales for measuring these dimensions are essentially free of social
desirability contamination.
Thus
far almost all applications of the SD to attitude measurement have relied only
on Evaluation measurements. This appears to be an unfortunate tradition. A
subjective examination of items in traditional attitude scales suggests that
Potency and Activity do get involved in traditional attitude measurements. Furthermore,
the multiple correlations of EPA ratings with traditional scales often are much
higher than the correlations of Evaluation ratings only with the scales. In the
future it would be advisable to obtain ratings on all three dimensions when one
is interested in attitudes. Almost certainly the full EPA information will
increase the power of analyses.
Perhaps
the most important general contribution of the SD is the provision of a single
attitude space for all stimuli. This permits analyses, comparisons, and
insights that. were virtually impossible with traditional instruments.
V.
REFERENCES
·
Barclay,
A., and F.J. Thumin. 1963 "A modified semantic differential approach to
attitudinal assessment." Journal of Clinical Psychology 19:376-378.
·
DiVesta,F.J.,and
W. Dick. 1966 "The test-retest reliability of children's ratings of the
semantic differential." Educational and Psychological Measurement
26:605-616.
·
Himmelfarb, S. (1993). The
measurement of attitudes. In A.H. Eagly& S. Chaiken (Eds.), Psychology of Attitudes, 23-88.
Thomson/Wadsworth.
·
Howe,
E. S. 1962 "Probabilistic adverbial qualifications of adjectives." Journal
of Verbal Learning and Verbal Behavior1 :225-242.
·
Miron,
M. S. 1961 "The influence of instruction modification upon test-retest
reliabilities of the semantic differential." Educational and
Psychological Measurement 21:883893.
·
Osgood, C.E., Suci, G.,
&Tannenbaum, P. (1957) The
measurement of meaning. Urbana, IL: University of Illinois Press.
·
Snider, J. G., and Osgood, C. E.
(1969) Semantic Differential
Technique: A Sourcebook. Chicago: Aldine.