Reprinted from Interchange, Vol. 12, Winter, 1981, with permission.
Follow Through is a large compensatory education program that operated in scores of communities across the United States throughout the seventies and that continues, on a reduced scale, today. During its most active phase, it was conducted as a massive experiment, involving planned variation of education approaches and collection of uniform data at each site. The main evaluation of outcomes was carried out by Abt Associates, Inc. (a private consulting firm, based in Cambridge, Massachusetts) on the second and third cohorts of children who reached third grade in the program, having entered in kindergarten or first grade. In a series of voluminous reports, Abt Associates presented analyses indicating that among the various education approaches tried, only those emphasizing "basic skills" showed positive effects when compared to Non-Follow Through treatments. House, Glass, McLean, and Walker (1978a) published a critique of the Abt Associates evaluation, along with a small reanalysis that found essentially no significant differences in effectiveness among the planned variations in educational approaches. Because of the great social importance attached to educational programs for disadvantaged groups and because no other large-scale research on the topic is likely to materialize in the near future, the Follow Through experiment deserves continuing study. The study reported here is an attempt, through more sharply focused data analysis, to obtain a more definitive answer to the question of whether different educational approaches led to different achievement outcomes.
Is it possible that the Follow Through planned variation experiment has
yielded no findings of value? Is it possible, after years of effort and millions
of dollars spent on testing different approaches, that we know nothing more than
we did before about ways to educate disadvantaged children? This is the implicit
conclusion of the widely publicized critique by House, Glass, McLean, and Walker
(1978a, 1978b). House et al found no evidence that the various Follow Through
models differed in effectiveness from one another or from Non-Follow Through
programs. The only empirical finding House et al were willing to credit was that
there was great variation in results from one Follow Through site to another.
This conclusion, as we shall show, is no more supportable than the conclusions
House et al rejected. Accordingly, if we were to follow House, Glass, McLean,
and Walker's lead, we should have to conclude that there are no substantive
findings to be gleaned from the largest educational experiment ever
conducted.
It would be a serious mistake, however, to take the critique
by House et al as any kind of authoritative statement about what is to be
learned from Follow Through. The committee assembled by House was charged with
reviewing the Abt Associates evaluation of Follow Through (Stebbins et al,
1977), not with carrying out an inquiry of their own. More or less, the
committee stayed within the limits of this charge, criticizing a variety of
aspects of the design, execution, and data analysis of the experiment. Nowhere
in their report do the committee take up the constructive problem that Abt
Associates had to face or that any serious inquiries will have to face. Given
the weaknesses of the Follow Through experiment, how can one go about trying to
extract worthwhile findings from it?
In this paper we try to deal
constructively with one aspect of the Follow Through experiment: the comparison
of achievement test results among the various sponsored approaches. We try to
show that if this comparison is undertaken with due cognizance of the
limitations of the Follow Through experiment, it is possible to derive some
strong, warranted, and informative conclusions. We do not present our research
as a definitive, and certainly not as a complete, inquiry into Follow Through
results. We do hope to show, however, that the conclusion implied by the House
committee-that the Follow Through experiment is too flawed to yield any positive
findings-is gravely mistaken.
Delimiting the Problem
Although Project Follow Through has numerous shortcomings as an experiment,
the seriousness of these shortcomings varies greatly depending on what questions
are asked of the data. One shortcoming was in the outcome measures used,
particularly in their limited range compared to the range of objectives pursued
by Follow Through sponsors. The House committee devotes the largest part of its
critique to this shortcoming, although it is a shortcoming that limits only the
range of conclusions that may be drawn. House et al allow, for instance, that
the Metropolitan Achievement Test was "certainly a reasonable choice for the
material it covers" (1978a, p. 138). Accordingly, Follow Through's shortcomings
as to outcome measures ought not to stand in the way of answering questions that
are put in terms appropriate to the measures that were used.
Another
shortcoming, recognized by all commentators on Follow Through, is the lack of
strictly comparable control groups. Follow Through and Non-Follow Through groups
at the same site differed from one another in uncontrolled and only partly
measurable ways, and the differences themselves varied from site to site. This
circumstance makes it difficult to handle questions having to do with whether
children benefited from being in Follow Through, because such questions require
using Non-Follow Through data as a basis for inferring how Follow Through
children would have turned out had they not been in Follow Through.
Much
of the bewildering complexity of the Abt Associates' analyses results from
attempts to make up statistically for the lack of experimental comparability. We
do not intend to examine those attempts except to note one curiosity. The
difficulty of evaluating "benefits" holds whether one is asking about the
effects of Follow Through as a whole, the effects of a particular model, or the
effect of a Follow Through program at a single site. The smaller the unit,
however, the more vulnerable the results are likely to be to a mismatch between
Follow Through and Non-Follow Through groups. On the one hand, to the extent
that mismatches are random, they should tend to average out in larger
aggregates. On the other hand, at a particular site, the apparent success or
failure of a Follow Through program could depend entirely on a fortuitously
favorable or unfavorable match with a Non-Follow Through group.
For
unknown reasons, both the Abt Associates and the House committee analysts have
assumed the contrary of the point just made. While acknowledging, for instance,
that the prevalence of achievement test differences in favor of Non-Follow
Through groups could reflect mismatch, they are able to make with confidence
statements like "Seven of the ten Direct Instruction sites did better than the
comparison classes but three of the Direct Instruction sites did worse" (House
et al, 1978a, p. 154). Such a statement is nonsense unless one believes that at
each of the ten sites a valid comparison between Follow Through and Non-Follow
Through groups could be made. But if House et al believe that, how could they
then believe that the average of those ten comparisons is invalid? This is like
arguing that IQ tests give an invalid estimate of the mean intelligence level of
disadvantaged children and then turning around and using those very tests to
classify individual disadvantaged children as retarded.
There is an
important class of questions that may be investigated, however, without having
to confront the problem of comparability between Follow Through and Non-Follow
Through groups. These are questions involving the comparison of Follow Through
models with one another. A representative question of this kind would be-how did
the Follow Through models compare with one another in reading achievement test
scores at the end of third grade? There are problems in answering such a
question, but the lack of appropriate control groups is not one of them. We can,
if we choose, simply ignore the Non-Follow Through groups in dealing with
questions of this sort.
Questions about the relative performance of
different Follow Through models are far from trivial. The only positive
conclusions drawn by Abt Associates relate to questions of this kind, and the
House committee's report is largely devoted to disputing those conclusions -that
is, disputing Abt's conclusions that Follow Through models emphasizing basic
skills achieved better results than others in basic skills and in self-concept.
The models represented in Follow Through cover a wide range of educational
philosophies and approaches to education. Choose any dimension along which
educational theories differ and one is likely to find Follow Through models in
the neighborhood of each extreme. This is not to say that the Follow Through
models are so well distinguished that they provide clean tests of theoretical
issues in education. But the differences that are there-like, for instance, the
difference between an approach based on behavior modification principles and an
approach modeled on the English infant school-offer at least the possibility of
finding evidence relevant to major ideological disputes within education.
Unscrambling the Methodology
The Abt Associates analysts were under obligation to try to answer the whole
range of questions that could be asked about Follow Through effects. In order to
do this in a coherent way, they used one kind of statistic that could be put to
a variety of uses. This is the measure they called "effect size," an adjusted
mean difference between the Follow Through and Non-Follow Through subjects at a
site. Without getting into the details of how effect size was computed, we may
observe that this measure is more suitable for some purposes than for others.
For answering question about benefits attributable to Follow Through, some such
measure as effect size is necessary. For comparing one Follow Through model with
another, however, the effect size statistic has the significant disadvantage
that unremoved error due to mismatch between a Follow Through and Non-Follow
Through group is welded into the measure itself. As we noted in the preceding
section, comparisons of the effectiveness of Follow Through models with one
another do not need to involve Non-Follow Through data. Because effect size
measures will necessarily include some error due to mismatch (assuming that
covariance adjustments cannot possibly remove all such error), these measures
will contain "noise" that can be avoided when making comparisons among Follow
Through models.
The Abt Associates analysts used several different ways
of computing effect size, the simplest of which is called the "local" analysis.
This method amounts to using the results for each cohort of subjects at each
site as a separate experiment, carrying out a covariance analysis of Follow
Through and Non-Follow Through differences as if no other sites or cohorts
existed. Although this analysis has a certain elegance, it clearly does not take
full advantage of the information available; the "pooled" analysis used by Abt,
which uses data on whole cohorts to calculate regression coefficients and at the
same time includes dummy variables to take care of site-specific effects, is
much superior in this respect. The House committee, however, chose to use effect
size measures based on the "local" analysis in their own comparison of models.
In doing so, they used the least powerful of the Abt effect size measures, all
of which are weakened (to unknown degrees) by error due to mismatch.
In
their comparisons of Follow Through models, Abt Associates analysts calculated
the significance of effects at different sites, using individual subjects at the
sites as the units of analysis, and then used the distribution of significant
positive and negative effects as an indicator of the effectiveness of the
models. The House committee argued, on good grounds we believe, that the
appropriate unit of analysis should have been sites rather than individual
children. To take only the most obvious argument on this issue, the manner of
implementing a Follow Through model is a variable of great presumptive
significance, and it is most reasonably viewed as varying from site to site
rather than from child to child. Having made this wise decision, however, the
House committee embarked on what must be judged either an ill-considered or an
excessively casual reanalysis of Follow Through data. Although the reanalysis of
data by the House committee occupies only a small part of their report and is
presented by them with some modesty, we believe their reanalysis warrants severe
critical scrutiny. Without that reanalysis, the House committee's report would
have amounted to nothing more than a call for caution in interpreting the
findings of the Abt Associates analysts. With the reanalysis, the House
committee seems to be declaring that there are no acceptable findings to be
interpreted. Thus a great deal hinges on the credibility of their
reanalysis.
Let us therefore consider carefully what the House committee
did in their reanalysis. First, they used site means rather than individual
scores as the unit of analysis. This decision automatically reduced the Follow
Through planned variation experiment from a very large one, with an N of
thousands, to a rather small one, with an N in the neighborhood of one hundred.
As previously indicated, we endorse this decision. However, it seems to us that
when one has opted to convert a large experiment into a small one, it is
important to make certain adjustments in strategy. This the House committee
failed to do. If an experiment is very large, one can afford to be cavalier
about problems of power, since the large N will presumably make it possible to
detect true effects against considerable background noise. In a small
experiment, one must be watchful and try to control as much random error as
possible in order to avoid masking a true effect.
However, instead of
trying to perform the most powerful analysis possible in the circumstances, the
House committee weakened their analysis in a number of ways that seem to have no
warrant. First, they chose to compare Follow Through models on the basis of
Follow Through/Non-Follow Through differences, thus unnecessarily adding error
variance associated with the Non-Follow Through groups. Next, they chose to use
adjusted differences based on the "local" analysis, thus maximizing error due to
mismatch. Next, they based their analysis on only a part of the available data.
They excluded data from the second kindergarten-entering cohort, one of the
largest cohorts, even though these data formed part of the basis for the
conclusions they were criticizing. This puzzling exclusion reduced the number of
sites considered, thus reducing the likelihood of finding significant
differences. Finally, they divided each effect-size score by the standard
deviation of test scores in the particular cohort in which the effect was
observed. This manipulation served no apparent purpose. And minor though its
effects may be, such as they are would be in the direction of adding further
error variance to the analysis.
The upshot of all these methodological
choices was that, while the House group's reanalysis largely confirmed the
ranking of models arrived at by Abt Associates, it showed the differences to be
small and insignificant. Given the House committee's methodology, this result is
not surprising. The procedures they adopted were not biased in the sense of
favoring one Follow Through model over another; hence it was to be expected that
their analysis, using the same effect measures as Abt, would replicate the
rankings obtained by Abt. (The rank differences shown in Table 7 of the House
report are probably mostly the result of the House committee's exclusion of data
from one of the cohorts on which the Abt rankings were based.) On the other
hand, the procedures adopted by the House committee all tended in the direction
of maximizing random error, thus tending to make differences appear small and
insignificant.
The analysis to be reported here is of the same general
type as that carried out by the House committee. Like the House committee, we
use site means rather than scores for individuals as the unit of analysis. The
differences in procedure all arise from our effort to minimize random error and
thus achieve the most powerful analysis possible. The following are the main
differences between our analysis and the House et al analysis:
1. We used site means for Follow Through groups as the dependent variable, using other site-level scores as covariates. The House committee used locally adjusted site-level differences between Follow Through and Non-Follow Through groups as the dependent variable, with covariance adjustments having been made on an individual basis. Our procedure appears to have been endorsed in advance by the House committee. They state: "For the sake of both inferential validity and proper covariance adjustment, the classroom is the appropriate unit of analysis" (House et al, 1978a, p. 153). While the House committee followed their own prescription in using site-level scores as dependent variables, they failed to follow it when it came to covariance adjustments.
2. When we used Non-Follow Through scores, we entered them as covariates along with other covariates. The procedure adopted by the House committee amounted, in effect, to arbitrarily assigning Non-Follow Through mean scores a regression weight of 1 while giving all other variables empirically determined regression weights. We could not see any rational basis for such a deviation from ordinary procedures for statistical adjustment.
3. We combined all data from one site as a single observation, regardless of cohort. The House committee appear to have treated different cohorts from the same site as if they were different sites. This seemed to us to violate the rationale for analyzing data at the site level in the first place.
4. We restricted the analysis to models having data on 6 or more sites. To include in the analysis models having as few as 2 sites, as the House committee did, would, it seemed to us, reduce the power of the statistical tests to an absurd level.
The data analysis that followed from the above-mentioned decisions was
quite straightforward and conventional. The dependent variable was always the
mean score for a site on one or more Metropolitan Achievement Test subtests,
averaged over all subjects in cohorts II and III for whom data were reported in
the Abt Associates reports. Models, which ranged from 12 to 6 in number of
sites, were compared by analysis of covariance, using some or all of the
following covariates:
SES-An index of socio-economic status calculated by
Abt for each cohort at each site. When more than one cohort represented a site,
an n-weighted mean was computed.
EL-An index of ethnic and linguistic
difference from the mainstream-treated in a manner similar to
SES.
WRAT-Wide-range Achievement Test, administered near time of entry to
Follow Through students. Taken as a general measure of academic
readiness.
NFT-Mean score of local Non-Follow Through students on the
dependent variable under analysis. As a covariate, NFT scores may be expected to
control for unmeasured local or regional characteristics affecting scholastic
achievement.
Two other covariates were tried to a limited extent: Raven
Progressive Matrices scores (which, though obtained after rather than before
treatment, might be regarded as primarily reflecting individual differences
variance not affected by treatment) and a score indicating the number of years
of Follow Through treatment experienced by subjects at a site (most Follow
Through groups entered in kindergarten, thus receiving four years of Follow
Through treatment; but some entered in first grade and received only three
years). Our overall strategy for use of analysis of covariance was as follows:
recognizing that reasonable cases could be made for and against the use of this
covariate or that, we would try various combinations and, in the end, would take
seriously only those results that held up over a variety of reasonable covariate
sets.
Results
Differences in achievement test performance-Two analyses of covariance will
be reported here, with others briefly summarized. Figure 1 displays adjusted and
standardized means from what we call the "full" analysis of covariance-that is,
an analysis using the four main covariates (SES, EL, WRAT, and NFT) described in
the preceding section. The virtue of this analysis is that it controls for all
the main variables that previous investigators have tried, in one way or other,
to control for in comparing Follow Through models.
Table 1* notes
pair-wise differences which are significant at the .05 level by Newman-Keuls
tests.
Figure 2* and Table 2* show comparable
data for what we call the "conservative" analysis.
This analysis is conservative in the sense that it
eliminates covariates for which there are substantial empirical and/or rational
grounds for objection. Grounds for objecting to the NFT variable as a covariate
have been amply documented in Abt reports and echoed in the report of the House
committee (House et al, 1978a); they will not be repeated here. Use of WRAT as a
covariate has been objected to on grounds that it is not, as logically required,
antecedent to treatment (Becker & Carnine, Ref. Note 1)-that is, the WRAT,
though nominally a pretest, was in fact administered at a time when at least one
of the models had already purportedly taught a significant amount of the content
touched on by the WRAT. While we would not suppose the SES and EL variables to
be above reproach, we have not encountered criticisms suggesting their use would
seriously bias results-whereas not to control for these variables would
unquestionably leave the results biased in favor of models serving less
disadvantaged populations. Accordingly, we have chosen them as the conservative
set of covariates.
Other analyses, not reported, used different
combinations of covariates from among those mentioned in the preceding section.
In every case, these analyses yielded adjusted scores intermediate between those
obtained from the "full" and the "conservative" analyses. Consequently, the
results shown in Figures 1 and 2 may be taken to cover the full range of those
observed.
In every analysis, differences between models were significant
at or beyond the .05 level on every achievement variable-almost all beyond the
.01 level. As Figures 1 and 2 show, models tended to perform about the same on
every achievement variable. Thus there is little basis for suggesting that one
model is better at one thing, another at another.
The relative standing
of certain models, particularly the Tucson Early Education Model, fluctuated
considerably depending on the choice of covariates.1 Two models, however, were
at or near the top on every achievement variable, regardless of the covariates
used; these were Direct Instruction and Behavior Analysis. Two models were at or
near the bottom on every achievement variable, regardless of the covariates
used; these were the EDC Open Education Model and Responsive Education.
Differences between the two top models and the two bottom models were in most
cases statistically significant by Newman-Keuls tests.
Variability
between sites-The only empirical finding that the House committee was willing to
credit was that there was enormous variability of effects from site to site
within Follow Through models. In their words: "Particular models that worked
well in one town worked poorly in another. Unique features of the local settings
had more effect on achievement than did the models" (House et al, 197&, p.
156). This conclusion has recently been reiterated by the authors of the Abt
evaluation report (St. Pierre, Anderson, Proper, & Stebbins, 1978) in almost
the same words.
The ready acceptance of this conclusion strikes us as
most puzzling. It is conceivable that all of the variability between sites
within models is due to mismatch between Follow Through and Non-Follow Through
groups. This is unlikely, of course, but some of the variability between sites
must be due to this factor, and unless we know how much, it is risky to make
statements about the real variability of effects. Furthermore there is, as far
as we are aware, no evidence whatever linking achievement to "unique features of
the local setting." This seems to be pure conjecture-a plausible conjecture, no
doubt, but not something that should be paraded as an empirical
finding.
Our analyses provide some basis for looking at the between-site
variability question empirically. Follow Through sites varied considerably in
factors known to be related to achievement-socioeconomic status, ethnic
composition, WRAT pretest scores, etc. To say that the variance in achievement
due to these factors was greater than the variance due to model differences may
be true but not very informative. It amounts to nothing more than the
rediscovery of individual differences and is irrelevant to the question of how
much importance should be attached to variation among Follow Through models. To
say that differences in educational method are trivial because their effects are
small in comparison to the effect of demographic characteristics is as absurd as
saying that diet is irrelevant to children's weight because among children
weight variations due to diet are small in comparison to weight variations due
to age.
Figure 1*
Standardized adjusted mean Metropolitan Achievement Test scores obtained
from "full" covariance analysis (rounded to the nearest even
tenth).
The variability issue may be more cogently formulated as
follows: considering only the variance in achievement that cannot be accounted
for by demographic and other entering characteristics of students, what part of
that variance can be explained by differences in Follow Through models and what
part remains unexplained? Our analyses provide an approximate answer to this
question, since covariance adjustments act to remove variance among sites due to
entering characteristics. Depending on the achievement test variable considered
and on the covariates used, we found model differences to account for roughly
between 17 and 55 per cent of the variance not attributable to covariates (as
indexed by w2).
Figure 2*
Standardized adjusted mean Metropolitan Achievement Test scores obtained
from "conservative" covariance analysis (rounded to the nearest even
tenth).
These results are shown graphically in Figures 1 and 2.
Adjusted mean scores are displayed there in units of the standard deviation of
residual site means. Thus, to take the most extreme case, in Figure 2 the
adjusted mean score of Direct Instruction sites on Language Part B is 3.6
standard deviations above the adjusted mean score of EDC Open Education
sites-that is, 3.6 standard deviations of between-site residual variability; in
other words, an enormous difference compared to differences between sites within
models. That is the most extreme difference, but in no case is the adjusted
difference between highest and lowest model less than 1.4 standard deviations.
Although what constitutes a "large" effect must remain a matter of judgment, we
know of no precedent according to which treatment effects of this size could be
considered small in relation to the unexplained variance.
Treatment
effects on other variables-Although the principal concern of this study was with
achievement test differences, the method of analysis is adaptable to studying
differences in other outcomes as well. Accordingly we ran several briefer
analyses, looking at what Abt Associates call "cognitive / conceptual" and
"affective" outcomes.
Two kinds of measures used in the Follow Through
evaluation were regarded by Abt Associates as reflecting "cognitive /
conceptual" outcomes- Raven's Progressive Matrices (a nonverbal intelligence
test) and several Metropolitan subtests judged to measure indirect cognitive
consequences of learning. The House committee objected to Progressive Matrices
on grounds that is insensitive to school instruction. This rather begs the
question of effects of cognitively-oriented teaching, however. True, Progressive
Matrices performance may be insensitive to ordinary kinds of school instruction,
but does that mean it will be insensitive to novel instructional approaches
claiming to be based on cognitive theories and declaring such objectives as "the
ability to reason" and "logical thinking skills in four major cognitive areas
(classification, seriation, spatial relations and temporal relations)"? It seems
that this should be an empirical question.
If it is an empirical
question, the answer is negative. Using the same kinds of covariance analyses as
were used on the achievement test variables, we found no statistically
significant differences between Follow Through models in Progressive Matrices
performance. This finding is consistent with the Abt Associates' analyses, which
show few material effects on this test, and more negative than positive
ones.
Among Metropolitan subtests the most obviously "cognitive" are
Reading (which is, in effect, paragraph comprehension) and Mathematics
Problem-Solving. As indicated in Figures 1 and 2, our analyses show differences
among models on these subtests that are similar in trend to those found on the
other subtests. They tend, however, to be of lesser magnitude. The most obvious
explanation for the lesser magnitude of difference on these subtests is the same
as that offered by House et al for the absence of differences on Progressive
Matrices-that these subtests, reflecting more general differences in
intellectual ability, are less sensitive to instruction. There is, however, a
further hypothesis that should be tested. Conceivably, certain models-let us say
those that avowedly emphasize "cognitive" objectives-are doing a superior job of
teaching the more cognitive aspects of reading and mathematics, but the effects
are being obscured by the fact that performance on the appropriate subtests
depends on mechanical proficiency as well as on higher-level cognitive
capabilities. If so, these hidden effects might be revealed by using performance
on the more "mechanical" subtests as covariates.
This we did. Model
differences in Reading (comprehension) performance were examined, including Word
Knowledge as a covariate. Differences in Mathematics Problem Solving were
examined, including Mathematics Computation among the covariates. In both cases
the analyses of covariance revealed no significant differences among models.
This is not a surprising result, given the high correlation among Metropolitan
subtests. Taking out the variance due to one subtest leaves little variance in
another. Yet it was not a forgone conclusion that the results would be negative.
If the models that proclaimed cognitive objectives actually achieved those
objectives, it would be reasonable to expect those achievements to show up in
our analyses.
The same holds true for performance on the affective
measures included in the Follow Through evaluation. The Abt Associates' analyses
show that the ranking of models on affective measures corresponds closely to
their ranking on achievement measures. House et al point out, however, that the
instruments used place heavy demands on verbal skills. Conceivably, therefore,
if reading ability were controlled statistically, the results might tell a
different story. We analyzed scores on the Coopersmith Self-Concept Inventory,
including reading subtest scores along with the other covariates. The result
showed no significant difference among models on the Coopersmith. This finding
could mean either that there are no differences between models in effects on
self-concept or that self-concept among disadvantaged third-graders is
sufficiently dependent on reading ability that, when one statistically removes
reading ability differences, one at the same time removes genuine self-concept
differences. We know of no way to resolve this ambiguity with the available
data. One thing is clear, however: removing effects due to reading achievement
does not in any way yield results either favoring models that emphasize
self-concept or disfavoring models that emphasize academic objectives.
Discussion
Before attempting to give any interpretation of Follow Through results, we
must emphasize the main finding of our study-that there were results. Follow
Through models were found to differ significantly on every subtest of the
Metropolitan Achievement Test.
Let us briefly compare our findings with
those of Abt Associates and the House committee.
1. We disagree with both
Abt and House et al in that we do not find variability among sites to be so
great that it overshadows variability among models. It appears that a large part
of the variability observed by Abt and House et al was due to demographic
factors and experimental error. Once this variability is brought under control,
it becomes evident that differences between models are quite large in relation
to the unexplained variability within models.
2. Our findings on the
ranking of Follow Through models on achievement variables are roughly in accord
with those of the House Committee, but we differ from the House committee in
finding significant differences among models on all achievement variables
whereas they found almost none. The similarities are no doubt due to the fact
that the two analyses used the same basic units-site-level means. The difference
in significance of outcomes is apparently due to the variety of ways (previously
discussed) in which our analysis was more powerful than theirs.
3. The
Abt Associates' results indicate that among major Follow Through models, there
is only one "winner" in the sense of having a preponderance of positive
effects-namely, Direct Instruction. All other models showed predominately null
or negative effects. Our results are not exactly comparable in that we compared
Follow Through models only with one another and not with Non-Follow Through
groups; consequently we cannot speak of "positive" or "negative" effects.
However, our results show two models to be above average on all achievement
subtests and two models to be below average on all subtests. Thus our results
may be said to indicate two "winners"-Direct Instruction and Behavior Analysis-
and two "losers"-EDC Open Education and Responsive Education.
We put the
words "winners" and "losers" in quotation marks because, of course, Follow
Through was not a contest with the object of attaining the highest possible
achievement test scores. It simply happens that the outcomes on which Follow
Through models are found to differ are achievement test scores. That other
criteria might have shown different winners and losers (a point heavily
emphasized by the House committee) must remain a conjecture for which all the
available evidence is negative. What we have are achievement test differences,
and we must now turn to the question of what those differences might
mean.
It lies outside the scope of this paper to discuss the importance
of scholastic achievement itself. The more immediate issue is whether the
observed differences in achievement test scores reflect actual differences in
mastery of reading, mathematics, spelling, and language.
One obvious
limitation that must be put on the results is that the Metropolitan Achievement
Test, like all other standardized achievement batteries, covers less than the
full range of achievement objectives. As House et al point out, the test does
not cover "even such straightforward skills as the ability to read aloud, to
write a story, or to translate an ordinary problem into numbers" (1978b, p.
473). This much is certainly true, but House et al then go on to say, "it would
be reckless to suppose that the results of the testing indicate the attainment
of these broader goals" (p. 473). "Reckless" is far too strong a word here.2
From all we know about the intercorrelation of scholastic skills, one could be
fairly confident in assuming that children who perform above average on the MAT
would also perform above average on tests of the other skills mentioned. A
glance again at Figures 1 and 2 tells us that achievements in a variety of areas
tend to go together. Given the homogeneous drift of scores downward from left to
right in those figures, it is hard to imagine another set of achievement
measures in mathematical and language skills that would show a trend in the
opposite direction. Such a trend cannot be declared impossible, of course, but
if House et al expect us to take such a possibility seriously, then they ought
to provide some evidence to make it plausible.
A more serious kind of
charge is that the MAT is biased in favor of certain kinds of programs. If true,
this could mean that the observed test score differences between models reflect
test bias and not true differences on the achievement variables that the test is
supposed to measure. We must be very careful, however, in using the term bias.
One sometimes hears in discussions of Follow Through statements that the MAT is
biased in favor of models that teach the sort of content measured by the MAT.
This is a dangerous slip in usage of the word bias and must be avoided. It makes
no sense whatever to call it bias when an achievement test awards higher scores
to students who have studied the domain covered by the test than to students who
have not. It would be a very strange achievement test if it did not.
It
is meaningful, however, to say that an achievement test is biased in its
sampling of a domain of content, but even here one must be careful not to abuse
the term. The Mathematics Concept subtest of the MAT, for instance, is a
hodge-podge of knowledge items drawn from "old math," "new math," and who knows
what. For any given instructional program, it will likely be found that the test
calls for knowledge of material not covered by that program-but that doesn't
mean the test is biased against the program. The test obviously represents a
compromise that cannot be fully satisfactory to any program. The only ground for
a charge of bias would be that the compromise was not even-handed. Investigating
such a charge would require a thorough comparison of content coverage in the
test and content coverage in the various Follow Through programs. It does no
good to show that for a particular program there are discrepancies between
content covered and content tested. The same might be equally true of every
program.
As far as the Follow Through evaluation goes, the only MAT
subtest to which a charge of content bias might apply (we have no evidence that
it does) is Mathematics Concepts. The other subtests all deal with basic skills
in language and mathematics. Different programs might teach different methods of
reading or doing arithmetic, and they might give different amounts of emphasis
to these skills, but the skills tested on the MAT are all ones that are
appropriate to test regardless of the curriculum. Even if a particular Follow
Through model did not teach arithmetic computation at all, it would still be
relevant in an assessment of that program to test students' computational
abilities; other people care about computation, even if the Follow Through
sponsor does not. The reason why Mathematics Concepts may be an exception is
that, while everyone may care about mathematical concepts, different people care
about different ones, and so a numerical score on a hodge-podge of concepts may
not be informative.
While such skill tests as those making up the bulk of
the MAT are relatively immune to charges of content bias, they can be biased in
other ways. They may, perhaps, be biased in the level of cognitive functioning
that they tap within a skill area. The House committee implies such a bias when
they say, "the selection of measures favors models that emphasize rote learning
of the mechanics of reading, writing, and arithmetic" (House et al, 1978a, p.
14S). This is a serious charge and, if true, would go some way toward
discrediting the findings.
But House et al offer no support for this
charge, and on analysis it seems unlikely that they could. Their statement rests
on three assumptions for which we know of no support: (1) That "the mechanics of
reading, writing, and arithmetic" can be successfully taught by rote; (2) that
there were Follow Through models that emphasized rote learning (the model
descriptions provided by Abt give no suggestion that this is true)3 and (3) that
the MAT measures skills in such a way that the measurement favors children who
have learned those skills by rote rather than through a meaningful process. We
must conclude, in fact, that since the House committee could not have been so
naive as to hold all three of these assumptions, they must have introduced the
word "rote" for rhetorical effect only. Take the word out and their statement
reduces to an unimpressive complaint about the limited coverage of educational
objectives in the Follow Through evaluation.
A final way in which skill
tests might be biased is in the form of the test problems. Arithmetic
computation problems, for instance, might be presented in notation that was
commonly employed in some programs and not in others; or reading test items
might use formats similar to those used in the instructional materials of one
program and not another. Closely related to this is the issue of "teaching for
the test"-when this implies shaping the program to fit incidental features of a
test such as item formats. We may as well throw in here the issue of
test-wiseness itself as a program outcome-that is, the teaching of behaviors
which, whether intended to do so or not, help children perform well on
tests-since it bears on the overall problem of ways in which a program might
achieve superior test scores without any accompanying superiority in actual
learning of content. In short, children in some programs might simply get better
at taking tests.
If one looks at the Direct Instruction and Behavior
Analysis models, with their emphasis on detailed objectives and close monitoring
of student progress, and compares them to EDC Open Education, with its disavowal
of performance objectives and repudiation of standardized testing, it is
tempting to conclude in the absence of any evidence that the former models must
surely have turned out children better prepared to look good on tests,
regardless of the children's true states of competence. Without wishing to
prejudge the issue, we must emphasize that it is an empirical question to what
extent children schooled in the various Follow Through models were favored or
disfavored with respect to the process of testing itself.
In general,
children involved in the Follow Through evaluation were subjected to more
standardized testing than is normal. Since studies of test-wiseness indicate
rapidly diminishing returns from increasing amounts of familiarization with
testing (Cronbach, 1960), there is presumptive evidence against claims that
differential amounts of test-taking among models could be significant in
accounting for test-score differences. It should be possible to investigate this
matter with Follow Through data, though not from the published data. Children in
the final Follow Through evaluation had been subjected to from two to five
rounds of standardized testing. Accordingly it should be possible to evaluate
the effect of frequency of previous testing on third-grade test
scores.
There are, however, numerous ways in which Follow Through
experience could affect children's behavior during testing. The amount of
experience that children in any program had with actual test-taking is probably
trivial in comparison to the amount of experience some children got in doing
workbook pages and similar sorts of paper-and-pencil activities. And the nature
of these activities might have varied from ones calling for constructed
responses, quite unlike those on a multiple-choice test, to ones that amounted
virtually to a daily round of multiple-choice test-taking. Programs vary not
only in the amount of evaluation to which children are subjected but also in the
manner of evaluationp;be it covert, which might have little effect on the
children, or face-to-face and oral, or carried out through group testing.
Finally, given that testing conditions in the Follow Through evaluation were not
ideal, it is probably relevant how well children in the various programs learned
to cheat effectivelyp;that is, to copy from the right neighbor.
Some
or most of these variables could be extracted from available information, and it
would be then possible to carry out analyses showing the extent to which they
account for test scores and for the score differences between models. Only
through such a multivariate empirical investigation could we hope to judge how
seriously to take suggestions that the score differences among models were
artifactual. Until that time, insinuations about "teaching for the test" must be
regarded as mere prejudice.
What Do The Results Mean?
What we have tried to establish so far is that there are significant
achievement test differences between Follow Through models and that, so far as
we can tell at present, these test score differences reflect actual differences
in school learning. Beyond this point, conclusions are highly conjectural.
Although our main purpose in this paper has been simply to clarify the empirical
results of the Follow Through experiment, we shall venture some interpretive
comments, if for no other purpose than to forestall possible
misinterpretations.
The two high-scoring models according to our analysis
are Direct Instruction and Behavior Analysis; the two low-scoring are EDC Open
Education and Responsive Education. If there is some clear meaning to the Follow
Through results, it ought to emerge from a comparison of these two pairs of
models. On the one hand, distinctive characteristics of the first pair are easy
to name: sponsors of both the Direct Instruction and Behavior Analysis models
call their approaches "behavioral" and "structured" and both give a high
priority to the three R's. EDC and Responsive Education, on the other hand, are
avowedly "child-centered." Although most other Follow Through models could also
claim to be child-centered, these two are perhaps the most militantly so and
most opposed to what Direct Instruction and Behavior Analysis stand
for.
Thus we have, if we wish it, a battle of the philosophies, with the
child-centered philosophy coming out the loser on measured achievement, as it
has in a number of other experiments (Bennett, 1976; Stallings, 1975; Bell and
Switzer, 1973; Bell, Zipousky & Switzer, 1976). This is interesting if one
is keen on ideology, but it is not very instructive if one is interested in
improving as educational program. Philosophies don't teach kids. Events teach
kids, and it would be instructive to know what kinds of events make the
difference in scholastic achievement that we have observed.
The teaching
behavior studies of Brophy & Good (1974), Rosenshine (1976), and Stallings
& Kaskowitz (1974) are helpful on this point. Generally they contrast direct
with informal teaching styles, a contrast appropriate to the two kinds of models
we are comparing. Consistently it is the more direct methods, involving clear
specifications of objectives, clear explanations, clear corrections of wrong
responses, and a great deal of "time on task," that are associated with superior
achievement test performance. The effects tend to be strongest with
disadvantaged children.
These findings from teacher observation studies
are sufficiently strong and consistent that we may reasonable ask what if
anything Follow Through results add to them. They add one very important
element, the element of experimental change. The teacher observation studies are
correlational. They show that teachers who do x get better achievement results
than those who do y. The implication is that if the latter teachers switched
from doing y to doing x, they would get better results, too; but correlational
studies can't demonstrate that. Perhaps teachers whose natural inclination is to
do y will get worse results if they try to do x. Or maybe teachers who do y
can't or worse won't do x. Or maybe x and y don't even matter; they only serve
as markers for unobserved factors that really make the difference.
The
Follow Through experiment serves, albeit imperfectly, to resolve these
uncertainties. Substantial resources were lavished on seeing to it that teachers
didn't just happen to use direct or informal methods according to their
inclinations by rather that they used them according to the intent of the model
sponsors. The experimental control was imperfect because communities could
choose what Follow Through model to adopt, and in some cases, we understand,
teachers could volunteer to participate. Nevertheless, it seems safe to assume
that there was some sponsor effect on teacher behavior in all instances, so that
some teachers who would naturally do x were induced to do y and vise-versa.
Thus, with tentativeness, we can infer from Follow Through results that getting
teachers of disadvantaged children to use more direct instructional methods as
opposed to more informal ones will lead to superior achievement in commonly
tested basic skills.
Before concluding, however, that what accounts for
the superior achievement test scores of Direct Instruction and Behavior Analysis
sites is their use of direct teaching methods, we should consider a more
profound way in which these two models are distinguished from the others. These
models are distinctive not only at the level of immediately observable teacher
behavior but also at a higher level which may be called the systemic. One may
observe a lesson in which the teacher manifests all the usual signs of direct
teaching- lively manner, clear focus on instructional objectives, frequent
eliciting of response from students, etc. One may return weeks later to find the
same teacher with the same class manifesting the same direct teaching
behavior-and still teaching the same lesson! The fault here is at the systemic
level: the teacher is carrying out sorts of activities that should result in
learning but is failing to organize and regulate them in such a way as to
converge on the intended objectives.
More effective teachers-and this
includes the great majority- function according to a convergent system. Consider
a bumbling Mr. Chips introducing his pupils to multiplication by a two-digit
multiplier. He demonstrates the procedure at the chalkboard and then discovers
that most of the students cannot follow the procedure because they have
forgotten or never learned their multiplication facts. So he backs up and
reviews these facts, then demonstrates the algorithm again and assigns some
practice problems. Performance is miserable, so he teaches the lesson again. By
this time some children get it, and they teach others. With a bit of help, most
of the class catches on. Mr. Chips then gives special tutoring, perhaps with use
of supplementary concrete materials, to the handful of students who haven't yet
got it. Finally everyone has learned the multiplication algorithm except for the
slowest pupils in the class-who, as a matter of fact, haven't yet learned to add
either.
Although none of the procedures used by Mr. Chips are very
efficient, he applies them in a convergent way so that eventually almost all the
children reach the instructional objective. Some of his procedures may not have
a convergent effect at all. For instance, he may assign practice worksheets to
pupils who haven't yet grasped the algorithm, and the result is that they merely
practice their mistakes (a divergent activity). But the overall effect is
convergent. Given more efficient activities, convergence on the instructional
goal might be more rapid and it might include the pupils who fail at the hands
of Mr. Chips. But the difference in effectiveness, averaged over all pupils,
would probably not be great. This convergent property of teaching no doubt
contributes, as Stephens (1967) has suggested, to the scarcity of significant
differences between teaching methods. Unless severely constrained, most teachers
will see to it that, one way or another, their students reach certain goals by
the end of the term.
We suggest that teaching performance of the kind
just described be taken as baseline and that innovative educational practices,
such as those promoted by the Follow Through sponsors, be judged in relation to
that baseline. What would happen to the teaching of our Mr. Chips if he came
under the supervision of a Follow Through sponsor? It seems fairly clear that
his system for getting students to reach certain goals by the end of the term
would be enhanced if he took guidance from a Direct Instruction or Behavior
Analysis sponsor but that it might well be disrupted by guidance from one of the
more child-centered sponsors.
What Direct Instruction and Behavior
Analysis provide are more fully developed instructional systems than teachers
normally employ. They provide more systematic ways of determining whether
children have the prerequisite skills before a new step in learning is
undertaken, more precise ways of monitoring what each child is learning or
failing to learn, and more sophisticated instructional moves for dealing with
children's learning needs. Open Education and Responsive Education, on the other
hand, because of their avowed opposition to making normative comparisons of
students or thinking in terms of deficits, will tend to discourage those
activities whereby teachers normally discover when children are not adequately
prepared for a new step in learning or when a child has mislearned or failed to
learn something. Also, because of their preference for indirect learning
activities, these models will tend to make teaching less sharply focused on
achieving specific earnings and remedying specific lacks.
Of course,
child-centered educators will wish to describe the matter differently, arguing
that they do have a well-developed system for promoting learning; but it is a
different kind of system pursuing different kinds of goals from those pursued by
the direct instructional approaches. They will point out that child-centered
teachers devote a great deal of effort to identifying individual pupils'
learning needs and to providing learning experiences to meet these needs; it is
just that their efforts are more informal and intuitive, less programmed.
Child-centered education, they will argue, is different, not
inferior.
One is inclined automatically to assent to this
live-and-let-live assessment, which relegates the differences between
educational methods to the realm of personal values and ideology. But surely the
Follow Through experiment and any comparative evaluation will have been in vain
if we take this easy way out of the dilemma of educating disadvantaged
children.
This easy way of avoiding confrontation between the two
approaches can be opposed on both empirical and theoretical grounds.
Empirically, child-centered approaches have been unable to demonstrate any
off-setting advantages to compensate for their poor showing in teaching the
three R's. House et al (1978a) have argued that the selection of measures used
in the Follow Through evaluation did not give child-centered approaches adequate
opportunity to demonstrate their effects. This may be true to a degree, but it
is certainly not true that child-centered approaches had no opportunity to
demonstrate effects relevant to their purposes. One had better not be a
perfectionist when it comes to educational evaluation. No measure is perfectly
correlated to one's objectives. The most one can hope for is a substantial
correlation between obtained scores on the actual measures and true scores on
the ideally appropriate measures that one wishes existed but do not.
When
child-centered educators purport to increase the self-esteem of disadvantaged
children and yet fail to show evidence of this on the Coopersmith Self-Concept
Inventory, we may ask what real and substantial changes in self-esteem would one
expect to occur that would not be reflected in changes on the Coopersmith?
Similarly for reasoning and problem-solving. If no evidence of effect shows on a
test of non-verbal reasoning, or a reading comprehension test loaded with
inferential questions, or on a mathematical problem solving test, we must ask
why not? What kinds of real, fundamental improvements in logical reasoning
abilities would fail to be reflected in any of these tests?
If these
remarks are harsh, it is only because we believe that the question of how best
to educate disadvantaged children is sufficiently serious that a policy of
live-and-let-live needs to be replaced by a policy of put-up-or-shut-up.
Certainly the cause of educational betterment is not advanced by continual
appeal to nonexistent measures having zero or negative correlations with
existing instruments purporting to measure the same thing. Among the numerous
faults that we have found with the House committee's report, their use of this
appeal is the only one that deserves the label of sophistry.
Critique of the Child-centered Approach
What follows is an attempt at a constructive assessment of the child-centered
approach as embodied in the Open Education and Responsive Education models. By
constructive we mean that we take seriously the goals of these models and that
our interest is in realizing the goals rather than in scrapping them in favor of
others. These remarks are by way of preface to the following observation:
child-centered approaches have evolved sophisticated ways of managing informal
educational activities but they have remained at a primitive level in the design
of means to achieve learning objectives.
We are here distinguishing
between two levels at which a system of teaching may be examined. At the
management level, an open classroom and a classroom running according to a token
economy, for example, are radically different, and while there is much to
dispute in comparing them, it is at least clear that both represent highly
evolved systems. When we consider the instructional design level, however, the
difference is more one-sided. Child-centered approaches rely almost exclusively
on a form of instruction that instructionally-oriented approaches use only when
nothing better can be found.
This primitive form of instruction may be
called relevant activity. Relevant activity is what teachers must resort to when
there is no available way to teach children how to do something, no set of
learning activities that clearly converge on an objective. This is the case, for
instance, with reading comprehension. Although there are some promising
beginnings, there is as yet no adequate "how-to-do-it" scheme for reading
comprehension. Accordingly, the best that can be done is to engage students in
activities relevant to reading comprehension-for instance, reading selections
and answering questions about the selections. Such activities are relevant in
that they entail reading comprehension, but they cannot be said to teach reading
comprehension.
For many other areas of instruction, however, more
sophisticated means have been developed. There are, for instance, ways of
teaching children how to decode in reading and how to handle equalities and
inequalities in arithmetic (Engelmann, Ref. Note 2). The instructional
approaches used in Direct Instruction and Behavior Analysis reflect years of
analysis and experimentation devoted to finding ways of going beyond relevant
activity to forms of instruction that get more directly at cognitive skills and
strategies. This effort has been successful in some areas, not so successful in
others, but the effort goes on. Meanwhile, child-centered approaches have tended
to fixate on the primitive relevant activities form of instruction for all their
instructional objectives.
The contrast of sophistication in management
and naiveté in instruction is visible in any well-run open classroom. The
behavior that meets the eye is instantly appealing-children quietly absorbed in
planning, studying, experimenting, making things-and one has to marvel at the
skill and planning that have achieved such a blend of freedom and order. But
look at the learning activities themselves and one sees a hodge-podge of the
promising and the pointless, of the excessively repetitious and the excessively
varied, of tasks that require more thinking than the children are capable of and
tasks that have been cleverly designed to require no mental effort at all (like
exercise sheets in which all the problems on the page have the same answer). The
scatteredness is often appalling. There is a little bit of phonics here and a
little bit of phonics there, but never a sufficiently coherent sequence to
enable a kid to learn bow to use this valuable tool. Materials have been chosen
for sensorial appeal or suitability to the system of management. There is a
predilection for cute ideas. The conceptual analysis of learning problems tends
to be vague and irrelevant, big on name-dropping and low on
incisiveness.
There does not appear to be any intrinsic reason why
child-centered educators should have to remain committed to primitive
instructional approaches. So far, child-centered educators have been able to
gain reassurance from the fact that for the objectives they emphasize-objectives
in comprehension, thinking, and feeling-their approaches are no more ineffective
than anyone else's. But even this defense may be crumbling. Instructional
designers, having achieved what appears to be substantial success in improving
the teaching of decoding in reading, basic mathematical concepts and operations,
spelling, and written English syntax, are now turning more of their attention to
the kinds of goals emphasized by child-centered educators. Unless thinkers and
experimenters committed to child-centered education become more sophisticated
about instruction and start devoting more attention to designing learning
activities that actually converge on objectives, they are in danger of becoming
completely discredited. That would be too bad. Child-centered educators have
evolved a style of school life that has much in its favor. Until they develop an
effective pedagogy to go with it, however, it does not appear to be an
acceptable way of teaching disadvantaged children.
*Graphs and tables in
this article could not be reproduced clearly in electronic
format.
Notes:
1. Reduced analyses were performed, dropping TEEM and
Cognitive Curriculum from the analysis. These were the two most unstable models
in the sense of shifting most in relative performance depending on the choice of
covariates. Moreover, Cognitive Curriculum had deviant relations between
criteria and covariates, showing for instance negative relationships between
achievement and SES. The only effect of removing these models, however, was to
increase the number of significant differences between the two top scoring
models and the other models.
2. Examined closely, the House et al
statement is a bit slippery. Since the MAT is a norm-referenced, (not a
criterion-referenced) test, it is of course "reckless" to infer any particular
attainments at all from test scores. All we know is how a person or group
performs in comparison to others. If, for example, the criterion for "ability to
write a story" is set high enough, it would be reckless to suppose that any
third-grader had attained it.
3. The obvious targets for the charge of
emphasizing rote learning are Direct Instruction and Behavior Analysis. However,
the Direct Instruction sponsors explicitly reject rote memorization (Bock,
Stebbins, & Proper, 1977, p. 65) and the Behavior Analysis model description
makes no mention of it. House, Glass, McLean, and Walker seem to have fallen
into the common fallacy here of equating direct instruction with rote learning.
If they are like most university professors, they probably rely extensively on
direct instruction themselves and yet would be offended by the suggestion that
this means they teach by rote.
Reference Notes:
1. Becker, W.C.,
& Carnine, D.W. Direct Instruction-A behavior-based model for comprehensive
educational intervention with the disadvantaged. Paper presented at the VIII
Symposium on Behavior Modification, Caracas, Venezuela, February, 1978. Division
of Teacher Education, University of Oregon, Eugene, Oregon.
2. Engelmann,
S. Direct Instruction. Seminar presentation. AERA, Toronto, March, 1978.
References
Bell, A.E., & Switzer, F. (1973). Factors related to pre-school
prediction of academic achievement: Beginning reading in open area vs.
traditional classroom systems. Manitoba Journal of Education, 8,
22-27.
Bell, A.E., Zipuvsky, M.A., and Switzer, F. (1977). Informal or
open-area education in relation to achievement and personality. British Journal
of Educational Psychology, 46. 235-243.
Bennett, N. (1976). Teaching
styles and pupil progress. Cambridge, Mass.: Harvard University
Press.
Brophy, J.E., & Good, T.L. (1974). Teacher-student
relationships: Causes and consequences. New York: Hold, Rinehart &
Winston.
Cronbach, L.J. (1960). Essentials of psychological testing. (2nd
ed.). New York: Harper & Brothers.
House, E.R., Glass, G.V., McLean,
L.F., and Walker, D.F. (1978a). No Simple Answer: Critique of the "Follow
Through" evaluation. Harvard Educational Review, 28(2), 128-160.
House,
E.R., Glass, G.V., McLean, L.F., and Walker, D.F. (1978b). Critiquing a Follow
Through evaluation. Phi Delta Kappan, 59(7), 473-474.
Rosenshine, B.
Classroom Instruction. (1976). In Seventy-fith Yearbook of the National Society
for the Study of Education (Part 1). Chicago: University of Chicago
Press.
St. Pierre, R.G., Anderson, R.B., Proper, E.C., and Stebbins, L.B.
(1978). That Follow Through evaluation. Phi Delta Kappan, 59(10),
729.
Stallings, J.A., & Kaskowitz, D.H. (1974). Follow Through
classroom observation evaluationp;1972-1973. Menlo Park, Cal.: Stanford
Research Institute.
Stallings, J. (1975). Implementation and child
effects of teaching practices in Follow Through classrooms. Monographs of the
Society for Research in Child Development, 40(7-8, Serial No.
163).
Stebbins, L.B., St. Pierre, R.G., Proper, E.C., Anderson, R.B., and
Cerva, T.R. (1977). A planned variation model. Vol. IV-A Effects of Follow
Through models. U.S. Office of Education.
Stephens, J. (1967). The
process of schooling. New York: Holt, Rinehart & Winston.