In the first post of this series, I discussed validity of A-ROI as a measure of cost-effectiveness. In this post, I focus on uncertainty embedded in A-ROI results.
In the business world, ROI is largely treated as an accounting measure with certainty, but the certainty only applies to the accounting period. That is, for a three-year investment, the ROI result is a both accurate and precise representation of its profitability over those three years only. Consumers of ROI information are admonished about risks when applying it to the future or other contexts. Generally, little is provided about those risks other than the fine print of “Past performance is no guarantee of future results.”, which can be found in almost every mutual fund prospectus or other investment disclosures.
In educational contexts, however, the A-ROI formulated by Levenson and traditional CER have been approached as point estimates for which we don’t know about their accuracy and precision with one-hundred percent certainty, due to both random errors (e.g., sampling error, measurement error, data entry error) and potentially systematic errors (e.g., omitted variables, model misspecification, research design flaws). This uncertainty has two direct implications for the use of A-ROI in high-stakes budgetary decision making. One is that we cannot simply judge the cost-effectiveness of an investment by its A-ROI result. The other is that assessing the worthiness of multiple investments is more complex than straightforward comparisons of the A-ROI face-values.
For a single point estimate of program effect, the conventional way of dealing with uncertainty associated with sampling error is to conduct null hypothesis significance testing (NHST), which involves using the 5% significance level to heed false positives and conducting power analysis to gauge the likelihood of false negatives. Uncertainty associated with systematic errors can be best addressed through rigorous research designs such as randomized control trials and quasi-experimental studies (Campbell & Stanley, 1963). There are also various sensitivity analysis techniques (Frank, 2000; Rosenbaum, 2010) to quantify potential bias and robustness of single point estimates.
With respect to comparing multiple point estimates, while tempting, it is problematic to compare the statistical significance of those point estimates to assess which programs are more investment-worthy because, as explained by Gelman & Stern (2006), the difference between “significant” and “not significant” is not itself statistically significant. For effect sizes from two or more independent studies, the uncertainty associated with examining the heterogeneity of those effect sizes can be dealt with by employing the method developed by Rosenthal & Rubin (1982) when estimates of variance are available.
The aforementioned design and statistical methods are part of a large body of scholarly work that addresses uncertainty in scientific inquires. These various research designs and methods have undoubtedly improved our ability to reduce or quantify uncertainty in our analysis results for decision making. For practical purposes, however, they suffer from two main drawbacks. One is the complexity involved in employing these techniques, which makes their utilization rare in most school districts that don’t have researchers or analysts.
The more pressing issue, however, is concerned with communicating quantified uncertainty to practitioners effectively so that they can interpret the statistical inference results properly for contextualized decisions. And that requires clear, mutually understood terms (Fischhoff & Davis, 2014). Given the widespread misunderstanding and misinterpretation of statistical results among social science researchers (Mittag & Thompson, 2000; Nelson, Rosenthal, & Rosnow, 1986), it is hard to imagine the task will be any easier with practitioners.
In spite of their predominance in social sciences, those sophisticated techniques are not the only weapon in our battle against uncertainty. Replication, which provides “critical information about the veracity and robustness of research findings” (The National Science Foundation & The Institute of Education Sciences, 2018), is another powerful tool for researchers.
In recent years, there has been an intensified interest in the so-called replication crisis (or reproducibility crisis) reported in multiple disciplines. This intensified interest has led to multiple debates and discussions, which focus on various aspects of replication based on different definitions of replication (Schmidt, 2016). For the purpose of this discussion that concerns uncertainty in A-ROI results, we adopt the replication standard proposed by Clemens (2017), which differentiates four different types of replication studies.
Observing confusion and harm resulting from lack of a consensus standard for determining what constitutes a replication, Clemens proposed to classify replication studies as “Replication” and “Robustness” (See Table 1). Based on the methods in follow-up studies versus those reported in the original, “Replication” can be further classified as “Verification” and “Reproduction”; and “Robustness” can be further classified as “Reanalysis” and “Extension”. For the rest of the discussion, replication refers to “Reproduction”.

Note. From “The Meaning of Failed Replications: A Review and Proposal,” by M. A. Clemens, 2015, Journal of Economics Surveys, Vol. 00, No. 0, p. 3. Copyright 2015 by John Wiley & Sons publications. Adapted with permission.
With the estimate from each replication that involves conducting the same analysis with a different sample from the same population, we get a sampling distribution of the program effect. Based on the distribution, we can use the average to estimate the program effect and calculate standard error (standard deviation of the sampling distribution) to gauge the degree of sample-to-sample variability we can expect if the investment continues.
As a result, NHST is no longer necessary to deal with the uncertainty associated with sampling error as long as there are sufficient replications (Kline, 2004). In addition, ‘Reproduction” helps reduce uncertainty associated with measurement error, coding errors, and low power (Clemens, 2017). This type of replication may also help address uncertainty associated with systematic errors such as confounding since both confirmations and disconfirmations can reduce uncertainty (McGrath & Brinberg, 1983). For example, after consistent results[1] from three replication studies of an intervention program in a school, the fourth replication yields significant deviation which coincides with the discontinuation of a practice in the school. This deviation suggests the possibility of the original and subsequent consistent findings being confounded, which necessitates further investigation.
It is somewhat ironic that as we become more certain about the potential inaccuracy of our earlier results due to the disconfirmation, we are less certain about the effect of the program. However, at the very least, we know we cannot put much confidence in the result without further investigation. This is in contrast to the situation with a single study where we practically deem the effect “scientifically proven” without knowing the potential existence of biases from replications. In certain circumstances, we may even be able to identify, with additional information at hand, which scenario (See Table 2 in Part I – Validity) the bias falls under and thus use the biased result if they are acceptable (See Table 3 in Part I – Validity).
By nature, an average derived from a sampling distribution carries less uncertainty than a point estimate from a single study as long as there are sufficient replications. The standard error can be translated into statement describing percentage of times we can expect the program effect fall within the one or two standard errors of the mean. This information should be more relevant, informative, and accessible for decision makers than confidence interval that gives the range encompassing the true program effect with a high probability (Howell, 2012).
Epistemologically, NHST and the various sensitivity and robustness methods are our solutions to the problem of making inference based on a single study. Despite repeated calls to lessen our dependence on this inductive logic and strengthen scientific inquiry through replications (Cohen, 1994; Edlund, 2016; King, 1995; Schneider, 2004), the “crisis” has not improved significantly. This is especially true in educational research where a recent study found a replication rate of 0.13% among current top 100 education journals ranked by 5-year impact factor (Makel & Plucker, 2014).
There are a number of structural barriers that discourage replications in academia, including editorial bias (Neuliep & Crandall, 1990, 1993), grant culture (Lilienfeld, 2017), reputation and career advancement norms (The National Science Foundation & The Institute of Education Sciences, 2018), and feasibility constraints for replications (Open Science Collaboration, 2015). These structural barriers are difficult to overcome, which explain the lack of replications in multiple social science fields despite various efforts to improve the situation.
In contrast, K-12 school systems provide friendly environments and optimal conditions for replications when it comes to evaluating a program’s effect. In many school districts, it is not uncommon to see some programs implemented year after year without change. For these programs, evaluation conducted in each year that employs the same design and method can be treated as a replication (“Reproduction”) as long as the assumption that program participants in those years are from the same population is not violated.
Within a school district, that assumption seems quite plausible when there are no boundary changes in school assignment, large student migration in and out of the school system, adjustments in program entrance and exit criteria, or significant improvement or deterioration of student achievement (including things that impact student achievement such as motivation) due to another program or district policies.
It is important to point out that the above discussion by no means suggests that “Reproduction” is superior to the other types of replication. However, as far as assessing the cost-effectiveness of an investment or comparing the cost-effectiveness of multiple investments in a K-12 setting is concerned, we think that, whenever doable, “Reproduction” should be the preferred approach to gauging and communicating uncertainty in our A-ROI results.
Continue Reading:
[1] There is no consensus on the criteria that should be used to determine whether replication has occurred (Subcommittee on Replicability in Science, 2015). Here, consistent results refer to findings not statistically significant different from each other, for which variation is due to sampling fluctuations.
REFERENCES
Campbell, D. T., & Stanley, J. (1963). Experimental and Quasi-Experimental Designs for Research (1 edition). Boston: Cengage Learning.
Clemens, M. A. (2017). The Meaning of Failed Replications: A Review and Proposal. Journal of Economic Surveys, 31(1), 326–342. https://doi.org/10.1111/joes.12139
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997
Edlund, J. E. (2016). Invited editorial: Let’s do it again: A call for replications in Psi Chi Journal of Psychological Research. Psi Chi Journal of Psychological Research, 21(1), 59–61.
Fischhoff, B., & Davis, A. L. (2014). Communicating scientific uncertainty. Proceedings of the National Academy of Sciences, 111(Supplement 4), 13664–13671. https://doi.org/10.1073/pnas.1317504111
Frank, K. A. (2000). Impact of a Confounding Variable on a Regression Coefficient. Sociological Methods & Research, 29(2), 147–194. https://doi.org/10.1177/0049124100029002001
Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 60(4), 328–331. https://doi.org/10.1198/000313006X152649
Howell, D. C. (2012). Statistical Methods for Psychology (8 edition). Belmont, CA: Cengage Learning.
King, G. (1995). Replication, Replication. PS: Political Science & Politics, 28(3), 444–452. https://doi.org/10.2307/420301
Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. https://doi.org/10.1037/10693-000
Lilienfeld, S. O. (2017). Psychology’s Replication Crisis and the Grant Culture: Righting the Ship. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 12(4), 660–664. https://doi.org/10.1177/1745691616687745
Makel, M. C., & Plucker, J. A. (2014). Facts Are More Important than Novelty: Replication in the Education Sciences. Educational Researcher, 43(6), 304–316. https://doi.org/10.3102/0013189X14545513
McGrath, J. E., & Brinberg, D. (1983). External Validity and the Research Process: A Comment on the Calder/Lynch Dialogue. Journal of Consumer Research, 10(1), 115–124.
Mittag, K. C., & Thompson, B. (2000). A National Survey of AERA Members’ Perceptions of Statistical Significance Tests and Other Statistical Issues. Educational Researcher, 29(4), 14–20. https://doi.org/10.2307/1176454
Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist, 41(11), 1299–1301. https://doi.org/10.1037/0003-066X.41.11.1299
Neuliep, J. W., & Crandall, R. (1990). Editorial bias against replication research. Journal of Social Behavior & Personality, 5(4), 85–90.
Neuliep, J. W., & Crandall, R. (1993). Reviewer bias against replication research. Journal of Social Behavior & Personality, 8(6), 21–29.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Rosenbaum, P. R. (2010). Design of Observational Studies. Retrieved from https://www.springer.com/us/book/9781441912121
Rosenthal, R., & Rubin, D. B. (1982). Comparing Effect Sizes of Independent Studies. Psychological Bulletin. Retrieved from https://dspace6-dev.lib.harvard.edu/handle/1/11718224
Schmidt, S. (2016). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. https://doi.org/10.1037/14805-036
Schneider, B. (2004). Building a Scientific Community: The Need for Replication. Teachers College Record, 106(7), 1471–1483. https://doi.org/10.1111/j.1467-9620.2004.00386.x
Subcommittee on Replicability in Science. (2015). Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science. Retrieved from National Science Foundation website: https://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf
The National Science Foundation, & The Institute of Education Sciences. (2018). Companion Guidelines on Replication and Reproducibility in Education Research (No. nsf19022). Retrieved from https://www.nsf.gov/pubs/2019/nsf19022/nsf19022.pdf