Data, Questions, and Comparisons: Three Things Every Impact Evaluation Needs

Most people who request an impact evaluation do not know what they are actually asking for or what they need to have ready. They know they want to know whether a program worked, and they know they need to collect data. But when the evaluator starts asking questions about comparison groups, dosage data, and construct definitions, the conversation stalls.

This post gives you the practical steps. Not the statistical methods, those belong to the evaluator. What belongs to you is the groundwork: a clearly articulated question, clean data, and an understanding of what comparison is possible. If you have those three things, a rigorous evaluation can happen. If you do not, even the best evaluator with vast knowledge of sophisticated techniques cannot save it.

What impact means in evaluation and what it does not

The word impact gets used loosely. Program managers talk about the impact of a tutoring program, the impact of a professional development series, the impact of a new curriculum. What they usually mean is that something changed: attendance went up, scores improved, teachers reported feeling more confident.

That is not impact in the evaluation sense, that is an outcome. Explaining this distinction first is important.

Impact, defined

In the evaluation and social science literature, impact refers specifically to the causal effect of an intervention. The definition is worth reading carefully: impact is the change in an outcome that can be attributed to the program itself, after ruling out alternative explanations. The foundational framework here is the counterfactual: what would have happened to participants if they had not received the program?

The World Bank's Impact Evaluation in Practice (Gertler et al., 2016) defines impact evaluation as "a specific type of evaluation that tries to answer questions about cause and effect." The key word is cause. An impact evaluation does not just document that change occurred. It asks whether the program caused the change.

Shadish, Cook, and Campbell (2002) in Experimental and Quasi-Experimental Designs for Generalized Causal Inference frame this as a question of internal validity: can the observed effect be attributed to the treatment, or could it be explained by something else? Maturation, selection bias, historical events, and regression to the mean are all alternative explanations that an impact evaluation must rule out or account for.

What impact is not

Outputs are not impact. The number of students served, the number of sessions delivered, the percentage who completed the program, or the number of parents who attended school, are all measures of reach and implementation, not of effect. They document that the activity happened, but they cannot attribute a change in outcomes.

Outcomes are not automatically impact. A test score gain is an outcome. Whether the program caused that gain is an impact question. Students can improve for many reasons: time passes, teachers change, and curriculum shifts. Without a comparison, you cannot separate the program's contribution from everything else happening at the same time.

Satisfaction is also not impact. Survey results showing that participants found the program useful or enjoyable are perception data. They tell you how the program was received. They do not tell you whether it changed anything.

The work of Michael Scriven, a pioneer in the evaluation field, on causal attribution emphasizes this point: evaluation that stops at documentation of change, without addressing the causal question, cannot be called impact evaluation.

Why the distinction matters in practice

When a program manager says, "our program had a huge impact on student reading," what they often mean is that reading scores went up during the program year. That is worth knowing, but it is not the same as demonstrating that the program caused the improvement.

The practical consequence is this: if you are asked to demonstrate impact, you need a design that can support a causal claim. That requires a type of comparison. Pre/post data alone is not enough. Outcomes data without a counterfactual is not enough. Good intentions and positive feedback are not enough.

The following sections walk through what you do need.

A question the data can answer

Everything starts here. Before data is pulled, before a design is selected, before anything is analyzed, you need an evaluation question that is answerable.

"Did the program work?" is not an answerable question. An answerable question names a population, an outcome, a comparison, and a timeframe. For example: "Did third grade students who participated in at least 20 sessions of the reading intervention show greater gains in reading proficiency from fall to spring compared to similar students who did not participate?"

That question can be designed around. The vaguer version cannot.

Below is a set of questions to ask yourself:

•       Who received the program and who did not?

•       What outcome were you trying to move?

•       Over what period of time?

•       What would a meaningful result actually look like?

If you can answer all four, you have a workable question.

The data infrastructure you need

Impact evaluation runs on data, but not just any data. The right data, collected at the right time, structured in a way that allows records to be linked and outcomes to be traced back to participation. Here is what that requires.

  • A well-defined construct

Before collecting or linking any data, you need to be clear about what you are measuring. This is the construct; simply put, it is the underlying concept the evaluation is intended to capture.

For example: engagement and attendance are not the same thing. A student can be present and disengaged; or a parent can visit school frequently for varied reasons but not be engaged in their child's learning process. Another example: reading proficiency and reading growth are not the same thing. A student can show proficiency without showing growth, or show growth without reaching proficiency. Academic performance and academic achievement are not the same thing either. The list goes on.

When the construct is vague, the measure selected to represent it often captures something adjacent but not quite right. That misalignment between the concept and the measure is one of the most common sources of uninterpretable findings in program evaluation. Before the evaluation begins, define in plain language what the program was designed to change, in whom, and how you would know if it changed. That definition drives every decision that follows.

  • Unique identifiers

This applies to student IDs, employee numbers, and similar fields; it is also the foundation of any data linkage. Very often you will have multiple reports or sources of data. It would be ideal to have one set of data all organized and cleaned, but that is hardly the reality. Records and data are often collected in alternative ways, and you will need a unique identifier in each to connect all the data points together. If you cannot connect a student's participation record to their outcome record, you cannot evaluate the program. That connection requires a consistent, unique identifier across every data source you plan to use.

In practice, this means checking that your program roster uses the same student ID as your assessment system, your attendance system, and any other data source the evaluation will draw from. It sounds simple. It is frequently not. IDs get entered manually, formatted differently across systems, or omitted entirely in program logs built outside the district's official data infrastructure.

A common scenario I encounter often: a student assessment record has the state ID number, and the program implementation record has the local ID number. The result is two files that cannot be merged together and with thousands of records, manually finding each is not practical. In that case, I look for a third report with both the state and local ID, and use it to merge all the data into one file.

Tip: Audit your data sources before the evaluation begins. If IDs are inconsistent, resolving them early saves significant time and prevents data loss later.

  • Outcome data

You need at least one measurable outcome that was collected before the program began and again after it ended. Pre and post measures are the minimum. Without a pre-measure, you cannot assess change. Without a post-measure, you have nothing to evaluate.

The outcome measure also needs to be consistent. If different schools used different assessments, or if the same assessment was administered under different conditions, the scores cannot be meaningfully compared. Consistency across campuses and across time is what makes the data usable.

  • Participation and dosage data

A program roster tells you who was enrolled. It does not tell you who actually participated, at what intensity, or for how long. Dosage matters. A student who attended two sessions and a student who attended twenty sessions did not receive the same intervention. Treating them the same in the analysis produces misleading results.

You need attendance or session logs that capture who showed up, when, and for how long. If that data does not exist, the evaluation will either have to make assumptions that weaken the findings or exclude dosage from the analysis entirely.

  • Demographic and covariate data

Impact evaluation rarely happens in a vacuum. Students who participated in a program may differ from students who did not in ways that matter: prior achievement, attendance history, grade level, socioeconomic status, language background. If those differences are not accounted for in the analysis, the evaluation cannot cleanly attribute outcomes to the program. Also think about nested structures: students within classrooms and within schools.

Two earlier posts in this series go deeper on that topic:

Prior achievement is usually the most important covariate. If you can provide one prior year of outcome data for every student in the dataset, the analysis has a much stronger foundation. Demographic variables help further, both for controlling the model and for examining whether the program worked differently for different groups of students.

  • Timing

When data was collected relative to program delivery matters more than most people realize. A post-assessment administered three months after the program ended captures something different than one administered the week after. A pre-assessment administered before the program begins is more useful than one administered partway through.

Know your data collection timeline. Know when the program started and ended. Know when assessments were administered. If the timing does not align, the evaluator needs to know that up front so the design can account for it.

How to isolate the effect: some common comparison options

The core challenge in impact evaluation is ruling out alternative explanations. If students improved, was it the program, or would they have improved anyway? Answering that question requires some form of comparison. Here are the most common options in education settings, along with what each one requires and what it can and cannot support.

  • Pre/post (before and after)

The simplest design. You compare student outcomes before the program to outcomes after it. It tells you whether change occurred. It cannot tell you whether the program caused the change. Students grow over time regardless of intervention. Other factors in the school year may have contributed. Pre/post is a starting point, not a strong causal design on its own.

What you need: Pre-measure and post-measure for the same students, consistent across time.

  • Historical trends

Using data from prior cohorts as a baseline. If students in previous years followed a predictable trajectory and this cohort diverged from it after the program was introduced, that divergence is worth examining. Historical trend analysis is stronger than pre/post alone because it accounts for typical growth patterns.

What you need: At least two to three years of prior outcome data for comparable student populations.

  • Comparison group

Similar students or schools that did not receive the program during the same period. This is one of the strongest non-experimental designs available in education because it directly addresses the counterfactual: what would have happened without the program? The comparison group serves as a proxy for that answer.

The key is similarity. The comparison group needs to resemble the program group on the characteristics most likely to affect the outcome. That is where covariate data becomes critical.

What you need: A pool of non-participating students or schools with outcome and covariate data for the same time period.

  • Waitlist or phased rollout

When a program rolls out in waves, the students or schools waiting to receive it become a natural comparison group. This is one of the cleanest designs available in practice because the not-yet-served group is, by definition, similar to the served group in intent if not yet in outcome.

If your program is expanding and you have any control over the rollout sequence, preserving this structure by design is worth the effort. Once everyone is served, the comparison is gone.

What you need: A documented rollout schedule and outcome data collected before the second wave receives the program.

What gets in the way and how to anticipate it

Even well-designed evaluations run into data problems. Knowing what to watch for gives you time to address issues before they compromise the study.

  • Missing or inconsistent IDs. Records cannot be linked. Resolve ID formatting across systems before data is pulled.

  • No comparison group. The evaluation is limited to pre/post or trend analysis. If a comparison is important, plan for it before the program launches.

  • Inconsistent outcome measures. Scores that cannot be compared across campuses or years cannot be pooled. Standardize the measure before the program begins if possible.

  • Program records that do not match student records. Rosters with names but no IDs, or IDs that do not match the student information system, create matching problems that cost time and reduce the usable sample.

  • Timing mismatches. Outcome data collected at the wrong point in the program cycle reduces the interpretability of the findings. Document when every data source was collected.

What you do not need to figure out on your own

You do not need to select the statistical model. You do not need to run the analysis. You do not need to interpret every output or understand every technical decision the evaluator makes.

What you need is a clearly defined construct, clean and linked data, a workable comparison, and a question grounded in a real decision. Bring those to the table and the evaluation has a foundation to stand on.

Requesting a rigorous impact evaluation is a skill. The more prepared you are going in, the more useful the findings will be coming out.

References

Gertler, P. J., Martinez, S., Premand, P., Rawlings, L. B., & Vermeersch, C. M. J. (2016). Impact evaluation in practice (2nd ed.). World Bank. https://openknowledge.worldbank.org/handle/10986/25030

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.

White, H., & Sabarwal, S. (2014). Quasi-experimental design and methods (Methodological Briefs: Impact Evaluation No. 8). UNICEF Office of Research. https://www.unicef-irc.org/publications/750

Next
Next

One Number, A Hundred Calls: A Spring Break Post on What Evaluation Actually Involves