Pretesting, Pilots, and Refinement: How I’m Approaching Agentic AI Development
I started experimenting with Custom GPTs and later Copilot Agents last year as a way to automate complex, multi-step, and time-consuming processes connected to the federal programs I support in my current role. At some point, the researcher in me took over the planning and development, and I found myself pulling directly from my measurement training to guide how these tools were designed, tested, and refined.
This all began as a small idea that slowly evolved into a working tool, one that started showing promise through early use and feedback from initial users. As the tool grew, I was applying the same logic I use when designing instruments or evaluation tools: develop the idea, pretest it, pilot it in real settings, refine based on what emerges, and repeat. Once the cycle is completed, I work on written guidance. That said, this blog is about that process. More specifically, it outlines the strategies I’ve used to develop and deploy AI agents for Title I schools. There are benefits and cautions that come with developing agentic AI and I plan to explore those in a future post. For now, the focus here is on the process itself and the practical strategies that have shaped this work so far.
Strategy 1: Develop the Idea
Like I mentioned earlier, this all started as a small idea. As an internal evaluator in a very large school district, I’m always thinking about how to better support all the different federal programs. That support happens at different levels, school and district, and across very different contexts. Even within Title I alone, there are multiple strategies, requirements, and moving pieces to keep track of. I’m sharing that context because in evaluation, we really do have to pay attention to the context of each program.
As I was working on the comprehensive needs assessment (CNA) materials and resources, I kept thinking about how the CNA is a multi-step process that requires a lot of brain power, and it repeats every year. The biggest challenges usually come back to data, collecting it, cleaning it, analyzing it, and then trying to turn it into something meaningful through root cause analysis. My previous blog touched on that data analysis piece that becomes the backbone of the CNA, and you can read it here:
At some point, I started wondering whether this was the kind of work that could be automated in a way that makes the workload lighter without compromising integrity. So I started sketching the idea out by laying out each step and thinking through what was supposed to happen at that point. What decisions are being made here? What information is needed? What should the output look like if this step is done well?
Basically, I was setting the guardrails and parameters for how the agent should perform. My background in measurement and evaluation, quantitative data analysis, and a lot of time living in federal programs requirements were all shaping how I approached this.
This part of the process felt familiar because it mirrors early instrument design almost exactly. I spent time turning what was initially a pretty loose idea into a more structured protocol, knowing that everything that came next would depend on getting this first part right. Before you worry about how the tool works, you have to be clear about what it’s supposed to produce.
Strategy 2: Pretesting Through Repeated Runs
Once the idea was more structured and the guardrails were in place, the next step was pretesting; this part was very repetitive. I started by running the agent on my own many (yes, many!) times. At each run, I was testing whether the agent was following the steps in order, without skipping sections. I also used different prompts, slightly different wording, incomplete inputs, and scenarios I knew could come up in real life. I would observe whether the agent was behaving consistently or not. Did the outputs make sense given the input? Where did it get confused, vague, or overly confident? This helped me to identify patterns and areas where more guardrails were needed.
After those initial runs, I shared the agent within my office and asked colleagues to use it on their own. This step helped a lot because people interact with these tools differently than the person who built them. The way others phrased prompts, interpreted instructions, or expected outputs immediately highlighted assumptions I didn’t realize I had embedded into the design.
When I think about it, this stage felt a lot like pretesting an instrument. Before anything is used at scale, you want to know whether the instructions are clear, whether different users interpret them similarly, and whether the tool produces outputs that are usable and aligned with its purpose. Pretesting helped surface issues early, before they turned into bigger problems later. More importantly, it gave me confidence that the agent could handle variation without falling apart, which is critical before moving into any kind of pilot testing.
Strategy 3: Pilot Testing in Real Workflows
After pretesting internally and within my office, the next step was moving into small pilot testing. I ran two smaller pilots with different audiences on purpose. I wanted to see how the same agent behaved across different roles, expectations, and levels of familiarity with federal programs work. At this stage, I was paying close attention to how people actually used the tool and the errors that were coming up.
During these pilots, I tracked the runs, followed the flow of tasks from start to finish, and spent time troubleshooting in real time. I also started to notice friction points: where do users pause? Where do they get stuck? Where do they ask follow-up questions that signal the output wasn’t as clear or complete as it needed to be?
These tests allowed me to distinguish normal learning curves of introducing new technology from areas where the agent needed further refinements. This part also made me think I needed to develop a side manual with instructions and common pitfalls. In all honesty, I did not think about creating a manual until we started pilot testing the tool and made me realize some issues can be prevented by educating the users in advance.
Strategy 4: Refinement After Every Stage
Refinement happened after every stage: after my own runs, after sharing the agent in the office, after the smaller pilots, and again after a full-scale pilot. Some refinements were just adjustments because instructions needed to be clearer. Other times it was an output that looked fine on its own but didn’t quite work when placed back into a larger workflow. I also had to tighten the boundaries of what the agent could do so it didn’t try to be overly helpful in ways that weren’t appropriate for the intended use. A few funny things also happened that the agent won’t allow to save the new instructions, and I had to troubleshoot how to accomplish the change while wording the instructions differently.
Changing core instructions, adjusting task flow, or refining outputs all affect how the agent behaves, so testing again after each change was very time consuming but necessary. All these steps combined helped improve consistency of the model over time. A few important steps I totally recommend doing while managing refinements are:
All these changes must be documented for future reference. This helps tremendously as you go through the agent development because sometimes changes do not work the way you intended, and you may need to reference back to the previous version.
Each test must be tracked for consistency through a list of performance indicators. Overtime I developed a list of performance indicators: those areas I need the agent to perform exactly as instructed. By doing this I was able to quantitatively measure consistency giving me a range of performance.
Strategy 5: Training and Written Guidance as Part of the Tool
After the multiple runs, pilot tests and refinements, the tool was made available to all school administrators. The launch came with a training going from the basics of what is a copilot agent to a walkthrough of the tool step-by-step including challenges, common pitfalls and what solutions I’ve identified so far. Alongside training, I developed written guidance and troubleshooting resources. These materials were meant to reduce variation in use and give users a place to go when something didn’t look right. Clear examples, step-by-step guidance, and common issues helped users feel more confident using the agent independently rather than relying on informal workarounds.
This step helped ensure the tool is being used as intended and that its outputs remain consistent and reliable across schools and programs. At this point, the process came full circle. The same attention given to idea development, testing, and refinement carried through to how the agent was introduced, supported, and sustained in practice.
A Closing Reflection
What this whole process has reinforced for me is that building agentic AI is not that different from building any other tool meant to support complex work. IT. TAKES. TIME! It also takes familiarity with the work itself. And it takes a willingness to slow down and test assumptions before scaling anything up.
Pretesting, pilot testing, refinement, and clear guidance are what make a tool usable, defensible, and trustworthy, especially in a federal programs context where consistency is extremely important. More than anything, this approach shows me that when AI is developed with clear guardrails, tested in real conditions, and refined based on how people actually use it, they can become a great supporting tool.
This is still evolving work for me, and I continue to learn from every iteration. But grounding AI development in the same principles we use in evaluation and measurement has given me a framework that is both practical and responsible.
If you’re thinking through how to apply a similar approach in your own context, I’m always open to conversations about design, testing, and implementation.