Experiments in automation

I think everyone knows something is changing, but each of us feels the effects differently. In my academic department, over the past few weeks there seems to be a latent excitement and incredulity about how coding agents will change the process of science. Chatbots are impressive and useful, but coding agents in 2026 feel like discovering a glitch in a video game. It doesn’t seem like such a productivity gain should be possible, especially one that appears so suddenly. And I do think the scientific process changes when implementation becomes so cheap.

I think, in retrospect, this will feel like a pivotal moment. Or, given the pace of progress, some kind of foreshadowing. I don’t have a particularly unique perspective on AI or its use — it’s not my field and I definitely don’t use today’s tools to their fullest potential. At the same time, I want a record of how it began to diffuse, a snapshot of what I found useful in this moment of the transition.

The first real benefit I got from Claude Code was last summer using Sonnet 4. I had all this research code that had accumulated throughout graduate school, saved in multiple different places and not ready for public consumption. Before coding agents, another grad student and I had tried to organize our code, document it, and release it as a Python package. We abandoned the idea after getting completely lost in software development land, where you have to keep track of configuration files, build scripts, wheels, I don’t even know. This is standard stuff for a software developer, but we did not want to be software developers. This was a perfect first task for Claude Code, because it’s basic software engineering that is trivial but that I had no experience with. I could focus on the actual scientific content. Now that the code is public, I have much more incentive to maintain it and ensure it’s well tested, which has had significant benefits for our research group. Another package I released last summer included functions translated from a different language, one that I don’t know well enough to translate myself. That was another basic task that would not have been done without LLMs, and that I’ve since reaped large benefits from getting finished. That was my theme for 2025: LLMs do the rote, basic tasks that I can’t be bothered to learn but have significant downstream advantages in maintainability, sharability, and documentation.

The next round of model releases (e.g. Sonnet 4.5) were good enough to start writing nontrivial code. At the time, I was brainstorming algorithms that could be used to simulate cloud fields. A pattern I developed was to run quick numerical experiments. I would tell the model exactly how to construct a candidate cloud field, and to plot it or calculate some statistic of interest (Initialize a field of random numbers using [specific function] ... transform using [specific operation] ... interpret as [atmospheric variable] ... visualize using [optical relationship]).

I could get feedback about whether I was on the right track in 5 minutes. This is probably the same thing I would have done before AI, but it would have taken a day to write the kind of numerical experiments I was doing. That felt like an acceleration.

I also found the models could now do simple data analysis. The main utility to me was to gain intuition, so errors were not important. I could point to a dataset and say “Plot metric [X]” and get a result in a minute, then discard the whole analysis. Repeating this with different datasets or different coding agents (Codex) gave me enough intuition to believe an analysis or to anticipate failure modes that were important for the work I was doing, like interpolating missing data.

Around the same time, I was also starting to make interactive visualizations of key concepts and accumulate them on my website. At first, I only did this for the most important concepts because it was fairly annoying to guide the agent through making a website (“This text box overlaps the data ... this plot isn’t showing anything ... that plot looks wrong, how did you calculate it ...”). Many of these issues seemed to go away with Opus 4.5, and I leaned heavily into visualizations, particularly for a seminar I gave in early January. I think the pedagogical utility here is incredible, both for myself and for teaching concepts to others. Imagine a university course where every class is just a discussion and walk through of a 3blue1brown-level interactive simulation. I think this is completely possible now, particularly with Opus 4.6 (I find GPT needs substantially more hand-holding here, including 5.4).

Yesterday, I did exactly this. With an hour of prep, I built a visualization of how 3D Monte Carlo ray tracing works in atmospheric rendering. The class was then just a discussion and demonstration of photons bouncing around in a cloudy atmosphere. In this case, I was able to point Claude Code at a research Monte Carlo renderer I’m developing and told it to reimplement in two dimensions using Javascript. But if the concepts are sufficiently well known, today’s agents are smart enough to not need a reference. I probably did not need to give it my research code here.

For research, I’ve recently moved from the more exploratory phase I described above to an implementation phase, where I want to produce publishable and reliable code. Personally, I have to use Claude Opus 4.6 for this, because I find GPT models write inscrutable code that I can’t read, even if it’s correct. My process has been to write a detailed document describing the algorithm I want, written in sufficient detail that there is no ambiguity in how the algorithm works. This will eventually be an appendix/supplement to a publication. I then tell Claude to implement the algorithm, read its code, and inevitably realize I was not specific enough and go back to the document to add detail. Once I’ve gotten the basic structure code finished, I move on to testing.

Writing extensive tests is a commonly given tip for coding agents, and it’s good advice. I also find that for my work it’s not quite so simple. If you just tell an agent to “write tests” it will do so, but I’ve found the tests tend to be fairly basic (Does this function return an array of the correct shape? Does it fail properly when incorrect inputs are given?). These are useful and basically free, so you should include them, but the most useful testing for research purposes is of a more abstract variety. For my cloud simulation, I want to ensure that the resulting simulated clouds are “realistic”, which is ultimately subjective. What metric do you use to measure the realism of simulated clouds? Given a metric, how many clouds need to be checked before the metric converges? What are the tolerances in your chosen metric that are acceptable?

Each of these can be quantified, but to do so you must make subjective decisions about thresholds and methodology to quantify them. This kind of subjectivity is an inherent part of the research process. For these reasons, testing resists full automation, but coding agents still speed up the process. Plotting and basic post-processing code is now free, so you can plot and examine anything very quickly. I spend much of my time coming up with different ways of plotting things, and generally trying to answer the question of “Is this cloud field realistic?” from as many different angles as I can think of. This was just as important before coding agents, and now it’s just faster.

This is also the reason I’m not too worried about agent reliability in science. To feel confident of any given scientific result, you should always have gotten the result in multiple different ways, whether that be with different datasets, metrics, or approaches to the problem. Incorrect code is only one failure mode. You could also be wrong because you asked the wrong question or took a flawed conceptual approach toward answering it. Your research process should be redundant in order to catch as many mistakes as possible. These principles do not change with coding agents, except that you can dramatically scale up the number of redundancy checks.

Rather than taking time to review every line of code, I’ll add a new dataset, reimplement an analysis pipeline from scratch with a different agent, or have the agent provide a walkthrough of exactly what the algorithm does. It is only sometimes faster to read the code directly. I also find GPT is more thorough for code review, and in addition to just telling it to “review” it can be helpful to do carefully scoped reviews (“In this analysis it’s important missing data is preserved and passed to [function name] rather than being dropped or interpolated”). Since different LLMs have different strengths, having multiple “eyes” on any piece of code is beneficial.

I’ve also been experimenting with longer-running agents on a dedicated desktop where they have full system access. The challenge here is to specify a sufficient optimization target for your code. But once that’s done, it’s fairly straightforward to set an agent on a hill climb. I just do a while loop:

while true; do codex exec "Open prompt.md and finish ALL the tasks described before completing. Do not ask me for confirmation."; done

In the prompt.md, I’ll tell the agent what the optimization target is, how to save the result and its notes after each iteration, and what tests should pass. The simplest example is to point it at a low-level bottleneck function and tell it to make it run faster. But you can optimize anything that can be quantified — I’ve had success with improving the accuracy of simulations of scale invariant fields. This works because the degree to which a field is scale invariant can be quantified and therefore optimized.

It’s not a very complex setup but it can work if you have good tests and the optimization target is actually what you want. However, these are not trivial requirements. In general I’ve found that agents initially find real optimizations before delivering code that is more and more tuned to your specific tests, and is therefore not useful. For the scale invariance simulation optimization loop, the agent first found some real improvements to the simulation method, but once real optimizations got harder to find, the agent started implementing modifications that were carefully tuned to exploit imperfections in my optimization target, rather than improving the simulations by a more general and robust method. This “over-optimization” is a common problem machine learning, and it probably means you need a better or more general way to quantify performance.

It’s interesting to compare an optimization loop like this to traditional machine learning methods. Both are essentially methods to optimize a function against a specified metric of performance. The main difference is that a neural network is a black box, and it’s very hard to know what the network is doing to process inputs into outputs. This makes it hard to determine whether a solution is good and general, or whether the network optimized too hard against the quirks of your evaluation. In my agent loop case, the function is not a black-box neural network, but a Python function which I can read and understand. I can therefore spot obviously bad methods like caching data to decrease runtime. If you know the method, it’s much easier to trust good optimizations while discarding bad ones. This is not usually possible with a neural network-based function. I feel this distinction is important enough that it’s what finally motivated me to start this blog:

The new AI for science is different

The most important thing I feel I haven’t figured out is how to get LLMs to help me write. They are definitely useful for editing and feedback, but I’m still writing everything myself. I’m not sure I’ll ever fully stop writing like I’ve fully stopped coding, because writing feels like an important part of the thinking process as a scientist. Perhaps others feel this way about coding. But this is not true of all writing and I would think LLMs could usefully write some things for me. Right now, LLM writing still feels lacking in some difficult-to-quantify way. Maybe I should fine-tune a local LLM on my writing or something, but I think there is still a basic problem of quantification. LLMs are increasingly good at anything you can quantify, and we don’t know how to quantify good writing.