When is Code Generation Useful?

Over the past two weeks I’ve built a couple of projects (AutoTransform and BetaTester) that involved this basic workflow:

LLM (Large Language Model) generates code to solve a problem
Code is executed
If code fails, LLM is called upon again to generate a new solution

In both cases, much of the generated code’s value was in cost and latency savings, where executing a model like GPT 4 over and over again becomes prohibitively expensive and slow.

But as LLMs get smarter and faster, is code generation necessary? Wouldn’t it be better to just call the LLM every time (i.e. only do step 1 above), and have it generate a solution on the fly instead of generating this intermediate representation of code every time? You wouldn’t need to add the extra steps of generating, executing, and correcting code, making the system less complex and potentially easier to maintain.

Potential Benefits of Generated Code

While the timing of this cheap-enough and fast-enough LLMs future is unclear and potentially far away due to compute / memory / energy constraints, I do think there are other advantages to code generation that might make it useful even in this future.

Interfacing with Humans

For AutoTransform, the generated code provides a way for humans to reason about what the system is doing / will do and alter the behavior of the system to their liking - it can be easier to edit code to get the exact behavior that you want instead of trying to mess with a prompt. It’s not clear how sustainable this paradigm is as machines generate code at a scale that humans can’t keep up with, but it seems within the realm of possibility that deterministic code will always be valuable in certain use cases due to safety / regulatory / liabilty constraints.

Interfacing with Existing Systems

For BetaTester, the system uses Playwright to interact with a website by generating code compliant with Playwright’s API (technically it outputs values that are deterministically converted to Playwright code). Since this is the dominant paradigm for integrating with most software systems today, it seems reasonable to assume that LLMs will continue to generate code to interact with existing systems over the long term. However, software systems communicating autonomously over natural language (in the BetaTester example, it just tells the browser, “click this button”) will also exist, in which case whether code or natural language is used is probably determined by the specific system constraints (safety, robustness, etc.).

Improving Reasoning

I find that writing code can help me reason about the business logic of a system, revealing gaps in the design / specification of the system from an operational perspective. Writing code might help LLMs detect flaws in their own reasoning, and I have seen cases where the LLM generated the incorrect output when prompted directly but the LLM generated code provided the correct output. It’s hard to know how true this will be as LLMs get smarter, but it does seem like something that could be tested empirically (maybe it already has?).