Notes on iterative prompt driven development experiment with AI coding assistant

By Gabriel, 28 Feb 2026 , updated 09 Mar 2026

This post is about a prompt development experiment using several Github-copilot-provided models to develop python scripts to download and upload to S3 some images.

The experiment

The main idea of this experiment is to compare several models at generating code purely from instructions without any manual code change of a real-world task. I have qualified it as iterative because, as a normal real problem, the requirements were not fully defined at the beginning.

I have used the prompt driven development (PDD) naming.

I have not used the term “vibe coding” because I felt that I wanted to be quiet specific in the prompts I used (instructing library to used, log messages to output, steps to follow, …). But maybe this is still a form of it!

I have not used Spec-Driven development (SDD). SDD seems stricter, it seems that particular care is applied to structuring and storing the specifications with the code.

Anyway, all of them being current AI coding buzzwords, their definitions are still in flux and already semantically diffused. But at the time and with my current knowledge of the AI coding jargon it seems a good description of this experiment.

The AI coding assistant is NOT used in agentic loop. Only one LLM call is made at each round.

GitHub Copilot plan

Models

At the time of writing this post, the list of models provided by GitHub Copilot in my situation (Business plan - using GitHub Copilot plugin for JetBrains IDE) is the following:

GitHub Copilot models February-2026

16 Premium Models

Premium models consume “Request” quota. The multiplier (e.g., 0.33x) indicates how many credits are deducted per request compared to a standard baseline.

Claude Haiku 4.5: 0.33x
Claude Opus 4.5: 3.0x
Claude Opus 4.6: 3.0x
Claude Sonnet 4: 1.0x
Claude Sonnet 4.5: 1.0x
Claude Sonnet 4.6: 1.0x
GPT-5.1: 1.0x
GPT-5.1-Codex: 1.0x
GPT-5.1-Codex-Max: 1.0x
GPT-5.1-Codex-Mini: 0.33x
GPT-5.2: 1.0x
GPT-5.2-Codex: 1.0x
Gemini 2.5 Pro: 1.0x
Gemini 3 Flash (Preview): 0.33x
Gemini 3 Pro (Preview): 1.0x
Grok Code Fast 1: 0.25x

Each copilot features (Copilot Chat, Copilot CLI, …) except for Spark (…) uses one premium request per user prompt, multiplied by the model’s rate. See Requests in GitHub Copilot.

The Business Plan, 19 USD per user / month, includes 300 premium requests per user / month. See GitHub Copilot plans & pricing

3 Standard Models

Standard models are included in the base subscription and do not count against my premium request limit

GPT-4.1: included
GPT-4o: included
GPT-5 mini: included

Comment

On a side note, considering the fast evolution of the models it is interesting to step back and acknowledge today (Feb-2026):

WHAT AI frontier labs are selected providers: OpenAI, Anthropic, Google (and xAI)
WHAT is a the current most valued model: Claude Opus 4.5 and Claude Opus 4.6
WHAT is a basic model: GPT-5 mini
- GPT-4.1: the other GPT standard model available is probably here for legacy reason
- GPT-4o: is a multimodel model (text, audio, image) so slightly different use case

Reflect what it was in the past:

< 2023: initially powered by only one model: the OpenAI Codex, which is a modified, production version of GPT-3.
Nov-2023: update to use OpenAI’s GPT-4 model
2024: starting to use multiple models, including external

And come back to that in 6 months, 1 year or a little bit more to see how it will evolve. Spoiler: the most valued models of today could be the basic models of tomorrow or even disappear from the list of models available!

The task

List of models reviewed:

GPT-5 mini (0x)
GPT-5.2 (1x)
Claude Sonnet 4.6 (1.0x)
Claude Opus 4.6 (3.0x)

The topic of the task is purposefully simple, way simpler than the tasks that can be handled today by AI coding assistants, even using a simple model. It is to allow the reviewer (me!) to perform a quicker and thorougher code reviews. I believe that a lot of the learnings and insights from this experiment can be extrapolated to more complex tasks. Actually as I went through the experiment, reading about a model and then using it here, I changed a bit my mind. I realised that this too simple task will fail to capture all the benefits of the most advanced models published in those last 3 months (GPT-5.2 and Claude 4.6). But let’s pursue this experiment as it is, draw some insights and then maybe do another one later, more complex.

See the git repository to check details of the experiment conditions, all the prompt rounds and the generated (and tested) code for each reviewed model: https://github.com/danrit/exp-gh-copilot-compare

Findings

Here are a list of findings I have collected, no particular order. It may appear harsh toward the output of the llm sometimes. but I think the comments I’ve raised would have been raise to me as a developer or by me a stakeholder if I had produced that code

I found myself doing a lot of “is it possible to do that?” check PRIOR to formulating the prompt. or “How to do that?” check. Whether it is an unnecessary I-want-the-control habit or a wise step to avoid the model being lost, I’m not sure!
sometimes the model (GPT-5 mini) is not able to remove complexity that has become unnecessary after a change.
I tried hard NOT to edit anything myself in the code, but I did it twice:
- updating the DEFAULT_TRANSFORMATION_PREFIX (cloudinary transformation) because simple enough and can be considered as NOT part of the experiment
- temporarily disables the last modified date check to RE-RUN a the script
chatGpt (mini and 5-2) choice to implementation the requirements: “start and end message ONLY on the console, all messages in the file” using a filter based on the message string starting with “start” or “end” is too literal IMO. IT won’t be extensible. I would have prefere a code duplication here.
GPT-5.2 in one occurrence was not consistent with the usage of timezone (mixing local and UTC). It was fixed when I pointed out but went overboard by specifying timezone too much (some case were just handle nicely byt python default)
claude sonnet 4.6: interestingly for the log console/file requirement (round 2.1): it implemented by defining a custom level. IMO it is uncessary complex, and discouraged by the official python logging documentation.
claude sonnet 4.6 and claude opus 4.6: step 4 and 5: interestingly on step 4 change on upload.py the model reuse the same pattern of logging as done previously in download.py. This is good as it simplify the prompt on the next step (5)
claude opus 4.6 is the only model that have understood the requirement to log the start and end of the process in the console, and all messages in the file on the first try.
claude opus 4.6: I did specified the same requirement for step 6 (progress bar) but that requirement has already been implemented in step 4 (I didn’t pay enough attention to the output of previous round). The model correctly picked that up and didn’t do any change to the code.

Results

With the technique used (multiple rounds until success, with manual testing) and the simplicity of the task all models were able to implement the requirements. But with very different line-to-line implementation. In fact it is challenging to compare 2 implementations. So how can we compare the “quality” of the output?!

In the table below I have tried to report on some measurable differences.

download.py:

branch_name	total_lines	code_lines	comment_lines	function_definitions	max_nesting_level
gpt-5-mini	125	85	40	2	8
gpt-5-2	150	122	28	6	4
claude-sonnet-4-6	116	84	32	4	3
claude-opus-4-6	82	65	17	1	3

upload.py:

branch_name	total_lines	code_lines	comment_lines	function_definitions	max_nesting_level
gpt-5-mini	256	190	66	7	10
gpt-5-2	230	196	34	8	4
claude-sonnet-4-6	193	149	44	6	7
claude-opus-4-6	165	144	21	2	6

More advanced models tend to produce smaller code (less line of codes) and simpler (less nesting level) code.

What’s next

Here is a list if ideas I would explore next if I were to pursue this experiment:

Repeating the task several time with the same model, same requirements and no to litle changes in the prompt or in the model settings (temperature, top_p…). Because I do suspect that the output can be quiet different.
giving those models more challenging tasks
experiment and comment findings of AI coding assistant in agentic loop mode (giving tools to the assistant to validate the code and let it iterate).
Find more ways to validate the code: more specific requirements (logging, performance…), test coverage, code quality, code style…

Conclusion

First, those models are globally good at implementing the requirements, with little trial and error (ie here with just a few extra rounds)

When I went to compare the generated code of two models I found a lot of little difference (spacing, quotes, choice of comment, preferred util function…) that make the code hard to compare. Make me realise with AI generate code, we shall still apply (and probably even more) some code style guide, formatting rules… It is important to make the code (as much as it is possible) more stable!

The pattern now with developer now using AI agent, you write “plain english” specifications and then review the generated code. Now, if you are aiming for a productivity gain (which is the main selling point of those tools) you have to relinquish some control over the generated code. You cannot possibly spend as much time reviewing the implementation (reading, thinking, imaging scenarios) as you would have spent writing it yourself. The same way you would have to relinquish some control if you were to delegate the task to another developer. (But in this case it is not someone that you can have a conversation with and ask for clarification, so it is a bit different). There are ways around that, to make it work I suppose: more high level review, focus on the important implementation choice, develop good understanding of the models, use good test coverage, … but that is another subject.

I do realise that advanced and proprietary models inference pricing is significantly higher than open weight models (whose pricing, because of the competition, is simply related to inference cost, hence lower). I feel that sometime it worth it sometines it doesn’t:

When a more advanced model significantly lowers the number of implementation issues, which carry code-review and maintenance cost, then it is worth it. For complex and not fully scoped-tasks…
When the requirement is clear and the logic can have a good automated test coverage, then smaller (cheaper) models have their place and should be preferred. I believe there are still a lot of such use case where a smaller model can bring development productivity gain. When the code generation can be reproduced very similarly by another run in almost the same conditions

I started that experiment with a very begin-2025-view of the AI coding assistant capabilities. What I mean by that is that while I knew that it could already provide productivity gain, I was sceptical about the quality of the code generated and the complexity of the task it can handle. But various readings during that time make me realise that recent models (gpt 5.2, Claude Opus 4.5 and above) have improved a lot. We have not figure out how to use them in the best way on mission critical software yet. However I do believe that soon developers will not be required to hand write code line by line. For that to happen we do need to figure out ways to architect, structure, channel and validate the generated code in a codebase in the long term and still deliver safely mission-critical software.