Research preview · Method note
A self-editing knowledge base for music generation
An evidence-based method for asking an AI what knowledge it actually needs to do its job — applied to music as a first test case.
Most working AI systems are scaffolded by hand-written context — system prompts, knowledge files, style guides, retrieval blocks. Authors choose what to include based on intuition. This experiment proposes a method for replacing that intuition with evidence: let the model rewrite its own knowledge files one line at a time, evaluate each edit through blind human comparison of downstream output, and keep only the changes that win. After thirty rounds, the surviving cookbook — and crucially the distribution of edits across files — is a measurement of which kinds of context actually matter for the task. We apply the method to a music generation agent.
If we let an AI rewrite its own context one line at a time and grade each line by whether downstream output beats the previous version in blind A/B testing, the lines that survive will reveal which kinds of knowledge are actually load-bearing for the task — and which were never doing any work at all.
One round, seven steps.
Click any step to jump straight to its detailed explanation. The colour of each circle tells you who is acting at that step.
- 1Editor proposes one editAI→
- 2Edit applied to cookbookSystem→
- 3Generation prompt writtenAI→
- 4Music model generates trackAI→
- 5Friends blind-listen + voteHumans→
- 6Auto-resolve at 5 votesSystem→
- 7RepeatSystem→
Who controls what.
The most important question in any human-AI loop is where the decision boundary sits. Here it is, drawn explicitly. The AI never touches the ground truth — humans always do.
Ollama (Editor)
- ·Proposes which file to edit
- ·Decides what to add, change, or remove
- ·Writes the generation prompt
- ·Reads the change history
Orchestrator
- ·Applies edits to disk
- ·Runs the music model
- ·Captures audio output
- ·Tallies votes, decides keep/revert
- ·Logs every iteration
You + your friends
- ·Sets the brief at the start
- ·Listens blind and picks A or B
- ·Provides the noise-floor signal
- ·Reads the change log
Five files. The AI can edit any of them.
We started each file deliberately short. The whole point is to see what the AI fills them in with — and which files end up mattering.
agent.mdWho the AI is and what it cares about. The agent's job description, written for itself to read on every iteration.
Why it matters · Does telling a model who it is change what it makes? This is the most testable claim about identity priming. We expect this file to receive few kept edits.
prompt-craft.mdHow to write effective prompts to a music generator. The mechanical craft of converting a brief into something a generator can act on.
Why it matters · If the bottleneck is prompting skill, this file will accumulate edits. If the bottleneck is musical knowledge, it won't. Either way the result is informative.
emotional-mapping.mdHow abstract feelings translate into concrete musical choices — key, tempo, instrumentation, density.
Why it matters · The brief is in human language (mood words). The generator wants technical specifications. This file is the bridge — and we expect it to be load-bearing.
genre-craft.mdThe signature elements of specific genres. Tempo ranges, instrumentation, production conventions, characteristic moves.
Why it matters · Genre fluency is the difference between a track that feels like the genre and one that just gestures at it. Whether models need this written down is an open question.
anti-patterns.mdWhat to avoid. The mistakes that make AI music sound generic, fatiguing, or like AI music.
Why it matters · Knowing what not to do may be more powerful than knowing what to do. This file tests whether explicit anti-patterns shift output away from local optima.
One file. One change. One hypothesis.
The Editor — an open-weights model running on Ollama Cloud (currently gpt-oss:120b-cloud) — reads the entire cookbook (all five files), the fixed brief for the experiment, and the full history of previous edits with their verdicts. It then proposes a single edit: which file to change, what the new content should be, and a short hypothesis for why this edit will improve the next track.
The constraint is strict: one file per turn. Multiple-file edits are rejected by the orchestrator. The change can be additive (a new line, a new paragraph), revisional (rewriting an existing line), or a deletion. The Editor sees its own change history so it can avoid repeating reverted ideas and can pursue threads that have already won keep verdicts.
The orchestrator writes the change to disk.
Before applying the edit, the orchestrator snapshots the file's current content into the database. This snapshot is what gets restored if the iteration is later reverted. The Editor's proposed new content is then written verbatim to the markdown file.
The Editor reads the new cookbook and writes a music prompt.
With the cookbook in its updated state, the Editor is asked to produce a single dense prompt for the music generator — one or two paragraphs covering instruments, tempo, mood, structure, and production. This is the output of the cookbook in this iteration: how do all five files, taken together, translate into a concrete prompt?
Google Lyria 3 produces audio via the Gemini API.
The generation prompt is sent directly to Google's Lyria 3 music model (lyria-3-clip-preview) through the Generative Language API. Lyria returns the track as base64-encoded MP3 in the response payload. The audio is decoded, saved alongside the iteration row in the database, and synced to the live site so listeners can stream it from a comparison link.
If Lyria's content filter rejects a prompt — it sometimes flags phrases that resemble specific artists or releases — the orchestrator automatically retries once with a more abstract version of the prompt before failing the iteration.
Two players, A and B, in randomized order.
Listeners visit a comparison URL and see two audio players. They don't know which is the new track and which is the previous winner. They listen, click whichever they would rather hear again, and optionally leave a one-line note. There is no rubric, no scale, no score. Just A or B.
The deliberate constraint is that voters are untrained. The experiment is not asking experts to grade craft; it is asking real listeners which track they prefer when stripped of context. That preference is the only signal we accept.
Keep, revert, or tie.
Once a comparison has accumulated five votes, the system resolves the iteration automatically. If 60% or more of votes prefer the new track, the edit is kept. If 60% or more prefer the previous track, the edit is reverted and the file is restored from snapshot. A tie also reverts — a tie is no signal.
Every resolved iteration is permanently logged: the file edited, the before-and-after diff, the hypothesis, the generation prompt, the audio, the vote tally, and the verdict.
Thirty rounds, or until the experiment plateaus.
The next iteration begins immediately. The Editor sees its updated history — including the verdict that just landed — and proposes the next edit. The experiment terminates at the earliest of: thirty resolved iterations, five consecutive iterations with no kept edit, or a noise-floor check that exceeds 70%.
The artifact is a ranking, not a track.
The final track produced after thirty iterations is interesting but not the point. The point is the distribution of kept edits across the five files. If emotional-mapping.md accumulates eight kept edits and agent.md accumulates zero, the experiment has produced a finding: emotional vocabulary was load- bearing context; self-description was not.
The dashboard renders this distribution live as a per-file impact table. It is the actual research output of the experiment.
Most wins are luck. We measure the noise floor.
Generative music models are stochastic. Two tracks generated from an identical prompt may sound noticeably different, and listeners will often prefer one over the other for reasons unrelated to any edit. A naive 60% B-preference threshold is not enough on its own to claim an edit was responsible for an improvement.
To address this we run a noise-floor check every ten iterations: two tracks generated from the same current-best prompt, with no edit, presented to listeners as A and B. The percentage of listeners who prefer one over the other places a lower bound on what a real-edit win must beat.
The method is not really about music.
Music is a convenient first domain because untrained listeners can form a preference between two thirty-second clips in seconds. The same loop applies wherever output can be human-judged through pairwise comparison: writing, design, summarization, code review.
The deliverable in every case is the same: a ranking of which kinds of context actually moved a downstream evaluator. That ranking can then be used to compress, simplify, or audit any hand-written agent scaffolding — replacing intuition with measured signal.
If a friend sent you a comparison link.
A comparison link looks like /compare/<n>. You will see two audio players, A and B, in randomized order. Listen to some of each. Click whichever you would rather hear again. An optional one-line comment is welcome but not required.
You do not need to know anything about music. The whole point of the design is that you don’t. Untrained ears are the control we want. Five votes from people like you are enough to resolve a single iteration.
Two musicians in San Francisco trying to understand it from the inside.
This is independent research. Neither of us is doing it on behalf of a lab, a company, or a grant. We’re running it because we want to understand a technology that’s about to reshape an art form we both love — and the only honest way to understand it is to use it, every week, until intuitions form.
Aaron Ta
Builder, listener, San Francisco. Spends a lot of time inside the city's electronic music scene as a fan. Treats this project as a way to get a feel — not a take — for what generative models can and cannot do for the music he already loves.
Derick Ngan
San Francisco DJ. The practitioner side of this collaboration — the person whose ear is the long-term test of whether any of this is producing music that holds up next to the human-made tracks he plays out.
Why we’re actually doing this
The conversation about generative music is dominated by people who either dismiss the technology entirely or hype it without using it. We don’t want to be in either camp. The honest position lives somewhere quieter: hands on the tools, ear on the output, doing the work of forming a real opinion through use.
The motivating question for both of us isn’t can AI make music? — it’s what does this mean for the artists we care about? Working musicians, DJs, producers, the underground electronic scene in cities like ours.