Co-Writing with an LLM: Critical Code Studies and Building an Oxford TSA App

October 20, 2025

Co-Writing with an LLM: Critical Code Studies and Building an Oxford TSA App

David M. Berry

Figure 1: Final version (2.5) of the Oxford TSA QM

In the field of critical code studies, we seek to understand how software code, as a technical and cultural object, shapes our world. In contrast to thinkers like Friedrich Kittler, who argued that media technologies determine our situation, here I analyse the co-constitutive logic, structure, and materiality of code in conjunction with the co-creation capabilities of generative AI (Kittler, 1999; Berry 2011). As Marino (2020) argues, "code means more than merely what it does; we must also consider what it means". However, this has potentially been a painstaking process of reading and interpreting complex and often historical codebases. Today, however, I find myself at a new juncture. With the emergence of powerful Large Language Models (LLMs) we can develop new ways to engage with code, not just as an object of study, but as a medium for collaborative creation and study.

In this article I look at the creation of what I call the Oxford TSA Questionmaster, a small piece of educational software I developed to address a practical problem.[1] Students preparing for the University of Oxford's Thinking Skills Assessment face a significant challenge. Official past papers are available in the PDF format, but there is no interactive way to practice under timed conditions, receive immediate feedback, or track progress across the fifty questions in Section 1. The manual process of checking answers against a separate answer sheet is tedious and error-prone so I wanted to automate this process, creating a browser-based tool that would display the official questions, accept answers, auto-mark the section 1 results, and provide feedback, all whilst simulating the actual test conditions with correct timers. It should also be noted that the Oxford TSA will be completely digital in 2025, so this makes for a more realistic test environment.[2]

Figure 2:Karpathy (2025), https://x.com/karpathy/status/1886192184808149383?lang=en

The entire application was developed through a series of 67 text prompts with Google's Gemini model (version 2.5 Pro as of October 2025). Karpathy (2025) calls this process "vibe coding", a rapid, iterative dialogue where one steers the high-level direction and the LLM handles the low-level implementation (see also Karpathy 2017, 2025b). We might note that the term "vibe coding" itself demands critical scrutiny as Karpathy's neologism performs significant ideological work through its studied casualness. By foregrounding "vibe", an affective, intuitive register, it effaces the technical complexity and material labour involved in LLM development and deployment. It seems to suggest a frictionless, almost magical process where one simply articulates intentions and code materialises, avoiding questions about whose knowledge the model has appropriated, under what conditions that knowledge was produced, and what forms of labour the interface obscures. As such, we should view vibe coding less as neutral description than as rhetorical term, naturalising a mode of development that brackets questions about authorship, intellectual property, reproducibility, and critical distance. Nevertheless, I retain Karpathy's term throughout this article precisely because its casual framework reveals something important about how generative AI encourages users to adopt an unreflective, instrumental stance towards computational collaboration. By examining what happens when one engages in vibe coding whilst maintaining critical attention to its limitations and ideological functions, this case study helps us to remain alert to what it conceals.

This case study, therefore, has two aims. First, to show how to produce a useful tool for students. Second, and more importantly, to critically document the co-writing process itself, exploring what it means to analyse software by participating directly in its formation through a tight feedback loop with a non-human partner. This approach builds on Marino's (2020) insight that critical code studies "applies critical hermeneutics to the interpretation of computer code, program architecture, and documentation within a socio-historical context", but extends it into the realm of collaborative code generation with artificial intelligence.

The development proceeded through three distinct phases, each representing a shift in approach in response to the inability of the LLM to operationalise the requested prompt and consequently an interesting experience about the limits and possibilities of automated code generation.

Phase One: The Failed PDF Extraction Approach

My initial idea was straightforward. I would have the program extract the questions and answers directly from the Section 1 question and answer PDFs, convert them into structured text, and present the questions to the user one by one in a single interface. The user would select their answers, and the system would mark them automatically from the answer sheet. This seemed elegant and simple.

My first prompt reflected this idea:

PROMPT:

"Create a test program in html that will take the questions for the Oxford TSA from three files (Section 1 questions, Section 1 answers, Section 2 questions) that I can run in the browser and asks the user the questions, then on submission marks their answers"

The LLM then generated the code to parse the PDF and extract text. Initially, this appeared to work. When opening the question file PDF, questions 1 through 20 would parse correctly. But then, consistently, somewhere between questions 30 and 40, the parsing would fail generating an error. The structure of the PDF, it turned out, was not as uniform as I had assumed. Formatting inconsistencies, page breaks, and subtle layout changes caused the regular expressions to misfire. It is interesting in vibe coding that rather than normalise the PDF documents, or extract the text, or even to read through the document format, you are able to just throw the documents into the LLM and see what happens. This has good and bad sides. On the positive, it is very creative to not have to worry about the details and just give the abstract program logic without the coding detail needed. On the bad side, one is likely to lead the LLM astray, or even give it a task to which it is not suited, beyond its capacities, or even one that is even pre-draft, barely formulated as a program, and which one offloads the problem onto the LLM. This we might call a tendency towards cognitive augmentation that LLMs encourage in vibe coding.

PROMPT:

"fix Failed to parse all 50 questions or answers. Parsed 31 questions and 50 answers. Please check PDF format." make the error handling more robust if the PDF document is slightly out of format."

I refined my prompts, asking for more robust parsing logic. The LLM generated increasingly complex regular expressions and error-handling routines. Yet even when the parsing succeeded technically, a more fundamental problem emerged. The LLM code would subtly alter the questions. It would break, shorten, or mangle the wording. This was unacceptable as it would completely change the meaning of the text or make the question unanswerable. The TSA questions are carefully worded, and even small changes could alter their meaning or difficulty.

Figure 3: Original Q8 question in PDF

Figure 4: Question 8 after LLM code extraction showing text corruption

But this design flaw was something I should have anticipated from the start. Many TSA questions contain diagrams, tables, or visual elements. A text extraction approach could not reproduce these as the formatting is inconsistent across the document – the software generated is a simple javascript function running in the browser. Question 23 might present a complex logical diagram. Question 31 might include a data table. These are integral to the questions, not decorative or ornamental additions. No amount of prompt engineering could overcome this limitation as there are just too many edge cases and exceptions to the rule. I therefore had to abandon the text extraction approach.

This failure, however, was very interesting. It revealed the boundary between what an LLM can reliably do (generate code structure, handle state, implement UI logic) and what it cannot (create software that can parse semi-structured documents or preserve visual information through text conversion). This experience confirms Manovich's (2001) observation that digital media operates through a tension between automation and variability. The LLM could automate code generation, but it could not maintain the necessary fidelity to extract the original source material. As Kirschenbaum (2008) demonstrates in his forensic approach to digital texts, all documents possess forensic materiality that resists abstraction. The PDF's encoding of page breaks, column layouts, embedded images, and typographic emphasis constituted a material substrate that the LLM's text extraction necessarily destroyed. indeed, the LLM's tendency to paraphrase and "clean up" extracted text reveals how pattern-matching differs from interpretive understanding. Where a human reader recognises that precise wording matters in logical reasoning questions, the LLM treated variation as acceptable optimisation. I needed a different strategy that acknowledged the materiality of the PDF format itself, what media archaeologists like Kittler (1999) would recognise as the medium's specific materiality.

This failure reveals a critical dimension of what I termed earlier the tendency towards cognitive augmentation in vibe coding. The ease of offloading problems to the LLM created a false sense of their capabilities. I had assumed the LLM could parse complex, semi-structured documents because previous prompts had succeeded with simpler tasks. The LLM, for its part, generated increasingly elaborate solutions without signalling that the approach was flawed. This dynamic shows how cognitive augmentation can become cognitive delegation: rather than improving my architectural decisions, the LLM's responsiveness encouraged me to persist with what turned out to be an unworkable strategy. The human architect or curator of software in AI mode must therefore maintain critical distance from the LLM's apparent capabilities, recognising that its willingness to generate code does not indicate the approach is necessarily sound.

This raises interesting methodological questions about the role of failure in vibe coding. In traditional software development, failure provides diagnostic information, including error messages, stack traces, and debugger output which reveal where problems were in the implementation. In vibe coding, failure operates differently. The LLM rarely refuses a prompt or signals that an approach is flawed. Instead, it generates plausible-looking code that fails in practice, shifting the burden of judgement entirely onto the human co-writer. This asymmetry, where the LLM appears willing to attempt anything whilst lacking the capacity to evaluate its soundness, requires that we theorise failure not merely as technical breakdown but as an epistemological signal about the limits of pattern-matching systems. Failure in vibe coding reveals not just bugs but the category boundaries of what LLMs can coherently model. As such, documenting and analysing failures may be as important methodologically as documenting successes, particularly for understanding when human judgement remains key to these systems' successful use and integration into social processes.

Phase Two: A PDF Display Solution

The breakthrough came from a simple realisation. I did not need to extract the questions at all. I could simply display the actual PDF directly in a browser frame and allow the user to read the document and simply ask them to click their answer. Each question in the TSA has five possible answers: A, B, C, D, or E so this made the programming much simpler. The user could read the question in its original, unaltered form in a left-hand box, complete with diagrams and tables, and then indicate their choice on a right-hand multiple choice (A-E) selection box.

This change could be said to exhibit what Bolter and Grusin (1999) call remediation, the process by which new media refashion older media forms. Rather than attempting to translate the PDF into a new digital format (which inevitably introduced too many errors and losses), I would remediate the PDF by embedding it within an interactive interface. The PDF would remain intact as a medium, but would be given new functionality through the surrounding software layer. This approach builds on the idea of a protocol of the document, its conventions and material properties, whilst extending its capabilities through algorithmic augmentation (Gitelman 2006) .

My prompt reflected this new approach:

PROMPT:
"ok this isn't working. instead rewrite the original TSA testing program. Make it so that when first run it asks for the section 1 question and then the section 1 answer sheet and the section 2 question sheet. Parse only the Section 1 answer sheet so that the program know the answers. Then instead of asking the questions, get it to show the section 1 questions PDF and ask for answers (a-e) for each of the questions (always showing the PDF in a frame of the page) and then move through the entire question sheet (1-50). Then in section 2 show the pdf for that section and offer a text box to write the answer."

This required a complete architectural restructuring. The LLM generated a two-panel layout using an embedded PDF viewer on the left and an answer panel on the right.

The move from Phase One to Phase Two shows how cognitive augmentation can function productively. Here, my architectural judgement, the recognition that extraction was flawed, directed the LLM toward a more realistic solution. The LLM then augmented my implementation suggestions, rapidly generating the interface code I had wanted but would have taken considerably longer to write manually. This shows, I think, the optimal human-LLM division of labour with human strategic thinking combined with machine generation of code. When we finally hit on the right solution, the augmentation remained genuinely supplementary rather than substitutive, with each actor contributing capabilities the other lacked.

The user would navigate through the PDF manually or using browser controls, whilst the software tracked which question they were answering through a simple counter and buttons.

The LLM generated answer sheet parsing logic that could handle this more structured format reliably. Even when the PDF-to-text conversion introduced spacing inconsistencies, I could prompt for corrections:

The LLM responded with a more robust regular expression:

javascript

function parseS1Answers(text) {
    const answers = {};
    const answerRegex = /(\d{1,2})\s*[^A-Z0-9]*?\s*([A-E])(?![a-zA-Z])/g;
    let match;
    while ((match = answerRegex.exec(relevantText)) !== null) {
        const questionNum = parseInt(match[1]);
        const answer = match[2];
        if (questionNum >= 1 && questionNum <= 50) {
            answers[questionNum] = answer;
        }
    }
    return answers;
}

The beauty of this approach was that it sidestepped most of the problems of Phase One. The questions were displayed exactly as the Oxford PDF presents them. No information was lost, no paraphrasing occurred, and visual elements remained intact. The software became a thin layer of interactivity wrapped around the original materials.

Timing was also crucial. The real TSA is strictly timed: 90 minutes for Section 1 and 30 minutes for Section 2. I prompted:

PROMPT:
"Add a timer to this that ticks in the right hand top corner. It should count down from 90 mins for section 1 and 30 mins for section 2. It should stop the test at that point and calculate the results."

The LLM generated countdown logic that displayed remaining time and automatically submitted the test when time expired.

Figure 5: Side-by-side PDF and answer panel (early version)

But I still needed to mark the answers automatically. Here, the LLM proved remarkably effective. The Section 1 answer sheet is a much simpler document, just a list of question numbers and corresponding letters. I prompted:

PROMPT:
"In section 2, add a text box so the user can write in which question they are answering. When you go back to review the results start the review at the first question. Under the section 1 results box give a 5x10 overview using numbered green and red circles to give a top level view of the answers that were right and wrong."

The new version of the software now produced this results screen for the student to check their answers. The LLM generated a grid layout that displayed all fifty questions, each cell coloured appropriately. This allowed for quick pattern recognition. Had the student struggled with a particular section? Were errors clustered or scattered?

Figure 6: Results screen: green correct/red incorrect (early version)

This code block shows the LLM overcoming the friction between structured data (our goal) and semi-structured data (the PDF text). The answer sheet could be parsed, the user's responses could be compared against it, and a score could be generated automatically. Phase Two was complete. The software now worked.

Phase Three: Refinement and Usability

With the software now functioning as I wanted, I turned my attention to the user experience. This phase was characterised by incremental additions, each prompted by imagining how a student would actually use the software under exam conditions and how best to present the results to them.

Next, I added output functionality. Students might want to save or share their results. I prompted for print and email options. The print function generated a formatted page with the score, the visual grid, and the essay from Section 2. The email function pre-populated a mailto link with the same information. These were small features, but they transformed the software from a practice tool into something with persistence and shareability.

The marking functionality itself required careful thought. In the real TSA, students often mark questions they are unsure about and return to them later. I prompted:

PROMPT:
"add a function to be able to mark a question in section 1 that you want to come back to to check or that you are not sure about."

The LLM introduced a new state variable and integrated it into the display logic:

javascript

markReviewBtn.addEventListener('click', () => {
    if (s1Submitted) return;
    markedQuestions[currentQuestion] = !markedQuestions[currentQuestion];
    if (userAnswers[currentQuestion] && markedQuestions[currentQuestion]) {
        delete userAnswers[currentQuestion];
    }
    updateS1Display();
});

This simple mechanism allowed students to flag questions and see at a glance which items needed a second look.

Then I sought to improve visual feedback. The screen needed better layout and I wanted students to see at a glance which questions they had answered correctly and which they had missed and to be able to print the results. I also attempted to improve usability by adding keyboard support. Keyboard controls were essential for working through the test. Clicking through fifty questions with a mouse would be tedious. The prompt was:

PROMPT:
"In the section 1 mode. Move the overview circles lower down the page so it is not so confusing about where to answer the current question. Perhaps use colour on the background to make it clearer where the user should be clicking. Change the Mark for Review button switchable On/Off so the user can mark and then clear it if not needed. Add keyboard controls using arrow left and right to move through the questions and allow the user to type A-E as well as click on the circles to answer the questions. If the user clicks RETURN then move to the next question. If the user clicks <escape> then move to the first question. If the user uses <ctrl>+an arrow move to the left or right most question as appropriate. If they use the shift modifier jump 10 questions in either direction as appropriate. If the user clicks TAB then pause the exam and freeze the screen until they click or press a key (make this clear by greying out the exam) also add a PAUSE button next to the quit button. update the version to document this."

The LLM generated a comprehensive event listener:

javascript

document.addEventListener('keydown', (e) => {
    if (isPaused) {
        togglePause();
        return;
    }
    switch (e.key) {
        case 'ArrowRight':
            if (e.shiftKey) currentQuestion = Math.min(50, currentQuestion + 10);
            else nextBtn.click();
            updateS1Display();
            break;
        case 'ArrowLeft':
            if (e.shiftKey) currentQuestion = Math.max(1, currentQuestion - 10);
            else prevBtn.click();
            updateS1Display();
            break;
        case 'a': case 'A':
            selectAnswer('A');
            break;
        // ... cases for B, C, D, E
    }
});

This made navigation much more fluid and fast. A student could move through the test, answer questions, and mark items for review without ever touching the mouse.

It should be noted that at this point the context window for the conversation with the LLM was starting to get unwieldy and the LLM often got confused as to what I was requiring. This context degradation represents a third mode of cognitive augmentation, what we might call cognitive overhead. Rather than simply augmenting my capabilities, the LLM now required active management, for example in tracking version numbers, constraining scope with phrases like "do only these changes," and preventing unsolicited modifications. The cognitive resources freed by not writing implementation code were partially consumed by managing the LLM's context and preventing regressions and mistakes. This shows that cognitive augmentation is not simply additive but involves real trade-offs where implementation speed gained should be set against the frustration at the human management of context – and which really the LLM should have been looking after.

In fact at this point in the development the LLM would often get stuck reimplementing the last change, or rolling it back to a previous version. Careful prompting was required to remind it of the version number it was actually working on, e.g. "fix only this. update the version number", "add the 2.4 comment to this file", "update to version 2.5", etc.

The initial PAUSE implementation was buggy. After testing, I clarified:

PROMPT:
"Pausing is broken and can't be unpaused please fix it"

This led to corrected logic:

javascript

function togglePause() {
    isPaused = !isPaused;
    if(isPaused) {
        pauseOverlay.classList.remove('hidden');
        pauseBtn.textContent = "Resume";
    } else {
        pauseOverlay.classList.add('hidden');
        pauseBtn.textContent = "Pause";
    }
}

As can be seen above, this feedback development loop is the essence of the co-writing process with an LLM. I identify a logical failure then prompt the LLM to correct the technical implementation. Test it again, and repeat the cycle.

Finally, I layered on contextual feedback. A raw score is useful, but students need to understand what it means. The TSA has approximately three threshold scores. I prompted:

PROMPT:
"do only these changes. update the version number to 2.3. allow any key to unpause the test. remove Ctrl+arrow function as it doesn't work correctly. When the final submission is made of section 2 if the section 1 score is below 27 tell them they did not reach the threshold, if 28-32 - inform the user that they have reached the minimum threshold or required average score. If they achieve over 38 then say this is in the upper quintile and a very good score."

I then asked it to make them a little more interesting

PROMPT:
"fix only this. Add the threshold message to the PRINT and the EMAIL functions. When the section 2 is submitted put the threshold message under the section 1 summary. Give the score, and the threshold reached ([cross]Below/[tick thumbs up emoji]Average/[tick tick and a winners cup emoji and mortarboard emoji]Above) and then summarise the different thresholds that might be met. If the result is above 45 then state excellent result with three ticks and a star emoji and 2 mortarboard emoji."

The LLM generated a function mapping scores to qualitative messages:

javascript

function getScoreFeedback(score) {
    if (score > 45) {
        return { message: "Excellent result!", emoji: "✅✅✅⭐🎓🎓" };
    } else if (score > 38) {
        return { message: "This is in the upper quintile and a very good score.", emoji: "✅✅🏆🎓" };
    } else if (score >= 28) {
        return { message: "You have reached the minimum threshold or required average score.", emoji: "✅👍" };
    } else {
        return { message: "Unfortunately, you did not reach the typical threshold score.", emoji: "❌" };
    }
}

Throughout this phase, I also added documentation directly into the HTML file. Version numbers, changelogs, usage instructions, and disclaimers were all embedded. Each significant change was logged, for example,

v1.0: Initial creation with PDF viewer
v1.1: Added countdown timers
v1.2: Corrected timer logic
v1.3: Enabled review of Section 1 results
...
v2.4: Added score threshold summary text

This versioning served both as documentation and as a chronological record of the development process itself. Finally, I made Section 2 more functional. This section required a 30-minute essay response. The initial text box was too small. I prompted to enlarge it, making it suitable for extended writing. With these refinements, the software felt complete.

Conclusions

Figure 7: Flowchart logic of the code

The creation of the Oxford TSA Questionmaster demonstrates that vibe coding is remarkably effective for small, personal software projects. Undertaking this design and creation using generative AI has helped develop the idea of distinct modes of cognitive augmentation. Phase One showed cognitive delegation, where the LLM's responsiveness created false confidence in an unworkable approach. Phase Two demonstrated productive augmentation, where human architectural decisions directed the generative AI implementation toward workable solutions. Phase Three revealed cognitive overhead, where managing the LLM's context and preventing errors required human control that offset some efficiency gains. This tripartite framework, across delegation, augmentation, and overhead, offers an interesting insight into vibe coding than assuming a generative AI can relatively easily implement a solution. The 67 prompts I used reflect not just iterative improvements but ongoing negotiation between these three modes, as I learned to maintain architectural control while keeping the LLM focussed on improving the software in a logical fashion.

However, this analysis of cognitive augmentation and hermeneutic co-creation risks obscuring the material conditions that make vibe coding possible. The ease with which I prompted Gemini through 67 iterations depends upon vast infrastructures of extraction and exploitation that remain invisible in the collaborative interface. As Horkheimer and Adorno (2002) argued, instrumental reason presents itself as neutral and efficient whilst concealing relations of domination. Similarly, the apparent neutrality of vibe coding, its promise of frictionless collaboration between human and machine, masks the global supply chains of labour, resources, and energy upon which it depends. It is certainly the case that critical code studies needs to attend not only to the epistemological and hermeneutic dimensions of LLM collaboration but also to its political economy, asking not just how we understand code differently when co-creating with AI, but whose labour and which planetary resources make such co-creation materially possible (see Berry 2025). It is interesting to note that Gemini calculated that although the development work spanned two days, actual interaction was only 4 hours, which worked out, with 67 prompts, to 3 minutes of compute time or 0.047 GPU hours on Google's Custom Hardware TPU v5p, costing approximately $0.20 (including the associated infrastructure, a commercial cost might be £5-£15). For every 80 minutes I spent on high-level direction, critique, and testing, the LLM spent just 1 minute on the low-level implementation. This is a staggering productivity multiplier.

Yet this process also raises questions about the nature of understanding itself when it involves a non-human collaborator. Gadamer's (1975) concept of the hermeneutic circle describes how the interpreter must "project before himself" a meaning for the text as a whole as soon as some initial meaning emerges in the text. But what happens when the "text" is code, and when that code is co-constructed through conversation with an artificial intelligence? The hermeneutic circle, in its classic formulation, operates through a continuous movement between part and whole, between the interpreter's prejudgements and the text itself, gradually achieving what Gadamer terms a "fusion of horizons" between reader and author.

In the case of vibe coding, however, this circle becomes altered. I brought my prejudgements about what the software should do, my understanding of the problem domain, and my experience with similar applications. The LLM brought its training on vast corpora of code, patterns abstracted from millions of programming examples, and statistical relationships between natural language descriptions and code implementations. Our conversation using prompts created, perhaps, a hermeneutic spiral rather than a circle, where each iteration produced not just refined understanding but new code that itself became an object of interpretation. For example, when I reviewed the LLM's generated functions for parsing PDF answers, I was simultaneously reading code (e.g. interpreting its technical operation) and reading my own intentions reflected back to me in computational form. The code became a kind of mirror, but a distorted one, showing me what I had asked for but also revealing the gaps and ambiguities in my own thinking and expression.

This introduces a kind of triadic hermeneutic structure. Where Gadamer describes a dialogue between interpreter and text (or between two human subjects in conversation), vibe coding involves a three-way exchange between human intention, computer generation, and executable code. The code is neither purely mine nor purely the LLM's. It exists in a strange ontological space, simultaneously authored and discovered, intentional and emergent. Each prompt I gave was an interpretive act, an attempt to describe my understanding of what the software should be. Each response from the LLM was both an interpretation of my prompt and a new text requiring interpretation. The resulting code was then executed in the browser, producing behaviour that itself required interpretation and often prompted further refinement, particularly when it failed or the functionality was flawed.

Gadamer (1975) argues that understanding is not merely a reproductive but always a productive activity and in co-writing with an LLM, this productivity becomes explicit and material. The computer writes code, but it does so by interpreting my natural language prompts through the models distilled from its training data. I then interpret that code, not just by reading it but by testing it, observing its behaviour, and deciding whether it matches my intentions. This process reveals something interesting about hermeneutics in the age of AI. Understanding is no longer simply about bridging the gap between two human consciousnesses, or between a reader and a historical text. It now involves negotiating meaning with non-human agents whose "understanding," if we can even call it that, is different in kind from our own, yet functionally similar enough to produce possible dialogue.

The implications for critical code studies are, I think, significant. If code is, as Marino (2020) suggests, something we must read for meaning beyond mere function, then co-generated code presents a unique hermeneutic challenge. Whose intentions are embedded in this code? What does it mean to interpret software whose authorship is distributed between human and machine? The traditional hermeneutic project assumed that understanding meant recovering or reconstructing the intentions of a human author, or at least engaging with meanings produced by human consciousness. But the LLM has no intentions in any meaningful sense. It has patterns, probabilities, and statistical associations. Yet these patterns, when prompted by human intention and shaped through iterative dialogue, produce code that functions and has contexts and meanings (e.g. drawn from the data set that trained it, the selection of that data, curation and reinforcement learning, and of course from the human user who is shaping and directing the final output).

This suggests we need to rethink our hermeneutic frameworks to account for hybrid authorship and distributed intentionality. The hermeneutic circle becomes a kind of hermeneutic network, a complex topology of meaning-making that includes human prejudgements, machine-learned patterns, executable code, observable behaviour, and the material constraints of computational systems. Understanding code in this context means understanding not just what it does or what it means, but how it emerged from a collaborative process that troubles the very categories through which we have traditionally understood authorship, interpretation, and meaning.

The cognitive augmentation notion developed here connects productively to historical concerns about anthropomorphising computational systems. Weizenbaum (1976), creator of the ELIZA chatbot in 1966, warned that "extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people". What Turkle (1984) and others later termed the "ELIZA effect" described users attributing genuine understanding to simple pattern-matching scripts. As my colleagues and I discovered in uncovering ELIZA's original source code, Weizenbaum's program operated through remarkably minimal linguistic transformations, yet users invested it with therapeutic insight and emotional intelligence (Berry and Marino 2024, Ciston et al 2026). Contemporary vibe coding represents both continuity and rupture with this phenomenon. The ELIZA effect described users misrecognising computational processing as understanding.

In vibe coding, however, the misrecognition operates differently and potentially more dangerously. Where the ELIZA effect involved users projecting understanding onto simple computing systems, what we might call the "competence effect" emerges when the functional capability of an LLM masks the absence of semantic understanding. The LLM produces code that compiles, executes, and often does exactly what was requested, creating a feedback loop that reinforces assumptions about the system's capabilities. This competence effect is potentially more dangerous than the ELIZA effect because the evidence appears to support the user's anthropomorphism. The code works, bugs get fixed through iterative prompting, and the system appears to learn from corrections. Unlike ELIZA users who had to ignore the system's obvious repetitiveness, vibe coders receive constant reinforcement that the LLM "gets it". The competence effect thus represents a qualitatively different epistemological trap, one where generative ability becomes confused with real comprehension, and where the boundary between pattern-completion and understanding becomes increasingly difficult to maintain.

The three modes of cognitive augmentation can be mapped onto different degrees of the ELIZA effect. Cognitive delegation occurs when the LLM's apparent responsiveness convinces us it grasps our architectural goals, leading us to persist with potentially flawed approaches because the system generates plausible-looking code. Productive augmentation, conversely, involves maintaining critical distance, treating the LLM as a sophisticated pattern-completion engine rather than a genuine collaborator with understanding. Cognitive overhead emerges partly from the mental labour required to resist the ELIZA effect continuously, correcting misrecognitions and managing the system's context because we cannot rely on shared understanding.

Where ELIZA's users anthropomorphised a simple script, vibe coders face the more subtle challenge of distinguishing between genuine collaboration and simulated comprehension. The LLM produces code that functions, answers prompts appropriately, and even appears to "learn" from corrections across iterations. Yet this surface competence conceals differences between pattern-matching and semantic understanding, between statistical probabilities and intentional meaning-making. Weizenbaum's warning thus remains important as the danger lies not in crude anthropomorphism but in subtle misattributions of understanding that lead us to delegate judgement inappropriately to computers. The cognitive augmentation framework I develop here offers analytical tools for resisting this misattribution, helping us recognise when we are treating the LLM as genuinely comprehending rather than statistically processing. It suggests that critical engagement with AI systems requires constant vigilance against the ELIZA effect when using contemporary AIs, maintaining analytical distance even whilst collaborating productively with these powerful machine learning systems.[3]

I think that these questions are also central to the future of digital humanities itself. As I have argued in Understanding Digital Humanities (Berry, 2012), computational techniques pose "theoretical and practical challenges" for traditional humanities disciplines, requiring us to rethink questions about interpretation and knowledge production. Hayles (2012) frames this even more directly, arguing that "we think through, with, and alongside media", something that becomes particularly urgent when the medium is an artificial intelligence capable of generating code from natural language. If digital humanities scholars must seek to understand with how computational methods transform humanistic inquiry, then co-coding with LLMs represents a limit case, a situation where the traditional hermeneutic circle is fundamentally restructured by the presence of a non-human collaborator whose "thinking" operates according to different principles than our own and yet has important consequences for it.[4]

More significantly, however, this process represents a new method for critical code studies. As I have argued elsewhere, code is not just a technical artefact but a form of writing that actively shapes subjectivity and knowledge (Berry 2011). By engaging in this tight feedback loop with an LLM, I was forced to constantly articulate goals, identify logical inconsistencies, and make high-level architectural decisions. It is a form of active, interpretative reading where one comes to understand the software by co-creating it.

The cognitive augmentation framework I have introduced here helps explain these boundaries more precisely. Vibe coding succeeds when human judgement identifies workable problems and the LLM augments implementation. It fails when the ease of prompting encourages cognitive delegation, through the offloading of judgement about problem design to a system incapable of strategic assessment. The context management overhead in Phase Three further suggests that cognitive augmentation has scaling limits. Indeed, beyond a certain complexity threshold, the mental resources required to manage LLM context may potentially exceed the resources saved by not writing code directly. It would be useful for future research to investigate these thresholds empirically, identifying when cognitive augmentation transitions from net benefit to net burden. In this approximately 1000 line program it was surprising to me that the cognitive overhead and the requirement to manage the context and versioning ballooned so quickly.

An LLM can generate robust code structures, but it is weak at making strategic decisions about when an approach is flawed. That insight, the recognition that text extraction would never work for these documents, had to come from me. The machine handled implementation and I handled architectural structure and abstract reasoning. Interestingly this division of labour reflects McLuhan's (1964) observation that media are extensions of human faculties. The LLM extended my coding capacity, but the critical, architectural thinking remained human all too human.

This division of labour suggests a new relationship between critic and code. The great challenge for critical code studies has always been the scale and complexity of software. An LLM can act as an expert assistant, navigating the codebase, extracting relevant snippets, and explaining their function. It collapses the distance between the critic and the code. The future of the field, I believe, will involve developing methods that use LLMs not as a replacement for critical thought, but as a powerful hermeneutic enabler.

The three modes of cognitive augmentation identified here in terms of delegation, productive augmentation, and overhead, have implications beyond this case study. If critical code studies scholars increasingly use LLMs to understand complex codebases, they must remain aware of which mode they occupy. Cognitive delegation risks accepting LLM-generated interpretations without sufficient critical scrutiny. Productive augmentation could accelerate the analysis of large-scale software systems currently beyond practical scholarly reach. Cognitive overhead might limit these gains, particularly when managing multiple analytical threads across extended research projects. The field requires not just tools for LLM-assisted code analysis but also methods for recognising and managing these distinct augmentation modes. This suggests that future critical code studies must include meta-critical reflection on how computational tools shape the interpretive process itself, a kind of reflexive augmented hermeneutics where we critically examine our own use of interpretive machines.

Notes

[1] The software is available to download and run from https://github.com/dmberry/Oxford_TSA_Questionmaster. The code is released under the Creative Commons Share-Alike license, permitting free use, modification, and redistribution. By making the software freely available, I intended to reduce rather than exacerbate educational inequalities around TSA preparation. Commercial test preparation services can charge quite substantial fees, creating advantage for those with financial resources. Open-source release potentially democratises access to practice tools like this. However, this gesture towards equality contains its own contradictions. Availability does not guarantee accessibility. Students must know the repository exists, possess technical competence to download and run this browser-based application, and have access to devices and internet connectivity. Moreover, the software depends upon possession of official TSA past papers, which Oxford provides freely but which require knowledge of where to look and confidence to engage with university materials. The tool thus operates within an ecology of access where each component, whilst technically free, demands forms of literacy and confidence unequally distributed across social classes. There is also an ethical question about automating preparation for high-stakes examinations. The TSA functions as a gatekeeping mechanism for Oxford admission. By creating software that makes practice more efficient, am I simply accelerating an arms race in test preparation, where advantage accrues to those who discover such tools first? Future work around these types of educational tool might investigate whether and how such tools actually get discovered and are used by students from different backgrounds, and whether open availability translates into equal access or merely adds another layer to existing hierarchies of educational advantage.

[2] The idea for this project was originally more practical rather than theoretical. Watching my son practise for the Oxford TSA, I observed the cumbersome workflow imposed by the examination documents. Oxford provides past papers as separate PDF files: one for the fifty section 1 questions, another containing just the answer key, and a third for the section 2 essay questions. A student attempting realistic timed practice must check each response against the answer sheet after the test is over. This process is not only time-consuming but pedagogically weak, as the delayed feedback and manual checking introduce opportunities for error and discourage repeated practice. A possible solution was automation of the marking with a system that could present the questions in their original format whilst handling timing, answer recording, and immediate marking automatically. This need, born from parental observation of an unnecessarily tedious study process, became the catalyst for exploring how quickly an LLM could transform such a practical requirement into working software using vibe coding. The theoretical implications of the co-coding process emerged retrospectively for me, as I reflected on how this collaborative co-creation challenged notions of authorship, interpretation, and the hermeneutic relationship between human intention and generative AI.

[3] Weizenbaum (1976) argued that computers should never be given tasks requiring "wisdom" precisely because they lack the embodied, affective dimensions of human judgement. His distinction between "deciding" (computational) and "choosing" (requiring judgement) still remains relevant. Recent scholarship has extended this analysis, as I and colleagues in studying the original ELIZA source code reveal how minimal the actual "intelligence" was, yet it produced profound effects on users (see http://findingeliza.org) (Berry and Marino 2024; Ciston et al 2026). The contemporary challenge is that LLMs, unlike ELIZA, demonstrate emergent capabilities that make the boundary between simulation and understanding increasingly difficult to discern. This requires new critical frameworks that can account for what might be called "synthetic hermeneutics", where interpretation is distributed across human and machine agents in ways that trouble traditional theories of meaning and understanding.

[4] The hermeneutic questions raised by co-coding with AI connect to broader concerns in critical theory about technology and interpretation (Berry 2014). Adorno and Horkheimer's (2002) analysis of the "culture industry" warned of instrumental rationality colonising aesthetic experience, similarly, Habermas's (1984) theory of communicative action distinguished between strategic and communicative rationality. The challenge posed by LLMs is whether they represent a form of pseudo-communication, a simulation of dialogue that lacks the intersubjective recognition essential to genuine understanding (Berry 2014). As Marcuse (1964) argued, technological rationality can become a form of domination precisely by appearing neutral and efficient. The question for digital humanities is whether computational methods undermine the interpretative possibilities key to humanistic inquiry. The cognitive augmentation framework introduced in this article extends these concerns to the specific domain of vibe coding. Where Marcuse identified technological rationality as domination through apparent neutrality, cognitive delegation represents a similar pattern with the LLM's apparent helpfulness masks its inability to make strategic judgements, potentially leading developers toward workable but potentially suboptimal results. See also Feenberg (2002) on critical theory of technology, and Stiegler (2016) on tertiary retention and grammatisation.

Bibliography

Adorno, T. W. and Horkheimer, M. (2002) Dialectic of Enlightenment: Philosophical Fragments, Stanford University Press.

Berry, D. M. (2011) The Philosophy of Software: Code and Mediation in the Digital Age, Palgrave Macmillan.

Berry, D. M. (2012) Understanding Digital Humanities, Palgrave Macmillan.

Berry, D. M. (2014) Critical Theory and the Digital, Bloomsbury Publishing.

Berry, D.M. (2025) ‘Synthetic media and computational capitalism: towards a critical theory of artificial intelligence’, AI & SOCIETY. Available at: https://doi.org/10.1007/s00146-025-02265-2.

Berry, D.M. and Marino, M.C. (2024) ‘Reading ELIZA: Critical Code Studies in Action’, Electronic Book Review. Available at: https://electronicbookreview.com/essay/reading-eliza-critical-code-studies-in-action/ (Accessed: 4 November 2024).

Bolter, J. D. and Grusin, R. (1999) Remediation: Understanding New Media, MIT Press.

Ciston, S. et al. (2026) Inventing ELIZA: How the First Chatbot Shaped the Future of AI. MIT Press (Software Studies series).

Feenberg, A. (2002) Transforming Technology: A Critical Theory Revisited, Oxford University Press.

Gadamer, H. G. (1975) Truth and Method, Continuum.

Gitelman, L. (2006) Always Already New: Media, History and the Data of Culture, MIT Press.

Habermas, J. (1984) The Theory of Communicative Action, Volume 1: Reason and the Rationalization of Society, Beacon Press.

Hayles, N. K. (2012) How We Think: Digital Media and Contemporary Technogenesis, University of Chicago Press.

Karpathy, A. (2017). 'Software 2.0'. Medium. Available at: https://medium.com/@karpathy/software-2-0-a64152b37c35.

Karpathy, A. (2025). 'Vibe coding' [X/Twitter post]. 2 February. Available at: https://x.com/karpathy/status/1886192184808149383 (Accessed: 20 October 2025).

Karpathy, A. (2025b) ‘Software Is Changing (Again)’, Youtube. Available at: https://www.youtube.com/watch?v=LCEmiRjPEtQ (Accessed: 23 June 2025).

Kirschenbaum, M. G. (2008) Mechanisms: New Media and the Forensic Imagination, MIT Press.

Kittler, F. A. (1999) Gramophone, Film, Typewriter, Stanford University Press.

Manovich, L. (2001) The Language of New Media, MIT Press.

Marcuse, H. (1964) One-Dimensional Man: Studies in the Ideology of Advanced Industrial Society, Beacon Press.

Marino, M. C. (2020) Critical Code Studies, MIT Press.

McLuhan, M. (1964) Understanding Media, McGraw-Hill.

Stiegler, B. (2016) Automatic Society: The Future of Work, Polity Press.

Turkle, S. (1984) The Second Self: Computers and the Human Spirit, Simon & Schuster.

Weizenbaum, J. (1976) Computer Power and Human Reason: From Judgment to Calculation, W. H. Freeman.

Search This Blog

stunlaw