LLMbench: A Comparative Close Reading Workbench For Large Language Models

David M. Berry


"The meaning of a text lies not behind the text but in front of it. The meaning is not something hidden but something disclosed."

Paul Ricoeur, 1981.


The proliferation of large language models (LLMs) has generated an equally prolific effort to measure them. Kahng et al. (2024) name part of the problem directly in their discussion of Google PAIR's LLM Comparator. Side-by-side evaluation of models, they argue, is a key practice, and existing tools for it tend to be quantitative or rely on presenting user-rating metrics. The system they present is a useful piece of engineering, but it is designed for model developers (especially those making products), not for the hermeneutic work of close reading what a model has generated. What is missing is a workbench for comparative close reading, an environment in which the outputs of two models can be subjected to the kinds of attention the humanities bring to primary sources, annotation, structural differences, rhetorical analysis, or, especially in relation to LLMs, inspection of the probabilistic structure from which the text emerged.

The digital humanities (DH) have a long tradition of tool building for textual analysis, Voyant Tools (Sinclair and Rockwell 2016) probably being the most familiar, alongside TAPoR, AntConc, MALLET, CATMA, Recogito, and a lineage of corpus-reading environments designed for scholarly rather than evaluative work. These tools count, visualise, index, and annotate, and they support both close and distant reading. They were built for a fixed object, a human-authored text or corpus whose meaning is explored through structural and statistical description. What they were not built to do is treat the text itself as a probabilistic object (i.e. a draw from a probability distribution that could just as easily have given different tokens at the same position). LLMbench inherits the DH tradition of browser-based, annotation-rich tool building, but the hermeneutic problem it takes on is different, how to read a text that could have been otherwise, and how to read it alongside another text that could have been otherwise in different ways.

LLMbench, part of what I call the Vector Lab tools, is built to make those differences readable. It has six modes. The Compare mode is the primary close-reading area, where two model responses to the same prompt sit side by side in annotatable panels with four analytical overlays. In addition to this are five Analyse modes that can be used to run empirical probes, (1) Stochastic Variation, (2) Temperature Gradient, (3) Prompt Sensitivity, (4) Token Probabilities, and (5) Cross-Model Divergence. The figures that follow use a prompt about Italo Calvino's Cybernetics and Ghosts, comparing the models Gemini 2.0 Flash against GPT-4o.[1]

Token Probabilities in Compare Mode

The Probs overlay is the analytical heart of the LLMbench.[2] Activating "Probs" in the Compare toolbar re-sends the current prompt to the model API and requests what is called "logprob" data alongside the response. This takes a few seconds. Once loaded, a continuous heatmap overlays both panels. The idea is simple. Tokens the model chose with high confidence (above roughly 70% probability) are given no highlight. Below that threshold the background colours progressively, from pale yellow through orange to deep red for positions where the model was "uncertain". The full probability gradient is visible across both panels at once, so confidence patterns in each response can be compared.

The navigation strip below the toolbar provides the analytical controls. Three buttons summarise the data at a glance. "Uncertain" jumps to the positions with the highest entropy, where the model was most genuinely uncertain across alternatives (i.e. where no single token was clearly favoured). "Forks" jumps to positions where the chosen token had less than 70% probability, a lower bar that surfaces a broader population of possibilities. "Diverge", available when both panels have logprob data, jumps to positions where Panel A and Panel B chose different tokens at the same sequence position. 

In the Calvino comparison shown in Figure 1, those numbers show 399 uncertain positions, 174 forks, and 281 diverge points across the two responses.

Figure 1: Compare mode with Probs overlay active across both panels. The continuous token heatmap is visible on each panel, ranging from uncoloured (confident) through yellow and orange to deep red (uncertain). The navigation strip is visible below the toolbar, showing Uncertain (399), Forks (174), and Diverge (281) counts. Both panels display Gemini 2.0 Flash and GPT-4o responses to the Calvino prompt, with the Tone analysis side panel on each side.

Click any token and a probability distribution panel opens alongside the text, showing the full top-k alternatives the model was considering at that position. This is where the reading becomes interesting. The inspector shows both the chosen token and the probability mass the model assigned to everything it did not choose. At position 26 in the Gemini response, for instance, the model was working with an entropy of 2.315 bits and chose its token with only 11.78% probability, meaning several other tokens were live and nearly as probable. The GPT-4o response at the same prompt position shows a very different distribution, entropy of 1.567 bits, chosen probability 49.27%. The two models are not equally uncertain at the same moment. They encounter different distributions of difficulty as they traverse the same conceptual territory, and the inspector makes that visible at the token level.

Command or Control-click a second token to pin two distributions side by side, enabling direct comparison of two moments of uncertainty within the same response, or of the same position across both panels.

Figure 2: The probability inspector panel pinned in both Panel A and Panel B. Panel A shows Position 26/399, Entropy 2.315 bits, Chosen 11.78%, with probability distribution bars for the top alternatives. Panel B shows Position 26/267, Entropy 1.567 bits, Chosen 49.27%, with its own distribution. The divergence annotation is visible below each inspector, noting that the two models chose different tokens here.

Three optional visualisation bands extend the Probs analysis beyond the inline heatmap. Each represents the same underlying probability data through different spatial visualisations, and each makes different patterns more visible.

The "Graph" band renders an entropy curve, an SVG sparkline of per-token entropy plotted across the entire sequence, with Panel A and Panel B overlaid in different colours. The horizontal axis is token position, the vertical axis entropy in bits. Reading the curve reveals the rhythmic structure of each model's uncertainty, where it was consistently committed across a passage, where it suddenly became spread, and crucially, where the two models diverge in their entropy profiles despite having received the same prompt. Click any point on the curve to jump the cursor in both panels to that token position.

Figure 3: The Graph band active, showing the entropy curve at the top of the interface. Blue and orange sparklines (Panel A and B respectively) trace per-token entropy across all token positions, with the horizontal axis showing token position up to approximately 398 and the vertical axis measuring bits (0 to 2). Below the curve, both panel heatmaps remain visible with the probability inspector open. The Uncertain and Diverge navigation chips are active in the strip.

The "Pixels" band offers something different, a bird's-eye summary of both responses at once. Each token becomes a coloured cell in a dense grid, with the same heat palette as the inline heatmap. Where the heatmap embedded in the prose requires reading to navigate, the pixel map collapses the entire response into a single glance-level view. Clusters of red cells indicate passages of sustained uncertainty, whereas pale or uncoloured expanses indicate confident stretches. Because both panels use the same cell size, the spatial extent of each response is immediately comparable, and the distribution of uncertainty visible across each grid can be read against the other.

Figure 4: The Pixels band active, showing the token pixel map for both panels above the regular text view. Each cell represents one token, coloured by probability using the Heat palette. Panel A (Gemini 2.0 Flash, 398 tokens) and Panel B (GPT-4o, 267 tokens) are displayed side by side, with their different response lengths visible in the different widths of the grids. The colour distribution varies noticeably between the two panels, particularly in the density of red cells.

The "Net" band renders the probability data as a three-dimensional terrain. Each token position becomes a vertex on a mesh surface and the vertical displacement of each vertex corresponds to its entropy. High-entropy positions become peaks and confident stretches become flat plains. The mesh is translucent with a wireframe net overlay and can be rotated freely by dragging. The top five highest-entropy peaks carry floating labels showing the token text and its entropy value. Click any point on the surface to jump the cursor to that position in both panels.

The three bands, curve, pixels, and net, give different ways to visualise the probability data. The curve emphasises temporal dynamics, how uncertainty evolves across the sequence as a linear narrative. The pixel map offers a glance-level spatial summary. The net turns uncertainty into navigable terrain that can be inspected from any angle. Each foregrounds different features of the same distribution, and moving between them allows the user to surface insights that no single view would have produced.

Figure 5: The Net band active, showing two 3D probability skyline meshes side by side, one for each panel. The WebGL terrain is visible with peaks corresponding to high-entropy token positions and flat areas corresponding to confident passages. Floating labels identify the top-5 highest-entropy points on each mesh. The standard text panels with heatmap overlay are visible below. The label bar at top identifies this as the Uncertainty Net.

Diff and Tone

The Probs family of views allows one to examine the model's internal probability distributions. The other two overlays in Compare mode work on the finished text as text, applying different analytical tools to make structural and rhetorical features visible.[3]

The "Diff" overlay computes word-level differences between the two responses. Words present in one panel but absent from the other are highlighted, with unique-word counts appearing in each panel header. Both panels scroll in synchronisation so corresponding passages stay aligned. That the models said something different is obvious without any overlay, but the ability to diff across the same model (by selecting them in the settings panel) means that you can see how the same model generates different versions of the text. What the diff surfaces is the shape of that difference. Some models diverge lexically at the periphery, synonyms and minor phrasing variations. Others choose entirely different vocabularies for the same conceptual territory. The diff reveals which kind of divergence you are looking at.

Figure 6: Compare mode with the Diff overlay active. Both panels display the Calvino responses with word-level highlighting. Words unique to each panel are highlighted in the respective panel's colour. Unique-word counts are visible in each panel header (Panel A: 52 unique; Panel B: 49 unique). The numbered sentence markers from the Struct view are visible in the gutter alongside the diff highlighting. Both panels are in synchronised scroll.

The "Tone" overlay applies Ken Hyland's (2005) metadiscourse model to both generated texts. Seven register categories are applied across the text. These are, (1) Hedges, words like "might", "perhaps", "arguably", are in blue, (2) Boosters, "clearly", "certainly", "must", are in green, (3) Limiting terms, "not", "never", "without", are in orange, (4) Attitude markers, "important", "surprising", "problematic", are in purple, (5) Intensifiers, "very", "extremely", "highly", are in amber, (6) Self-mentions, "I", "we", "our", are in rose, and (7) Engagement markers, "you", "consider", "note", "imagine", are in teal.[4] 

Click any chip in the navigation bar to toggle a category. The question mark beside each chip opens the Hyland definition. Hover any marked word for its surrounding context, frequency count, and a brief linguistic note on its function at that position.

What the Tone view shows is the rhetorical stance each model adopts toward its own claims and toward its reader. A response dense with hedges is a cautious one. A response heavy with boosters is assertive, perhaps overconfident. When the Calvino prompt is sent to Gemini and GPT-4o, the distributions differ in ways that would be difficult to characterise without the overlay making them visible. The balance bar at the foot of each panel shows proportional category distribution, a summary of each model's rhetorical style.

Figure 7: Compare mode with the Tone overlay active. Both panels display the Calvino responses with Hyland's metadiscourse categories applied as colour-coded highlights throughout the text. The category count chips are visible at the top of each panel (Hedges, Boosters, Limiting, Attitude, Intensifiers, Self-mentions, Engagement), with numerical counts. The register balance bar at the foot of each panel shows proportional distribution across categories. Highlighted words in different colours are distributed throughout both responses.

The Analyse Modes

Where Compare mode supports close reading of individual outputs, the five Analyse modes run empirical probes. Each poses a specific question about model behaviour and returns quantitative results. The most productive workflow would probably use both together, combining a close reading to identify what seems interesting about a given response and analytical modes to see how it is differently presented.

The "Stochastic Variation" mode sends the same prompt to the same model between three and twenty times in succession and reports how much the outputs differ from one another. This is among the most counterintuitive things a model does, and observing it directly is very helpful for analysis. Temperature above zero means the model samples from its probability distribution rather than always selecting the highest-probability token, and sampling is by definition stochastic. 

In the Calvino example, five runs of Gemini 2.0 Flash produce an average word count of 386, average vocabulary diversity of 54.8%, and average pairwise word overlap of 42.4%. Each run appears as a result card with its own metrics. The Deep Dive expands to show a pairwise overlap matrix across all runs, colour-coded from green for high overlap through yellow to red, giving a quick visual summary of where the model is consistent and where it is different.

Figure 8: Stochastic Variation mode showing five runs of the Calvino prompt to Gemini 2.0 Flash. Summary statistics at the top: 386 average words, 54.8% average vocabulary diversity, 42.4% average pairwise overlap, 5/5 runs complete. Three result cards are visible in the main area: Run 1 (442 words, 52% lexical diversity), Run 2 (409 words, 50% lexical diversity), Run 3 (368 words, 56% lexical diversity), with Run 4 and Run 5 cards partially visible below. The word count and lexical diversity metrics are labelled under each card.

The "Temperature Gradient" mode runs the same prompt across six fixed sampling temperatures, from 0.0 through 0.3, 0.7, 1.0, 1.5, to 2.0. Temperature 0.0 is deterministic, always selecting the highest-probability token. At 2.0, sampling is randomised enough that the output begins to explore quite unlikely regions of the model's vocabulary. Reading the six result cards sequentially is a record of how randomness shapes output. 

The Calvino prompt at temperature 0.0 produces 456 words with 48% lexical diversity; at 2.0, 382 words with 61% diversity. The Deep Dive provides a per-temperature metrics table covering word count, sentence count, average sentence length, vocabulary diversity, and response time, with contextual notes on how low- and high-temperature behaviour typically differs.

Figure 9: Temperature Gradient mode showing six results for Gemini 2.0 Flash at temperatures 0.0, 0.3, 0.7, 1.0, 1.5, and 2.0. The summary strip at the top shows word count range 393-458, diversity range 48-61%, and 6/6 temperatures complete. Six result cards are arranged in two rows of three. Each card shows temperature value, word count, lexical diversity percentage, and a preview of the response text. The Full Text expand link is visible beneath each preview.

"Prompt Sensitivity" tests how minor changes to a prompt affect model outputs. The mode auto-generates variations from the base prompt, adding "please", changing punctuation, rephrasing as a question, adding "step by step", and so on, then ranks each variation by its word overlap with the base output. Custom user-defined variations can be added to the set. The results reveal which prompt tweaks produce the largest divergence from the base response. 

In the Calvino example, the prompt variations range from 33% overlap with the base (adding a period) to 41% overlap (adding step by step), with the question-form variant and the please variant sitting in between. This is a fast empirical check on prompt brittleness. If small phrasings produce large divergences, the responses being read in Compare mode should be treated as draws from a probability distribution.

Figure 10: Prompt Sensitivity mode showing the base Calvino prompt and four auto-generated variations. The summary strip shows 389 base words, 6/7 variations complete, 37.8% average overlap with base, 6 successful runs. The Base Prompt card is visible at the top (389 words, labelled Base). Below it, four variation cards: Add "Please" (458 words, 35% overlap), Add period (383 words, 33% overlap), Add "Step by step" (423 words, 41% overlap), and Question form (369 words, 38% overlap). The Details expand link is visible under each card.

The "Token Probabilities" mode provides an environment for single-response "logprob" analysis. Where the Probs overlay in Compare mode is designed for comparative work across two responses, this mode is for extended inspection of a single output. The summary bar at the top reports mean entropy, average probability, the token with the maximum entropy, and total token count. Below it, an entropy distribution histogram divides all tokens into five confidence bands from Very Low to Very High and clicking any bar lists the exact tokens that fall into it. A Sentence Entropy view colour-codes each sentence by its mean token entropy, surfacing which sentences carry the most uncertainty at a structural level. The Uncertainty Deep Dive at the bottom provides hotspot lists and the most frequently considered alternatives across all positions.

Figure 11: Token Probabilities standalone mode for Gemini 2.0 Flash responding to the Calvino prompt. The summary bar shows Mean Entropy 0.704, Avg Probability 73.2%, Max Entropy Token "the", Total Tokens 449. Below, the Entropy Distribution histogram is visible with five confidence bands. The Token Heatmap tab is active, showing the coloured response text with a probability inspector pinned to the right showing Position 26/449, Entropy 1.644 bits, Chosen 1.63% (intersects...), with probability distribution bars for the top alternatives. The Sentence Entropy tab selector is visible for switching views.

"Cross-Model Divergence" provides the quantitative frame for what Compare mode examines qualitatively. The headline metrics are Jaccard similarity, the proportion of unique words shared relative to the union of both vocabularies, and word overlap percentage. Structural metrics for each panel appear beneath, such as word count, sentence count, average sentence length, vocabulary diversity. 

In the Calvino comparison, Gemini 2.0 Flash produces 322 words across 16 sentences with average sentence length 20.1 and 58% vocabulary diversity. GPT-4o produces 262 words across 10 sentences, average sentence length 26.2, 63% vocabulary diversity. The Jaccard similarity between them is 22.8%, meaning the two models are drawing on largely different lexical territory despite responding to the same prompt. The Vocabulary Analysis Deep Dive partitions the vocabulary into what is unique to A, shared, and unique to B, alongside top-word frequency bar charts and unique bigram candidates for each panel.

Figure 12: Cross-Model Divergence mode comparing Gemini 2.0 Flash and GPT-4o on the Calvino prompt. The Divergence Metrics panel shows: Jaccard Similarity 22.8%, Word Overlap 37.1%, Shared Words 65, Unique to A 121, Unique to B 99. Below, per-panel structural metrics: Panel A (Gemini 2.0 Flash) 322 words, 16 sentences, 20.1 avg sent length, 58% vocab diversity; Panel B (GPT-4o) 262 words, 10 sentences, 26.2 avg sent length, 63% vocab diversity. The Vocabulary Analysis expandable section is visible. Both full response text panels are visible at the bottom.

Conclusion

What LLMbench makes possible is a form of reading that is comparative but also not reducible to the numbers the engineering tools report.[5] The token probability views reveal the model's counterfactual history, every token is a road taken and the distribution shows the roads not taken. The high-entropy positions, the forks, the diverge points, are the moments where the text could plausibly have gone several different ways, where what appears in the panel as a settled word or phrase was, a fraction of a second before sampling, a genuinely open question. I argue that logprobs are an underutilised tool in humanistic and social scientific readings of AI, possibly the closest we get to seeing which parts of the text the model was committed to and which parts could just as easily have come out differently, where it was, we might say, effectively rolling dice within a set of similar words.

The variorum principle from 10 PRINT (Montfort et al. 2013) is a key principle behind the ideas for this tool. Different variants of the same text, whether produced by different models or by the same model at different temperatures, are analytically productive to explore. They reveal what is deterministic and what is contingent, what depends on particular training decisions and what the models share. For example, the Analyse modes make the within-model variants visible, and the Compare mode makes the across-model comparison more visible and more amenable to close reading. Together they are a workbench for the comparative hermeneutics of AI-generated text, which, I would argue, can contribute to the scholarly practice of working with and critiquing generative AI models.[6] What such reading finally shows, to borrow from Ricoeur, is that the meaning of these texts is not hidden behind them, in benchmark scores or alignment metrics, but disclosed in front of them, in the cloud of near-equivalent alternatives the model passed over to arrive at the words it wrote.


The tool is browser-based and requires only an API key from a supported provider. OpenRouter gives the broadest model access through a single key. The deployed version is at https://llm-bench-mu.vercel.app/. Code, documentation, and the full architecture and description are at https://github.com/dmberry/LLMbench. Contributions, issues, and forks are welcome. LLMbench is MIT licensed.


Notes

[1] The twelve figures above are screenshots from the LLMbench tool. The prompt used throughout is "Tell me about Calvino's Cybernetics and Ghosts and its relevance for AI today," with Gemini 2.0 Flash in Panel A and GPT-4o in Panel B. The same prompt and pairing run across all figures so that patterns across modes can be read as belonging to a single comparison rather than as separate examples.

[2] Logprob data is available from Google Gemini (version 2.0 and later), OpenAI models directly, and OpenAI models via OpenRouter. GPT-4o and GPT-4o-mini return reliable logprob data through OpenRouter. Other models hosted on OpenRouter do not currently expose choice.logprobs through the routing layer. Selected Hugging Face models via Fireworks and Together inference backends also support logprobs. A logprobs-compatible only checkbox in Provider Settings greys out providers and models that do not expose token probabilities. I recommend using OpenRouter or Gemini API. 

[3] There is an annotation system in Compare mode that supports six typed categories drawn from the Critical Code Studies methodology developed across Marino (2020) and the ELIZA reading project (Berry and Marino 2024), Observation, Question, Metaphor, Pattern, Context, and Critique. Select any text in either panel to activate the annotation widget. Annotations persist to browser localStorage and export with the comparison as JSON, plain text, or PDF with coloured annotation badges.

[4] The Hyland (2005) metadiscourse framework was developed for the analysis of academic writing. Its application here to AI-generated text is an experimental approach in this tool. LLMs are not writing academic prose exactly, but they have been trained on vast quantities of it, and the patterns of hedging, boosting, and reader engagement that Hyland identifies are present in many model outputs and vary systematically across models, prompts, and training regimes. I would be very interested in alternatives to this framework if you wish to contact me. 

[5] A code-editor substrate (CodeMirror) is used to display prose text and discussed in the README. Briefly, the aim was to support a gutter and line-based annotation system taken from another tool I developed called CCS Workbench. The result is a reading environment positioned between a word processor and a code editor, which is a document analysis environment with a scholarly annotation system built in.

[6] The Vector Lab tools draw on a theoretical position I have been developing elsewhere (Berry 2025, 2026a, 2026b), that LLMs inhabit a high-dimensional vector space or manifold and that reading their outputs is partly a matter of reading against the structure of that vector space, not only against the surface of the text. The tools in the series are attempts to make aspects of that space legible to humanistic close reading.

Bibliography

Berry, D. M. (2025) 'Synthetic media and computational capitalism: towards a critical theory of artificial intelligence', AI & Society, 40(7), pp. 5257-5269. https://doi.org/10.1007/s00146-025-02265-2.

Berry, D. M. (2026a) 'Vector Theory', Stunlaw. https://stunlaw.blogspot.com/2026/02/vector-theory.html.

Berry, D. M. (2026b) 'The Vector Medium', Stunlaw. https://stunlaw.blogspot.com/2026/03/the-vector-medium.html.

Berry, D. M. and Marino, M. C. (2024) 'Reading ELIZA: Critical Code Studies in Action', Electronic Book Reviewhttps://electronicbookreview.com/essay/reading-eliza-critical-code-studies-in-action/.

Hyland, K. (2005) Metadiscourse: Exploring Interaction in Writing. Continuum.

Kahng, H., Fourney, A., Diaz, M., Simard, P. and Amershi, S. (2024) 'LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models', IEEE Transactions on Visualization and Computer Graphics.

Marino, M. C. (2020) Critical Code Studies. MIT Press.

Montfort, N., Baudoin, P., Bell, J., Bogost, I., Douglass, J., Marino, M.C., Mateas, M., Reas, C., Sample, M. and Vawter, N. (2013) 10 PRINT CHR$(205.5+RND(1)); : GOTO 10. MIT Press.

Ricoeur, P. (1981) 'Metaphor and the central problem of hermeneutics', in Hermeneutics and the Human Sciences, ed. J. B. Thompson. Cambridge University Press, pp. 127-143.

Sinclair, S. and Rockwell, G. (2016) Voyant Tools. https://voyant-tools.org/.




Comments