LLMbench: A Comparative Close Reading Workbench for Large Language Models
David M. Berry "The meaning of a text lies not behind the text but in front of it. The meaning is not something hidden but something disclosed." Paul Ricoeur, 1981. The proliferation of large language models (LLMs) has generated an equally prolific effort to measure them. Kahng et al. (2024) name part of the problem directly in their discussion of Google PAIR's LLM Comparator. Side-by-side evaluation of models, they argue, is a key practice, and existing tools for it tend to be quantitative or rely on presenting user-rating metrics. The system they present is a useful piece of engineering, but it is designed for model developers (especially those making products), not for the hermeneutic work of close reading what a model has generated. What is missing is a workbench for comparative close reading, an environment in which the outputs of two models can be subjected to the kinds of attention the humanities bring to primary sources, annotation, structural differences, rhetori...