Efficient References with LLMs

Many applications include references (for the purpose of this post, specific portions of text within a document) with LLM responses. One strategy that I’ve found to be particularly effective is to inject markers into the document text and then ask the LLM to identify the markers when referencing the document.

Issues with the Naive Approach

When I first started dealing with this problem, I’d ask the LLM to just output the entire text of the reference.

Full Text

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. The ultimate reason for this is
Moore's law, or rather its generalization of continued exponentially falling cost per unit of
computation. Most AI research has been conducted as if the computation available to the agent were
constant (in which case leveraging human knowledge would be one of the only ways to improve
performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. Seeking an improvement that makes a difference in the
shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing
that matters in the long run is the leveraging of computation. These two need not run counter to each
other, but in practice they tend to. Time spent on one is time not spent on the other. There are
psychological commitments to investment in one approach or the other. And the human-knowledge
approach tends to complicate methods in ways that make them less suited to taking advantage of
general methods leveraging computation.  There were many examples of AI researchers' belated
learning of this bitter lesson, and it is instructive to review some of the most prominent.

Reference

Seeking an improvement that makes a difference in the
shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing
that matters in the long run is the leveraging of computation.

One problem with this approach is that even the best LLMs will inevitably hallucinate some sort of subtle change (e.g. spacing, punctuation, etc.) that can make it difficult to match the reference to the actual text. For some use cases this may be acceptable, but if you want to be able to, for example, highlight the document reference in a web application, this won’t work. You’ll either be stuck begging the LLM to not hallucinate or writing more and more complex matching logic.

Another problem with this approach is its token cost. If you need to reference a large portion or many portions of the document, you are paying for every single token in those portions even though they exist in the source document already.

Marker Injection Approach

To overcome these issues, we can inject markers into the document that allow the LLM to reference specific portions of the text document these individual tokens. This eliminates the subtle hallucination problem and drastically reduces the number of tokens the LLM needs to output.

Some considerations with this approach:

The markers should be unique and not appear in the normal course of text so that the LLM doesn’t confuse them with the document’s content. For example, just using numbers directly like 1, 2, 3, etc. tends to be a bad idea since they are common in many documents.
Markers should be injected in places that align with the structure of the document (e.g. paragraphs, sentences, lines, etc.). More markers will allow for more granular references, but too many markers will make the document text more cluttered and may interfere with the LLM’s ability to understand the document (imagine if there was a marker before every character, you’d have amazing granularity, but the document would be incomprehensible to human and LLM alike).

Example

In this example we’ll use nltk to split the document into sentences and inject markers at the beginning of each sentence. The markers themselves are numbers surrounded by < and > to make them visually distinct, eg. <#1>.

from nltk.tokenize import sent_tokenize

# Split the document into sentences
sentences: list[str] = sent_tokenize(text)
reformatted_text = text
current_pos = 0

for i, sentence in enumerate(sentences):
    # Find the next occurrence of the sentence starting from current position
    next_pos = reformatted_text.find(sentence, current_pos)
    if next_pos != -1:
        # Replace only this specific occurrence
        reformatted_text = (
            reformatted_text[:next_pos] +
            f"<#{i}>{sentence}" +
            reformatted_text[next_pos + len(sentence):]
        )
        current_pos = next_pos + len(f"<#{i}>{sentence}")

From this we get the following reformatted text:

<#0>The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. <#1>The ultimate reason for this is
Moore's law, or rather its generalization of continued exponentially falling cost per unit of
computation. <#2>Most AI research has been conducted as if the computation available to the agent were
constant (in which case leveraging human knowledge would be one of the only ways to improve
performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. <#3>Seeking an improvement that makes a difference in the
shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing
that matters in the long run is the leveraging of computation. <#4>These two need not run counter to each
other, but in practice they tend to. <#5>Time spent on one is time not spent on the other. <#6>There are
psychological commitments to investment in one approach or the other. <#7>And the human-knowledge
approach tends to complicate methods in ways that make them less suited to taking advantage of
general methods leveraging computation.  <#8>There were many examples of AI researchers' belated
learning of this bitter lesson, and it is instructive to review some of the most prominent.

The reference output of the LLM is now:

<#3>

It’s then easy to parse this reference to get the raw text:

# extract the integer index from the reference
reference_index = int(llm_output.strip('<>').split('#')[1])

# use `sentences` from the previous block to get the raw text now that we know the index
reference_text = sentences[reference_index]