Character-Level Processing Limitations in Large Language Models

Picture this: a cascading stream of text flows into the valves of a large language model, where not words, but characters, are the initial droplets. Yet these characters don't stay solitary for long. In the heart of this linguistic machinery, the process of tokenization swiftly corrals them into coherent groups.

Each token is mapped onto a high-dimensional space, where they gain new identities as vectors—numerical representations that encapsulate the meaning of the original text in a form the model can manipulate and interpret.

These vectors are the currency of models like GPT-4, enabling it to perform feats of language generation and comprehension. As for the training data, it's subjected to the same tokenization and vectorization process, instilling in the model the knowledge and patterns gained from a vast corpus of textual information.

This is the technical alchemy that powers a model’s ability to engage with human language—not through a one-to-one correspondence with the characters that form the words, but by interpreting the statistical patterns within the tokens that represent them.

Dialing Up the Microscope

While large language models excel at most natural language processing tasks, they tend to lack the granularity needed for character-level processing.

The architecture and design of large language models like GPT-4 prioritize the processing of higher-level linguistic structures over character-level details. These models are built using transformer architectures that excel in capturing the context and semantics of word sequences, operating predominantly on the level of tokens—these tokens can represent whole words, common phrases, or parts of words, depending on the model's vocabulary and how it was trained.

When text is tokenized for a model like GPT-4, a complex and semantically rich token can encompass much more information than a single character. This efficiency in information representation allows the model to grasp meaning, predict subsequent text, and generate language with remarkable coherence over lengthy passages. However, this design inherently glosses over the nuances of individual character manipulation, which is essential for many tasks.

The tokenization step in GPT-4 can lose character-level information when it groups characters into tokens. For example, a unique sequence of characters could be represented as a single, opaque token, rendering the individual characters within that token inaccessible to the model's processing layers. This means that the model may not always be aware of individual character changes that would be evident if it processed text character by character.

Due to these design choices, GPT-4 is more naturally suited to tasks like completing sentences, translating language, providing explanations, and generating thematically relevant content—all of which depend on understanding and manipulating text at the token level rather than the character level.

The Achilles’ Heel(s)

The following are a few demonstrations of this limitation.

Caesar Ciphers

The Caesar cipher is a type of substitution cipher in which each letter in the plaintext is 'shifted' a certain number of places down or up the alphabet. Named after Julius Caesar, this cipher is one of the simplest and most widely known encryption techniques.

To encrypt a message, you choose a shift value (for example, shifting three places) and apply this shift uniformly across your message. So with a shift of three, the letter 'A' would become 'D', 'B' would become 'E', 'C' would turn into 'F', and so on, wrapping around to the start of the alphabet after 'Z'.

Decryption is simply the reverse process, where the same shift value is used, but in the opposite direction.

Here is our input:

If he had anything confidential to say, he wrote it in cipher, that is, by so changing the order of the letters of the alphabet, that not a word could be made out.

conversation
conversation

When prompted to decrypt an encoded message, GPT-4 fails. The result is not accurate and nonsensical.

Prompting to produce the tokenization of the input reveals the issue. The encoded message is processed into groups of letters, which leads to the incorrect translation.

Substitution Ciphers

Continuing the theme of ciphers, in a substitution cipher, each letter is replaced systematically by another letter or symbol. To decode it, one must examine each character individually and consistently apply a key or mapping.

Input as placeholder copy:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Encoded with an emoji substitution cipher:

a= 🌷b= 💐c= 🌹d= 🌺e= 🌸f= 🌼g= 🌻h= 🌞i= 🌟j= ✨k= 🌈l= 💕m= 🍬n= 🎁o= 🎈p= 🎀q= 🎊r= 🎉s= 🍷t= 💛u= 💚v= 💙w= 💟x= 💖y= 💝z= 🔅

prompt
prompt
response
response

Once again, GPT-4 fails. This time there seems to be an attempt to separate the input into individual tokens, evidenced by the spacing between each character. However, the retrieval of the substitution falls short, with the character denoted with the blue heart completely not translated.

Anagrams

An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. For example, the word "listen" can be turned into "silent," and "a gentleman" can be rearranged into "elegant man." Anagrams are a popular form of linguistic play and have been used historically for various purposes, from coded messages and pseudonyms to generating humor or irony. Crucially, anagrams depend on character-level manipulation; the meaning of the words is secondary to the possible permutations of their letters.

Here is GPT-4 attempting to create anagrams of the word “score":

conversation
conversation

While the model produced twenty anagrams, it fell short of producing all of the anagrams possible.

For a 5-letter word where all the letters are unique, the number of anagrams is determined by the number of permutations of the 5 letters. To find the number of permutations, you would use the factorial function.

For a 5-letter word, the number of anagrams would be the factorial of 5:

5!=5×4×3×2×1=1205! = 5 × 4 × 3 × 2 × 1 = 120

So, a 5-letter word with all unique letters would have 120 anagrams.

Character Sorting

Here we test if the model can sort a string of characters alphabetically. This task is straightforward but requires character-level manipulation.

conversation
conversation

Again it fails, and does not even retain the length of the inputted text.

Not a Bug, but a Feature… Sometimes

The token-level processing intrinsic to models like GPT-4 can be unexpectedly beneficial when dealing with the vagaries of human communication, such as typos or imperfect typing. When a user makes a typographical error, especially within the middle of a word, the overall structure of the word can often remain recognizable due to the context provided by the surrounding words. Because GPT-4 is designed to prioritize context and the patterns of language use at a higher level than individual letters, it is remarkably resilient to such errors, and can often correctly infer the intended word despite the presence of scrambled or incorrect letters.

Moreover, the token-based approach extends to the process of deciphering leetspeak—a form of writing where conventional letters are often replaced with numerals and other characters that resemble the letters they replace.

For instance, the leetspeak word "h4x0r" is intended to be read as "hacker." GPT-4 can often interpret such substitutions correctly because it doesn't rely strictly on the character representation of words, but rather on the broader context and patterns they fit into. So even if letters are replaced with visually similar numbers or symbols, as long as the overall pattern of the word roughly matches a token within the model's repertoire, the intended meaning can be preserved. This ability reflects an understanding of language not as a series of strict character sequences, but as flexible and context-dependent streams of information.

Precision in Coding

In the context of code, the issue of token versus character-level processing is less pronounced due to the inherently structured and precise nature of programming languages, and how language models like GPT-4 are trained to handle them.

Coding often involves single-letter variable names, such as "x" or "i", which are ubiquitous in codebases and algorithms, making it likely that these have been sufficiently represented during the training phase of the model.

Consequently, the tokenizers used within these models are more attuned to the importance of individual characters in the syntax and semantics of code. This heightened sensitivity to character-level details in coding tasks allows models like GPT-4 to perform well with code completion, debugging, and even generation of small programs, because the tokenization process has been optimized to capture the granular distinctions that can be critical to the correct functioning of software.

This applies to mathematics, however GPT-4 lacks advanced reasoning capabilities in this field.

Band-Aid, or Stitches?

While the core architecture of large language models like GPT-4 is unlikely to shift away from token-based processing, given its efficacy in understanding and generating natural language at scale, there is potential for augmentation with specialized input parsers or tokenizers.

These augmented systems could be designed to activate when the nature of a prompt indicates that character-level analysis is required. For instance, upon detecting cues or keywords within a prompt that suggest a need for character-level manipulation—such as "cipher," "anagram," or "palindrome"—the model could trigger an auxiliary character-based tokenizer. This specialized tokenizer would break the input into individual characters instead of the usual word-level tokens, allowing the model to apply more granular operations to the text as needed.

Implementing such a solution would necessitate a degree of adaptive pre-processing of inputs, where the model dynamically selects an appropriate parsing strategy based on the task at hand—analogous to how a human might switch between skimming an article for the main idea and scrutinizing a legal document for specific details.

By integrating this capability, the augmented model could maintain the strengths of its primary token-based design while expanding its flexibility to handle tasks ordinarily outside its purview. The success of this hybrid approach would hinge on the precision of the activation mechanism in correctly identifying when to engage the character-level tokenizer, thereby ensuring that the model's responses consistently align with the requirements of the user's query, and character-based linguistic challenges are addressed effectively.

Subscribe to jeffy yu
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.