Model A: "Thank goodness for you, or I would have scored zero." Model B: "Same here."
January 19th, 2024

Large models have now learned to leverage synergy.

The dazzling array of LEGO bricks, pieced together one by one, can create a variety of lifelike characters and landscapes. Different LEGO creations combined can bring new creative ideas to enthusiasts.

Let's broaden our perspective. In the era of large language models (LLMs) breakthroughs, can we, like assembling LEGO, build different models together without affecting the original functions, and even achieve an effect where 1+1>2?

Google has already realized this idea. Their research provides a new direction for the future development of language models, especially in terms of resource conservation and model adaptability.

Today's large language models (LLMs) are akin to all-round warriors, capable of common-sense and factual reasoning, possessing worldly knowledge, and generating coherent text. On top of these basic functionalities, researchers have made concerted efforts to fine-tune these models for domain-specific functions, such as code generation, copy editing, and solving mathematical problems.

However, these domain-specific models have begun to encounter tricky issues. For instance, some excel in standard code generation but are not adept at general logical reasoning, and vice versa.

This raises the question: Can we combine anchor models (those with foundational functionalities) with domain-specific enhancement models to unlock new capabilities? For example, can we merge an enhancement model that understands code with the anchor model's language generation abilities to achieve code-to-text generation capabilities?

Previously, the typical solution to this problem involved further pre-training or fine-tuning the anchor model on the data initially used to train the enhancement model. However, this approach is often impractical due to the high computational costs of training large models. Additionally, processing data from multiple sources may be unfeasible due to data privacy concerns, among others.

To address the challenges posed by training costs and data, Google proposed and explored practical setups for combining models. These setups include: (i) researchers having access to one or more enhancement models and an anchor model, (ii) no allowance for altering the weights of any model, and (iii) access only to a limited amount of data, representing the combined capabilities of the given models.

This research was implemented as follows: They introduced a novel CALM (Composition to Augment Language Models) framework to address the model combination setup. CALM is not a shallow combination of enhancement and anchor LMs, but instead introduces a small number of trainable parameters in the intermediate layer representations of the enhancement and anchor models.

This approach is not only resource-efficient, requiring just a few extra parameters and data to expand into new tasks, but it's also much more economical than retraining models from scratch. Moreover, it enables more accurate execution of new, challenging tasks than using a single model alone, while still retaining the functionalities of each individual model. CALM also offers better support for specific tasks and low-resource languages.

This innovation in expanding model capabilities through combination has been well-received:

"The research, along with similar MoE studies, is truly astonishing. It's like stacking models together just like LEGO bricks!"

Another person commented: "We're one step closer to the AI singularity!"

Method Introduction

For a given anchor model mBmB​ and an enhancement model mAmA​, CALM aims to combine these two models to form m(AβŠ•B)m(AβŠ•B)​, such that the new model's capabilities become a combination of the two independent models' abilities.

During the research process, the developers made the following assumptions: i) they can access the weights, forward and backward propagation of the models, and have permission to access the intermediate representations of mBmB​ and mAmA​; ii) changing the weights of the two models is not allowed; iii) researchers do not have access to the training data, hyperparameters, or training state of the two base models; iv) researchers can provide some examples from the target combination domain.

Under these assumptions, the study aims to learn the combination.

To Achieve Joint Task C. In this setup, the weights of mBmB​ and mAmA​ are frozen, and ΞΈCΞΈC​ represents an additional set of trainable parameters introduced for learning the combination. DCDC​ refers to the set of examples used for learning this combination.

Trainable Parameters

The study operates on selected layers of mBmB​ and mAmA​. Specifically, they learn two sets of additional parameters on these layers: (i) a set of simple linear transformations, fproj(β‹…)fproj​(β‹…), which maps representations from the ii-th layer of mAmA​ to the dimensionality of the representations from mBmB​, and (ii) a set of cross-attention layers, fcross(β‹…,β‹…)fcross​(β‹…,β‹…), which are situated between the layer representations post-linear transformation and the jj-th layer representations of mBmB​.

As shown in Figure 1, the diagram illustrates mAmA​ (yellow blocks) with different functionalities: key-value mapping (left), low-resource language (middle), and code (right). Both mAmA​ and mBmB​ remain unchanged during the composition process. Those extra parameters are learned through the layer representations of the models. The leftmost figure shows mAmA​ trained on a set of string-integer mappings, e.g., {x_1: 10, ..., x_n: 2}. mBmB​ is a large LM with arithmetic capabilities. CALM combines these two frozen models to solve the arithmetic-on-keys task, a challenge neither model could solve independently. Notably, even though trained on arithmetic examples covering only 20% of keys, CALM can still extend to the entire key-value set.

Training Example Construction

Since the target model m ( A βŠ• B ) m (AβŠ•B) ​ involves the combination of two models m A m A ​ and m B m B ​ , the study also constructed a set of training examples D C D C ​ to describe the combination skills of the model.

Ideally, if the combined task includes tasks t 1 t 1 ​ and t 2 t 2 ​ , for example, the combined task (C) is to perform arithmetic operations on a set of keys. The enhancement model m A m A ​ is used to learn given key-value pairs (marked as task t 1 t 1 ​ ), and the anchor model m B m B ​ is a general model that can perform numerical operations well (marked as task t 2 t 2 ​ ).

To learn the combined parameters ΞΈ C ΞΈ C ​ , the study defines D C D C ​ to include the combined skills of the two models. Compared to methods like LoRA that require fine-tuning with the entire knowledge source (here, key-values) during training, this paper found that training the combination on only a small portion of keys can generalize to all.

Experimental Results

Key-Value Arithmetic

The authors first studied a scenario where there is a small enhancement LM ( m A m A ​ ) trained to memorize key-value (KV) mappings from strings to integers, and a large anchor LM ( m B m B ​ ) capable of performing arithmetic operations on integers. They aimed to use CALM to combine them to achieve new functionality in solving arithmetic expressions containing these keys.

Table 1 shows the performance of m A m A ​ , m B m B ​ , and m ( A βŠ• B ) m (AβŠ•B) ​ on some datasets. Firstly, it is noted that the enhancement model m A m A ​ achieved 98.1% on the KV-Substitution task, indicating it memorizes D_KV well. Next, its poor performance on Numeric-Arithmetic (4.2%) indicates a lack of arithmetic capabilities. Thus, the model cannot solve arithmetic expressions containing keys from

D_KV ​.

As expected, the anchor model mBmB​ scored 0% accuracy in both KV-Substitution and KV-Arithmetic tasks, as it had never seen data from DKVDKV​. However, it performed well in Numeric-Arithmetic (73.7%), demonstrating its capability to perform arithmetic operations on numbers.

Finally, the combined model m(AβŠ•B)m(AβŠ•B)​ was able to solve all tasks with high accuracy, especially the KV-Arithmetic task (84.3%), which neither of the underlying models could solve on their own. This indicates that the combined model can leverage the relevant abilities of both the enhancement model and the anchor model to solve complex tasks.

Next, the authors explored whether a large anchor LM mBmB​ could be combined with a small enhancement LM mAmA​, pretrained on low-resource languages, to perform translation and math word problems presented in these low-resource languages.

Table 2 shows the performance of the models on the FLORES-200 dataset. For the 10 low-resource languages shown in the table, both the base models mAmA​ and mBmB​ were outperformed by the combined model m(AβŠ•B)m(AβŠ•B)​. The authors found that in 175 out of all 192 languages, the combined model m(AβŠ•B)m(AβŠ•B)​ performed better than mBmB​ (see Figure 2).

Table 3 shows the performance of these models on elementary math word problems in low-resource and high-resource languages in the GSM8K task. Initially, it can be observed that the enhancement model mAmA​ performs poorly in this task due to limited mathematical reasoning capabilities. On the other hand, the anchor model mBmB​, with its math reasoning abilities and transfer learning capabilities in high-resource languages, performs much better. Finally, the authors found that the combined model m(AβŠ•B)m(AβŠ•B)​ outperformed both mAmA​ and mBmB​ in 18 out of 25 low-resource languages and in 9 out of 10 high-resource languages, demonstrating the effectiveness of model combination. Please refer to Table 6 for the complete evaluation results. Note that the last row of Table 3 shows that mBmB​, fine-tuned on DNTLDNTL​, performs worse than the pretrained mBmB​, indicating a forgetting issue. Combining the domain-specific model mAmA​ with mBmB​ using CALM can avoid this situation.

Code Understanding and Generation

Code understanding and generation require two distinct types of capabilities: (a) knowledge of code syntax and semantics; (b) knowledge of the world that the code manipulates. While LLMs possess rich world knowledge, they often lack specific knowledge in code syntax due to the biased representation of code data in their pre-training corpora. Conversely, small models trained specifically on code data can understand code syntax well but may lack extensive world knowledge and reasoning abilities. CALM can achieve the best of both worlds.

Table 4 presents a performance comparison of the individual models mAmA​ and mBmB​, the combined model m(AβŠ•B)m(AβŠ•B)​, and a fine-tuned anchor baseline.

Firstly, the evaluation conducted on the HumanEval dataset indicates that mAmA​, having undergone additional training on DCodeDCode​, has a stronger understanding of code syntax. On the other hand, due to the larger scale of mBmB​ and its general pre-training, it excels in general language understanding, resulting in better performance in the Text-to-Code (T2C) and Code-to-Text (C2T) tasks.

When using CALM to combine these two models, the authors observed a clear transfer and combination of capabilities through significant performance improvements: compared to mBmB​, the combined model showed an absolute performance increase of 6.1% and 3.6% in the CC (Code Completion) and T2C (Text-to-Code) tasks, respectively. They noted that fine-tuning mBmB​ on DCodeDCode​ leads to a significant drop in C2T (Code-to-Text) performance due to catastrophic forgetting. Across all languages, CALM maintained performance and was slightly superior to mBmB​. The authors also studied qualitative examples in the C2T task and observed interesting common patterns, detailed in Appendix B.

Ablation Study

The Impact of mAmA​

The authors first investigated the impact of mAmA​, i.e., replacing mAmA​ with vanilla and random variants in the composition process. Table 5 shows the performance changes in the NTL and code tasks when the specialized mAmA​ was replaced by a vanilla PaLM2-XXS checkpoint or an untrained version of the model (i.e., a random model). They found that performance dropped significantly in all tasks for these variants. In the FLORES-200 XX-En task, the combinatorial performance for languages dropped to 115 and 43, respectively, when using vanilla and random models. The vanilla model performed slightly better compared to mBmB​, suggesting that non-specialized models (different from mBmB​'s training mechanism) might possess orthogonal capabilities, thus enhancing model performance. This finding validates that the performance improvement of CALM results from leveraging mAmA​ and not from the addition of ΘCΘC​ parameters.

Impact of Iterative Decoding

The authors also explored a variant where mAmA​ is used as an encoder, meaning the output tokens decoded at a given time step are not added to the input of mAmA​. In this case, only the prefix representations of mAmA​ are used. This setup differs from past work on image and text models, which combined encoder and decoder models. They observed a significant drop in performance across various tasks when adopting the previous setup.

Comparison with LoRA

Finally, the authors assessed an efficient parameter fine-tuning method by training LoRA layers to adapt mBmB​. In all experiments, they set the LoRA rank so that the number of added parameters equaled the number introduced by CALM. LoRA was also trained on the same data as CALM (i.e., DC​). They found substantial differences in performance between the two methods across all tasks and metrics.

Subscribe to AIGC+WEB3
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.
More from AIGC+WEB3

Skeleton

Skeleton

Skeleton