Behind Meta's commercial open-source best big model: giant life-…

Behind Meta's commercial open-source best big model: giant life-preserving competition, Musk, Apple take a different path

5cent

0xEB03

August 9th, 2023

Meta announced on July 19th on its official website that the large language model Llama2 was officially released, which is the latest version of Meta's large language model and Meta's first open-source commercial large language model, and at the same time, Microsoft Azure also announced that it would cooperate with Llama2 in depth.

According to Meta's official data, Llama 2 has a 40% increase in training data compared to its predecessor, and includes 3 versions with 7 billion, 13 billion and 70 billion parameters.The Llama 2 pre-trained model has been trained with 2 trillion tokens, the context length is twice as long as that of Llama 1, and its fine-tuned model has been trained with more than 1 million human annotations.

Its performance is said to be comparable to that of GPT-3.5, and it is also known as the best big model in open source. Once the news was released, the media and the industry even gave the assertion that the commercialization of Llama2 was about to change the competitive landscape in the field of large models. How big is the impact of this event? What kind of impact will it bring to the industry? We invited two industry insiders to have a chat, one is Tao Zhou, deputy general manager of the product development center of Leo Group Digital Technology Co., Ltd, who has led a team to evaluate most of the mainstream large models at home and abroad; the other one is Jiao Juan, president of the Anxin Universe Research Institute, who has been observing the ecology of the science and technology industry at home and abroad in depth for many years.

The following are the two main points of view:

① Llama2 has the bottom line to compare with GPT-3.5 in terms of model parameters, time consumption, arithmetic consumption and other comprehensive considerations.

② Generative AI will bring a sea change to the whole open source system.

③ In the coming time, open source and closed source must swing each other, and a pattern of mutual game and competition will be formed in this field for quite a long time.

④ The commercial open source of Llama2 will not necessarily reduce the cost of using big models for entrepreneurs, but it is likely to cause big model service providers to start price wars, which is good news for applicators and entrepreneurs.

⑤ The competition of overseas giants in the field of AI is no longer quite as simple as developing the second curve, the competition is fierce and decisive, and even a bit of life-preserving, and the reason behind is worth pondering.

The following is a selection of the conversation:

Tencent Technology: from the perspective of industry practitioners or applicators, how do you go about evaluating a big model?

Zhou Songtao: The big model assessment framework used in the international arena is MMLU, which takes into account 57 disciplines, from humanities to social sciences to science and technology, and most of the cases we assess are based on this framework. However, our industry is advertising, based on the attributes of the advertising industry, we will add some other assessment items.

As we said at the group's management meeting, the focus of the advertising industry is not on creativity, but on control. The generated results must be 100% restored to the advertiser, its product performance, appearance, logo, etc. Only after these restorations have been reached can the advertiser and the product be evaluated. Only after these reductions are reached can there be room for dispersion and imagination. So we will do separate tests for the control of the illusion of the big model. However, most of the big language models and image-generated diffusion models on the market can hardly meet the needs of advertisers 100%. After the generalized big models are released, there is still a long way to go before they are fully commercialized.

In addition, the most important thing we consider is the cost problem: closed-source models have a direct quotation system, and we generally measure the cost of thousands of Token. For open source models, there are more links to measure, from deployment to fine-tuning to the final on-line for inference, how much arithmetic power is consumed, and how much development cost and data cost are invested in maintaining the open source model.

Big model quality feedback plus cost estimates, we can form an evaluation of the model, a simple sentence is, the more cost-effective, the more popular.

Jiao Juan: From our point of view, we put more emphasis on how to define some vertical requirements. Because globally speaking, no matter whether it is a hard technology company or an Internet company, there may not be many companies that have the ability to define the demand, so this proposition can be converted into - can the big model company define the demand for some vertical segments, and if not, can they join hands with the eco-partners to explore a better vertical direction? If not, can they join hands with eco-partners to explore a better niche direction. Of course, if some companies can have their own data accumulation and experience accumulation in specific directions, that would be better. This is our point of view, from the perspective of application, defining the needs of vertical segments.

Tencent: Can Llama2 really surpass or match GPT-3.5 in terms of performance?

Tao Zhou: The large model for Llama2 is still under evaluation, which will take about 2 weeks. But from our study of the paper, and some simple evaluations that have been conducted so far, we can talk about some general comparisons.

There are a few differences between the pre-training phase and GPT's original model, and these changes have not been done by any other modeling company before. The first is that in the pre-training phase, the traditional Transformer's multi-attention mechanism is changed to a piecewise grouping mechanism. Somewhat similar to or imitating our original big data processing, when dealing with massively parallel data, the use of slicing technology. Attention to a large number of Query (request) group by group, each group into a training unit to go, so that the efficiency and speed of parallelism, theoretically, will be greatly improved. This part I think is Meta in the previous experience of massively parallel processing, to make a new change.

Based on this change, I think they are theoretically many times better than existing large models in terms of arithmetic requirements and time consumption. I predict that Llama2, according to them, is being done in January, which, based on the release time, should be shorter than the pre-training time of Llama1 because it has a larger number of parameters than Llama1. In that case, multiple rounds of training possible for this cycle is compressed. This is closely related to the GQA mentioned in the paper. At this point, it is supposed to exceed GPT4, although we don't know exactly what GPT-4 is, but according to outside speculation, GPT-4 is much higher than GPT-3 and GPT-3.5.

For GQA, we currently feel that the GQA processing does improve training speed for users with sufficient arithmetic cards, especially with GPU parallel arithmetic resources. However, testing and peer review found that this feature has high requirements on arithmetic pool size and hardware, and because of well-known reasons, few developers in Mainland China have large-scale GPU parallel arithmetic resources, so GQA may be chicken ribs for us.

Also the second point, in the adjustment stage, we know that the GPT system in the normalization, are done in the layer of data processing, so that the results of data training is very accurate, but also very consuming arithmetic power. But Llama2 uses a different scheme, which is to add weight coefficients on top of the layered processing, which is helpful for efficiency improvement and accuracy maintenance, and also helpful for arithmetic power saving. These two points are the optimizations done in the pre-training phase.

Also it is mentioned in the paper that the Embedding position in Llama1 is fixed and cannot be modified. But in Llama2, this is dynamically adjustable, which is also a highlight. We are also interested in this one and would like to know what kind of practical effect it can produce.

In addition to these, Llama2 obviously absorbed some of the engineering experience of Llama1 and GPT series, that is, the success of the RHLF stage are reused, which should be greatly improved.

The last thing is than the parameters, so far what we see are those parameters that it has publicized itself on the official website. Including its artificial reinforcement feedback parameters are about more than 1 million, fine-tuning part of the more than 100,000. These parameters, if he dares to put out means that Meta has the bottom line in terms of model parameters, time consumption, arithmetic power consumption and other aspects of comprehensive consideration.

Subscribe to 5cent

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

QIlDfPajJTh8Kpq…UXygdscUoIkPkjc

Author Address

0xEB035f0fD8059A6…4A8c75bA1000861

Content Digest

wGDUS_qDf0xxPCV…ciHdg9tINMoxwKU