Tie Zhen: former Google engineer, from TensorFlow team. I joined hugging face last November and witnessed the changes of LLM AIGC. Today, I am honored to be invited by CFG and would like to show some interesting and models I saw on hugging face, hoping to inspire you. I am not a researcher, I will use easy-to-understand language and practical ideas to go through and help you understand.
Hugging Face is established to build Chatbot long time ago, when the LLM has not yet appeared, and it’s hard to compete with ChatGPT. Then Google released Bird, and did TensorFlow, but the community has gradually shifted to Pytorch. So we did a Pytorch version of the Bird, transferred weights to Pytorch through some way rather than re-trained, combined with which we slowly formed the Transformers. Transformer is a model structure, transformers is our library, covering all the commonly used models with this transformer structure. Developers and researchers can easily add new models, and users can also use the same interface with different kinds of models.Not only NLP (natural language processing), CV (computer vision) many fields are using this architecture based on Transformer. After the rise Text2image, we build a similar library for diffusion models, called diffusers, to include diffusion models. These are the two main products of our library; another product is Hugging Face Hub, Hugging face.co. I'll share the screen.
It's similar to Github, covering several sections. 1) Models: there are all kinds of models, we have 150,000 models now. If you want to learn NLP models, or want to play something, you can find it in the left filter. Here is the classification according to the task. My personal understanding is that NLP is not very useful now, GPT is too powerful now and has taken the whole NLP market share, but others such as audio, computer vision, multimodels, you could see some good models. For models, any models you have trained can be uploaded. The biggest difference between us and Github, is that once uploading, we can help you save large files. For example, GPT-2, a very early model of OpenAI, has more than 400 megabytes (MB) for a random file called tflite, and more than 500 megabytes for a file Pytorch. The Github can’t host such large files. Our services include storage of large file and CDN. There is a hosted inference API on the right side of the page, once you upload the model, he will guess what the model is doing. When you see GPT-2, he knows it's probably a text generation model and will help you set up the weject... You could interact with your model here. Here's an example of what my name is, followed by the blue font that is generated out.
Chloe: If you keep feeding bad data, will it affect the model? How to solve it?
Tie Zhen: Definitely. When Microsoft launched Ice, it was played by everyone. Feeding some strange things that impacted the Ice not in positive way. We could do some counter measures to make sure this is balanced and works best for the future such as data-level auditing, including feedback, which can be recorded to long-term memory, which is not.
Then if you are planning to use hugging face, please pay attention to the Trending Page which show you the most popular ones.This is the spirit of open source since many models will be hosted on Hugging Face directly. If we look at the models, Tsinghua's chatglm-6b is hot, with more than 10,000 downloads and 200 likes. Note that the downloads here are two weeks' downloads, not all the downloads on Github. In addition, ControlNet, Stable Diffusion, OrangeMixs, anything which I will introduce soon, are hot text2images ones. Spaces is very interesting, once models is released, many people will write the UI interface. We just go through the OpenChatKit.
This Spaces you see here is one with front end and back end, and I may also have some interaction with him. For example, I can use it to generate images. Will this be hard to work on? I type A high tech solarpunk utopia in the Amazon rainforest (low quality). For example, if we look at the code Stable Diffusion, gitattributes, README, you don't need to look at these. Look at app.py only, read his code, it's actually very simple. The core code is from the “infer function” (15 lines). Once the user clicked the bottom, then we could generate the image out. This is done with gradio (gradio.app). The core idea is that you could build the UI app through only a few lines code, which greatly improves the productivity of AI apps. This is a brief introduction for Hugging Face.
Previously I have some awesome Spaces list, now I believe is not enough awesome, ChatGPT4 is really strong while others spaces is far away from commercialization. But there still exist some interesting ones. For example, YOLO is very commonly used detection model for CV. The core idea is to find out what is inside this, for example, let the city camera to count the number of cars on the street. You can run a YOLO model to help you find the Bounding Box, to help you find all the cars. For the case I show here, this author has written a Space, and it is very intuitive to see how YOLOv8 is different from v7 or other versions. This is probably even more true for the text2image models. In the past, it may be enough to look at the accuracy rate to decide whether the model is good or bad, but now for many models, you need to try it out to see if it works better, or if it can meet your needs in some specific scenarios. This result is running out, because the CPU is used, so it runs slower, but the effect is still good.
Wang Yi: Do we use the browser's computing resources to run?
Tiezhen: No, this is a complete background run. If you create a space, we will give you an instance on the background for free, we give you a 2-core CPU and 16GB of memory for free, and then other versions will be upgraded for a fee.
This is the Large Language Model API, you can call different LLMs, I'll show you a few demos that I think are the most impressive. We know that ChatGPT has a UI interface and also provides an API, so you can use the API to interact with him. How about people said, "I am not very satisfied with the default UI interface, can't I write a UI interface that is fully functional with the API. I can write a flatter app, which can run on my phone and computer.川虎ChatGPT is such an implementation, completely using the Public API to achieve the function of a UI interface. For example, when you run the program, you will see Queue:1/1 13.5/14.0s followed by the estimated time. Queue means how many people are in front of you, 1/1 means that there is one person in front of you, and your turn will come when the person in front is finished. Because there is a queue, so it shows the timeout. At this point you can duplicate this space, which is equivalent to you creating a copy of this space and running it entirely with your own resources. What does that mean. Let's say this is a 2-core, 16GB VM, after you Duplicate it, you have your own VM for your own use, or you can change it to public to share it with others. I have already created it here, I can find it now. I can chat with him now.
I think the biggest value of this thing, if you are not satisfied with the UI interface, and want to design some better UI interface, for example including some better prompt, tips as templates, we can use this interface. For example, you can go look at his code. If you need to create a Template function, you could change the code which I think this is the biggest charm for open source software. You can customize the interaction interface and ways. For example, word doesn’t have the function which you would like to use, very difficult for you to use, but in the open source world, especially this relatively lightweight app, please feel free to go to play creativity, do a lot of things.
WangYi: Can I understand this very much like wordpress, the domain name and server thing. So basically hugging face.co helps you manage it, and you can do incremental customization.
Tie Zhen: You can understand it that way, but he is much more powerful than wordpresses. The customization one is still very strong. You can use the Gradio I just mentioned, and you can even upload a Docker profile on it and execute it directly on the virtual machine, so the scalability is so good.
I think AIGC is very similar to empty ears, so let me give you an example. When you go to listen to a Korean or Indian song, the first time you listen to it, you may think it is noise, you can't understand what it says, you may know that he has some rhymes, you don't know what he says. Once you read the subtitles, the subtitles are actually someone hearing him out, using Chinese, pronounced very similar words to describe it. Then you listen to it, you may feel that this is about the lyrics, although the lyrics are very nonsensical, such as the I said “you do not mind not to shower”. Once you accept the sorts of lyrics, when you go to listen to the song again, you will find that he seems to be singing about this very thing, no longer a noise, but a subtitle is saying something. So I believe AIGC speaks the same thing.
What's going on here with the diffusion model is actually a process of noise reduction. We imagine what the process of adding noise is like. The original is a graph, a little bit add noise, a little bit add noise, slowly the graph is what we can not see. The Diffuison model does the reverse process, and the information inside the noise needs to be extracted, how to extract it, if you do not give any direction at all. If you do not give him the direction, Korean songs you let him listen to 100 times, or do not understand what is, always in this state. So the only way is to give him the prompt, control, imbedding, the additional information. Let him go through the process for “empty ear”, let him feel as if I see the second picture on the right, then there will be a little move from the rightmost picture to the left. Here I seem to see a cat. In fact, I looked at the second picture on the right, and then look at the first picture on the right, I can vaguely see a cat out. After taking the first step, you can take the second part and finally restore the cat completely. Think the cluttered picture is a cat, then it’s believed to see the cat.The truth as for what the original pic is not important, what is song in Korean is not important.
License: In fact, people focus on text2image rarely concern about license,so does open source software.It’s still in the gray area. But we have seen some lawsuits, in some areas such as the United States, which has some influence. Once the first case law is out there, there might bring huge impact. The license for Stable Diffusion is still relatively general. For example, if you draw a portrait of someone without their consent, hand painting should not be a problem in our opinion. But if you use the AIGC tool, Stable Diffusion to create a portrait without anyone's consent, according to my understanding, it may be a violation of the license. The reason why I think it is more lenient is that it can be used for commercial use. In contrast, MJ has a restriction that free users can only use the generated images for personal use, not for commercial use. The company MJ has the right to use the images you generate. There is a big difference between this and Stable Diffusion. Why do we say that this thing is still in the gray area? Because although there is license, the U.S. copyright law, according to his current law, only applies to works created by people rather than the AI works created without the owner, without the copyright. I can’t predict the future, but will have influential impact.
Not only the artists will encounter such a problem, code generation, Chatgpt will also do.Whether I can use arbitrary things to train, or the one without the copyright license is still a question. The trained model, the generated content is owned by whom. This is a relatively large problem, I do not have the answer, I just give you FBI warning before using any model.
Wang Yi: My understanding is that the training data is certainly copyrighted, As long as the training data copyright solution is solved, it’s fine.There are two available solutions. The first is buyout, if the the training data belongs to you, then anything through the training data re-creation, regeneration still belongs to you. The second is that you only have the right to use, you could only use the metadata for training, that you do not have actual ownership. Then if you generate this thing, you also only have the right to use, you do not have unlimited commercial rights. Instead it should belong to the owner of the metadata. It only needs to confirm be confirmed, because this is indirect relationship. If you find a way, for example, such as the image comparison, that you and his picture have some very similarities that, he believes that this belongs to the training, rather than inference, you are likely to be sued.
Tie Zhen: It's possible. I'm not a lawyer, but I personally think it's a little bit more complicated, because different people have different interests. The first is the original creator, the human artist, and then the creator of the dataset, such as Laion, the Pile (eleuther.ai), the person who collected the data into a dataset, and Stable Diffusion, the person who trained the model, took the dataset with the copyrighted data, and then trained the model out. For example, if I use Stable Diffusion to make an inference, it is actually related to my copyright and Stable Diffusion’s. Even after I generated the image with AIGC, I made some changes, or someone else did something with my image. So there is a long chain in between.
As you know, Stable Diffusion has its own model, for example, Stable Diffusion v1-5, we use the original model to generate something. We can also take a fine-tuned model such as anything v4 (Fantasy.ai) to compare. Different models, training sets, training methods, training weights, are different, the style is also different.
WangYi: Can you do animation through prompt now? Fix a character, GPTCHAT, generate different actions and scenes for this character, generate animation, and some short movies.
Tie Zhen: I think this involves two problems, whether I can generate a character and give him different pictures. You see this anything v4 model, is the same portrait, for different forms of generation. After overlaying ControlNet, you can pose for her. The second question, can I make this figure move, which basically you need to insert countless pictures in between, resulting in this series of movement,I to be honest this technology is not yet mature yet. Stable Diffusion has been around for half a year, so maybe there will be another model in another half year.So let’s be patient.
Frank: For 3D's Netflix is doing it. Runway, a German company, is also doing that.
Tie Zhen: I'm in some AIGC groups and it's totally indistinguishable whether it's AI generated or real people. Professional can really mix up the ones, which is still very impressive. We have AIGC models and diagrams in Diffusers Gallery. Some models with LoRA, which you can use this as a tool. LoRA model is a concept or several concepts equivalent, and is being used in the AIGC generated pictures.
For example, the standard model has the form of a dog, but you say draw a picture of a dog. But this dog is still different from your dog. Each dog has a unique place, through DreamBooth technology, we can let the model learn what this dog looks like, and after learning, I can generate a picture of this dog in different scenes, and we can observe that the training set has different angles of the dog.
DreamBooth is a fine-tuning technique that speaks to the effect that can be achieved with a handful of photos. If you have more photos to tune the parameters, you may have better results. If 3-5 photos, if you can achieve a good result, it is actually a very good result. This technology was first invented by Google. But this is not the only one, there are also texing patterns and so on. We have hosted the relevant hackathon. Here is the DreamBooth models that people uploaded, the winner is the national tide style, what he trained is not something specific, rather the national dynasty style.
The problem with DreamBooth is that the model is too big, 4-5 G post training. So there is a new technology behind it, where two superimposed together can make the fine-tuned model very small, for example LoRA+DreamBooth.
Wang Yi: Diffusion model, one 11G GPU are able to host, if then training, probably the 24G A30 could work to handle it, right?
Tiezhen: Yes. I think the reason why Stable Diffusion is so popular is 1) open source and 2) huge, it could run on home graphics cards to (rather than A100). dreamBooth + LoRA stacked up, the model can do 3.3 MB. What's the benefit? You can put your LoRA model together and customize the effect.
We did a contest with Flying Paddle, and Baidu Flying Paddle provided the computing power and the code. All you have to do is prepare a few photos, go to the contest, choose a GPU, drag the photos to the computing center of Flying Paddle, run them, and build the model.
This is a planetary engine, WonderingEarth. let's look at this contestant, a man looking at the WonderingEarth. Because it's LoRA, you can string different concepts together. Imagine a person made concept called moon orbiter, prompt a man looking at the WonderingEarth + moon orbiter, here will appear the moon orbiter, you can combine different concepts together to achieve the effect, even you can say, the country dynasty style planetary engine. Of course there are many technical details inside, the general idea is this.
The next technology that I am optimistic about, Elite, is that I need to have a training process to make the model learn a new concept. For traditional machine learning, it’s the concept of fewshot, now I have the possibility to do zeroshot. I give you a few photos, you do not need to re- fine-tune, fine-tuning takes ten minutes to twenty minutes. Can you see me this photo and will know what I want to do,and directly draw out. For example, I chose the photo of the kitten, gave him a mask, told him that this position of the kitten is what I want him to be generated here again in the new picture. This example defaults to the concept called S, which was previously called Kokusai, or Wonderingearth, and here is called S. My command (S in jar). We see that the effect is still good. If we use DreamBooth, and LoRA, the results will be even better. After all, it takes so long to train and there are so many photos. For this case, one photo is enough to achieve such an effect.
Another one is called ControlNet, just now we said that when the empty ear gives you the lyrics, and then you imagine what the song is singing according to the lyrics. Now I do not give you the lyrics, I give you the painting, give you other Control. In the image above, you can give him an edge, tell him, you see an edge from this nothingness, you can see what, and then give an additional prompt, and then generate the effect. The Prompt I gave him was a boy, and there are a couple of photos that still show it, and although it's a girl on the left, the right photo morphologically satisfies my requirements. He also met the requirements of a black and white edge. If I can draw a sketch and tell him what the sketch is about, he can fill in the color, modify it, light and shadow, and handle it all, and use new techniques like LoRA to include some concepts.
This is another kind of Control, just now we apply the edge, here is pose estimation, we just call another AI model to identify, identify the bone key point, where the head is, where the hand is. After recognition, and then give on the prompt. We see the generally the effect is good. The hand is not handled well, the head, the location of the feet are good. Also it generated four pieces altogether. This is ControlNet, where you give him more Control, and let him follow your ideas to help generate.
Extending this idea forward, we can do more things, such as whether we can list all possibilities, and I can choose what I need. For example, Tencent recently made T2-Adapter, which can use Style & Color, Structure, sketch, pose, depth, and edge all together. This example is that I take the shouting figure as Style, I want to use my Pose, I want to draw a monk. Again the hand is a bit weird. If I give a full body shot, including the hands, the effect should be better. This interesting thing is that you can put these all kinds of combination altogether. ControlNet can also do via multicontrolNet, but the UI is not so good.
Ali has a new work called Composer, which you could get Sketch, color scale, mask, etc. from a picture. According to different combinations, you can generate different pictures. Not yet open sourced.
Next, we introduce non-model, cutting-edge productivity tools. The first is the Robust Face Restoration and upscale models. The scene is where Stable diffusion generates 512*512 maps by default, and with upscale you can generate 2048*2048 maps. This is one of the upscaler. You can go through this way, see how the effect, if good, you can then choose a slower upscale, to effect the picture better.
Introduced above are all single tools, if you make a living or develop an deep interest in the area, you need to learn to use Webui UI, the open source tool, which many people use for deployment in Hugging Face. It provides a lot of features, this is only a compressed version where some plug-ins are not installed. But you can do text 2 image. Give him prompt, or even negative prompt, many parameters can be selected, and subsequently generated a landscape painting. Local editing, instruct him to draw an airplane in a certain position, he can draw an airplane, there is an airplane in the clouds and mountains. It feels like it is going to fly out of the painting. There is also image2image, which you can learn.
Finally, I will show you some examples of sound generation. Here you can use the style of the original God character to generate the words being said. For singing, we can generate songs for you. You could customize to change the pitch, duration, like I was doing the change of the lyrics. The style of the song, or the character's voice, are required a lot of data to train, the future may also appear the technology similar to LoRA, where we can compose different styles such as for example, let Deng Ziqi sing Jay Chou's song, plus our own lyrics. Or in the future emergence of zeroshot, similar to Elite style, I gave him a song, the scene learned this style, break down, customize and create the new songs again.
Chloe: A lot of papers I read before talked about training parameters, training methods, datasets, hardware requirements, etc. But what you shared made me feel that this wave of AI is really going to reach all of us, and everyone should learn how to use these modeling tools. The CEO of Microsoft also said at one point that if you won't use these AI tools in the future, it's like you won't use your smartphone now. You've made some of the more complex, academic stuff very insightful. For us investors, especially important is how to take these obscure things, to translate for people who don’t have enough obsession or ideas for AI.Previously, the AI industry's threshold is relatively high, I remember around all learning AI are PhD, PhD in mathematics. Now in fact, generative AI has greatly reduced the threshold of the industry, through mastery of these core skills and functions, all people can train their own models, their own data, I think the future target is to reach a billion, if not tens of billions of users. The British government has also announced its plans to spend billions of pounds to establish a supercomputing center, whether it is for an individual, a country, a religion, or a culture that needs its own idiosyncratic model. As a front-line entrepreneur, Tie Zhen has talkd a large number of startups, technology companies, running different AI models. We also found that the iteration rate is very fast, and the development of AI in the past few months has exceeded the growth rate of the past few years. 15 years ago, when AlphaGo Progamme appeared, many people in the industry had already predicted that AI (2024, 2025) would usher in a huge development opportunity in ten years. This is the reason why we are laying out in advance.
Wang Yi: I am now in charge of algorithms at GraphCore, and we are making chips to support the training of various models. This research of yours help me understand of Stable Diffusion as well as ecosystems. We also ran the Stable Diffusion model before.I think the tool chain of Hugging face is interesting, we need to learn how to control the generated content, in line with our expectations. I believe the future e will have more natural of control, through prompt and interactive tools.