GLIDE for image augmentation aka ToadVerse technical details

We had an idea of shipping derivative of the Cryptoadz NFT collection because we like art, vibe, and community. We decided to exercise the idea of the existence of parallel blockchains and to integrate that into the lore of our collection. So, Toadverse as a Universe of Toads seemed an optimal setting for creating a collection. We only needed art.

The first approach was obvious — StyleGAN, though we were not satisfied with the results produced. Thus, we skipped all ideas of fine-tuning an arbitrary model such as styleGAN in an obscure way to produce a stylized toadz. Instead, we concentrated on such approaches, which could be adopted for our case without any fine-tuning.

The most promising way appeared to be text conditional image generation models, i.e., one that takes the input text and generates an image based on the text input. If we were able to make the resulting image look similar to some toad from the OG collection, then generating ToadVerse collection will require the following steps:

  • Take a random Toad
  • Come up with a text description of the style
  • Generate a new toad from the ToadVerse based on the original toad and style

If some of you have followed some AI artists, you may already know that there are possibly appropriate tools such as CLIP guided VQ-GAN, CLIP guided diffusion, etc. Although all these approaches were already used in various NFT collections, we were looking for something special and new that was not widely adopted yet. For our luck, OpenAI released GLIDE, a new view on learning an even larger diffusion generative model conditioned on text. So, what is the diffusion model, and how did we use it to create ToadVerse?

As once Y. Song and D. P. Kingma pointed out turning data into noise is easy, but turning noise into data is generative modeling. Such a comprehensive view describes the idea of diffusion models. Consider that we have an image. We could add noise and obtain a noised version of the original picture. We still could see the original image through a little noise. Although, we could repeat the process and add more and more noise into the image, making it disappear in the grain.

https://arxiv.org/pdf/2011.13456.pdf
https://arxiv.org/pdf/2011.13456.pdf

As we pointed above, while adding such a noise is easy, consider that we could reverse this process. I.e., having a complete noise image, passing it through the denoising process, and obtaining a high fidelity image. That is what diffusion models are trying to achieve.

To train such a model, one could take a dataset of images, generate a sequence of noised versions for each image and then optimize the following loss:

https://arxiv.org/pdf/2112.10741.pdf
https://arxiv.org/pdf/2112.10741.pdf

While it may look scary at first sight, it simply forces a model to predict a direction in which we have to update the noise image to get a clear one.

Such a model already could sample images, and we could pass it through the model several times and get a sample with random noise. Although we still did not include any conditions, we cannot describe what we want to generate. While this could be done using a CLIP model, OpenAI proposed to train a model that will omit CLIP guidance and instead rely purely on its representations.

The proposed method is to train such a model on a labeled dataset of images and their text captions. Although naive training in such a supervised manner would lead to unsatisfactory results such in most cases, the captions poorly describe the images.

To overcome this issue, the simple trick was applied: instead of training purely in a supervised way, the model was not provided with textual information on the image, and thus, it has to learn the prior distribution of images. Then, consider that we want to generate an image from a text description. We could get two directions: conditioned and unconditioned ones. We could weigh them in the following way:

https://arxiv.org/pdf/2112.10741.pdf
https://arxiv.org/pdf/2112.10741.pdf

In practice, "s" works like the strength of following the text description. The bigger it becomes, the more we move away from the prior distribution and towards conditioned energy.

Such a model we used for the ToadVerse collection. Although, as we were already pointed out, we have to make the model generate samples that will be conditioned not only on text descriptions but also on some external image, so we will be able to stylize it. Happily, diffusion models allow us to do so in a straightforward way.

While generating an image from scratch starts from a noise image, to sample from a certain condition image, we could initialize sampling from our diffusion with this condition. The only difference in the generation process is that instead of starting the sampling from the beginning, we say the model that the original image is something that this model produced us after some generation steps. Varying the insertion step of the condition image affects the "force" of stylization, and the earlier we insert it, the more the original image will change and vice versa.

We also experimented with other approaches to conditioning the GLIDE with an external image, e.g., starting the generation process with the noised version of the original image, although such strategy produced less stable results.

One of the first results of ToadVerse generation with an image caption "crayon doodle" with a noised initial image strategy. Although it may look nice, all results have this colorful grain in the background regardless of the text description, making the collection less diverse.
One of the first results of ToadVerse generation with an image caption "crayon doodle" with a noised initial image strategy. Although it may look nice, all results have this colorful grain in the background regardless of the text description, making the collection less diverse.

Once the approach is fixed, the collection generation then requires selecting appropriate hyperparameters for sampling and cherrypicking the best results.

The hyperparameters of the model are:

  • Text description
  • Text guidance strength
  • Step of initial image insertion

So that is how we managed to fine-tune and improve generation results to produce results that met and overcame our expectations in terms of the art quality. That overcoming of the expectations made us pivot from the original idea of experimenting to doing a real NFT collection.

That is why we decided to do limited drops on alternative blockchains to allow participants of those drops, researching L2 solutions, to participate in the pre-sale of the main collection.

Follow our Twitter https://twitter.com/toadversenft for updates on the first upcoming Polygon drop.

Subscribe to Rotebal Ganov
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.