Dream Booth fine-tuning utilizing a Decentralized Object Store

December 30th, 2022

Submission for the 2023 Hugging Face Dream Booth Hackathon by Kevin Leffew

To follow along step by step in a notebook, check out: https://colab.research.google.com/drive/1EnqpDiKOVYhR0c6f4CgmDg2zqcbYZJpB?usp=sharing

The dataset and card for this model (bethecloud/golf-courses) is located here: https://huggingface.co/datasets/bethecloud/golf-course

A Quick Overview

Fine-Tuning Stable Diffusion for mythological Golf Course Images with Storj DCS

Stable diffusion is an extremely revolutionary and popular text-to-image foundation model used to generate realistic images. Applications like Lens.ai are generating millions of dollars a month acting as simple UX wrappers on top of Stable Diffusion, creating custom, guided, and fine-tuned output for users looking to create exciting and beautiful Avatars.

Stable Diffusion works by tokenizing english language words (using CLIP) and iteratively applying input weights to latent space, in a way that gradually bringing the output closer to the input prompt.

A key benefit of stable diffusion is that it can be fine-tuned to generate high-quality images with a simple and open-source text2image model.

For this application, we have fine tuned a model to create scenic landscapes of golf-courses, with historic and mythological buildings in the background.

How (and why) do we “fine-tune” a Stable Diffusion model

To fine-tune a stable diffusion model, you will need to first select a small number of images that you would like to use as training data. These images should be high-quality and representative of the type of images you want the model to be able to generate.

Creating a training dataset to fine-tune the model

For this example, we are pulling the training data from Hugging Face, and mirroring the training data to both Stor-native for increased performance and content-addressability.

You can access the public bucket of training images yourself on Storj DCS edge services here:

An example of one of the 21 “golf-course” tagged Unsplash images - that are stored in the decentralized cloud.

Object is called “dean-SuGEzQkeJno-unsplash.jpg” in dataset.

Once the model has been trained, you can evaluate its performance on the validation set. This will give you an idea of how well the model is able to generate images that are similar to the training images. If the model's performance is not satisfactory, you can adjust the hyperparameters and try training the model again.

Next up: Defining a data collator

Now that we have a training dataset, the next thing we need is to define a data collator.

A data collator is a function that collects elements in a batch of data and applies some logic to form a single tensor we can provide to the model.

If you'd like to learn more, you can check out this video from the Hugging Face Course: https://colab.research.google.com/drive/hf.co/course

Last Step: Load the components of the Stable Diffusion pipeline

The pipeline is composed of several models, all of which use the 🤗 Diffusers and 🤗 Transformers libraries:

A text encoder that converts the prompts into text embeddings. Here we're using CLIP since it's the encoder used to train Stable Diffusion v1-4
A VAE or variational autoencoder that converts the images to compressed representations (i.e. latents) and decompresses them at inference time
A UNet that applies the denoising operation on the latent of the VAE

For the Hugging Face Dream Booth hackathon, we fine-tuned the base model using a simple Hugging Face Dataset (bethecloud/golf-courses). The compute processes ran easily on an open Google Collab notebook (referenced at header), and was trained on a collection of 21 landscape images of golf courses (with the images replicated to the decentralized cloud via Storj DCS).

Here as an example of fine-tuned Dream Booth output for:

prompt = "a photo of {golf} {course} with the Acropolis from Ancient Greece in the background"

Source: https://link.storjshare.io/ju7ym5x7k3cqktqbfuk3mpg2h7sq/golf-course-output%2Fgolf-acropolis.png

Dataset Summary: golf-courses

This dataset (bethecloud/golf-courses) includes 21 unique images of golf courses pulled from Unsplash.

The dataset is a collection of photographs taken at various golf courses around the world. The images depict a variety of scenes, including fairways, greens, bunkers, water hazards, and clubhouse facilities. The images are high resolution and have been carefully selected to provide a diverse range of visual content for fine-tuning a machine learning model.

The dataset is intended to be used in the context of the Hugging Face Dream Booth hackathon, a competition that challenges participants to build innovative applications using the Hugging Face transformers library. The submission is for the category of landscape.

Overall, this dataset provides a rich source of visual data for machine learning models looking to understand and classify elements of golf courses. Its diverse range of images and high-quality resolution make it well-suited for use in fine-tuning models for tasks such as image classification, object detection, and image segmentation.

By using the golf course images as part of their training data, participants can fine-tune their models to recognize and classify specific features and elements commonly found on golf courses. The ultimate goal after the hackathon is to pull this dataset from decentralized cloud storage (like Storj DCS), increasing its accessibility, performance, and resilience by distributing across an edge of over 17,000 uncorrelated participants.

Usage

The golf-courses dataset can be used by modifying the instance_prompt: a photo of golf course .

Languages

The language data in golf-courses is in English (BCP-47 en)

Dataset Structure

The complete dataset is GBs and consists of 21 objects.

Accelerated download using Decentralized Object Storage (Storj DCS)

A direct download for the dataset is located at https://link.storjshare.io/juo7ynuvpe5svxj3hh454v6fnhba/golf-courses.

In the future, Storj DCS will be used to download large datasets (exceeding 1TB) in a highly parallel, highly performant, and highly economical manner (by utilizing a network of over 17,000 diverse and economically incentivized datacenter node endpoints.

My previous post covered a specific architecture to combine stable diffusion with distributed cloud technologies:

https://mirror.xyz/bitkevin.eth/F3cZoh630VvRKgmPzVQ3QXM5gQu_s9kd3YhW6Qt2mRc
The synthesis between web3 composability through it’s on-chain notarization of image output and text2img tools like Stable Diffusion are obvious and will be covered in future posts and app demos

Source Data

The source data for the dataset is simply pulled from Unsplash

Licensing Information

MIT License

Thanks to John Whitaker and Lewis Tunstall

Thanks to John Whitaker and Lewis Tunstall for writing out and describing the initial hackathon parameters at https://huggingface.co/dreambooth-hackathon.

Subscribe to Kevin Leffew

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

6AjuarH8OQu6mBS…LNVNrkYyngBDFbA

Author Address

0x7aaA5eCA9B4f988…0fEb842DdeC9C65

Content Digest

rmqxOCzeefxu53T…OdJ_5CAVhUbuER8