This is a deep dive into how I made Matrix Milady, the processes and workflows that went into it, and how to reason with and get promising results with Stable Diffusion for creating a milady derivative that floods the timeline with adorable cuteness ^_^
I would like to say I have absolutely no artistic ability, and I know nothing about art, I am just soooooooooooooooooooooooooo in love with miladys <3
I write this guide with the sole intention of helping come up with practical ways to think about how to increase the subjective quality of art in milady derivative collections, since it’s in my best interest to increase the floor of milady maker, following this equation, if you have higher perceived value for your art, it will lead you to price higher at mint, which can lead to a higher initial treasury, which can lead to a bigger 20% sweep of the milady floor, that being said, please don’t grift, follow these guidelines for a respectable grift 🫡
Let’s begin, but first, ⭐ tooling! ⭐
I recommend (optional tho) bringing yourself up to full speed on Pachyderm by reading 3 pieces from my web2 life
First, you’re gonna need a model, technically this isn’t required, you could start from scratch but for this tutorial, let’s go to civitai, and look for an anime-inspired model (I’ll call this the Gen 1 model)
Spin up Stable Diffusion Web-UI and load the model, start by exploring the latent space in your model with prompts that the civitai community posts in the “Gallery” section, they are usually some of the best results that you can get from the model you started with
This is what I had when I got started:
Compare our current set of images to the Milady design docs, (which I should have read, but didn’t know about at the time) we’re way off from anything that looks or “feels” remotely milady
Now you must be asking yourself, what is “milady” or what does it mean to have a “milady-like” appearance, now this is extremely subjective, different people tend to appreciate different features of milady, I ended up taking a few decisive stances where I was confident of what I liked and I disliked in the outputs that were generated, and used it as human feedback to guide SD in my derivative inspired by the OG milady maker:
The eyes, ese, it’s all about the eyes - big anime-inspired, peering into your soul's eyes are a must-have, in its absence, I will accept closed eyes that have a kawaii-esque feeling ^_^
Facial structure & ears - the ears were mainly inspired by Ghiblady maker and Oekaki maker, they were the first collection(s) to show us of original miladys whose ears and facial structure retained milady-ness but didn’t have the pointy elf ear which as you will see below, tends to confuse Stable diffusion
No lewd - self-explanatory (but important to mention since some of the models you download might not have a filter for NSFW content)
Now that we have a plan, let’s look at how we can actually go about doing it:
we’ll need to take the latent space that the model originally came with, but tactically replace the eyes, ears, and facial features to fit our milady description, how do we do this?
You take a base image, import it into Photoshop, and replace the head with an actual milady head, you don’t need to do it precisely, just a rough estimate should suffice, but feel free to spend as much time as you want to get this right, this is what we have so far:
(Psst! if you want just the milady heads as transparent png, you could use this, please feel free to contribute back to the sheet to add moar miladys <3 I’m fairly confident someone will find more interesting use cases for it)
Some progress was made, however, the head doesn’t seem to be blending with the rest of the image, and it feels like they are two distinct styles that were poorly merged, let’s do something about that, what we essentially need to do at this stage is take “our” rendition of an image, and run it back through the original base model (the Gen 1 model) but we don’t want it to completely replace the eyes, ears and facial structure, what we need is a “configurable” or “tunable” approach to stop SD in case it computes “past milady” and reaches “anime waifu” status, luckily for us, SD supports this out of the box with the “X/Y/Z Plot” script:
Use this handy one-liner to make values for Y (python):
numbers = [round(x, 2) for x in list(np.arange(0.20, 0.60, 0.01))]
Run this on batch img2img for every photoshopped Gen 1 that you have, it should give you something like this:
Aha! So the idea here is that at some specific point, we go from being milady to being anime waifu, for example in the first plot that happens between 0.44 - 0.45
You could go further, and do this specifically for the first plot:
numbers = [round(x, 3) for x in list(np.arange(0.440, 0.450, 0.001))]
basically, you are tweening between 0.440.. 0.441.. 0.442 and so on till 0.0450
if it were the third plot, we’d have tweened between 0.39.. 0.40
^ This step is optional, it’s only required if you want to confirm that you’re not missing out on any ultra insanely cute miladys hiding deep in latent space essentially in the vector space between “past milady” and before reaching “anime waifu”
So now you have to do this as a batch im2img for all the Gen 1 miladys that you have photoshopped, I did about 40 for the first generation, here’s what a few look like:
Now comes the fun part, we’re gonna run these 40 images through Joe Penna’s Dreambooth, so it can help us remember specifics of milady features, eyes, ears, etc
You could use collab for this, I recommend getting a 50$ / 500 hours pack every time you run out, you get background execution which is nice
if you know what you are doing, you can save a lot of time by working on sanitizing the dataset prior to spinning up the collab notebook so you can directly get to training, if you want even more savings, you could use runpod, it costs about 25$/hr for a 3090, you can train the model and throw away the container after copying the model to drive
If time to market matters to you and you don’t mind spending $$$, 8 x A100s from lambda labs will be rock solid and, you could probably train in minutes instead of hours with that kinda horsepower 🚀
head to dreambooth and find a notebook you can run (google_collab or runpod), you should be able to click keep clicking run till you get to a stage where it asks for training images, which you then should be able to upload from your gen 1 photoshopped and subsequently diffused miladys
generic settings for the training session:
base model = Download a fresh copy of stable diffusion 1.5 base model (don’t use the gen 1 model you downloaded before)
dataset = “person_ddim” (we could do either person or style since we’re making a model of a person, milady, we’ll go with person here)
steps = (number of training images x 101 - so for our example, we’ll do 4040 samples = 40 x 101 so that would be an additional 101 steps for each training image)
token = “miladymodel” (don’t use just "milady" because it’s likely SD has learned the word milady from the Reddit neckbeards which is not the definition we’re going for here)
generic advice for the training session:
Hit train, sit back, and relax for about 3 hours and 15 minutes, good time to do one strategic tactical sweep of the remilio floor tbh
it’s very easy to overfit a dreambooth model if you have the patience (and the disk space especially) I highly recommend saving every 250 samples, this way you’ll get 16+ models that have increasingly more knowledge on milady features from the 40 images that were used to train it
once training is complete, you are in business, this becomes your gen 2 model, you should essentially start with the last model (it should be something like 4040 samples) and test how "overfit" the model is, use a generic prompt like “miladymodel person, slightly smiling”, if the model is generating too many faces (use “solo” in your prompt), and is too eager to milady, you might want to use a model that was 250 samples prior to the one you have open currently, do this until you are confident that the model you’ve picked is still able to make regular images, but also knows to milady 🫡
Now for our next steps, head over to your favorite LLM, I got chatgpt plus so I went with gpt4, basically, you ask it to generate 10 stable diffusion prompt themes that can have a specific color palette in mind:
”give me 10 stable diffusion prompts that have white angelic energy”
“give me 10 stable diffusion prompts that have purple lightning energy”
“give me 10 stable diffusion prompts that have blue calm aqua energy”
and so on and so forth, generating around 1500 prompts of various settings, the goal is you want to set yourself up for being able to quickly look through the latent space of a new model and curate insanely cute miladys
you can use “prompts from a text file/text box” to upload these prompts and run batch txt2img jobs to explore a good chunk of the latent space in the model you selected after training
Of the 1500 prompts that generated 1500 images, look through all of them, some of them might be undoubtedly cuter than the initial 40 images you had in the post-photoshop, post-diffusion gen 1 bucket, collect them in a folder, here’s what a few looked like:
It seems when you do a dreambooth ”person” training, it learns face structure, eyes, and ears independently of how poorly you photoshopped and diffused, it already knows how to draw a person’s face, it just uses your training images as the base level of “milady” leading to more fully baked and cohesive image
What you did till now was a form of RLHF (Reinforcement Learning Human Feedback) similar to how ChatGPT was trained, except you did it to Stable diffusion instead of GPT3, theoretically speaking we can keep going
A way to go about increasing the perceived art quality from this model could be in two directions:
increase aesthetically pleasing imagery score
increase inherently present but undefinable and unquantifiable miladyness score
To do direction 1: You make SD generate 1500 images, you score each image on a value between 0 to 1 depending on how aesthetically pleasing it is to view, 0.5 being average, and you have to find the lowest referential conduit… ahh I didn't go to college, just find the lowest scoring 1460 images and delete them, of the 40 that are left, they should have the highest “aesthetically pleasing” score, you can then train another dreambooth model, essentially gen3, but this model will be more aesthetically pleasing, if you fully automate this through pipelines, and generate higher order generations, couple hundred generations later, you’ll essentially be midjourney
To do direction 2:
This is a lot trickier, and a lot more worth exploring the science behind because I’m sure a lot of folks are working on direction 1
To do this though, let’s do a lil meta-cognition to understand how we reached Gen 2,
your brain has this insanely complex and flowing definition of what it means for a particular jpg to be considered milady by your definition, it could be that various derivative collections so far might have also influenced this definition, also it’s important to mention that memes will play a huge part in being outliers that might stretch the definition, for example, if you thought about this image:
this will 100% confuse our milady model because till now we’ve mostly made assumptions that milady (remilio) is anime waifu, not necessarily a human with milady facial features, ears, and eyes, so how can we retain miladyness but also let it contextually let it go in favor of being an overall aesthetically pleasing image?
we need to set up our training to understand, how much miladyness is present in a particular jpg, we do this by generating an integer 0 or 1 - 0 being not at all milady, and 1 being completely milady, if you think about it, a similar system (binary classifier) is what most of the email providers use to decide if an incoming e-mail is a spam or not
how do we make a model that can decide whether a particular jpg is milady or not? tbh it’s a fairly solved problem with TensorFlow, but since we want a black box that handles underlying problems without the program assuming that we know enough ML to get our hands dirty, we’ll use something like the classification box from machine-box, that way we can focus on creating higher order data pipelines and allow actual data science to happen inside the black box that machine box provides, though rest assured, 15 minutes of prompting gpt4 can give you something self-hosted if you feel you’ll want the ability to subsequently tweak layers
to train this classification model you would mostly have to create 2 folders, one that has milady jpgs, I ended up scraping milady maker, remilio, radbro webring, ghiblady maker, oekaki maker, pixelady maker (which was upscaled thrice and downsampled once), and all the jpgs in the milady world order telegram sticker pack, and threw in a couple of memes from the milady meme thread on Twitter. Reuse the regularization images that you used in the dreambooth training phase for the other non-milady folder (prefer to use woman-ddim, since it is closer to milady).
you should be able to run each image SD generates through your classification model, and it should tell you if it’s "milady" (score of 1) or “not milady” (score of 0) which you can use to classify whether you add it to training or not.
when you finally automate the creation of subsequent generations, each subsequent model will be able to "milady", soooooooooooooooooooooooooo much more than the previous generation
I chose to go down the milady path myself, and trained subsequent milady generations, by the time I had reached gen 6, I ended up having 158 training images, so at 15,958 (158 x 101) steps it took about a day (23 hours-ish) to train each generation (I used collab for training)
I did try to train generations 7, 8, and 9 but they were only marginally (barely) better than gen 6, so I am kind of on a roadblock until I implement some recent developments to drastically change how effective training can be (details below), I am fairly certain the current training cycle with no changes might organically find alpha and circumvent local maxima of cuteness if given enough generations, based on some very likely to be incorrect assumptions, I can say reasonably say that gen 30 might be in a totally different league especially if I were sure to incorporate human feedback in each training cycle, though that biases it with things that I, personally end up liking which is something that would be ideal to avoid so we’ll need to revisit our training plan
Most of what I have described till now is Dreambooth training by Joe Penna, a caveat with this approach is you get a model where the latent space has been completely overwritten with newer weights that are specific to the model you are training, in our case "milady", which is probably what we wanted because we do not care a lot about anything that’s not milady,
that being said, there is another approach called EveryDream training that’s better at trying not to be destructive of the latent space, so for example, in an every dream training session, you can start from a base model and teach it specifics of “miladymodel” person, and in subsequent training sessions, you can teach it a “remiliomodel” person, and a “radbromodel” person and one model should be able to keep them in shared latent space memory
This could be interesting because you can then do something like “remiliomodel person skating alongside miladymodel person in a skating rink, wearing the official exclusive remilia trucker hat that was only ever given out to friends of the crew at the infamous skate park milady rave in the city of new york city, being spied on from a distance by radbromodel person”
caveats with this approach are that you still need to represent as many objects and people into one model and that model is always capped at having the same amount of fixed latent space, just that it has to remember a lot more details about a lot more people/objects compared to dreambooth training, so I’m leaning towards believing it’s not as good as Dreambooth
My guess is, it’s likely easier to train separate models for key characters, having separate models allows you to fine-tune how a particular character/object will get diffused into an image, if you still need the ability to combine multiple characters from multiple models into one cohesive image, you can use this extension to SD, essentially what this does is generate a character with a model, generate a separate background, and superimpose the former on the latter and runs it through img2img once to clean up any imperfections when blending the two images
A third approach exists, which is only open to you if you have 8 x A100s, I would very much like to do this since I believe this approach has the most malleability into getting the exact results you had in mind, it’s basically training stable diffusion from scratch, you start at SD 1.4 or 1.5 and you train a model that can diffuse 32x32 pixels and work your way up until you have a respectable model, that can do 512x512, the advantage of this approach is that it works really well to ensure you can influence aesthetics and miladyness at a raw level, and it’s also quite hard to overfit this model, so you can throw as many sample images as you possibly get your hands on, and results mostly get better with more quality training data, as much as I would like to do this, I refrain from going down this path because I believe it is too much power for one person to wield, it’s much better in the hands of more seasoned artists [The Remilia Collective - Charlotte Fang and Milady Sonoro] who can shape shift the model to perfection
To improve the aesthetics of the generated images, you would want to add a plethora of visually appealing imagery into the training set. This can include fine art, photographs of natural landscapes, and other types of stunning images. By exposing the model to a broader range of aesthetic styles, the model may start to generate more visually appealing images overall. (Don’t forget to first Photoshop a milady head on to the person present in the jpg, prior to using it as training data, because, if we aren’t getting more milady, what has all this been about🤷♀️)
To fix miladyness scoring, rather than a binary classification (milady or not milady), train a regression model that continuously predicts a milady score between 0 and 1. (ResNet50 or anything along those lines) This helps provides more control over the specific quantity of miladyness in the images that get generated.
You can now train a multi-task model that predicts both qualities. The model would output two separate scores for each image: one for aesthetics and one for miladyness. During training, adjust the model's weights to minimize the difference between predicted and ground truth scores for both tasks, or you can read that line over and over, not understand what gpt4 wants you to do, give up, and use a weighted average
To prevent overfitting, and improve the model’s ability to generalize, unseen examples, always roll with Weights and Biases set up from the beginning, I will not claim I know why it is required, but it seems to have something to do with tracking gradient dissent which I think is something you would want to do with machine learning so seems important (<= this may or may not be a joke)
(kinda) Seriously though, for the most part Machine Learning / AI was sorta gatekept (more so I had this unspeakable fear of it) because of this mythical requirement for math and calculus, but let me assure you, those days are gone (maybe not completely, but getting there, if you still see someone claiming they do the math behind AI, its kinda sus, no offense, depends on a case to case basis ._. ), the science doesn’t particularly require you to dig that deep into math to get the results you want, not even the scientists who create deep learning models can really explain how they work, with that in mind, the best hat imo you can wear today is a person who’s profession is that of an intellectually curious mechanic, you will develop skills, skills that you will refine over the course of weeks, over time you will begin to get feelings of cause and effect when working on your model, “Aha, so if I whack in this nut here, it seems to tighten the hold there” you will begin building a mental model of all the different parameters and variables that go into your model’s engine, and the more creatively you brick your model and revive it back, the more you will explore a broad range of cause and effects to different actions you can take on your engine, allowing you to passively build thought patterns that make you an effective model bender, steadily letting you level up from a mechanic to a mad scientist*****
(*T&C apply; might require sleeping half an hour every hour, every day, until openai grants you API access to gpt4)
DM if you wanna chat @chillgates_ or @ogmilady ⭐
milady 🫡