Stable Diffusion Basics

from noise to images, through words

ai621 is based on Stable Diffusion, an neural network which was built with StabilityAI and (contrary to DALL-E), released to the public in the middle of August 2022

It is mainly used for image synthesis using predefined prompts, using a method called "denoising", which starts from a random image and tries to remove noise until the result is as close as possible to the prompt.

How does it work?

Have you ever been on a bed, looking at the ceiling, trying to sleep - and you start seeing shapes and images?

That's exactly how it works.

Random, seed 5367918
Random, from my ceiling

Let's get drawing

To do this, we need to give the bot some parameters:

Seed

The bot starts from a random image, which is actually pseudo-random.

It means that to the eyes of the AI, it is truly random. But actually, we (humans) have created a mathematical formula which can create noise on demand.

To do this, we use a 🌱 seed. The seed is the number used inside of the formula which creates the noise.
Same seed, same noise. Even for computers on the opposite side of the world.

This makes it easy to use the same randomness over and over again, so that we only see the effect of the parameters on the image, and not the effect of the randomness.

Prompt and negative prompt

These two define what the bot will try to draw, and what it will try to stay away from. You can use various methods:

You can also give more weight to tags either by enclosing them with (brackets) (the more brackets, the strongest the tag) or by specifying a numeric weight (forest:1.76). Think of it as making the words bold.

Changing the weight of a tag/word makes the network either consider or ignore more that word.

Inference steps (aka Cycles)

Just like a person the AI needs to generate an idea and sketch it, and then go towards the end result.

This is done in Inference Steps. For every step the AI looks at the current noise and tries to "fix it" by drawing over it.

This is done as many times as the number of steps, every time removing a little bit more noise. More steps means the AI will have more time to draw the prompt you asked for, but sometimes it also means it will have more time to fixate on useless details and ruin your image.

Steps = 4, there is just time to make a simple sketch. Maybe.
Steps = 200, there is so much time available that the AI tried to give us the best eyes it could draw.

Guidance scale (aka Quality)

The AI, however, cannot just decide to draw anything it sees. There is a limit, specified by the "Guidance Scale", which defines the maximum amount of modification the AI is allowed to make for every step.

A Guidance of 0 basically means "don't touch anything", and will result in just random noise, while a guidance to the maximum means "try to change as much as you can", which will usually lead to the bot fixating on a couple of random words and drawing them really strongly.

Guidance = 1, the poor AI couldn't draw a concept fast enough, and we got random fox pieces
Guidance = 50, the AI noticed the fox eyes and liked them so much it made them very big