Computers have learned to generate images from text!
They can draw a pencil sketch, a watercolor painting, or generate realistic photos of objects and scenes which do not exist, like human faces, pottery, architecture and so on. The image in the banner was generated using the text prompt, “a painting of new york on new year’s eve”, and the following image with “beach palm trees sunset sand waves”, and well, they are quite good!
In this article, I will try to give a high-level and mildly technical intuitive explanation of various aspects of image generation. I will only explain the generation processs and completely skip the learning part — how exactly the models are trained/learned to be able to do sensible generation.
To explain generation, we need to establish some basic background as well. So I hope this article while a bit long, is helpful to an average non-AI reader for understanding the intuition behind image generation.
I will also keep on editing this for a while, adding links to papers and other resources, refered to here, as well as the ones I believe would be interesting to the readers.
You might want to get a drink now …
The technology behind image generation belongs to a broad class of machine learning methods called Generative methods. The name generative comes because, as you would guess, these models aim to be able to generate samples from a distribution.
Let me attempt to explain distributions, in an informal way. Think of a set of objects, say human eyes, and then describe their possible states, e.g. brown, blue, black etc., and finally specify the proportion of objects you would expect to have those states, e.g. 20% of people have brown, 15% blue, and rest black eyes. This will constitute a simple one dimensional (1D) distribution. 1D because you only have one thing to specify.
How in this world is this related to image generation?
Bear with me please.
Human eyes are not just defined by their color though! Let’s also add approximate shapes to our description
of eyes. Say we have almond and round eyes (I am just making these up here). Now when we want
to specify an eye, we have to specify both color and shape, i.e. the eye is brown and round, or black
and almond shaped. And when we have to specify the proportions we now have to provide 6 percentage
values, one each for the \(3 \times 2\) combinations of the two things, color and shape. Once you do
that, you will have a 2D distribution this time!
If we also add the skin tone of the person, and limit it to light and dark. Then it becomes a 3D distribution. Now as the number of such things to specify increase, you would perhaps notice that some combinations seem unlikely, e.g. dark skin tone with blue eyes (again, making it up). This is an important bit which we will revisit when we talk about image generation soon.
Let me summarize a couple of key takeaways I would request you to process and remember.
If you have several things to specify an object, and each of them have options they can take, then the number of proportions you require equals the product of the number of options for each.
Some of the combinations would be unlikely, i.e. the percentage of such combination occuring would be zero.
Now let’s come to generative models first, and we will get to images in a few seconds.
Say, we wanted to generate eyes within the above scope, i.e. three things based description. If we knew the proportions of the combinations, we could randomly choose a combination with the respective odds. What would come out would be a legitimate description of a eye. And well, to perhaps an utterly disappointing climax, that would be a trivial generative model.
Please do not hate me yet, it will get interesting as we get to images next.
In summary, if you can store the odds/probabilties of all the possible combinations occuring, then that list is your generative model. You can sample/generate by randomly drawing objects according to these respective odds.
Then why is it a big deal?
The problem with such approach is that the number of proportions to store blows up rather quickly.
Above we had three things (color, shape and skin tone) with \(3\) , \(2\) and \(2\) options
respectively, which required \(3 \times 2 \times 2 = 12\) proportions to save. Say we have 10 things
with 20 options each, then you do the math: \(20 \times \ldots \times 20\) (10 times).
How does all that tie to images?
(If you do not really know what a digital image is, or if pixels are still a mystery to you, read
the next section and then come back to the following paragraph)
Think of each pixel of an image as one of the things that you need to specify, and all the colors a
pixel can take as the possible options for that thing. If you have a \(100 \times 100 = 10,000\)
pixel image and each pixel can take say \(8\) colors, then the same math as above makes the
number of combinations \(= 8 \times \ldots \times 8\) (10,000 times)!!!
Hope you would agree that computing and keeping a list of probabilities that large would not be tasteful, to humans and computers alike.
Let’s now dig into images a bit more.
If you did not already know, digital images are basically made up of dots, similar to Pointilism paintings like Paul Signac’s Femmes au puits, currently displayed at Musée d’Orsay, Paris.
The only constraints is that the dots in digital images are on a strict grid. They are displayed so closely by screens that the human eyes see a continuous image instead of a dotted one (like the one above). As an example, modern monitor and phone screens have densities around 200 ppi, i.e. they are displaying that many dots per inch!
Each such dot, called a pixel (the p in ppi above), of an image can take many colors, about \(16.7\) million (see note 1 at the end)! And a budget phone camera would capture images of around \(5\) megapixels, which means it would have around \(5\) million pixels! I believe you are already getting a feel of the blowing up we talked about in the section above.
Explaining distribution of images would be easier with checkerboard images first. These are chessboard like images whose pixels only take two values, either black or white (binary images), and the colors of pixels, squares here, sharing a side are different.
Since the first square can be either black or white, there are two possible checkerboard images for any size.
Here are example binary images, top row shows the only two checkerboard images, and the bottom shows some non-checkerboard images, of \(3 \times 3\) size.
How many possible black and white images would there be for size \(3\times 3\)?
Well, each square can take \(2\) values, so \(9\) squares can take \(2 \times \ldots \times 2\) (\(9\) times), which is \(2^{9} = 512\). Out of these possible \(512\) images, only two are correct (checkerboard). Even if the images were of size \(100 \times 100\), there would still be \(2\) valid images, while there would be a total of \(2 ^{10,000} \approx 10^{1000}\) possibe number of images! I.e. the distribution is really sparse, only two combinations have \(50\%\) probability each, and the rest have none.
Similar to checkerboard images, distribution of images of natural scenes and objects are also very sparse. Neighbouring pixels are usually highly correlated, as they either fall on the same object or on different but co-occuring ones. Consider now natural images with \(8\) bit grayscale pixels, i.e. where each pixel can have one of \(2^8 = 256\) possible shades of gray. Now imagine randomly sampling the gray value for each pixel independently. Below (on the left) are a couple of images generated by doing so. It should not be hard to imagine that it will take a really long time, if at all, to hit a plausible natural image, like the one on the right below, by doing such sampling.
Perhaps we could, given a very very large computer, potentially enumerate all \((H \times W) ^ {256}\) images, \(H, W\) being the height and width respectively, and then find those that are naturally plausible for tiny values of \(H, W\). But it quickly becomes intractable. So noting all possible images and their probabilites, while being a legitimate (and trivial) generative model, is not a feasible option anymore.
Hence we need to summarize or codify the distribution such that representing it, and eventually doing useful things like generating new samples from it, becomes feasible.
Generative models of images thus specify some clever way of generating natural images from simple to specify distributions. The generation process followed by all such methods is some mathematical operation, or a sequence of mathematical operation, performed on elements drawn from the simple starting distributions.
A generative model for the checkerboard images above could be as simple as:
Sample white or black color randomly, e.g. flip a coin and pick white if heads, tails otherwise.
Color the first square (pixel) depending on the sample from step 1, and then just alternate the colors of the rest of the squares.
It should not be very surprising to see that generating an arbitrarily large checkerboard image takes just one draw of black and white with \(1:1\) odds. We already established earlier, there there are only two valid checkerboard images of any size and both of them are equally likely.
Now, lets consider a simple image distribution of ocean and sky meeting in a perfectly level horizon. Assume further, that the sky and the ocean parts are standard textures and the only thing to specify, i.e. the only variable of the distribution is the height of the horizon from the bottom. Here are some images from such a distribution.
This would be a 1D distribution, where we would first sample the height of the horizon, and then paste sky above and ocean below it. We could have the height of the horizon distributed according to a Gaussian distribution, i.e. \(h \sim \mathcal{N}(\mu, \sigma)\) (Gaussian distributions are also called Normal distributions and are often matematically represented with \(\mathcal{N}\)). And then the generation process would be simple: sample the height and paste sky and ocean parts. In this case the parameters we would need to know for the generative model would be the mean \(\mu\) and variance \(\sigma\) of the Gaussian.
We can now increase the complexity and have images with standard sky and grass templates, variable height horizon line, and a standard house as well, which can be positioned anywhere in the grass and can be of different size. Some examples could be:
In this case we would need to specify \(4\) varialbes, i.e. height of the horizon line, the \((x,y)\) position of the house, and the size (scale) \(\sigma\) of the house. Generation process would similarly be:
Sample the height of the horizon
Sample a point \((x,y)\) on the grass region (possibly, making sure that the house is within the image)
Sample the size of the house
Note here that the generation process is sequential, i.e. you need to fix the horizon first so that you can place the house on the grass/ground. This is also the case with generative methods like the ones based on diffusion.
Once we have all this information, rendering the image is just a matter of pasting the elements appropriately.
I hope you can see where this is going. We saw how to generate simple toyish images, but extrapolating from this, natural image generation should make some intuitive sense. Generative models specify distributions of some underlying aspects of the components of the elements, which need to be put together in different (but precisely quantifiable) ways to make new samples from the distribution. The difference in different generative models is how such components are defined, parametrized, composed, and eventually learned from the available data.
Continuing on the need for codifying complex image distribution, let us now talk about Diffusion Models.
Diffusion Models are one way of representing image distributions in a feasible and useful way.
The basic concept is quite simple. Say we want to have a generative model of \(100 \times 100\) images. We define the generative process as follows:
Generate a random noise image of size \(100 \times 100\) pixels, and call is \(I_0\)
Initialize the step \(t = 0\) — we already have \(I_t\) for \(t = 0\) from step 1 (technical readers see note 2 at the bottom)
Add a small Gaussian perturbation with parameters \((\mathbf{\mu_t} , \mathbf{\Sigma_t})\) to \(I_t\)
Till \(t < T\) (maximum number of steps), increment step, i.e., \(t \leftarrow t + 1\), and repeat
Return the final \(I_t\)
If we ignore the random Gaussian part of the steps, the basic construct is that we start with a pure noise image, and then add small parts successively to it. This is continued for a number of steps which results in the purely noise image, transforming into a natural image. If we think about it the process is not very different from what we discussed earlier with the simple sky and sea, or the sky, grass and house image distributions. We started with an uninformative image (blank in those cases) and added parts to it to make the final image. While what we added there made semantic sense, what diffusion models learn is rather abstract or uninterpretable semantically. Below is an example which shows the process on an example face image. The starting image is noise, while the successive steps add corrective perturbations, finally leading to the intended image.
Typical number of steps for good image generation, as reported in the original paper (Sohl-Dickstein et al. 2015 ), are in the thousands, i.e. \(T \approx 1000\)’s, which means the addition of concepts or corrective perturbations, to form the final image, along with being abstract, is really slow as well.
The job here is now to learn the parameters of the small perturbations, so that the generation process can be executed. Apologies, as I am completely skipping how such models are learned, and only discussing the generation part to give an intuitive understanding to a non/mildly-technical reader. I do hope to write a follow up article with the technical details.
The vanilla Diffusion Models as applied to images is inefficient, as the diffusion process, i.e. successive small perturbations, are computed and applied in the size of the original image. I.e. for a small image of \(256 \times 256\) we have vectors of approximately a million in size, and these need to be computed thousands of times.
To alleviate this inefficiency, the Latent Diffusion method, works in a smaller space compared to the space of all pixels, i.e., a latent space for images. Think of it as a form of compression, i.e., you must have noticed when you send an image via Gmail or WhatsApp etc. the quality of image is reduced — they are compressed to send efficiently over the network, and then the compressed data is decoded back into an image, which is a of slightly less quality. Working in a latent space reduces size of the image data, and makes performing diffusion more efficient. And also, more critically, allows to train the model with larger amount of data for longer time resulting in a better quality model.
Till now whatever we discussed pertained to learning a generative model and generating new images from it. The generated images, while natural, would be completely arbitrary. One sample might come out to be a beach image, while the next an indoor kitchen image.
So the final question is how do you make the system generate images which match a text prompt, provided by the user as input.
Mathematically such models are called conditional models. You are fixing a condition, and asking the system to process assuming the condition holds. Hence such models are called Conditional Latent Diffusion Models.
At a high level, one of the ways in which conditioning is achieved by adding the text representation to the noise vector and then running the generation process starting from that joint vector.
Explaining what is text representation would require a full article, but intuitively you can imagine giving each word a serial number depending on where it appears in a very comprehensive dictionary. A sentence/prompt representation then could be obtained by multiplying those numbers for each word it contains. This is a hypothetical example to demonstrate how text might be converted to a mathematical number/vector, the actual process employed to do so is more complex that this, and involves a learning algorithm in itself.
Coming back to text prompt based generation of image; once the text representation is provided in addition to the noise vector in image latent space, the generation process outputs images which best match the prompt, vs. arbitrary images in the case of unconditioned generation.
To make the system understand which kind of sentences match what kind of images, example pairs of matching (text, image) are given while training. Such pairs obviously do not cover all possible combination of prompts and images, but are sufficiently large to let the system learn general aspects of the relation between text prompts and images.
The last part of the story is Stable Diffision.
To train a powerful image generation network, you need three things:
A powerful model and an efficient algorithm to learn it,
Lot of data to learn it with, and
Lots of compute power, i.e. GPU servers, to process all that data via the learning algorithm to obtained the final learned model.
With the development of Latent Diffusion Model, the first condition was met.
The second was met due to the efforts of a non-profit organization with members all over the world! The amazing Large-scale Artificial Intelligence Open Network (LAION) was formed specifically to democratize data availability for large scale machine learning models.
With the rapid rate of technical and algorithmic development, the large and highly performant models could only be trained with data which was proprietary to big corporations like Google, Facebook and Microsoft. This was severly gating and limiting access to such AI technologies for small businesses and individuals. To break thier monopolies, a group of people came together and formed LAION, and were sustained by donations and public research grants. They collected large datasets of text, image, 3D assets, audio and video by scraping the internet. With one such large dataset collected by them training of a powerful Latent Diffusion Model became possible.
The third and final piece was computational resources. For that another company, Stability AI and collaborators, came forward and offered resources, compute and collaborators as well, to train the largest open source Diffusion Model available to anyone.
The official announcement article from Stability.ai states:
“This [the Stable Diffusion release] has been led by Patrick Esser from Runway and Robin Rombach from the Machine Vision & Learning research group at LMU Munich (formerly CompVis lab at Heidelberg University) building on their prior work on Latent Diffusion Models at CVPR’22, combined with support from communities at Eleuther AI, LAION and our own generative AI team.”
This large text conditioned Latent Diffusion Model, trained on a one of the large LAION dataset is called Stable Diffusion, which we hear about all over the internet now!
Hopefully I have been able to give a high-level intuitive explanation of how text conditioned Latent Diffusion Models work, and what is Stable Diffusion. The explanation is really the tip of the iceberg with a lot of details omitted or obfuscated for keeping it accessible to a wide audience.
If you have not noticed already, CloudStudio, a browser based advanced media editor, allows you to generate images using text with Stable Diffusion. Go try it out!
We are team of researchers and engineers working in Computer Vision and Machine Learning. We are excited to be working on CloudStudio to make advanced audio visual editing easy and accessible to everyone.
If you have any comments, or suggestions for improving the article, or spot any factual or technical error, kindly let me know. My email is gaurav@tensortour.com — CloudStudio is made by TensorTour, Inc.
Happy image generation!
Each pixel is made up of Red, Green and Blue (RGB) values, with each represented as an integer between 0 and 255 representing their relative intensities. The final color is the blend of the three primary colors with the respective intensities. This makes the total number of colors possible as \(256 \times 256 \times 256\) which is \(16.7\) million colors.
To technical readers here, I have used steps going from \(0\) to \(T-1\), however in the academic papers the generation part is usually a reverse diffusion process, so the corresponding steps are in decreasing order from \(T-1\) to \(0\).