Academic Project Page

We fine-tune Stable Diffusion 2 with ImageReward, which was trained on human assessments of text-image pairs, to optimize for human preferences. We show the results from our fine-tuned model on real-user prompts here and provide qualitative comparison with other reward optimization methods below. By evaluating our model on both in-domain dataset DiffusionDB and out-of-domain test set PartiPrompts, we demonstrate that our trained model generates more visually appealing images compared to the base SDv2 model, and it generalizes well to unseen text prompts.

All images below are generated with the same random seeds. Our outputs are better aligned with human aesthetic preferences, favoring finer details, focused composition, vivid colors, and high contrast.

Baseline Comparison

Prior reward fine-tuning methods for diffusion models mainly fall under three categories: reward-based loss reweighting, dataset augmentation, and backpropagation through the reward model. We compare against a variety of baseline methods, including ReFL, RAFT, DRaFT and Reward-weighted, covering the three different methodologies.

A portrait of an anthropomorphic wolf wearing a black doublet, furry fursona, Victorian era masterwork, by Samuel Luke Fildes

Illustration, a study of a nordic village, post grunge concept art by Josan Gonzales and Wlop, highly detailed, intricate, sharp focus, Trending on Artstation HQ, deviantart-H 704

Kitten walks the empty street in a rainy day, led lights around the place, digital painting, ultra detailed, unreal engine 5

Woman with long red hair, very beautiful style, in a gold suit, night desert, dunes, photorealism, night in the desert, her face illuminated by golden rays, pensive, dreamy, red lips, john singer sargent, edgard maxence

A portrait of a gothic princess in white baroque dress in a scenic environment by Henriette Ronner - Knip

Goddess of illusion, beautiful, stunning, breathtaking, mirrors, glass, magic circle, magic doorway, fantasy, mist, bioluminescence, hyper-realistic, unreal engine, by blizzard concept artists

A portrait painting of a husky in cowboy costume, wearing a cowboy hat, by Rembrandt, character design, trending on artstation

Old house at the end of a forest road in the rain, creepy ambiance, high focus, highly detailed

Head and shoulders portrait of a female knight, quechua!, lorica segmentata, cuirass, tonalist, symbolist, realism, ...

A hyper realistic cat warrior, ultra detailed, magic the gathering art, digital art, cinematic, studio lighting, …

Anime portrait by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimichan, trending on artstation

A hauntingly beautiful woman with horns, painted by Artgerm and Tom Bagshaw, fantasy art, dramatic lighting, highly detailed oil painting

A drawing of a man standing under a tree

A full moon rising above a mountain at night

Skintone Diversity

The training of diffusion models is highly data-driven, relying on billion-sized datasets that are randomly scraped from internet. As a result, the trained models may contain significant social bias and stereotypes. For example, it has been observed that text-to-image diffusion models commonly exhibit a tendency to generate humans with lighter skintones. We aim to mitigate this bias by explicitly guiding the model using a skintone diversity reward. During training, we use a reward function that encourages a uniform distribution among different skintones. We observe that our fine-tuned model greatly reduces the skintone bias embedded in the pretrained SDv2 model, especially for occupations with more social stereotypes or biases inherent in the pretraining dataset.

Object Composition

As diffusion models often fail to accurately generate different compositions of objects in a scene, we further explored using our RL framework in ensuring compositionality with diffusion models. We use relationship terms such as “and,” “next to,” “near,” “on side of” and “beside” to produce captions that designate a spatial relationship between two objects and use the generated captions for training and evaluating the compositional skill of the model.