Releases: huggingface/diffusers
v0.19.0: SD-XL 1.0 (permissive license), AutoPipelines, Improved Kanidnsky & Asymmetric VQGAN, T2I Adapter
SDXL 1.0
Stable Diffusion XL (SDXL) 1.0 with permissive CreativeML Open RAIL++-M License was released today. We provide full compatibility with SDXL in diffusers
.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]
image
Many additional cool features are released:
- Pipelines for
- Img2Img
- Inpainting
- Torch compile support
- Model offloading
- Ensemble of Denoising Exports (E-Diffi approach) - thanks to @bghira @SytanSD @Birch-san @AmericanPresidentJimmyCarter
Refer to the documentation to know more.
New training scripts for SDXL
When there’s a new pipeline, there ought to be new training scripts. We added support for the following training scripts that build on top of SDXL:
Shoutout to @harutatsuakiyama for contributing the training script for InstructPix2Pix in #4079.
New pipelines for SDXL
The ControlNet and InstructPix2Pix training scripts also needed their respective pipelines. So, we added support for the following pipelines as well:
StableDiffusionXLControlNetPipeline
StableDiffusionXLInstructPix2PixPipeline
The ControlNet and InstructPix2Pix pipelines don’t have interesting checkpoints yet. We hope that the community will be able to leverage the training scripts from this release to help produce some.
Shoutout to @harutatsuakiyama for contributing the StableDiffusionXLInstructPix2PixPipeline
in #4079.
The AutoPipeline API
We now support Auto
APIs for the following tasks: text-to-image, image-to-image, and inpainting:
Here is how to use one:
from diffusers import AutoPipelineForTextToImage
import torch
pipe_t2i = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", requires_safety_checker=False, torch_dtype=torch.float16
).to("cuda")
prompt = "photo a majestic sunrise in the mountains, best quality, 4k"
image = pipe_t2i(prompt).images[0]
image.save("image.png")
Without any extra memory, you can then switch to Image-to-Image
from diffusers import AutoPipelineForImageToImage
pipe_i2i = AutoPipelineForImageToImage.from_pipe(pipe_t2i)
image = pipe_t2i("sunrise in snowy mountains", image=image, strength=0.75).images[0]
image.save("image.png")
Supported Pipelines: SDv1, SDv2, SDXL, Kandinksy, ControlNet, IF ... with more to come.
Refer to the documentation to know more.
A new “combined pipeline” for the Kandinsky series
We introduced a new “combined pipeline” for the Kandinsky series to make it easier to use the Kandinsky prior and decoder together. This eliminates the need to initialize and use multiple pipelines for Kandinsky to generate images. Here is an example:
from diffusers import AutoPipelineForTextToImage
import torch
pipe = AutoPipelineForTextToImage.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipe(prompt=prompt, num_inference_steps=25).images[0]
image.save("image.png")
The following pipelines, which can be accessed via the "Auto" pipelines were added:
- KandinskyCombinedPipeline
- KandinskyImg2ImgCombinedPipeline
- KandinskyInpaintCombinedPipeline
- KandinskyV22CombinedPipeline
- KandinskyV22Img2ImgCombinedPipeline
- KandinskyV22InpaintCombinedPipeline
To know more, check out the following pages:
🚨🚨🚨 Breaking change for Kandinsky Mask Inpainting 🚨🚨🚨
NOW: mask_image
repaints white pixels and preserves black pixels.
Kandinksy was using an incorrect mask format. Instead of using white pixels as a mask (like SD & IF do), Kandinsky models were using black pixels. This needs to be corrected and so that the diffusers API is aligned. We cannot have different mask formats for different pipelines.
Important => This means that everyone that already used Kandinsky Inpaint in production / pipeline now needs to change the mask to:
# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)
# For PyTorch and Numpy input
mask = 1 - mask
Asymmetric VQGAN
Designing a Better Asymmetric VQGAN for StableDiffusion introduced a VQGAN that is particularly well-suited for inpainting tasks. This release brings the support of this new VQGAN. Here is how it can be used:
from io import BytesIO
from PIL import Image
import requests
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
def download_image(url: str) -> Image.Image:
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
prompt = "a photo of a person"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
pipe.to("cuda")
image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
image.save("image.jpeg")
Refer to the documentation to know more.
Thanks to @cross-attention for contributing this model in #3956.
Improved support for loading Kohya-style LoRA checkpoints
We are committed to providing seamless interoperability support of Kohya-trained checkpoints from diffusers
. To that end, we improved the existing support for loading Kohya-trained checkpoints in diffusers
. Users can expect further improvements in the upcoming releases.
Thanks to @takuma104 and @isidentical for contributing the improvements in #4147.
T2I Adapter
pip install matplotlib
from PIL import Image
import torch
import numpy as np
import matplotlib
from diffusers import T2IAdapter, StableDiffusionAdapterPipeline
def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invalid_mask=None, background_color=(128, 128, 128, 255), gamma_corrected=False, value_transform=None):
"""Converts a depth map to a color image.
Args:
value (torch.Tensor, numpy.ndarry): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None.
vmax (float, optional): vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None.
cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'.
invalid_val (int, optional): Specifies value of invalid pixels that should be colored as 'background_color'. Defaults to -99.
invalid_mask (numpy.ndarray, optional): Boolean mask for invalid regions. Defaults to None.
background_color (tuple[int], optional): 4-tuple RGB color to give to invalid pixels. Defaults to (128, 128, 128, 255).
gamma_corrected (bool, optional): Apply gamma correction to colored image. Defaults to False.
value_transform (Callable, optional): Apply transform funct...
Patch Release: v0.18.2
Patch release to fix:
-
torch.compile
for SD-XL for certain GPUs
-
from_single_file
for all SD models
-
- Fix broken ONNX export
-
- Fix incorrect VAE FP16 casting
-
- Deprecate loading variants that don't exist
Note:
Loading any stable diffusion safetensors or ckpt with StableDiffusionPipeline.from_single_file
or StableDiffusionmg2ImgIPipeline.from_single_file
or StableDiffusionInpaintPipeline.from_single_file
or StableDiffusionXLPipeline.from_single_file
, ...
is now almost as fast as from_pretrained(...)
and it's much more tested now.
All commits:
- Make sure torch compile doesn't access unet config by @patrickvonplaten in #4008
- [DiffusionPipeline] Deprecate not throwing error when loading non-existant variant by @patrickvonplaten in #4011
- Correctly keep vae in
float16
when using PyTorch 2 or xFormers by @pcuenca in #4019 - minor improvements to the SDXL doc. by @sayakpaul in #3985
- Remove remaining
not
in upscale pipeline by @pcuenca in #4020 - FIX
force_download
in download utility by @Wauplin in #4036 - Improve single loading file by @patrickvonplaten in #4041
- keep _use_default_values as a list type by @oOraph in #4040
Patch Release for Stable Diffusion XL 0.9
Patch release 0.18.1: Stable Diffusion XL 0.9 Research Release
Stable Diffusion XL 0.9 is now fully supported under the SDXL 0.9 Research License license here.
Having received access to stabilityai/stable-diffusion-xl-base-0.9
, you can easily use it with diffusers
:
Text-to-Image
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]
Refining the image output
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]
Loading single file checkpoitns / original file format
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")
Memory optimization via model offloading
- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()
and
- refiner.to("cuda")
+ refiner.enable_model_cpu_offload()
Speed-up inference with torch.compile
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
Note: If you're running the model with < torch 2.0, please make sure to run:
+pipe.enable_xformers_memory_efficient_attention()
+refiner.enable_xformers_memory_efficient_attention()
For more details have a look at the official docs.
All commits
- typo in safetensors (safetenstors) by @YoraiLevi in #3976
- Fix code snippet for Audio Diffusion by @osanseviero in #3987
- feat: add
Dropout
to Flax UNet by @SauravMaheshkar in #3894 - Add 'rank' parameter to Dreambooth LoRA training script by @isidentical in #3945
- Don't use bare prints in a library by @cmd410 in #3991
- [Tests] Fix some slow tests by @patrickvonplaten in #3989
- Add sdxl prompt embeddings by @patrickvonplaten in #3995
Shap-E, Consistency Models, Video2Video
Shap-E
Shap-E is a 3D image generation model from OpenAI introduced in Shap-E: Generating Conditional 3D Implicit Functions.
We provide support for text-to-3d image generation and 2d-to-3d image generation from Diffusers.
Text to 3D
import torch
from diffusers import ShapEPipeline
from diffusers.utils import export_to_gif
ckpt_id = "openai/shap-e"
pipe = ShapEPipeline.from_pretrained(ckpt_id).to("cuda")
guidance_scale = 15.0
prompt = "A birthday cupcake"
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "cake_3d.gif")
Image to 3D
import torch
from diffusers import ShapEImg2ImgPipeline
from diffusers.utils import export_to_gif, load_image
ckpt_id = "openai/shap-e-img2img"
pipe = ShapEImg2ImgPipeline.from_pretrained(ckpt_id).to("cuda")
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"
image = load_image(img_url)
generator = torch.Generator(device="cuda").manual_seed(0)
batch_size = 4
guidance_scale = 3.0
images = pipe(
image,
num_images_per_prompt=batch_size,
generator=generator,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size =256,
output_type="pil"
).images
gif_path = export_to_gif(images[0], "burger_sampled_3d.gif")
For more details, check out the official documentation.
The model was contributed by @yiyixuxu in #3742.
Consistency models
Consistency models are diffusion models supporting fast one or few-step image generation. It was proposed by OpenAI in Consistency Models.
import torch
from diffusers import ConsistencyModelPipeline
device = "cuda"
# Load the cd_imagenet64_l2 checkpoint.
model_id_or_path = "openai/diffusers-cd_imagenet64_l2"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Onestep Sampling
image = pipe(num_inference_steps=1).images[0]
image.save("consistency_model_onestep_sample.png")
# Onestep sampling, class-conditional image generation
# ImageNet-64 class label 145 corresponds to king penguins
image = pipe(num_inference_steps=1, class_labels=145).images[0]
image.save("consistency_model_onestep_sample_penguin.png")
# Multistep sampling, class-conditional image generation
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo.
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77
image = pipe(timesteps=[22, 0], class_labels=145).images[0]
image.save("consistency_model_multistep_sample_penguin.png")
For more details, see the official docs.
The model was contributed by our community members @dg845 and @ayushtues in #3492.
Video-to-Video
Previous video generation pipelines tended to produce watermarks because those watermarks were present in their pretraining dataset. With the latest additions of the following checkpoints, we can now generate watermark-free videos:
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()
prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_path = export_to_video(video_frames)
For more details, check out the official docs.
It was contributed by @patrickvonplaten in #3900.
All commits
- remove seed by @yiyixuxu in #3734
- Correct Token to upload docs by @patrickvonplaten in #3744
- Correct another push token by @patrickvonplaten in #3745
- [Stable Diffusion Inpaint & ControlNet inpaint] Correct timestep inpaint by @patrickvonplaten in #3749
- [Documentation] Replace dead link to Flax install guide by @JeLuF in #3739
- [documentation] grammatical fixes in installation.mdx by @LiamSwayne in #3735
- Text2video zero refinements by @19and99 in #3733
- [Tests] Relax tolerance of flaky failing test by @patrickvonplaten in #3755
- [MultiControlNet] Allow save and load by @patrickvonplaten in #3747
- Update pipeline_flax_stable_diffusion_controlnet.py by @jfozard in #3306
- update conversion script for Kandinsky unet by @yiyixuxu in #3766
- [docs] Fix Colab notebook cells by @stevhliu in #3777
- [Bug Report template] modify the issue template to include core maintainers. by @sayakpaul in #3785
- [Enhance] Update reference by @okotaku in #3723
- Fix broken cpu-offloading in legacy inpainting SD pipeline by @cmdr2 in #3773
- Fix some bad comment in training scripts by @patrickvonplaten in #3798
- Added LoRA loading to
StableDiffusionKDiffusionPipeline
by @tripathiarpan20 in #3751 - UnCLIP Image Interpolation -> Keep same initial noise across interpolation steps by @Abhinay1997 in #3782
- feat: add PR template. by @sayakpaul in #3786
- Ldm3d first PR by @estelleafl in #3668
- Complete set_attn_processor for prior and vae by @patrickvonplaten in #3796
- fix typo by @Isotr0py in #3800
- manual check for checkpoints_total_limit instead of using accelerate by @williamberman in #3681
- [train text to image] add note to loading from checkpoint by @williamberman in #3806
- device map legacy attention block weight conversion by @williamberman in #3804
- [docs] Zero SNR by @stevhliu in #3776
- [ldm3d] Fixed small typo by @estelleafl in #3820
- [Examples] Improve the model card pushed from the
train_text_to_image.py
script by @sayakpaul in #3810 - [Docs] add missing pipelines from the overview pages and minor fixes by @sayakpaul in #3795
- [Pipeline] Add new pipeline for ParaDiGMS -- parallel sampling of diffusion models by @AndyShih12 in #3716
- Update control_brightness.mdx by @dqueue in #3825
- Support ControlNet models with different number of channels in control images by @JCBrouwer in #3815
- Add ddpm kandinsky by @yiyixuxu in #3783
- [docs] More API stuff by @stevhliu in #3835
- relax tol attention conversion test by @williamberman in #3842
- fix: random module seeding by @sayakpaul in #3846
- fix audio_diffusion tests by @teticio in #3850
- Correct bad attn naming by @patrickvonplaten in #3797
- [Conversion] Small fixes by @patrickvonplaten in #3848
- Fix some audio tests by @patrickvonplaten in #3841
- [Docs] add: contributor note in the paradigms docs. by @sayakpaul in #3852
- Update Habana Gaudi doc by @regisss in #3863
- Add guidance start/stop by @holwech in #3770
- feat: rename single-letter vars in
resnet.py
by @SauravMaheshkar in #3868 - Fixing the global_step key not found by @VincentNeemie in #3844
- Support for manual CLIP loading in StableDiffusionPipeline - txt2img. by @WadRex in #3832
- fix sde add noise typo by @UranusITS in #3839
- [Tests] add test for checking soft dependencies. by @sayakpaul in #3847
- [Enhance] Add LoRA rank args in train_text_to_image_lora by @okotaku in #3866
- [docs] Model API by @stevhliu in #3562
- fix/docs: Fix the broken doc links by @Aisuko in #3897
- Add video img2img by @patrickvonplaten in #3900
- fix/doc-code: Updating to the latest version parameters by @Aisuko in #3924
- fix/doc: no import torch issue by @Aisuko in #3923
- Correct controlnet out of list error by @patrickvonplaten in #3928
- Adding better way to define multiple concepts and also validation capabilities. by @mauricio-repetto in #3807
- [ldm3d] Update code to be functional with the new checkpoints by @estelleafl in #3875
- Improve memory text to video by @patrickvonplaten in #3930
- revert automatic chunking by @patrickvonplaten in #3934
- avoid upcasting by assigning dtype to noise tensor by @prathikr in #3713
- Fix failing np tests by @patrickvonplaten in #3942
- Add
timestep_spacing
andsteps_offset
to schedulers by @pcuenca in #3947 - Add Consistency Models Pipeline by @dg845 in #3492
- Update consistency_models.mdx by @sayakpaul in #3961
- Make
UNet2DConditionOutput
pickle-able by @prathikr in #3857 - [Consistency Models] correct checkpoint url in the doc by @sayakpaul in #3962
- [Text-to-video] Add
torch.compile()
compatibility by @sayakpaul in #3949 - [SD-XL] Add new pipelines by @patrickvonplaten in #3859
- Kandinsky 2.2 by @cene555 in #3903
- Add Shap-E by @yiyixuxu in #3742
- disable num attenion heads by @patrickvonplaten in #3969
- Improve SD XL by @patrickvonplaten in #3968
- fix/doc-code: import torch and fix the broken document address by @Aisuko in #3941
Significant community contributions
The follo...
Patch Release: v0.17.1
Patch release to fix timestep for inpainting
- Stable Diffusion Inpaint & ControlNet inpaint - Correct timestep inpaint in #3749 by @patrickvonplaten
v0.17.0 Improved LoRA, Kandinsky 2.1, Torch Compile Speed-up & More
Kandinsky 2.1
Kandinsky 2.1 inherits best practices from DALL-E 2 and Latent Diffusion while introducing some new ideas.
Installation
pip install diffusers transformers accelerate
Code example
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
generator = torch.Generator(device="cuda").manual_seed(12)
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, guidance_scale=1.0, generator=generator).to_tuple()
image = t2i_pipe(prompt, negative_prompt=negative_prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds).images[0]
image.save("cheeseburger_monster.png")
To learn more about the Kandinsky pipelines, and more details about speed and memory optimizations, please have a look at the docs.
Thanks @ayushtues, for helping with the integration of Kandinsky 2.1!
UniDiffuser
UniDiffuser introduces a multimodal diffusion process that is capable of handling different generation tasks using a single unified approach:
- Unconditional image and text generation
- Joint image-text generation
- Text-to-image generation
- Image-to-text generation
- Image variation
- Text variation
Below is an example of how to use UniDiffuser for text-to-image generation:
import torch
from diffusers import UniDiffuserPipeline
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to("cuda")
# This mode can be inferred from the input provided to the `pipe`.
pipe.set_text_to_image_mode()
prompt = "an elephant under the sea"
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0).images[0]
sample.save("elephant.png")
Check out the UniDiffuser docs to know more.
UniDiffuser was added by @dg845 in this PR.
LoRA
We're happy to support the A1111 formatted CivitAI LoRA checkpoints in a limited capacity.
First, download a checkpoint. We’ll use this one for demonstration purposes.
wget https://civitai.com/api/download/models/15603 -O light_and_shadow.safetensors
Next, we initialize a DiffusionPipeline
:
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
pipeline = StableDiffusionPipeline.from_pretrained(
"gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
pipeline.scheduler.config, use_karras_sigmas=True
)
We then load the checkpoint downloaded from CivitAI:
pipeline.load_lora_weights(".", weight_name="light_and_shadow.safetensors")
(If you’re loading a checkpoint in the safetensors
format, please ensure you have safetensors
installed.)
And then it’s time for running inference:
prompt = "masterpiece, best quality, 1girl, at dusk"
negative_prompt = ("(low quality, worst quality:1.4), (bad anatomy), (inaccurate limb:1.2), "
"bad composition, inaccurate eyes, extra digit, fewer digits, (extra arms:1.2), large breasts")
images = pipeline(prompt=prompt,
negative_prompt=negative_prompt,
width=512,
height=768,
num_inference_steps=15,
num_images_per_prompt=4,
generator=torch.manual_seed(0)
).images
Below is a comparison between the LoRA and the non-LoRA results:
Check out the docs to learn more.
Thanks to @takuma104 for contributing this feature via this PR.
Torch 2.0 Compile Speed-up
We introduced Torch 2.0 support for computing attention efficiently in 0.13.0. Since then, we have made a number of improvements to ensure the number of "graph breaks" in our models is reduced so that the models can be compiled with torch.compile()
. As a result, we are happy to report massive improvements in the inference speed of our most popular pipelines. Check out this doc to know more.
Thanks to @Chillee for helping us with this. Thanks to @patrickvonplaten for fixing the problems stemming from "graph breaks" in this PR.
VAE pre-processing
We added a Vae Image processor class that provides a unified API for pipelines to prepare their image inputs, as well as post-processing their outputs. It supports resizing, normalization, and conversion between PIL Image, PyTorch, and Numpy arrays.
With that, all Stable diffusion pipelines now accept image inputs in the format of Pytorch Tensor and Numpy array, in addition to PIL Image, and can produce outputs in these 3 formats. It will also accept and return latents. This means you can now take generated latents from one pipeline and pass them to another as inputs, without leaving the latent space. If you work with multiple pipelines, you can pass Pytorch Tensor between them without converting to PIL Image.
To learn more about the API, check out our doc here
ControlNet Img2Img & Inpainting
ControlNet is one of the most used diffusion models and upon strong demand from the community we added controlnet img2img and controlnet inpaint pipelines.
This allows to use any controlnet checkpoint for both image-2-image setting as well as for inpaint.
👉 Inpaint: See controlnet inpaint model here
👉 Image-to-Image: Any controlnet checkpoint can be used for image to image, e.g.:
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
import cv2
from PIL import Image
# download an image
image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
np_image = np.array(image)
# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)
# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# generate image
generator = torch.manual_seed(0)
image = pipe(
"futuristic-looking woman",
num_inference_steps=20,
generator=generator,
image=image,
control_image=canny_image,
).images[0]
Diffedit Zero-Shot Inpainting Pipeline
This pipeline (introduced in DiffEdit: Diffusion-based semantic image editing with mask guidance) allows for image editing with natural language. Below is an end-to-end example.
First, let’s load our pipeline:
import torch
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
sd_model_ckpt,
torch_dtype=torch.float16,
safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
Then, we load an input image to edit using our method:
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
Then, we employ the source and target prompts to generate the editing mask:
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
Then, we employ the caption and the input image to get the inverted latents:
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
Now, generate the image with the inverted latents and semantically generated mask:
`...
Patch Release: v0.16.1
v0.16.1: Patch Release to fix IF naming, community pipeline versioning, and to allow disable VAE PT 2 attention
- merge conflict by @apolinario (direct commit on v0.16.1-patch)
- Fix community pipelines by @patrickvonplaten in #3266
- Allow disabling torch 2_0 attention by @patrickvonplaten in #3273
v0.16.0 DeepFloyd IF & ControlNet v1.1
DeepFloyd's IF: The open-sourced Imagen
IF
IF is a pixel-based text-to-image generation model and was released in late April 2023 by DeepFloyd.
The model architecture is strongly inspired by Google's closed-sourced Imagen and a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding:
Installation
pip install torch --upgrade # diffusers' IF is optimized for torch 2.0
pip install diffusers --upgrade
Accept the License
Before you can use IF, you need to accept its usage conditions. To do so:
- Make sure to have a Hugging Face account and be logged in
- Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0
- Log-in locally
from huggingface_hub import login
login()
and enter your Hugging Face Hub access token.
Code example
from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch
# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()
# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()
# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)
# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
# stage 1
image = stage_1(
prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# stage 2
image = stage_2(
image=image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
# stage 3
image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
image[0].save("./if_stage_III.png")
For more details about speed and memory optimizations, please have a look at the blog or docs below.
Useful links
👉 The official codebase
👉 Blog post
👉 Space Demo
👉 In-detail docs
ControlNet v1.1
Lvmin Zhang has released improved ControlNet checkpoints as well as a couple of new ones.
You can find all 🧨 Diffusers checkpoints here
Please have a look directly at the model cards on how to use the checkpoins:
Improved checkpoints:
Model Name | Control Image Overview | Control Image Example | Generated Image Example |
---|---|---|---|
lllyasviel/control_v11p_sd15_canny Trained with canny edge detection |
A monochrome image with white edges on a black background. | ||
lllyasviel/control_v11p_sd15_mlsd Trained with multi-level line segment detection |
An image with annotated line segments. | ||
lllyasviel/control_v11f1p_sd15_depth Trained with depth estimation |
An image with depth information, usually represented as a grayscale image. | ||
lllyasviel/control_v11p_sd15_normalbae Trained with surface normal estimation |
An image with surface normal information, usually represented as a color-coded image. | ||
lllyasviel/control_v11p_sd15_seg Trained with image segmentation |
An image with segmented regions, usually represented as a color-coded image. | ||
lllyasviel/control_v11p_sd15_lineart Trained with line art generation |
An image with line art, usually black lines on a white background. | ||
lllyasviel/control_v11p_sd15_openpose Trained with human pose estimation |
An image with human poses, usually represented as a set of keypoints or skeletons. | ||
lllyasviel/control_v11p_sd15_scribble Trained with scribble-based image generation |
An image with scribbles, usually random or user-drawn strokes. | ||
lllyasviel/control_v11p_sd15_softedge Trained with soft edge image generation |
An image with soft edges, usually to create a more painterly or artistic effect. | <img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_sof... |
v0.15.1: Patch Release to fix safety checker, config access and uneven scheduler
Fixes bugs related to missing global pooling in controlnet, img2img processor issue with safety checker, uneven timesteps and better config deprecation
- [Bug fix] Add global pooling to controlnet by @patrickvonplaten in #3121
- [Bug fix] Fix img2img processor with safety checker by @patrickvonplaten in #3127
- [Bug fix] Make sure correct timesteps are chosen for img2img by @patrickvonplaten in #3128
- [Bug fix] Fix config deprecation by @patrickvonplaten in #3129
v0.15.0 Beyond Image Generation
Taking Diffusers Beyond Image Generation
We are very excited about this release! It brings new pipelines for video and audio to diffusers
, showing that diffusion is a great choice for all sorts of generative tasks. The modular, pluggable approach of diffusers
was crucial to integrate the new models intuitively and cohesively with the rest of the library. We hope you appreciate the consistency of the APIs and implementations, as our ultimate goal is to provide the best toolbox to help you solve the tasks you're interested in. Don't hesitate to get in touch if you use diffusers
for other projects!
In addition to that, diffusers 0.15
includes a lot of new features and improvements. From performance and deployment improvements (faster pipeline loading) to increased flexibility for creative tasks (Karras sigmas, weight prompting, support for Automatic1111 textual inversion embeddings) to additional customization options (Multi-ControlNet) to training utilities (ControlNet, Min-SNR weighting). Read on for the details!
🎬 Text-to-Video
Text-guided video generation is not a fantasy anymore - it's as simple as spinning up a colab and running any of the two powerful open-sourced video generation models.
Text-to-Video
Alibaba's DAMO Vision Intelligence Lab has open-sourced a first research-only video generation model that can generatae some powerful video clips of up to a minute. To see Darth Vader riding a wave simply copy-paste the following lines into your favorite Python interpreter:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)
For more information you can have a look at "damo-vilab/text-to-video-ms-1.7b"
Text-to-Video Zero
Text2Video-Zero is a zero-shot text-to-video synthesis diffusion model that enables low cost yet consistent video generation with only pre-trained text-to-image diffusion models using simple pre-trained stable diffusion models, such as Stable Diffusion v1-5. Text2Video-Zero also naturally supports cool extension works of pre-trained text-to-image models such as Instruct Pix2Pix, ControlNet and DreamBooth, and based on which we present Video Instruct Pix2Pix, Pose Conditional, Edge Conditional and, Edge Conditional and DreamBooth Specialized applications.
Ftb9VnoakAE_B7T.mp4
For more information please have a look at PAIR/Text2Video-Zero
🔉 Audio Generation
Text-guided audio generation has made great progress over the last months with many advances being based on diffusion models.
The 0.15.0 release includes two powerful audio diffusion models.
AudioLDM
Inspired by Stable Diffusion, AudioLDM
is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.
from diffusers import AudioLDMPipeline
import torch
repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
For more information see cvssp/audioldm
Spectrogram Diffusion
This model from the Magenta team is a MIDI to audio generator. The pipeline takes a MIDI file as input and autoregressively generates 5-sec spectrograms which are concated together in the end and decoded to audio via a Spectrogram decoder.
from diffusers import SpectrogramDiffusionPipeline, MidiProcessor
pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion")
pipe = pipe.to("cuda")
processor = MidiProcessor()
# Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid
output = pipe(processor("beethoven_hammerklavier_2.mid"))
audio = output.audios[0]
📗 New Docs
Documentation is crucially important for diffusers
, as it's one of the first resources where people try to understand how everything works and fix any issues they are observing. We have spent a lot of time in this release reviewing all documents, adding new ones, reorganizing sections and bringing code examples up to date with the latest APIs. This effort has been led by @stevhliu (thanks a lot! 🙌) and @yiyixuxu, but many others have chimed in and contributed.
Check it out: https://huggingface.co/docs/diffusers/index
Don't hesitate to open PRs for fixes to the documentation, they are greatly appreciated as discussed in our (revised, of course) contribution guide.
🪄 Stable UnCLIP
Stable UnCLIP is the best open-sourced image variation model out there. Pass an initial image and optionally a prompt to generate variations of the image:
from diffusers import DiffusionPipeline
from diffusers.utils import load_image
import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-unclip-small", torch_dtype=torch.float16)
pipe.to("cuda")
# get image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
image = load_image(url)
# run image variation
image = pipe(image).images[0]
For more information you can have a look at "stabilityai/stable-diffusion-2-1-unclip"
Fsei9kLaUAM27yZ.mp4
🚀 More ControlNet
ControlNet was released in diffusers
in version 0.14.0, but we have some exciting developments: Multi-ControlNet, a training script, and upcoming event and a community image-to-image pipeline contributed by @mikegarts!
Multi-ControlNet
Thanks to community member @takuma104, it's now possible to use several ControlNet conditioning models at once! It works with the same API as before, only supplying a list of ControlNets instead of just once:
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet_canny = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16).to("cuda")
controlnet_pose = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose",
torch_dtype=torch.float16).to("cuda")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"example/a-sd15-variant-model", torch_dtype=torch.float16,
controlnet=[controlnet_pose, controlnet_canny]
).to("cuda")
pose_image = ...
canny_image = ...
prompt = ...
image = pipe(prompt=prompt, image=[pose_image, canny_image]).images[0]
And this is an example of how this affects generation:
Control Image1 | Control Image2 | Generated |
---|---|---|
(none) | ||
(none) |
ControlNet Training
We have created a training script for ControlNet, and can't wait to see what new ideas the community may come up with! In fact, we are so pumped about it that we are organizing a JAX Diffusers sprint with a special focus on ControlNet, where participant teams will be assigned TPUs v4-8 to work on their projects 🤯. Those are some mean machines, so make sure you join our discord to follow the event: https://discord.com/channels/879548962464493619/897387888663232554/1092751149217615902.
🐈⬛ Textual Inversion, Revisited
Several great contributors have been working on textual inversion to get the most of it. @isamu-isozaki made it possible to perform multitoken training, and @pies...