Diffusers学习(1)

blacsheep

2024-04-21

ldm的VQGAN还在跑,先来看下huggingface现有的diffusion model的实现框架.

diffusion model(ddpm)

install

首先安装和import

1
2
3

!pip install diffusers["torch"] transformers

from diffusers import DDPMPipeline

pipline

pipline简洁代码

1
2
3

ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
image = ddpm(num_inference_steps=30).images[0]
image

效果其实一般.

内部细节

还是和前面看的ddpm一样, Unet加上一个scheduler. 对生成结果进行绘图得到以下结果.

from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import torch
from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained('google/ddpm-cat-256')
model = UNet2DModel.from_pretrained('google/ddpm-cat-256', use_safetensors=True).to('cuda')
scheduler.set_timesteps(50)

noice = torch.randn((1,3,model.config.sample_size,model.config.sample_size), device='cuda')
img_index = 0
fig = plt.figure(figsize=(15,15))
for i, t in enumerate(scheduler.timesteps):
    with torch.no_grad():
        noisy_residual = model(noice, t).sample
    previous_noisy_sample = scheduler.step(noisy_residual, t, noice).prev_sample
    noice = previous_noisy_sample
    
    if i % 10 == 0:
        image = (noice / 2 + 0.5).clamp(0, 1).squeeze()
        image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
        image = Image.fromarray(image)
        plt.subplot(1, 10, i//10+1)
        plt.imshow(image)
plt.show()

stable diffusion

pipline

Text-to-image

有stable diffusion1.5, stable diffusion XL等等

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("a bowl of cats, cute", generator=generator).images[0]
image

PNDM(Pseudo Numerical Methods for Diffusion Models on Manifolds)

scheduler也是有不少的,从ddpm到ddim到pndm,具体可以看Note on Variants of Diffusion Scheduler, DDPM DDIM PNDM.

大概来说就是ddpm基于markov,ddim是deterministic且对back process做了优化. pndm是对ddim做出了微分方程求解层面的优化.

不过官方给的说法是stable diffusion默认用pndm,那我们就展现一下换scheduler有多简单,于是用的就是UniPCMultistepScheduler.

load from pretrained

这里区别于前面的UNet,这里的UNet是conditional. 然后encoder用的AutoencoderKL.

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from diffusers import UniPCMultistepScheduler

vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
)
unet = UNet2DConditionModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
)
scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

text processing

文本方面就是先tokenize然后embedding,然后取padding token再做一次embedding最后concat

# text processing
prompt = ["a photograph of a bowl of cats"]
batch_size = len(prompt)

# tokenizing
text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
# generate text embedding
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

# generate text embedding for only padding tokens
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

# concat embeddings
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

train

这里官方的generator会出问题,用cuda的话记得用torch.cuda.manual_seed

然后思路很简单, 直接check vae的downsample数,然后对维度直接进行调整,生成latent space的random input.

然后unet里面传入text embedding作为hidden state. 然后这里guidance我不是很理解,大概是unet默认同时传出conditional和unconditional的result,然后这里手动做调整?

from tqdm.auto import tqdm

# image settings
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.cuda.manual_seed(0)  # Seed generator to create the initial latent noise

# latent noice
# 2 ** (len(vae.config.block_out_channels) - 1) == 8
latents = torch.randn(
    (batch_size, unet.config.in_channels, height // 8, width // 8),
    generator=generator,
    device=torch_device,
)

# scaling input with the noice distribution
latents = latents * scheduler.init_noise_sigma
scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)

    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

show result

latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image