Diffusers学习(2)-LoRA

LoRA: Low-Rank Adaptation of Large Language Models

最直白的解释就是大模型重新训练因为参数过多过于耗费时间,重新训练来做微调是不可取的.

更加能接受的方式是我们给模型增加一个adapter来作为辅助模型,我们不用完全重新训练模型,只需要在模型上面把adapter训练好即可让模型能够朝着我们需要的方向工作.

It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

官方地址: https://huggingface.co/docs/diffusers/training/lora

script level

install

1
2
3
4
5
6
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/text_to_image
pip install -r requirements.txt

script

其实大部分都没啥好看的,主要也就是几个自定义的点

scheduler, tokenizer

位置:

https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L567

1
2
3
4
5
# Load scheduler, tokenizer and models.
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained(
args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision
)

UNet

位置:

https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L599

1
2
3
4
5
6
7
8
unet = UNet2DConditionModel.from_pretrained(
args.pretrained_model_name_or_path, subfolder="unet", revision=args.non_ema_revision
)

# Freeze vae and text_encoder and set unet to trainable
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
unet.train()

text processing

位置:

https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L724

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def tokenize_captions(examples, is_train=True):
captions = []
for caption in examples[caption_column]:
if isinstance(caption, str):
captions.append(caption)
elif isinstance(caption, (list, np.ndarray)):
# take a random caption if there are multiple
captions.append(random.choice(caption) if is_train else caption[0])
else:
raise ValueError(
f"Caption column `{caption_column}` should contain either strings or lists of strings."
)
inputs = tokenizer(
captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
)
return inputs.input_ids

image processing

位置:

https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L742

1
2
3
4
5
6
7
8
9
10
# Preprocessing the datasets.
train_transforms = transforms.Compose(
[
transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
]
)

run & result

官方给出的宝可梦模型训练, 学校机器在被用并不能跑. 很可惜看不了结果了.

值得一提官方平台也有很多其他人上传的dataset可以自定义使用.

https://huggingface.co/datasets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--enable_xformers_memory_efficient_attention
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model" \
--push_to_hub

结果可以调用查看.

1
2
3
4
5
6
7
from diffusers import StableDiffusionPipeline
import torch

pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt="yoda").images[0]
image.save("yoda-pokemon.png")

Load adapter

脚本层面的理解是为了能调制出我们自己想要的模型, 如果是有现有的LoRA weight的话也可以直接进行导入.

可用仓库

现有模型:

stable-diffusion-conceptualizer, LoraTheExplorer.

仓库: diffusers-gallery, civitai

几个不同模型

也可以直接看youtube视频. LoRA vs Dreambooth vs Textual Inversion vs Hypernetworks

DreamBooth

通过同物品的几张图片以及unique identifier生成其他包含此物品的图片.

训练方式为整个模型训练, 即tokenizer, text embedding,以及最后的UNet都需要进行训练.

本质为在原模型基础上重新强行训练一个identifer给一个新特征, 是一个单独的全新的模型.

1
2
3
4
5
6
7
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style", torch_dtype=torch.float16).to("cuda")
prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration"
image = pipeline(prompt).images[0]
image

Textual inversion

同样是将特定物品或风格和identifier进行关联. 但是Textual inversion是从text的角度出发,将模型的文本处理部分训练到能够单独输出一个向量给某特定identifier从而让generator生成相关图片.

1
2
3
4
5
6
7
8
9
10
11
from diffusers import AutoPipelineForText2Image
import torch

# load stable diffusion
pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")

# load textual inversion
pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork")
prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style"
image = pipeline(prompt).images[0]
image

LoRA

通过外嵌adapter,局部微调从而使得模型可以朝某个方向工作.

1
2
3
4
5
6
7
8
9
10
from diffusers import AutoPipelineForText2Image
import torch
# load stable diffusion
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")

# load lora
pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors")
prompt = "bears, pizza bites"
image = pipeline(prompt).images[0]
image

大概也就是这些了,下次有空看一下PEFT,然后看下LoRA在GPT层面的应用.