LoRA: Low-Rank Adaptation
of Large Language Models
最直白的解释就是大模型重新训练因为参数过多过于耗费时间,重新训练来做微调是不可取的.
更加能接受的方式是我们给模型增加一个adapter来作为辅助模型,我们不用完全重新训练模型,只需要在模型上面把adapter训练好即可让模型能够朝着我们需要的方向工作.
It works by inserting a smaller number of new weights into the model
and only these are trained. This makes training with LoRA much faster,
memory-efficient, and produces smaller model weights (a few hundred
MBs), which are easier to store and share.
官方地址: https://huggingface.co/docs/diffusers/training/lora
script level
install
1 2 3 4 5 6 git clone https://github.com/huggingface/diffusers cd diffuserspip install . cd examples/text_to_imagepip install -r requirements.txt
script
其实大部分都没啥好看的,主要也就是几个自定义的点
scheduler, tokenizer
位置:
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L567
1 2 3 4 5 noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler" ) tokenizer = CLIPTokenizer.from_pretrained( args.pretrained_model_name_or_path, subfolder="tokenizer" , revision=args.revision )
UNet
位置:
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L599
1 2 3 4 5 6 7 8 unet = UNet2DConditionModel.from_pretrained( args.pretrained_model_name_or_path, subfolder="unet" , revision=args.non_ema_revision ) vae.requires_grad_(False ) text_encoder.requires_grad_(False ) unet.train()
text processing
位置:
https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L724
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 def tokenize_captions (examples, is_train=True ): captions = [] for caption in examples[caption_column]: if isinstance (caption, str ): captions.append(caption) elif isinstance (caption, (list , np.ndarray)): captions.append(random.choice(caption) if is_train else caption[0 ]) else : raise ValueError( f"Caption column `{caption_column} ` should contain either strings or lists of strings." ) inputs = tokenizer( captions, max_length=tokenizer.model_max_length, padding="max_length" , truncation=True , return_tensors="pt" ) return inputs.input_ids
image processing
位置:
https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L742
1 2 3 4 5 6 7 8 9 10 train_transforms = transforms.Compose( [ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution), transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x), transforms.ToTensor(), transforms.Normalize([0.5 ], [0.5 ]), ] )
run & result
官方给出的宝可梦模型训练, 学校机器在被用并不能跑.
很可惜看不了结果了.
值得一提官方平台也有很多其他人上传的dataset可以自定义使用.
https://huggingface.co/datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 export MODEL_NAME="runwayml/stable-diffusion-v1-5" export dataset_name="lambdalabs/pokemon-blip-captions" accelerate launch --mixed_precision="fp16" train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --use_ema \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=15000 \ --learning_rate=1e-05 \ --max_grad_norm=1 \ --enable_xformers_memory_efficient_attention --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model" \ --push_to_hub
结果可以调用查看. 1 2 3 4 5 6 7 from diffusers import StableDiffusionPipelineimport torchpipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model" , torch_dtype=torch.float16, use_safetensors=True ).to("cuda" ) image = pipeline(prompt="yoda" ).images[0 ] image.save("yoda-pokemon.png" )
Load adapter
脚本层面的理解是为了能调制出我们自己想要的模型, 如果是有现有的LoRA
weight的话也可以直接进行导入.
可用仓库
现有模型:
stable-diffusion-conceptualizer ,
LoraTheExplorer .
仓库: diffusers-gallery ,
civitai
几个不同模型
也可以直接看youtube视频. LoRA vs Dreambooth vs
Textual Inversion vs Hypernetworks
DreamBooth
通过同物品的几张图片以及unique
identifier生成其他包含此物品的图片.
训练方式为整个模型训练, 即tokenizer, text
embedding,以及最后的UNet都需要进行训练.
本质为在原模型基础上重新强行训练一个identifer给一个新特征,
是一个单独的全新的模型.
1 2 3 4 5 6 7 from diffusers import AutoPipelineForText2Imageimport torchpipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style" , torch_dtype=torch.float16).to("cuda" ) prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" image = pipeline(prompt).images[0 ] image
Textual inversion
同样是将特定物品或风格和identifier进行关联. 但是Textual
inversion是从text的角度出发,将模型的文本处理部分训练到能够单独输出一个向量给某特定identifier从而让generator生成相关图片.
1 2 3 4 5 6 7 8 9 10 11 from diffusers import AutoPipelineForText2Imageimport torchpipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5" , torch_dtype=torch.float16).to("cuda" ) pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork" ) prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, <gta5-artwork> style" image = pipeline(prompt).images[0 ] image
LoRA
通过外嵌adapter,局部微调从而使得模型可以朝某个方向工作.
1 2 3 4 5 6 7 8 9 10 from diffusers import AutoPipelineForText2Imageimport torchpipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0" , torch_dtype=torch.float16).to("cuda" ) pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora" , weight_name="cereal_box_sdxl_v1.safetensors" ) prompt = "bears, pizza bites" image = pipeline(prompt).images[0 ] image
大概也就是这些了,下次有空看一下PEFT,然后看下LoRA在GPT层面的应用.