Norm相关

blacsheep

2024-05-08

本来是不打算写的, 不过最近看各种东西老是能碰到norm的问题, 所以还是单独记录一下.

BatchNorm

对batch维度做norm, 常见1D和2D.

1D在nlp领域,输入: N,C,L 然后num_features为C的大小,对L做norm

2D在图像处理,输入: N,C,H,W,然后num_features为C,对HW做norm

example:

1D:

In [1]: import torch

In [2]: x = torch.arange(2*2*2, dtype=torch.float).reshape(2,2,2)

In [3]: x[0], x[1]
Out[3]:
(tensor([[0., 1.],
         [2., 3.]]),
 tensor([[4., 5.],
         [6., 7.]]))

In [4]: import numpy as np

In [5]: mean = np.mean([0,1,4,5])

In [6]: std = np.std([0,1,4,5])

In [7]: norm = lambda x:(x-mean)/std

In [8]: for i in [0,1,4,5]:
   ...:     print(norm(i))
   ...:
-1.212678125181665
-0.7276068751089989
0.7276068751089989
1.212678125181665

In [9]: mean = np.mean([2,3,6,7])

In [10]: std = np.std([2,3,6,7])

In [11]: for i in [2,3,6,7]:
    ...:     print(norm(i))
    ...:
-1.212678125181665
-0.7276068751089989
0.7276068751089989
1.212678125181665

In [15]: BNlayer = torch.nn.BatchNorm1d(2)

In [16]: BNlayer(x)
Out[16]:
tensor([[[-1.2127, -0.7276],
         [-1.2127, -0.7276]],

        [[ 0.7276,  1.2127],
         [ 0.7276,  1.2127]]], grad_fn=<NativeBatchNormBackward0>)

2D类似,不过norm就像字面意思,norm维度变成了2维:

In [20]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [21]: x
Out[21]:
tensor([[[[ 0.,  1.],
          [ 2.,  3.]],

         [[ 4.,  5.],
          [ 6.,  7.]]],


        [[[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]]])

In [22]: x[0], x[1]
Out[22]:
(tensor([[[0., 1.],
          [2., 3.]],

         [[4., 5.],
          [6., 7.]]]),
 tensor([[[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]]))

In [23]: mean = np.mean([0,1,2,3,8,9,10,11])

In [24]: std = np.std([0,1,2,3,8,9,10,11])

In [25]: for i in [0,1,2,3,8,9,10,11]:
    ...:     print(norm(i))
    ...:
-1.3242443839434612
-1.0834726777719228
-0.8427009716003844
-0.601929265428846
0.601929265428846
0.8427009716003844
1.0834726777719228
1.3242443839434612

In [26]: bn2d = torch.nn.BatchNorm2d(2)

In [27]: bn2d(x)
Out[27]:
tensor([[[[-1.3242, -1.0835],
          [-0.8427, -0.6019]],

         [[-1.3242, -1.0835],
          [-0.8427, -0.6019]]],


        [[[ 0.6019,  0.8427],
          [ 1.0835,  1.3242]],

         [[ 0.6019,  0.8427],
          [ 1.0835,  1.3242]]]], grad_fn=<NativeBatchNormBackward0>)

BN的batchsize相关

如果模型选择了BN,那么一定得注意batchsize的选择问题,毕竟是batch-wise的norm, 虽然你可以设置batchsize为1, 但那样就等同于instance norm了, BN的作用是对每个batch做normalization, 为了让同batch的分布一致, 可能更适用于分类问题.

conv + BN

conv输出的时候可以带bias, 但是如果后面是BN, 那么bias就可以带可以不带.

因为假设输出为x:

x -> (mean_x, var_x)

x+b -> (mean_x + b, var_x)

所以BN之后:

(x - mean_x) / std_x == (x+b - mean_x - b) / std_x

因此bias带不带没影响.

LDM/Stable Diffusion

最近一直也在调LDM, 其中发现的一个问题就是自己写的时候带了BatchNorm, 但是又考虑到GPU大小问题, batchsize只能设置2.

于是回头看了下CompVis它们的代码, 发现他们的batchsize也是设置的4, 大概也是觉得batchsize太小用BN不合适, 他们norm也是选择了使用groupnorm.

LayerNorm

区别于batchnorm, layernorm是依据你输入的维度来做norm.

更直观的说, 就是从batch里面每次取出一个sample(区别于batchnorm,即not batch-wise), 然后从这个sample里面每次取出instance, 这个instance的大小是自己指定, 然后对这个instance做norm, 然后取下一个instance.

比如下面这个例子, batch大小为2, 设置为layernorm((2,2,2)), 那么每次循环从sample 1取出一个222张量做norm, 同理设置为(2,2), 那么就会取出2*2来做norm, 一个sample结束了之后再取sample 2然后norm.

In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0.,  1.],
          [ 2.,  3.]],

         [[ 4.,  5.],
          [ 6.,  7.]]],


        [[[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]]])

In [5]: layernorm = torch.nn.LayerNorm((2,2,2))

In [6]: layernorm(x)
Out[6]:
tensor([[[[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]]],


        [[[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]]]], grad_fn=<NativeLayerNormBackward0>)

In [7]: mean = np.mean([0,1,2,3,4,5,6,7])

In [8]: std = np.std([0,1,2,3,4,5,6,7])

In [9]: norm = lambda x:(x-mean)/std

In [10]: for i in [0,1,2,3,4,5,6,7]:
    ...:     print(norm(i), end=',')
    ...:
-1.5275252316519468,-1.091089451179962,-0.6546536707079772,-0.2182178902359924,0.2182178902359924,0.6546536707079772,1.091089451179962,1.5275252316519468,
In [11]: layernorm = torch.nn.LayerNorm((2,2))

In [12]: layernorm(x)
Out[12]:
tensor([[[[-1.3416, -0.4472],
          [ 0.4472,  1.3416]],

         [[-1.3416, -0.4472],
          [ 0.4472,  1.3416]]],


        [[[-1.3416, -0.4472],
          [ 0.4472,  1.3416]],

         [[-1.3416, -0.4472],
          [ 0.4472,  1.3416]]]], grad_fn=<NativeLayerNormBackward0>)

In [13]: x
Out[13]:
tensor([[[[ 0.,  1.],
          [ 2.,  3.]],

         [[ 4.,  5.],
          [ 6.,  7.]]],


        [[[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]]])

In [14]: mean = np.mean([0,1,2,3])

In [15]: std = np.std([0,1,2,3])

In [16]: for i in [0,1,2,3]:
    ...:     print(norm(i), end=', ')
    ...:
-1.3416407864998738, -0.4472135954999579, 0.4472135954999579, 1.3416407864998738,

InstanceNorm

就像刚刚介绍的, 类似于batchsize为1的BN. 即每次从batch里面取出一个instance, 然后锁Channel, 对HW做norm. 这个norm的作用主要是针对每一个不同的图片来做norm, 所以被运用来风格迁移上.

In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0.,  1.],
          [ 2.,  3.]],

         [[ 4.,  5.],
          [ 6.,  7.]]],


        [[[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]]])

In [5]: instanceNorm = torch.nn.InstanceNorm2d(2)

In [6]: instanceNorm(x)
Out[6]:
tensor([[[[-1.3416, -0.4472],
          [ 0.4472,  1.3416]],

         [[-1.3416, -0.4472],
          [ 0.4472,  1.3416]]],


        [[[-1.3416, -0.4472],
          [ 0.4472,  1.3416]],

         [[-1.3416, -0.4472],
          [ 0.4472,  1.3416]]]])

In [7]: batchnorm = torch.nn.BatchNorm2d(2)

In [8]: batchnorm(x)
Out[8]:
tensor([[[[-1.3242, -1.0835],
          [-0.8427, -0.6019]],

         [[-1.3242, -1.0835],
          [-0.8427, -0.6019]]],


        [[[ 0.6019,  0.8427],
          [ 1.0835,  1.3242]],

         [[ 0.6019,  0.8427],
          [ 1.0835,  1.3242]]]], grad_fn=<NativeBatchNormBackward0>)

GroupNorm

介于InstanceNorm和LayerNorm中间, InstanceNorm是group为channel number的特例, 而layernorm为group为1的特例.

pytorch官方文档也写了

>>> input = torch.randn(20, 6, 10, 10)
>>> # Separate 6 channels into 3 groups
>>> m = nn.GroupNorm(3, 6)
>>> # Separate 6 channels into 6 groups (equivalent with InstanceNorm)
>>> m = nn.GroupNorm(6, 6)
>>> # Put all 6 channels into a single group (equivalent with LayerNorm)
>>> m = nn.GroupNorm(1, 6)
>>> # Activating the module
>>> output = m(input)

实例:

In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*4*2*2, dtype=torch.float).reshape(2,4,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0.,  1.],
          [ 2.,  3.]],

         [[ 4.,  5.],
          [ 6.,  7.]],

         [[ 8.,  9.],
          [10., 11.]],

         [[12., 13.],
          [14., 15.]]],


        [[[16., 17.],
          [18., 19.]],

         [[20., 21.],
          [22., 23.]],

         [[24., 25.],
          [26., 27.]],

         [[28., 29.],
          [30., 31.]]]])

In [5]: groupnorm = torch.nn.GroupNorm(2,4)

In [6]: groupnorm(x)
Out[6]:
tensor([[[[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]],

         [[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]]],


        [[[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]],

         [[-1.5275, -1.0911],
          [-0.6547, -0.2182]],

         [[ 0.2182,  0.6547],
          [ 1.0911,  1.5275]]]], grad_fn=<NativeGroupNormBackward0>)

In [7]: mean = np.mean(range(8))

In [8]: std = np.std(range(8))

In [9]: norm = lambda x:(x-mean)/std

In [10]: for i in range(8):
    ...:     print(norm(i), end=', ')
    ...:
-1.5275252316519468, -1.091089451179962, -0.6546536707079772, -0.2182178902359924, 0.2182178902359924, 0.6546536707079772, 1.091089451179962, 1.5275252316519468,