Norm相关

本来是不打算写的, 不过最近看各种东西老是能碰到norm的问题, 所以还是单独记录一下.

BatchNorm

对batch维度做norm, 常见1D和2D.

1D在nlp领域,输入: N,C,L 然后num_features为C的大小,对L做norm

2D在图像处理,输入: N,C,H,W,然后num_features为C,对HW做norm

example:

1D:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
In [1]: import torch

In [2]: x = torch.arange(2*2*2, dtype=torch.float).reshape(2,2,2)

In [3]: x[0], x[1]
Out[3]:
(tensor([[0., 1.],
[2., 3.]]),
tensor([[4., 5.],
[6., 7.]]))

In [4]: import numpy as np

In [5]: mean = np.mean([0,1,4,5])

In [6]: std = np.std([0,1,4,5])

In [7]: norm = lambda x:(x-mean)/std

In [8]: for i in [0,1,4,5]:
...: print(norm(i))
...:
-1.212678125181665
-0.7276068751089989
0.7276068751089989
1.212678125181665

In [9]: mean = np.mean([2,3,6,7])

In [10]: std = np.std([2,3,6,7])

In [11]: for i in [2,3,6,7]:
...: print(norm(i))
...:
-1.212678125181665
-0.7276068751089989
0.7276068751089989
1.212678125181665

In [15]: BNlayer = torch.nn.BatchNorm1d(2)

In [16]: BNlayer(x)
Out[16]:
tensor([[[-1.2127, -0.7276],
[-1.2127, -0.7276]],

[[ 0.7276, 1.2127],
[ 0.7276, 1.2127]]], grad_fn=<NativeBatchNormBackward0>)

2D类似,不过norm就像字面意思,norm维度变成了2维:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
In [20]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [21]: x
Out[21]:
tensor([[[[ 0., 1.],
[ 2., 3.]],

[[ 4., 5.],
[ 6., 7.]]],


[[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]]])

In [22]: x[0], x[1]
Out[22]:
(tensor([[[0., 1.],
[2., 3.]],

[[4., 5.],
[6., 7.]]]),
tensor([[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]]))

In [23]: mean = np.mean([0,1,2,3,8,9,10,11])

In [24]: std = np.std([0,1,2,3,8,9,10,11])

In [25]: for i in [0,1,2,3,8,9,10,11]:
...: print(norm(i))
...:
-1.3242443839434612
-1.0834726777719228
-0.8427009716003844
-0.601929265428846
0.601929265428846
0.8427009716003844
1.0834726777719228
1.3242443839434612

In [26]: bn2d = torch.nn.BatchNorm2d(2)

In [27]: bn2d(x)
Out[27]:
tensor([[[[-1.3242, -1.0835],
[-0.8427, -0.6019]],

[[-1.3242, -1.0835],
[-0.8427, -0.6019]]],


[[[ 0.6019, 0.8427],
[ 1.0835, 1.3242]],

[[ 0.6019, 0.8427],
[ 1.0835, 1.3242]]]], grad_fn=<NativeBatchNormBackward0>)

BN的batchsize相关

如果模型选择了BN,那么一定得注意batchsize的选择问题,毕竟是batch-wise的norm, 虽然你可以设置batchsize为1, 但那样就等同于instance norm了, BN的作用是对每个batch做normalization, 为了让同batch的分布一致, 可能更适用于分类问题.

conv + BN

conv输出的时候可以带bias, 但是如果后面是BN, 那么bias就可以带可以不带.

因为假设输出为x:

x -> (mean_x, var_x)

x+b -> (mean_x + b, var_x)

所以BN之后:

(x - mean_x) / std_x == (x+b - mean_x - b) / std_x

因此bias带不带没影响.

LDM/Stable Diffusion

最近一直也在调LDM, 其中发现的一个问题就是自己写的时候带了BatchNorm, 但是又考虑到GPU大小问题, batchsize只能设置2.

于是回头看了下CompVis它们的代码, 发现他们的batchsize也是设置的4, 大概也是觉得batchsize太小用BN不合适, 他们norm也是选择了使用groupnorm.

LayerNorm

区别于batchnorm, layernorm是依据你输入的维度来做norm.

更直观的说, 就是从batch里面每次取出一个sample(区别于batchnorm,即not batch-wise), 然后从这个sample里面每次取出instance, 这个instance的大小是自己指定, 然后对这个instance做norm, 然后取下一个instance.

比如下面这个例子, batch大小为2, 设置为layernorm((2,2,2)), 那么每次循环从sample 1取出一个222张量做norm, 同理设置为(2,2), 那么就会取出2*2来做norm, 一个sample结束了之后再取sample 2然后norm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0., 1.],
[ 2., 3.]],

[[ 4., 5.],
[ 6., 7.]]],


[[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]]])

In [5]: layernorm = torch.nn.LayerNorm((2,2,2))

In [6]: layernorm(x)
Out[6]:
tensor([[[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]]],


[[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]]]], grad_fn=<NativeLayerNormBackward0>)

In [7]: mean = np.mean([0,1,2,3,4,5,6,7])

In [8]: std = np.std([0,1,2,3,4,5,6,7])

In [9]: norm = lambda x:(x-mean)/std

In [10]: for i in [0,1,2,3,4,5,6,7]:
...: print(norm(i), end=',')
...:
-1.5275252316519468,-1.091089451179962,-0.6546536707079772,-0.2182178902359924,0.2182178902359924,0.6546536707079772,1.091089451179962,1.5275252316519468,
In [11]: layernorm = torch.nn.LayerNorm((2,2))

In [12]: layernorm(x)
Out[12]:
tensor([[[[-1.3416, -0.4472],
[ 0.4472, 1.3416]],

[[-1.3416, -0.4472],
[ 0.4472, 1.3416]]],


[[[-1.3416, -0.4472],
[ 0.4472, 1.3416]],

[[-1.3416, -0.4472],
[ 0.4472, 1.3416]]]], grad_fn=<NativeLayerNormBackward0>)

In [13]: x
Out[13]:
tensor([[[[ 0., 1.],
[ 2., 3.]],

[[ 4., 5.],
[ 6., 7.]]],


[[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]]])

In [14]: mean = np.mean([0,1,2,3])

In [15]: std = np.std([0,1,2,3])

In [16]: for i in [0,1,2,3]:
...: print(norm(i), end=', ')
...:
-1.3416407864998738, -0.4472135954999579, 0.4472135954999579, 1.3416407864998738,

InstanceNorm

就像刚刚介绍的, 类似于batchsize为1的BN. 即每次从batch里面取出一个instance, 然后锁Channel, 对HW做norm. 这个norm的作用主要是针对每一个不同的图片来做norm, 所以被运用来风格迁移上.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*2*2*2, dtype=torch.float).reshape(2,2,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0., 1.],
[ 2., 3.]],

[[ 4., 5.],
[ 6., 7.]]],


[[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]]])

In [5]: instanceNorm = torch.nn.InstanceNorm2d(2)

In [6]: instanceNorm(x)
Out[6]:
tensor([[[[-1.3416, -0.4472],
[ 0.4472, 1.3416]],

[[-1.3416, -0.4472],
[ 0.4472, 1.3416]]],


[[[-1.3416, -0.4472],
[ 0.4472, 1.3416]],

[[-1.3416, -0.4472],
[ 0.4472, 1.3416]]]])

In [7]: batchnorm = torch.nn.BatchNorm2d(2)

In [8]: batchnorm(x)
Out[8]:
tensor([[[[-1.3242, -1.0835],
[-0.8427, -0.6019]],

[[-1.3242, -1.0835],
[-0.8427, -0.6019]]],


[[[ 0.6019, 0.8427],
[ 1.0835, 1.3242]],

[[ 0.6019, 0.8427],
[ 1.0835, 1.3242]]]], grad_fn=<NativeBatchNormBackward0>)

GroupNorm

介于InstanceNorm和LayerNorm中间, InstanceNorm是group为channel number的特例, 而layernorm为group为1的特例.

pytorch官方文档也写了

1
2
3
4
5
6
7
8
9
>>> input = torch.randn(20, 6, 10, 10)
>>> # Separate 6 channels into 3 groups
>>> m = nn.GroupNorm(3, 6)
>>> # Separate 6 channels into 6 groups (equivalent with InstanceNorm)
>>> m = nn.GroupNorm(6, 6)
>>> # Put all 6 channels into a single group (equivalent with LayerNorm)
>>> m = nn.GroupNorm(1, 6)
>>> # Activating the module
>>> output = m(input)

实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
In [1]: import torch

In [2]: import numpy as np

In [3]: x = torch.arange(2*4*2*2, dtype=torch.float).reshape(2,4,2,2)

In [4]: x
Out[4]:
tensor([[[[ 0., 1.],
[ 2., 3.]],

[[ 4., 5.],
[ 6., 7.]],

[[ 8., 9.],
[10., 11.]],

[[12., 13.],
[14., 15.]]],


[[[16., 17.],
[18., 19.]],

[[20., 21.],
[22., 23.]],

[[24., 25.],
[26., 27.]],

[[28., 29.],
[30., 31.]]]])

In [5]: groupnorm = torch.nn.GroupNorm(2,4)

In [6]: groupnorm(x)
Out[6]:
tensor([[[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]],

[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]]],


[[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]],

[[-1.5275, -1.0911],
[-0.6547, -0.2182]],

[[ 0.2182, 0.6547],
[ 1.0911, 1.5275]]]], grad_fn=<NativeGroupNormBackward0>)

In [7]: mean = np.mean(range(8))

In [8]: std = np.std(range(8))

In [9]: norm = lambda x:(x-mean)/std

In [10]: for i in range(8):
...: print(norm(i), end=', ')
...:
-1.5275252316519468, -1.091089451179962, -0.6546536707079772, -0.2182178902359924, 0.2182178902359924, 0.6546536707079772, 1.091089451179962, 1.5275252316519468,