卷积神经网络简介

关于卷积的概念这部分，我不想多写，大三的时候修过一门必修课《复变函数与积分变换》，里面讲了各种变换，有关于卷积的概念。可惜我只知道按照公式去计算，却不明白为什么要这样算。
matlab官方教程里面有关于卷积的概念，有视频，比较清晰，不过我不记得是哪方面的教程了，也是我大三的时候学的。斋藤康毅的《深度学习入门》里面也是图文结合，一步一步推导讲解，非常清晰。我没必要再搬运到这里来了。

卷积大概就是两个函数之间的一种运算，这边神经网络里的卷积层就是用一个滑动窗，乘积、求和，我认为可以把它看作一种特征提取，池化层就更简单了，就是每块区间里取最大值，来压缩图像。就这样。
我感觉这更像是一种工程实践，而不是优雅的数学理论。

不同教材或库对数据形状定义顺序不完全一致把3 维数据表示为多维数组时，书写顺序为（channel, height, width）。比如，通道数为C、高度为H、长度为W的数据的形状可以写成（C,H,W）
滤波器也一样，要按（channel, height, width）的顺序书写。比如，通道数为C、滤波器高度为FH（Filter Height）、长度为FW（Filter Width）时，可以写成（C, FH, FW）。

通过应用FN个滤波器，输出特征图也生成了FN个。作为4 维数据，滤波器的权重数据要按(output_channel, input_channel, height, width) 的顺序书写

看看tensorflow实现

Yikes, it’s a 4D tensor; we haven’t seen this before! What do all these dimensions mean? Well, there are two sample images, which explains the first dimension. Then each image is 70 × 120, since that’s the size we specified when creating the CenterCrop layer (the original images were 427 × 640). This explains the second and third dimensions. And lastly, each pixel holds one value per color channel, and there are three of them—red, green, and blue—which explains the last dimension.

当我们谈论2D卷积层时，“2D”指的是空间维度（高度和宽度）的数量，但正如你所看到的，该层采用4D输入：正如我们所看到的那样，两个额外的维度是批量大小（第一个维度）和通道（最后一个维度）。

If instead we set padding="same", then the inputs are padded with enough zeros on all sides to ensure that the output feature maps end up with the same size as the inputs (hence the name of this option):

python


from sklearn.datasets import load_sample_images
import tensorflow as tf

images = load_sample_images()["images"]
images = tf.keras.layers.CenterCrop(height=70, width=120)(images)
images = tf.keras.layers.Rescaling(scale=1 / 255)(images)

程序图

设置步长：
For example, if you set strides=2 (or equivalently strides=(2, 2)), then the output feature maps will be 35 × 60: halved both vertically and horizontally.

The kernels array is 4D, and its shape is [kernel_height, kernel_width, input_channels, output_channels]. The biases array is 1D, with shape [output_channels]. The number of output channels is equal to the number of output feature maps, which is also equal to the number of filters.

>>> kernels, biases = conv_layer.get_weights()
>>> kernels.shape
(7, 7, 3, 32)
>>> biases.shape
(32,)

卷积层可能对内存需求较大。

Pooling Layers，池化层。
平均池化层的工作原理与最大池化层完全相同，只是它计算平均值而不是最大值。平均池化层过去非常流行，但人们现在大多使用最大池化，因为它们通常表现更好。这可能看起来很令人惊讶，因为计算平均值通常比计算最大值损失更少的信息。但另一方面，最大池只保留了最强的特征，去掉了所有无意义的特征，因此下一层可以获得更干净的信号。此外，最大池化提供了比平均池化更强的平移不变性，并且所需的计算量略低。

The following code creates a MaxPooling2D layer, alias MaxPool2D, using a 2 × 2 kernel. The strides default to the kernel size, so this layer uses a stride of 2 (horizontally and vertically). By default, it uses "valid" padding (i.e., no padding at all):

python


max_pool = tf.keras.layers.MaxPool2D(pool_size=2)
To create an average pooling layer, just use AveragePooling2D, alias AvgPool2D, instead of MaxPool2D.

请注意，最大池化和平均池化可以沿着深度维度而不是空间维度执行，尽管这并不常见。

Keras does not include a depthwise max pooling layer, but it’s not too difficult to implement a custom layer for that:

python


class DepthPool(tf.keras.layers.Layer):
    def __init__(self, pool_size=2, **kwargs):
        super().__init__(**kwargs)
        self.pool_size = pool_size

    def call(self, inputs):
        shape = tf.shape(inputs)  # shape[-1] is the number of channels
        groups = shape[-1] // self.pool_size  # number of channel groups
        new_shape = tf.concat([shape[:-1], [groups, self.pool_size]], axis=0)
        return tf.reduce_max(tf.reshape(inputs, new_shape), axis=-1)

在现代架构中经常会看到的最后一种类型的池化层是全局平均池化层。它的工作原理非常不同：它所做的是计算整个特征图的均值（这就像使用与输入有相同空间维度的池化内核的平均池化层）。这意味着它每个特征图和每个实例只输出一个单值。尽管这是极具破坏性的（特征图中的大多数信息都丢失了），但它可以用作输出层，正如我们将在本章稍后看到的那样。要创建这样的层，只需使用 keras.layers.GlobalAvgPool2D 类：

python


global_avg_pool = tf.keras.layers.GlobalAvgPool2D()

# 它等效于此简单的Lambda层，该层计算空间维度（高度和宽度）上的平均值：
global_avg_pool = tf.keras.layers.Lambda(lambda X: tf.reduce_mean(X, axis=[1, 2]))


# 例如
>>> global_avg_pool(images)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.64338624, 0.5971759 , 0.5824972 ],
    [0.76306933, 0.26011038, 0.10849128]], dtype=float32)>

CNN架构

架构的意思就是怎么把上面的层组合起来。

典型架构

典型CNN架构图

一个常见的错误是使用过大的卷积核。例如，与其使用具有5×5内核的卷积层，不如堆叠具有3×3内核的两层：它将使用更少的参数，需要更少的计算，并且通常性能更好。一个例外是第一个卷积层：它通常可以有一个大内核（例如，5×5），通常步长为2或更大。这将在不损失太多信息的情况下降低图像的空间维度，并且由于输入图像通常只有三个通道，因此成本不会太高。

简单的CNN来处理Fashion MNIST数据集：

python


from functools import partial

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same",
                        activation="relu", kernel_initializer="he_normal")
model = tf.keras.Sequential([
    # the images are 28 × 28 pixels, with a single color channel
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    tf.keras.layers.MaxPool2D(),

    # 接下来是全连接的网络
    # 请注意，我们必须将其输入展平，因为密集网络需要每个
    # 实例的一维特征阵列。我们还添加了两个dropout层，每层的dropout率
    # 均为50％，以减少过拟合的情况。
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=128, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=64, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=10, activation="softmax")
])

If you compile this model using the "sparse_categorical_crossentropy" loss and you fit the model to the Fashion MNIST training set, it should reach over 92% accuracy on the test set.

We will first look at the classical LeNet-5 architecture (1998), then several winners of the ILSVRC challenge: AlexNet (2012), GoogLeNet (2014), ResNet (2015), and SENet (2017). Along the way, we will also look at a few more architectures, including Xception, ResNeXt, DenseNet, MobileNet, CSPNet, and EfficientNet.

LeNet-5

Yann LeCun于1998年创建的，已被广泛用于手写数字识别（MNIST），是CNN元祖。
Yann LeCun的网站上有LeNet-5分类数字的演示。

层	类型	Maps	Size	内核大小	步幅	激活函数
Out	Fully connected	–	10	–	–	RBF
F6	Fully connected	–	84	–	–	tanh
C5	Convolution	120	1 × 1	5 × 5	1	tanh
S4	Avg pooling	16	5 × 5	2 × 2	2	tanh
C3	Convolution	16	10 × 10	5 × 5	1	tanh
S2	Avg pooling	6	14 × 14	2 × 2	2	tanh
C1	Convolution	6	28 × 28	5 × 5	1	tanh
In	Input	1	32 × 32	–	–	–

MNIST图像为28×28像素，但是将其零填充为32×32像素并在送入网络之前进行了归一化。网络的其余部分不使用任何填充，这就是图像随着网络延展而尺寸不断缩小的原因。平均池化层比一般的池化层要复杂一些：每个神经元计算其输入的平均值，然后将结果乘以可学习的系数（每个特征图一个），并添加一个可学习的偏置项（同样每个特征图一个），最后应用激活函数。C3特征图中的大多数神经元仅连接到了在S2特征图中的三个或四个神经元（而不是S2特征图中的所有6个）。有关详细信息，请参见原始论文中的表1（第8页）输出层有点特殊：每个神经元输出的是输入向量和权重向量之间的欧几里得距离的平方，而不是计算输入向量和权重向量的矩阵乘法。每个输出测量图像属于特定数字类别的程度。交叉熵成本函数现在是首选，因为它对不良预测的惩罚更大，产生更大的梯度并收敛更快。

AlexNet

在LeNet 问世20 多年后，AlexNet 被发布出来。AlexNet 是引发深度学习热潮的导火线，不过它的网络结构和LeNet基本上没有什么不同。它是由Alex Krizhevsky（因此得名）、Ilya Sutskever和Geoffrey Hinton开发的，与LeNet-5相似，只是更大和更深，它是第一个将卷积层直接堆叠在一起的方法，而不是将池化层堆叠在每个卷积层之上。

Layer	Type	Maps	Size	Kernel size	Stride	Padding	Activation
Out	Fully connected	–	1000	–	–	–	Softmax
F10	Fully connected	–	4096	–	–	–	ReLU
F9	Fully connected	–	4096	–	–	–	ReLU
S8	Max pooling	256	6 × 6	3 × 3	2	valid	–
C7	Convolution	256	13 × 13	3 × 3	1	same	ReLU
C6	Convolution	384	13 × 13	3 × 3	1	same	ReLU
C5	Convolution	384	13 × 13	3 × 3	1	same	ReLU
S4	Max pooling	256	13 × 13	3 × 3	2	valid	–
C3	Convolution	256	27 × 27	5 × 5	1	same	ReLU
S2	Max pooling	96	27 × 27	3 × 3	2	valid	–
C1	Convolution	96	55 × 55	11 × 11	4	valid	ReLU
In	Input	3 (RGB)	227 × 227	–	–	–	–

为了减少过拟合，作者使用了两种正则化技术。首先，他们在训练期间对F9层和F10层的输出使用了dropout率为50％的dropout技术。其次，他们通过随机变换训练图像的各种偏移量、水平翻转及更改亮度条件来执行数据增强。
数据增强，Data augmentation ，通过生成每个训练实例的许多变体来人为地增加训练集的大小。这减少了过拟合，使之成为一种正则化技术。例如，你可以将训练集（训练集中的图片数量各不相同）中的每张图片稍微移动、旋转和调整大小，将生成的图片添加到训练集中，这迫使模型能更容忍图片中物体的位置、方向和大小的变化。对于更能容忍不同光照条件的模型，你可以类似地生成许多具有各种对比度的图像。通常，你还可以水平翻转图片（文本和其他非对称物体除外）。通过组合这些变换，可以大大增加训练集的大小。
AlexNet还在层C1和C3的ReLU之后立即使用归一化步骤，称为局部响应归一化（LRN）：最强激活的神经元会抑制位于相邻特征图中相同位置的其他神经元（在生物神经元中已观察到这种竞争性激活）。这鼓励不同的特征图的专业化，将它们分开，并迫使它们探索更广泛的特征，从而最终改善泛化能力。

GoogLeNet

GoogLeNet架构由Google研究院的Christian Szegedy等人开发。

Inception模块

Inception模块
符号“3×3+1（S）”表示该层使用3×3内核，步幅为1且填充为"same"。带1×1内核的卷积层无法识别空间特征，但它们可以识别沿深度维度的特征，并且能降低维度表，减小计算量。

GoogLeNet架构

前两层将图像的高度和宽度除以4（因此将其面积除以16）以减少计算量。第一层使用较大的内核，因此可以保留很多信息。然后，局部响应归一化层确保前面的层学习各种各样的特征。接下来是两个卷积层，其中第一层就像一个瓶颈层。如前所述，你可以将这一对层视为一个更智能的卷积层。同样，局部响应归一化层可确保先前的层识别各种模式。接下来，最大池化层将图像的高度和宽度减小到1/2，再次加快了计算速度。

然后是9个inception模块的高堆叠，与几个最大池化层交错以减少维度并加快网络速度。接下来，全局平均池化层输出每个特征图的均值：这将丢弃所有剩余的空间信息，这是可以的，因为在该点上没有太多的空间信息。实际上，GoogLeNet输入图像通常为224×224像素，因此在经过5个最大池化层（每个高度和宽度除以2）后，特征图将降至7×7。而且这是分类任务，不是对象定位，因此对象在哪里都没有关系。由于此层带来的降维效果，因此不需要在CNN的顶部（如AlexNet中）有几个全连接层，这大大减少了网络中的参数数量，并降低了过拟合的风险。最后一层是不言自明的：为了进行正则化而dropout，然后是一个具有1000个单元的全连接层（因为有1000个类）和一个softmax激活函数来输出估计的类别概率。

Google研究人员后来提出了GoogLeNet架构的几种变体，包括Inception-v3和Inception-v4，它们使用略微不同的inception模块并获得了更好的性能。

ResNet

何凯明等使用残差网络（或ResNet）赢得了ILSVRC 2015挑战赛，其前5名的错误率低于3.6％。获胜的变体使用了由152层组成的非常深的CNN（其他变体具有34、50和101层）。它证实了一个趋势：模型变得越来越深，参数越来越少。能够训练这种深层网络的关键是使用跳过连接（也称为快捷连接）：馈入层的信号也将添加到位于堆栈上方的层的输出中。

选择合适的架构

参阅：https://keras.io/api/applications/

使用keras实现ResNet-34

Most CNN architectures described so far can be implemented pretty naturally using Keras (although generally you would load a pretrained network instead, as you will see). To illustrate the process, let’s implement a ResNet-34 from scratch with Keras. First, we’ll create a ResidualUnit layer:

python


DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, strides=1,
                        padding="same", kernel_initializer="he_normal",
                        use_bias=False)

class ResidualUnit(tf.keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = tf.keras.activations.get(activation)
        self.main_layers = [
            DefaultConv2D(filters, strides=strides),
            tf.keras.layers.BatchNormalization(),
            self.activation,
            DefaultConv2D(filters),
            tf.keras.layers.BatchNormalization()
        ]
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                DefaultConv2D(filters, kernel_size=1, strides=strides),
                tf.keras.layers.BatchNormalization()
            ]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)


# 接下来，我们可以使用Sequential模型来构建ResNet-34
# 实际上只是一个很长的层序列
model = tf.keras.Sequential([
    DefaultConv2D(64, kernel_size=7, strides=2, input_shape=[224, 224, 3]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"),
])
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters

model.add(tf.keras.layers.GlobalAvgPool2D())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10, activation="softmax"))

It is amazing that in about 40 lines of code, we can build the model that won the ILSVRC 2015 challenge! This demonstrates both the elegance of the ResNet model and the expressiveness of the Keras API. Implementing the other CNN architectures is a bit longer, but not much harder. However, Keras comes with several of these architectures built in, so why not use them instead?

使用Keras的预训练模型

可以直接获取预训练网络：
For example, you can load the ResNet-50 model, pretrained on ImageNet, with the following line of code:

python


model = tf.keras.applications.ResNet50(weights="imagenet")

就这样！这会创建一个ResNet-50模型并下载在ImageNet数据集上预先训练的权重。要使用它，首先需要确保图像尺寸合适。ResNet-50模型需要224×224像素的图像（其他模型可能需要其他尺寸，例如299×299）

python


images = load_sample_images()["images"]
images_resized = tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True)(images)


# 每个模型都提供一个preprocess_input（）函数，你可以用来预处理图像
inputs = tf.keras.applications.resnet50.preprocess_input(images_resized)

# Now we can use the pretrained model to make predictions:

>>> Y_proba = model.predict(inputs)
>>> Y_proba.shape
(2, 1000)

# 通常Y_proba输出一个矩阵，每个图像一行，每个类一列（在此例
# 中，共有1000个类）。如果要显示前K个预测（包括类名和每个预测类
# 的估计概率），请使用encode_predictions（）函数。
# 对于每个图像，它返回一个包含前K个预测的数组，其中每个预测都表示为一个包
# 含类标识符、其名称和对应置信度得分的数组：
top_K = tf.keras.applications.resnet50.decode_predictions(Y_proba, top=3)
for image_index in range(len(images)):
    print(f"Image #{image_index}")
    for class_id, name, y_proba in top_K[image_index]:
        print(f"  {class_id} - {name:12s} {y_proba:.2%}")

# The output looks like this:

# Image #0
#   n03877845 - palace       54.69%
#   n03781244 - monastery    24.72%
#   n02825657 - bell_cote    18.55%
# Image #1
#   n04522168 - vase         32.66%
#   n11939491 - daisy        17.81%
#   n03530642 - honeycomb    12.06%

As you can see, it is very easy to create a pretty good image classifier using a pretrained model. As you saw in Table 14-3, many other vision models are available in tf.keras.applications, from lightweight and fast models to large and accurate ones.

为迁移学习预训练模型

python


# 先导入数据
import tensorflow_datasets as tfds
# 通过设置with_info=True获得有关数据集的信息。
dataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True)
dataset_size = info.splits["train"].num_examples  # 3670
class_names = info.features["label"].names  # ["dandelion", "daisy", ...]
n_classes = info.features["label"].num_classes  # 5



# 由于只有训练集，没有其他数据集，因此进行拆分
test_set_raw, valid_set_raw, train_set_raw = tfds.load(
    "tf_flowers",
    split=["train[:10%]", "train[10%:25%]", "train[25%:]"],
    as_supervised=True)
# 10％用于测试，接下来的15％用于验证，其余的75％用于训练


# 接下来必须确保输入数据的大小符合要求
batch_size = 32
preprocess = tf.keras.Sequential([
    tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True),
    tf.keras.layers.Lambda(tf.keras.applications.xception.preprocess_input)
])
train_set = train_set_raw.map(lambda X, y: (preprocess(X), y))
train_set = train_set.shuffle(1000, seed=42).batch(batch_size).prefetch(1)
valid_set = valid_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)
test_set = test_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)

# Now each batch contains 32 images, all of them 224 × 224 pixels, 
# with pixel values ranging from –1 to 1. Perfect!

# Since the dataset is not very large, a bit of data augmentation will certainly help.
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip(mode="horizontal", seed=42),
    tf.keras.layers.RandomRotation(factor=0.05, seed=42),
    tf.keras.layers.RandomContrast(factor=0.2, seed=42)
])

# 接下来，我们加载一个在ImageNet上预训练的Xception模型。我
# 们通过设置include_top=False排除网络的顶部：这排除了全局平均池
# 化层和密集输出层。然后，根据基本模型的输出，添加我们自己的全
# 局平均池化层，再跟一个每个类一个单位的密集输出层，使用softmax
# 激活函数。最后我们创建Keras模型：
base_model = tf.keras.applications.xception.Xception(weights="imagenet", include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
model = tf.keras.Model(inputs=base_model.input, outputs=output)

# 在训练开始时冻结预训练层的权重
for layer in base_model.layers:
    layer.trainable = False
# 由于我们的模型直接使用基本模型的层，而不是使用
# base_model对象本身，因此设置base_model.trainable=False无效。

# Finally, we can compile the model and start training:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=3)


# 对模型进行几个轮次的训练后，其验证准确率应该达到80%以上,
# 并且不再取得很大进展。这意味着顶层现在已经受过良好的训练，因
# 此我们准备解冻所有层（或者你可以尝试只解冻顶层）并继续进行训
# 练（在冻结或解冻时不要忘记编译模型）。这次我们使用低得多的学
# 习率来避免损坏预训练的权重：
for layer in base_model.layers[56:]:
    layer.trainable = True
# Don’t forget to compile the model whenever you freeze or unfreeze layers. 
# Also make sure to use a much lower learning rate to avoid damaging the pretrained weights:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=10)

The tf.keras.preprocessing.image.ImageDataGenerator class makes it easy to load images from disk and augment them in various ways: you can shift each image, rotate it, rescale it, flip it horizontally or vertically, shear it, or apply any transformation function you want to it. This is very convenient for simple projects. However, a tf.data pipeline is not much more complicated, and it’s generally faster. Moreover, if you have a GPU and you include the preprocessing or data augmentation layers inside your model, they will benefit from GPU acceleration during training.
If you are running in Colab, make sure the runtime is using a GPU: select Runtime → “Change runtime type”, choose “GPU” in the “Hardware accelerator” drop-down menu, then click Save. It’s possible to train the model without a GPU, but it will be terribly slow (minutes per epoch, as opposed to seconds).

关注网址：https://paperswithcode.com/