人工神经网络(ANN)简介
Artificial Neural Networks,ANN
人工神经网络是深度学习的核心。它们用途广泛、功能强大且可扩展,使其非常适合处理大型和高度复杂的机器学习任务,例如对数十亿张图像进行分类(例如Google Images),为语音识别服务(例如Apple的Siri)提供支持,每天向成千上万的用户推荐(例如YouTube)观看的最佳视频,或学习在围棋游戏(DeepMind的AlphaGo)中击败世界冠军。
神经网络的发展简史参阅前面的博客
神经元的逻辑运算
- 恒等函数
- 逻辑与
- 逻辑或
- 异或
感知机 Perceptron
The perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a threshold logic unit (TLU), or sometimes a linear threshold unit (LTU).
阈值逻辑单元(TLU), 线性阈值单元(LTU)
罗森布拉特感知机
感知器仅由单层TLU组成,每个TLU连接到所有的输入。当一层中的所有神经元都连接到上一层中的每个神经元(即其输入神经元)时,该层称为全连接层或密集层。感知器的输入被送到称为输入神经元的特殊直通神经元:它们输出被送入的任何输入。
计算全连接层的输出:
其中,$X$代表输入特征的矩阵。每个实例一行,每个特征一列。权重矩阵$W$包含所有连接权重。在该层中,每个输入神经元一行,每个人工神经元一列。偏置向量$b$包含偏置神经元和人工神经元之间的所有连接权重。每个人工神经元有一个偏置项。$\phi$称为激活函数:当人工神经元是TLU时,它是阶跃函数,后面会讨论其他激活函数。
Hebb规则:当一个生物神经元经常触发另一个神经元时,这两个神经元之间的联系就会增强。即:两个神经元同时触发时,它们之间的连接权重会增加。 感知器学习规则:感知器一次被送入一个训练实例,并且针对每个实例进行预测。对于产生错误预测的每个输出神经元,它会增强来自输入的连接权重,这些权重将有助于正确的预测。
其中,$w_{i,j}$是第 i 个输入神经元和第 j 个输出神经元之间的连接权重。$x_i$是当前训练实例的第i个输入值。$\hat y_j$是当前训练实例的第j给输出神经元的实际输出。$y_j$是当前训练实例的第 j 个输出神经元的目标输出。$\eta$是学习率。
每个输出神经元的决策边界都是线性的,因此感知器无法学习复杂
的模式(就像逻辑回归分类器一样)。但是,如果训练实例是线性可分
的,Rosenblatt证明了该算法将收敛到一个解。这被称为感知器收
敛定理。
我已经翻译了Rosenblatt的论文的一大部分,后期有时间我把翻译完成后放到网站上来。
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 0) # Iris setosa
per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)
X_new = [[2, 0.5], [3, 1]]
y_pred = per_clf.predict(X_new) # predicts True and False for these 2 flowers
你可能已经注意到,感知器学习算法非常类似于随机梯度下降。实际上,Scikit-Learn的Perceptron类等效于使用具有以下超参数的SGDClassifier:loss="perceptron",learning_rate="constant",eta0=1(学习率)和penalty=None(无正则化))。请注意,与逻辑回归分类器相反,感知器不输出分类概率;相反,它们基于硬阈值进行预测。这是逻辑回归胜过感知器的原因。
罗森布拉特单层感知机的致命缺陷:无法解决异或问题。
多层感知机
但是,通过堆叠多个感知机的方式,就可以消除单层感知机的某些局限性,得到的ANN就被称为多层感知机(MLP, a multilayer perceptron),
MLP由一层(直通)输入层、一层或多层TLU(称为隐藏层)和一个TLU的最后一层(称为输出层)组成。
当ANN包含隐藏层的深层堆叠时,它被称为深度神经网络(DNN)。
反向传播算法
多年来,研究人员一直在努力寻找一种训练MLP的方法,但没有成功。但在1986年,David Rumelhart、Geoffrey Hinton和Ronald Williams发表的开创性论文介绍了反向传播训练算法,该算法至今仍在使用。简而言之,它是使用有效的技术自动计算梯度下降:在仅两次通过网络的过程中(一次前向,一次反向),反向传播算法能够针对每个模型参数计算网络误差的梯度。换句话说,它可以找出应如何调整每个连接权重和每个偏置项以减少误差。一旦获得了这些梯度,它便会执行常规的梯度下降步骤,然后重复整个过程,直到网络收敛到解。
- 它一次处理一个小批量( mini-batch,例如,每次包含32个实例),并且多次遍历整个训练集。每次遍历都称为一个epoch(轮次)。
- 每个小批量都传递到网络的输入层,然后将其送到第一个隐藏层。然后该算法将计算该层中所有神经元的输出(对于小批量中的每个实例)。结果传递到下一层,计算其输出并传递到下一层,以此类推,直到获得最后一层(即输出层)的输出。这就是前向通路:就像进行预测一样,只是保留了所有中间结果,因为反向遍历需要它们。
- 接下来,该算法测量网络的输出误差(该算法使用一种损失函数,该函数将网络的期望输出与实际输出进行比较,并返回一些误差测量值)
- 然后,它计算每个输出连接对错误的贡献程度。通过应用链式法则(可能是微积分中最基本的规则)来进行分析,从而使此步骤变得快速而精确。
- 算法再次使用链式法则来测量这些错误贡献中有多少是来自下面层中每个连接的错误贡献,算法一直进行,到达输入层为止。如前所述,这种反向传递通过在网络中向后传播误差梯度,从而有效地测量了网络中所有连接权重上的误差梯度。
- 最终,该算法执行梯度下降步骤,使用刚刚计算出的误差梯度来调整网络中的所有连接权重。
warning: It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart. If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.
随机初始化所有隐藏层的连接权重是很重要的,否则训练将失败。例如,如果你将所有权重和偏差初始化为零,那么给定层中的所有神经元都将完全相同,因此反向传播将以完全相同的方式影响它们,因此它们将保持相同。换句话说,尽管每层有数百个神经元,但你的模型会表现得好像每层只有一个神经元:它不会太聪明。相反,如果你随机初始化权重,你就会打破对称性,允许反向传播来训练不同的神经元组合。
激活函数
In order for backprop to work properly, Rumelhart and his colleagues made a key change to the MLP’s architecture: they replaced the step function with the logistic function, σ(z) = 1 / (1 + exp(–z)), also called the sigmoid function. This was essential because the step function contains only flat segments, so there is no gradient to work with (gradient descent cannot move on a flat surface), while the sigmoid function has a well-defined nonzero derivative everywhere, allowing gradient descent to make some progress at every step. In fact, the backpropagation algorithm works well with many other activation functions, not just the sigmoid function.
Sigmoid函数:
ReLU函数,修正线性单元(Rectified Linear Unit):
双曲正切函数,hyperbolic tangent:
Why do we need activation functions in the first place?
为什么需要激活函数?
因为如果仅仅是线性函数的相互连接,那么得到的仍然只有线性关系,即使堆叠很多层也不过相当于单层。理论上,具有非线性激活函数的足够大的DNN可以近似任何连续函数。
MLP回归
sklearn提供了 MLPRegressor
类,可以用于搭建MLP回归模型
# 这里的加州房价数据集,特征均为数字特征,没有非数字特征,没有缺失值
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
X_train_full, y_train_full, random_state=42)
mlp_reg = MLPRegressor(hidden_layer_sizes=[50, 50, 50], random_state=42)
pipeline = make_pipeline(StandardScaler(), mlp_reg)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_valid)
rmse = mean_squared_error(y_valid, y_pred, squared=False) # about 0.505
print(rmse)
We get a validation RMSE of about 0.505, which is comparable to what you would get with a random forest classifier. Not too bad for a first try!
Note that this MLP does not use any activation function for the output layer, so it’s free to output any value it wants. This is generally fine, but if you want to guarantee that the output will always be positive, then you should use the ReLU activation function in the output layer, or the softplus activation function, which is a smooth variant of ReLU: softplus(z) = log(1 + exp(z)). Softplus is close to 0 when z is negative, and close to z when z is positive. Finally, if you want to guarantee that the predictions will always fall within a given range of values, then you should use the sigmoid function or the hyperbolic tangent, and scale the targets to the appropriate range: 0 to 1 for sigmoid and –1 to 1 for tanh. Sadly, the
MLPRegressor
class does not support activation functions in the output layer.
TheMLPRegressor
class uses the mean squared error, which is usually what you want for regression, but if you have a lot of outliers(异常值,离群值) in the training set, you may prefer to use the mean absolute error instead. Alternatively, you may want to use the Huber loss, which is a combination of both. It is quadratic(平方的) when the error is smaller than a threshold δ (typically 1) but linear when the error is larger than δ. The linear part makes it less sensitive to outliers than the mean squared error, and the quadratic part allows it to converge faster and be more precise than the mean absolute error. However,MLPRegressor
only supports the MSE.
Hyperparameter超参数 | Typical value典型值 |
---|---|
# hidden layers | Depends on the problem, but typically 1 to 5 |
# neurons per hidden layer | Depends on the problem, but typically 10 to 100 |
# output neurons | 1 per prediction dimension |
Hidden activation | ReLU |
Output activation | None, or ReLU/softplus (if positive outputs) or sigmoid/tanh (if bounded outputs) |
Loss function | MSE, or Huber if outliers |
MLP分类
MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the sigmoid activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. The estimated probability of the negative class is equal to one minus that number.
MLPs can also easily handle multilabel binary classification tasks (多标签二分类任务). For example, you could have an email classification system that predicts whether each incoming email is ham or spam(正常邮件或垃圾邮件), and simultaneously predicts whether it is an urgent or nonurgent email. In this case, you would need two output neurons, both using the sigmoid activation function: the first would output the probability that the email is spam, and the second would output the probability that it is urgent. More generally, you would dedicate one output neuron for each positive class. Note that the output probabilities do not necessarily add up to 1. This lets the model output any combination of labels: you can have nonurgent ham, urgent ham, nonurgent spam, and perhaps even urgent spam (although that would probably be an error).
If each instance can belong only to a single class, out of three or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer (see Figure 10-9). The softmax function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to 1, since the classes are exclusive. 这是多分类。 Regarding the loss function, since we are predicting probability distributions, the cross-entropy loss (or x-entropy or log loss for short) is generally a good choice.
Hyperparameter超参数 | Binary classification二进制分类 | Multilabel binary classification多元二进制分类 | Multiclass classification多分类 |
---|---|---|---|
# hidden layers | Typically 1 to 5 layers, depending on the task | ||
# output neurons | 1 | 1 per binary label | 1 per class |
Output layer activation | Sigmoid | Sigmoid | Softmax |
Loss function | X-entropy | X-entropy | X-entropy |
值得关注的网站TensorFlow playground: https://playground.tensorflow.org/
Keras实现MLP分类
Implementing MLPs with Keras
fashion_minist数据集介绍
该数据集是minist(手写数字识别)数据集的替代品,包含70 000张灰度图像,每幅28×28像素,有10个类,图像代表的是时尚物品,比手写数字识别更难一些。该数据集加载的时候,我尝试了Tensorflow提供的api接口,发现无法使用,程序如下:
import tensorflow as tf
# 应该在使用VPN的前提下下载数据,否则可能无法下载
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
# 确认数据的形状
assert x_train.shape == (60000, 28, 28)
assert x_test.shape == (10000, 28, 28)
assert y_train.shape == (60000,)
assert y_test.shape == (10000,)
程序报错,无法下载。我探索了两中方案
方案一:
访问 数据集官网 ,下载数据,并按官方给定的方式导入数据。
下载数据:
git clone git@github.com:zalandoresearch/fashion-mnist.git
下载完成后,可以在./fashion-mnist/data/fashion/
文件夹下找到数据
from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets('data/fashion')
方案二:在本网站下载,我以及将文件打包压缩至本网站下,点击此处下载
将文件解压后放在合适的位置,即可
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
mypath = "E:\\fashion-mnist-demo.csv" #改为适当的路径,注意绝对路径和相对路径的写法
data = pd.read_csv(mypath)
#关于 DataFrame 数据结构,参阅前面的博客
x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'pixel1':'pixel784'].to_numpy(),data['label'].to_numpy(), random_state=2)
# 查看数据,可以看到第3条数据是个鞋子
plt.imshow(x_train[2].reshape(28,28))
import matplotlib.pyplot as plt
from tensorflow.examples.tutorials.mnist import input_data
# 指定适当的文件路径
data = input_data.read_data_sets('data/fashion-mnist/data/fashion/')
# 查看图像,可以看到一只鞋子
plt.imshow(data.train.images[2].reshape(28,28))
# 取出数据
X_train_full = data.train.images
y_train_full = data.train.labels
X_test = data.test.images
y_test = data.test.labels
X_valid = data.validation.images
y_valid = data.validation.labels
# 查看数据概况
print(data.test.num_examples) # 10000
print(data.train.num_examples) # 55000
print(data.validation.num_examples) # 5000
print(type(X_train_full)) # <class 'numpy.ndarray'>
print(X_train_full.shape) # (55000, 784)
# 指定标签
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
"Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
# 可以看到'Sneaker',运动鞋
class_names[y_train_full[2]]
使用sequential API
import tensorflow as tf
tf.random.set_seed(42)
# 创建一个Sequential model。这是用于神经网络的最简单的Keras模型,它仅由顺序连接的单层堆栈组成。
model = tf.keras.Sequential()
# keras需要知道输入的形状
model.add(tf.keras.layers.Input(shape=[784,]))
# 这里可有可无
model.add(tf.keras.layers.Flatten())
# 添加具有300个神经元的Dense隐藏层,relu激活函数,每层都会管理自己的权重矩阵
model.add(tf.keras.layers.Dense(300, activation="relu"))
# 再添加一层
model.add(tf.keras.layers.Dense(100, activation="relu"))
# 输出层
model.add(tf.keras.layers.Dense(10, activation="softmax"))
上面的程序等价于:
import tensorflow as tf
tf.random.set_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=[784,]),
tf.keras.layers.Dense(300, activation="relu"),
tf.keras.layers.Dense(100, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")
])
模型创建完成后,可以用 summary()
方法显示模型所有层
>>> model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten (Flatten) (None, 784) 0
dense (Dense) (None, 300) 235500
dense_1 (Dense) (None, 100) 30100
dense_2 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________
可以查看层的情况,可以给层指定名称
>>> model.layers
[<keras.layers.reshaping.flatten.Flatten at 0x249212bd748>,
<keras.layers.core.dense.Dense at 0x2492132b708>,
<keras.layers.core.dense.Dense at 0x2492116b748>,
<keras.layers.core.dense.Dense at 0x24922efd6c8>]
hidden1 = model.layers[1]
hidden1.name # 'dense'
model.get_layer('dense') is hidden1 # True
weights, biases = hidden1.get_weights()
# print(weights) # dense 层随机初始化了连接权重
print(weights.shape) # (784, 300)
# print(biases) # 偏置被初始化为0,可以在创建层时调整初始化方法,参阅 https://keras.io/api/layers/initializers/
print(biases.shape) # (300,)
注意:权重矩阵的形状取决于输入数据,因此我们在创建层时指定了数据的形状,如果不指定输入的形状也是可以的,只不过keras会等到知道输入形状后才真正构建模型,在真正构建模型之前,我们无法查看模型信息。
model.compile(loss="sparse_categorical_crossentropy",
optimizer="sgd",
metrics=["accuracy"])
在现在这种标签的形状下,稀疏标签(即对于每个实例,只有一个目标类索引,在这种情况下为0到9),且各类互斥,我们使用 sparse_categorical_crossentropy
损失,相反,如果每个实例的每一个类都有一个目标概率,如独热编码形式,则需要使用 categorical_crossentropy
损失。如果是二进制分类(带有一个或多个二进制标签),则输出层使用 sigmod
激活函数,而不是 softmax
激活函数,并且使用 binary_crossentropy
损失。
When using the SGD optimizer, it is important to tune the learning rate. So, you will generally want to use optimizer=tf.keras. optimizers.SGD(learning_rate=???) to set the learning rate, rather than optimizer="sgd", which defaults to a learning rate of 0.01.
Finally, since this is a classifier, it’s useful to measure its accuracy during training and evaluation, which is why we set metrics=["accuracy"].
训练模型和评估模型
history = model.fit(X_train_full, y_train_full, epochs=30, validation_data=(X_valid, y_valid))
看到如下输出
Epoch 1/30
1719/1719 [==============================] - 4s 2ms/step - loss: 0.7180 - accuracy: 0.7641 - val_loss: 0.5366 - val_accuracy: 0.8164
Epoch 2/30
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4900 - accuracy: 0.8288 - val_loss: 0.4428 - val_accuracy: 0.8486
Epoch 3/30
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4460 - accuracy: 0.8427 - val_loss: 0.5538 - val_accuracy: 0.7920
...
Epoch 29/30
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2333 - accuracy: 0.9163 - val_loss: 0.3234 - val_accuracy: 0.8812
Epoch 30/30
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2298 - accuracy: 0.9175 - val_loss: 0.3122 - val_accuracy: 0.8884
We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to train (or else it would default to just 1, which would definitely not be enough to converge to a good solution). We also pass a validation set (this is optional). Keras will measure the loss and the extra metrics on this set at the end of each epoch, which is very useful to see how well the model really performs. If the performance on the training set is much better than on the validation set, your model is probably overfitting the training set, or there is a bug, such as a data mismatch between the training set and the validation set.
我们将输入特征和目标类以及要训练的epochs次数(默认为1,这个默认值效果并不好)传递给它,传递验证集是可选的。Keras将在每个轮次结束时测量此集合上的损失和其他指标,这对于查看模型的实际效果非常有用。如果训练集的性能好于验证集,则你的模型可能过拟合训练集,或者存在bug,例如训练集和验证集之间的数据不匹配。Instead of passing a validation set using the validation_data argument, you could set validation_split to the ratio of the training set that you want Keras to use for validation. For example, validation_split=0.1 tells Keras to use the last 10% of the data (before shuffling) for validation.
If the training set was very skewed(训练集不平衡), with some classes being overrepresented and others underrepresented, it would be useful to set theclass_weight
argument when calling thefit()
method, to give a larger weight to underrepresented classes and a lower weight to overrepresented classes. These weights would be used by Keras when computing the loss. If you need per-instance weights, set thesample_weight
argument. If both class_weight and sample_weight are provided, then Kerasmultiplies
them. Per-instance weights could be useful, for example, if some instances were labeled by experts while others were labeled using a crowdsourcing platform: you might want to give more weight to the former. You can also provide sample weights (but not class weights) for the validation set by adding them as a third item in thevalidation_data
tuple.The
fit()
method returns aHistory
object containing the training parameters (history.params
), the list of epochs it went through (history.epoch
), and most importantly a dictionary (history.history
) containing the loss and extra metrics it measured at the end of each epoch on the training set and on the validation set (if any).可以用这个字典绘制学习曲线。
查看学习曲线:
import pandas as pd
pd.DataFrame(history.history).plot(
figsize=(8, 5), xlim=[0, 29], ylim=[0, 1], grid=True, xlabel="Epoch",
style=["r--", "r--.", "b-", "b-*"])
plt.show()
没有过拟合。
在这种特殊情况下,该模型看起来在验证集上的表现要好于训练开始时在训练集上的表现。但是事实并非如此:确实,验证误差是在每个轮次结束时计算的,而训练误差是使用每个轮次的运行平均值计算的。因此,训练曲线应向左移动半个轮次。如果这样做,你会看到训练和验证曲线在训练开始时几乎完全重叠。
If you are not satisfied with the performance of your model, you should go back and tune the hyperparameters(超参数). The first one to check is the learning rate(学习率). If that doesn’t help, try another optimizer (并在更改任何超参数后重新调整学习率). If the performance is still not great, then try tuning model hyperparameters such as the number of layers, the number of neurons per layer, and the types of activation functions to use for each hidden layer. You can also try tuning other hyperparameters, such as the batch size (it can be set in the
fit()
method using thebatch_size
argument, which defaults to 32). We will get back to hyperparameter tuning at the end of this chapter. Once you are satisfied with your model’s validation accuracy, you should evaluate it on the test set to estimate the generalization error before you deploy the model to production. You can easily do this using theevaluate()
method (it also supports several other arguments, such asbatch_size
andsample_weight
; please check the documentation for more details): 学习率,优化器,层数,每层神经元数量,每个隐藏层的激活函数类型,批处理大小
>>> model.evaluate(X_test, y_test)
313/313 [==============================] - 1s 2ms/step - loss: 0.3370 - accuracy: 0.8811
[0.33702415227890015, 0.8810999989509583]
Remember to resist the temptation to tweak the hyperparameters on the test set, or else your estimate of the generalization error will be too optimistic.
下面,可以使用 predict()
方法对新实例进行预测
X_new = X_test[:3]
y_proba = model.predict(X_new)
print(y_proba.round(2)) #保留两位小数
输出信息:
[[0. 0. 0. 0. 0. 0.01 0. 0.01 0. 0.98]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. ]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. ]]
对于每个实例,模型估计从0类到9类每个类的概率。
如果您只关心具有最高估计概率的类(即使该概率很低),那么您可以使用 argmax()
方法来获得每个实例的最高概率类索引:
import numpy as np
y_pred = y_proba.argmax(axis=-1)
print(y_pred) # [9 2 1]
print(np.array(class_names)[y_pred]) # ['Ankle boot' 'Pullover' 'Trouser']
print(y_test[:3]) # [9 2 1]
ok!
Now you know how to use the sequential API to build, train, and evaluate a classification MLP. But what about regression?
MLP回归
回归的最大区别是:输出层仅有一个神经元(因为我们只预测1个数值),并且不使用激活函数,而损失函数是均方误差由于数据集噪声很大,我们只使用比以前少的神经元的单层隐藏层,以避免过拟合。
加州房价数据集(没有非数字特征、没有缺失值):
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)
# the metric is the RMSE
# we’re using an Adam optimizer like Scikit-Learn’s MLPRegressor did.
# Moreover, in this example we don’t need a Flatten layer,
# and instead we’re using a Normalization layer as the first layer: it does the same thing as Scikit-Learn’s StandardScaler,
# but it must be fitted to the training data using its adapt() method before you call the model’s fit() method.
tf.random.set_seed(42)
norm_layer = tf.keras.layers.Normalization(input_shape=X_train_full.shape[1:])
model = tf.keras.Sequential([
norm_layer,
tf.keras.layers.Dense(50, activation="relu"),
tf.keras.layers.Dense(50, activation="relu"),
tf.keras.layers.Dense(50, activation="relu"),
tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
norm_layer.adapt(X_train_full)
history = model.fit(X_train_full, y_train_full, epochs=20,
validation_data=(X_valid, y_valid))
mse_test, rmse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)
更复杂的模型
宽&深网络 Wide & Deep
在论文 Heng-Tze Cheng et al., “Wide & Deep Learning for Recommender Systems”, Proceedings of the First Workshop on Deep Learning for Recommender Systems (2016): 7–10. 中,引入了一种 Wide & Deep neural network
模型,它将所有或部分输入直接连接到输出层,这种架构使神经网络能够学习深度模式(使用深度路径)和简单规则(通过短路径)。
宽&深网络:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)
tf.random.set_seed(42)
# a Normalization layer to standardize the inputs
normalization_layer = tf.keras.layers.Normalization()
# two Dense layers with 30 neurons each, using the ReLU activation function
hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
hidden_layer2 = tf.keras.layers.Dense(30, activation="relu")
# a Concatenate layer
concat_layer = tf.keras.layers.Concatenate()
# one more Dense layer with a single neuron for the output layer, without any activation function.
output_layer = tf.keras.layers.Dense(1)
# we create an Input object (the variable name input_ is used to avoid overshadowing Python’s built-in input() function)
input_ = tf.keras.layers.Input(shape=X_train_full.shape[1:])
normalized = normalization_layer(input_)
hidden1 = hidden_layer1(normalized)
hidden2 = hidden_layer2(hidden1)
concat = concat_layer([normalized, hidden2])
output = output_layer(concat)
model = tf.keras.Model(inputs=[input_], outputs=[output])
Once you have built this Keras model, everything is exactly like earlier, so there’s no need to repeat it here: you compile the model, adapt the Normalization layer, fit the model, evaluate it, and use it to make predictions.
但是如果你想通过宽路径送入特征的子集,而通过深路径送入特征的另一个子集(可能有重合)呢?
多输入
多输入模型:
# since there are two inputs
input_wide = tf.keras.layers.Input(shape=[5]) # features 0 to 4
input_deep = tf.keras.layers.Input(shape=[6]) # features 2 to 7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
# creates a Concatenate layer and calls it with the given inputs
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)
model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])
# Now we can compile the model as usual,
# but when we call the fit() method, instead of passing a single input matrix X_train,
# we must pass a pair of matrices (X_train_wide, X_train_deep), one per input.
# The same is true for X_valid, and also for X_test and X_new when you call evaluate() or predict():
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
X_train_wide, X_train_deep = X_train[:, :5], X_train[:, 2:]
X_valid_wide, X_valid_deep = X_valid[:, :5], X_valid[:, 2:]
X_test_wide, X_test_deep = X_test[:, :5], X_test[:, 2:]
X_new_wide, X_new_deep = X_test_wide[:3], X_test_deep[:3]
norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit((X_train_wide, X_train_deep), y_train, epochs=20,
validation_data=((X_valid_wide, X_valid_deep), y_valid))
mse_test = model.evaluate((X_test_wide, X_test_deep), y_test)
y_pred = model.predict((X_new_wide, X_new_deep))
Instead of passing a tuple
(X_train_wide, X_train_deep)
, you can pass a dictionary{"input_wide": X_train_wide, "input_deep": X_train_deep}
, if you set name="input_wide" and name="input_deep" when creating the inputs. This is highly recommended when there are many inputs, to clarify the code and avoid getting the order wrong.
多输出
举例如下:
- 想在图片中定位和分类主要物体。这既是回归任务(查找物体中心的坐标以及宽度和高度),又是分类任务。
- 可能有基于同一数据的多个独立任务。当然你可以为每个任务训练一个神经网络,但是在许多情况下,通过训练每个任务一个输出的单个神经网络会在所有任务上获得更好的结果。这是因为神经网络可以学习数据中对任务有用的特征。例如,你可以对面部图片执行多任务分类,使用一个输出对人的面部表情进行分类(微笑、惊讶等),使用另一个输出来识别他们是否戴着眼镜。
- 正则化技术(即训练约束,其目的是减少过拟合,从而提高模型的泛化能力)。例如,你可能希望在神经网络结构中添加一些辅助输出,以确保网络的主要部分自己能学习有用的东西,而不依赖于网络的其余部分。
处理多个输出,在此示例中,添加辅助输出以进行正则化
添加额外的输出非常容易:只需将它们连接到适当的层,然后将它们添加到模型的输出列表中即可
[...] # Same as above, up to the main output layer
output = tf.keras.layers.Dense(1)(concat)
aux_output = tf.keras.layers.Dense(1)(hidden2)
model = tf.keras.Model(inputs=[input_wide, input_deep],
outputs=[output, aux_output])
每个输出都需要自己的损失函数。因此当我们编译模型时,应该传递一系列损失。可以用字典来指定。如果传递单个损失,Keras将假定所有输出必须使用相同的损失)。默认情况下,Keras将计算所有这些损失,并将它们简单累加即可得到用于训练的最终损失。我们更关心主要输出而不是辅助输出(因为它仅用于正则化),因此我们要给主要输出的损失更大的权重,可以在编译模型时设置所有的损失权重。
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=("mse", "mse"), loss_weights=(0.9, 0.1), optimizer=optimizer,metrics=["RootMeanSquaredError"])
# 或者使用字典:loss={"output": "mse", "aux_output": "mse"},loss_weights同理
当训练模型时,需要为每个输出提供标签。在此示例中,主要输出和辅助输出应预测出相同的结果,因此它们应使用相同的标签。
So instead of passing y_train, we need to pass (y_train, y_train), or a dictionary {"output": y_train, "aux_output": y_train} if the outputs were named "output" and "aux_output". The same goes for y_valid and y_test:
norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit((X_train_wide, X_train_deep), (y_train, y_train), epochs=20, validation_data=((X_valid_wide, X_valid_deep), (y_valid, valid)))
当评估模型时,Keras将返回总损失以及所有单个损失:
eval_results = model.evaluate((X_test_wide, X_test_deep), (y_test, y_test))
weighted_sum_of_losses, main_loss, aux_loss, main_rmse, aux_rmse = eval_results
# If you set return_dict=True, then evaluate() will return a dictionary instead of a big tuple(一个大元组).
同样,predict()方法将为每个输出返回预测值:
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))
The predict()
method returns a tuple, and it does not have a return_dict
argument to get a dictionary instead. However, you can create one using model.output_names
:
y_pred_tuple = model.predict((X_new_wide, X_new_deep))
y_pred = dict(zip(model.output_names, y_pred_tuple))
Subclassing API
顺序API和函数式API都是声明性的:首先声明要使用的层以及应该如何连接它们,然后才能开始向模型提供一些数据进行训练或推断。这具有许多优点:可以轻松地保存、克隆和共享模型;可以显示和分析它的结构;框架可以推断形状和检查类型,因此可以及早发现错误(即在任何数据通过模型之前)。由于整个模型是一个静态图,因此调试起来也相当容易。但另一方面是它是静态的。一些模型涉及循环、变化的形状、条件分支和其他动态行为。对于这种情况,或者只是你喜欢命令式的编程风格,则子类API非常适合你。
只需对Model类进行子类化,在构造函数中创建所需的层,然后在call()
方法中执行所需的计算即可。例如,创建以下WideAndDeepModel
类的实例将给我们一个等效于刚刚使用函数式API构建的模型。然后,你可以像我们刚做的那样对其进行编译、评估并使用它进行预测:
class WideAndDeepModel(tf.keras.Model):
def __init__(self, units=30, activation="relu", **kwargs):
super().__init__(**kwargs) # needed to support naming the model
self.norm_layer_wide = tf.keras.layers.Normalization()
self.norm_layer_deep = tf.keras.layers.Normalization()
self.hidden1 = tf.keras.layers.Dense(units, activation=activation)
self.hidden2 = tf.keras.layers.Dense(units, activation=activation)
self.main_output = tf.keras.layers.Dense(1)
self.aux_output = tf.keras.layers.Dense(1)
def call(self, inputs):
input_wide, input_deep = inputs
norm_wide = self.norm_layer_wide(input_wide)
norm_deep = self.norm_layer_deep(input_deep)
hidden1 = self.hidden1(norm_deep)
hidden2 = self.hidden2(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = self.main_output(concat)
aux_output = self.aux_output(hidden2)
return output, aux_output
model = WideAndDeepModel(30, activation="relu", name="my_cool_model")
Now that we have a model instance, we can compile it, adapt its normalization layers (e.g., using
model.norm_layer_wide.adapt(...)
andmodel.norm_layer_deep.adapt(...)
), fit it, evaluate it, and use it to make predictions, exactly like we did with the functional API.
这个示例看起来非常类似于函数式API,只是我们不需要创建输入。我们只使用 call()
方法的输入参数,就可以将构造函数中层的创建与其在 call()
方法中的用法分开。最大的区别是你可以在 call()
方法中执行几乎所有你想做的操作:for循环、if语句、底层TensorFlow操作,等等。这使得它成为研究新想法的研究人员的绝佳API。
However, this extra flexibility does come at a cost: your model’s architecture is hidden within the
call()
method, so Keras cannot easily inspect it; the model cannot be cloned usingtf.keras.models.clone_model()
; and when you call thesummary()
method, you only get a list of layers, without any information on how they are connected to each other. Moreover, Keras cannot check types and shapes ahead of time, and it is easier to make mistakes. So unless you really need that extra flexibility, you should probably stick to the sequential API or the functional API.
保存和加载模型
Saving a trained Keras model is as simple as it gets:
model.save("my_keras_model", save_format="tf")
When you set save_format="tf", Keras saves the model using TensorFlow’s SavedModel format: this is a directory (with the given name) containing several files and subdirectories. In particular, the saved_model.pb file contains the model’s architecture and logic in the form of a serialized computation graph, so you don’t need to deploy the model’s source code in order to use it in production; the SavedModel is sufficient. The keras_metadata.pb file contains extra information needed by Keras. The variables subdirectory contains all the parameter values (including the connection weights, the biases, the normalization statistics, and the optimizer’s parameters), possibly split across multiple files if the model is very large. Lastly, the assets directory may contain extra files, such as data samples, feature names, class names, and so on. By default, the assets directory is empty. Since the optimizer is also saved, including its hyperparameters and any state it may have, after loading the model you can continue training if you want.
save_format="tf" ,This is currently the default, but the Keras team is working on a new format that may become the default in upcoming versions, so I prefer to set the format explicitly to be future-proof.
If you set
save_format
="h5" or use a filename that ends with.h5
,.hdf5
, or.keras
, then Keras will save the model to a single file using a Keras-specific format based on the HDF5 format. However, most TensorFlow deployment tools require the SavedModel format instead.
加载模型:
model = tf.keras.models.load_model("my_keras_model")
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))
You can also use
save_weights()
andload_weights()
to save and load only the parameter values. This includes the connection weights, biases, preprocessing stats, optimizer state, etc. The parameter values are saved in one or more files such asmy_weights.data-00004-of-00052
, plus an index file likemy_weights.index
.
Saving just the weights is faster and uses less disk space than saving the whole model, so it’s perfect to save quick checkpoints during training. If you’re training a big model, and it takes hours or days, then you must save checkpoints regularly in case the computer crashes. But how can you tell the fit()
method to save checkpoints? Use callbacks.
回调函数:
The fit()
method accepts a callbacks
argument that lets you specify a list of objects that Keras will call before and after training, before and after each epoch, and even before and after processing each batch.
例如,在训练期间ModelCheckpoint回调会定期保存模型的检查点,默认情况下,在每个轮次结束时:
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_checkpoints", save_weights_only=True)
history = model.fit([...], callbacks=[checkpoint_cb])
此外,如果在训练期间使用验证集,则可以在创建ModelCheckpoint时设置save_best_only=True。在这种情况下,只有在验证集上的模型性能达到目前最好时,它才会保存模型。这样,你就不必担心训练时间太长而过拟合训练集:只需还原训练后保存的最后一个模型,这就是验证集中的最佳模型。
实现提前停止的另一种方法是使用EarlyStopping回调。如果在多个epochs (由patience参数定义)的验证集上没有任何进展,它将中断训练,并且可以选择回滚到最佳模型。你可以将两个回调结合起来,保存模型的检查点,以防计算机崩溃,并在没有更多进展时尽早中断训练(以避免浪费时间和资源):
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
history = model.fit([...], callbacks=[checkpoint_cb, early_stopping_cb])
The number of epochs can be set to a large value since training will stop automatically when there is no more progress (just make sure the learning rate is not too small, or else it might keep making slow progress until the end). The EarlyStopping callback will store the weights of the best model in RAM, and it will restore them for you at the end of training.
keras.callbacks
包中还有许多其他回调函数。
可以根据需要,自定义回调函数,例如以下自定义回调将显示训练过程中验证损失与训练损失之间的比率(例如,检测过拟合):
class PrintValTrainRatioCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
ratio = logs["val_loss"] / logs["loss"]
print(f"Epoch={epoch}, val/train={ratio:.2f}")
As you might expect, you can implement
on_train_begin()
,on_train_end()
,on_epoch_begin()
,on_epoch_end()
,on_batch_begin()
, andon_batch_end()
. Callbacks can also be used during evaluation and predictions, should you ever need them (e.g., for debugging). For evaluation, you should implementon_test_begin()
,on_test_end()
,on_test_batch_begin()
, oron_test_batch_end()
, which are called byevaluate()
. For prediction, you should implementon_predict_begin()
,on_predict_end()
,on_predict_batch_begin()
, oron_predict_batch_end()
, which are called bypredict()
.
使用TensorBoard进行可视化
TensorBoard是一款出色的交互式可视化工具,可用于在训练期间查看学习曲线;比较多次运行的学习曲线;可视化计算图;分析训练统计数据;查看由模型生成的图像;把复杂的多维数据投影到3D,自动聚类并进行可视化……
TensorBoard is installed automatically when you install TensorFlow.
如果使用的是 Colab,则需要执行下面的命令:
%pip install -q -U tensorboard-plugin-profile
微调神经网络超参数
The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Not only can you use any imaginable network architecture, but even in a basic MLP you can change the number of layers, the number of neurons and the type of activation function to use in each layer, the weight initialization logic, the type of optimizer to use, its learning rate, the batch size, and more. How do you know what combination of hyperparameters is the best for your task?
简单搜索
可以像之前一样使用 GridSearchCV
或者 RandomizedSearchCV
来探索超参数空间,
使用Keras Tuner library
However, there’s a better way: you can use the Keras Tuner library, which is a hyperparameter tuning library for Keras models. It offers several tuning strategies, it’s highly customizable, and it has excellent integration with TensorBoard. Let’s see how to use it.
If you followed the installation instructions at https://homl.info/install to run everything locally, then you already have Keras Tuner installed, but if you are using Colab, you’ll need to run
%pip install -q -U keras-tuner
. Next, importkeras_tuner
, usually askt
, then write a function that builds, compiles, and returns a Keras model. The function must take akt.HyperParameters
object as an argument, which it can use to define hyperparameters (integers, floats, strings, etc.) along with their range of possible values, and these hyperparameters may be used to build and compile the model. For example, the following function builds and compiles an MLP to classify Fashion MNIST images, using hyperparameters such as the number of hidden layers (n_hidden), the number of neurons per layer (n_neurons), the learning rate (learning_rate), and the type of optimizer to use (optimizer):
import keras_tuner as kt
def build_model(hp):
n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)
learning_rate = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2, sampling="log")
optimizer = hp.Choice("optimizer", values=["sgd", "adam"])
if optimizer == "sgd":
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
else:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten())
for _ in range(n_hidden):
model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
metrics=["accuracy"])
return model
The first part of the function defines the hyperparameters. For example,
hp.Int("n_hidden", min_value=0, max_value=8, default=2)
checks whether a hyperparameter named "n_hidden" is already present in theHyperParameters
object hp, and if so it returns its value. If not, then it registers a new integer hyperparameter named "n_hidden", whose possible values range from 0 to 8 (inclusive), and it returns the default value, which is 2 in this case (when default is not set, then min_value is returned). The "n_neurons" hyperparameter is registered in a similar way. The "learning_rate" hyperparameter is registered as a float ranging from 10–4 to 10–2, and since sampling="log", learning rates of all scales will be sampled equally. Lastly, the optimizer hyperparameter is registered with two possible values: "sgd" or "adam" (the default value is the first one, which is "sgd" in this case). Depending on the value of optimizer, we create an SGD optimizer or an Adam optimizer with the given learning rate.
The second part of the function just builds the model using the hyperparameter values. It creates a Sequential model starting with a Flatten layer, followed by the requested number of hidden layers (as determined by the n_hidden hyperparameter) using the ReLU activation function, and an output layer with 10 neurons (one per class) using the softmax activation function. Lastly, the function compiles the model and returns it.
Now if you want to do a basic random search, you can create a kt.RandomSearch tuner, passing the build_model function to the constructor, and call the tuner’s search() method:
random_search_tuner = kt.RandomSearch(
build_model, objective="val_accuracy", max_trials=5, overwrite=True,
directory="my_fashion_mnist", project_name="my_rnd_search", seed=42)
random_search_tuner.search(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
The
RandomSearch
tuner first callsbuild_model()
once with an emptyHyperparameters
object, just to gather all the hyperparameter specifications. Then, in this example, it runs 5 trials; for each trial it builds a model using hyperparameters sampled randomly within their respective ranges, then it trains that model for 10 epochs and saves it to a subdirectory of themy_fashion_mnist/my_rnd_search
directory. Sinceoverwrite=True
, the my_rnd_search directory is deleted before training starts. If you run this code a second time but withoverwrite=False
andmax_trials=10
, the tuner will continue tuning where it left off, running 5 more trials: this means you don’t have to run all the trials in one shot. Lastly, sinceobjective
is set to "val_accuracy", the tuner prefers models with a higher validation accuracy, so once the tuner has finished searching, you can get the best models like this:
top3_models = random_search_tuner.get_best_models(num_models=3)
best_model = top3_models[0]
You can also call get_best_hyperparameters()
to get the kt.HyperParameters
of the best models:
>>> top3_params = random_search_tuner.get_best_hyperparameters(num_trials=3)
>>> top3_params[0].values # best hyperparameter values
{'n_hidden': 5,
'n_neurons': 70,
'learning_rate': 0.00041268008323824807,
'optimizer': 'adam'}
Each tuner is guided by a so-called oracle: before each trial, the tuner asks the oracle to tell it what the next trial should be. The RandomSearch tuner uses a RandomSearchOracle, which is pretty basic: it just picks the next trial randomly, as we saw earlier. Since the oracle keeps track of all the trials, you can ask it to give you the best one, and you can display a summary of that trial:
>>> best_trial = random_search_tuner.oracle.get_best_trials(num_trials=1)[0]
>>> best_trial.summary()
Trial summary
Hyperparameters:
n_hidden: 5
n_neurons: 70
learning_rate: 0.00041268008323824807
optimizer: adam
Score: 0.8736000061035156
都一样,获取验证最佳精度的:
>>> best_trial.metrics.get_last_value("val_accuracy")
0.8736000061035156
如果最佳模型的表现比较好,那么将最优模型继续在完整的训练集上训练几个epochs,然后把它放到测试集上评估:
best_model.fit(X_train_full, y_train_full, epochs=10)
test_loss, test_accuracy = best_model.evaluate(X_test, y_test)
如果要微调训练过程中的参数,例如 model.fit()
的参数,batch size 等,可以使用 kt.HyperModel
类定义build()
and fit()
The
build()
method does the exact same thing as thebuild_model()
function. Thefit()
method takes a HyperParameters object and a compiled model as an argument, as well as all themodel.fit()
arguments, and fits the model and returns theHistory
object.
class MyClassificationHyperModel(kt.HyperModel):
def build(self, hp):
return build_model(hp)
def fit(self, hp, model, X, y, **kwargs):
if hp.Boolean("normalize"):
norm_layer = tf.keras.layers.Normalization()
X = norm_layer(X)
return model.fit(X, y, **kwargs)
用上面定义的类举个例子
hyperband_tuner = kt.Hyperband(
MyClassificationHyperModel(), objective="val_accuracy", seed=42,
max_epochs=10, factor=3, hyperband_iterations=2,
overwrite=True, directory="my_fashion_mnist", project_name="hyperband")
it starts by training many different models for few epochs, then it eliminates the worst models and keeps only the top 1 / factor models (i.e., the top third in this case), repeating this selection process until a single model is left.19 The max_epochs argument controls the max number of epochs that the best model will be trained for. The whole process is repeated twice in this case (hyperband_iterations=2). The total number of training epochs across all models for each hyperband iteration is about max_epochs * (log(max_epochs) / log(factor)) ** 2, so it’s about 44 epochs in this example. The other arguments are the same as for kt.RandomSearch.
Let’s run the Hyperband tuner now. We’ll use the TensorBoard callback, this time pointing to the root log directory (the tuner will take care of using a different subdirectory for each trial), as well as an EarlyStopping callback:
root_logdir = Path(hyperband_tuner.project_dir) / "tensorboard"
tensorboard_cb = tf.keras.callbacks.TensorBoard(root_logdir)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=2)
hyperband_tuner.search(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid),
callbacks=[early_stopping_cb, tensorboard_cb])
Now if you open TensorBoard, pointing --logdir to the
my_fashion_mnist/hyperband/tensorboard
directory, you will see all the trial results as they unfold. Make sure to visit the HPARAMS tab: it contains a summary of all the hyperparameter combinations that were tried, along with the corresponding metrics. Notice that there are three tabs inside the HPARAMS tab: a table view, a parallel coordinates view, and a scatterplot matrix view. In the lower part of the left panel, uncheck all metrics except forvalidation.epoch_accuracy
: this will make the graphs clearer. In the parallel coordinates view, try selecting a range of high values in thevalidation.epoch_accuracy
column: this will filter only the hyperparameter combinations that reached a good performance. Click one of the hyperparameter combinations, and the corresponding learning curves will appear at the bottom of the page. Take some time to go through each tab; this will help you understand the effect of each hyperparameter on performance, as well as the interactions between the hyperparameters.
Hyperband在资源分配方式上比纯随机搜索更聪明,但其核心仍然是随机探索超参数空间;它很快,但很粗糙。
此外, Keras Tuner还包含一个 kt.BayesianOptimization
优化器,该算法通过拟合一个称为高斯过程的概率模型,逐渐学习超参数空间的哪些区域最有前途。这允许它逐渐放大最佳超参数。缺点是该算法有自己的超参数。alpha 表示你在整个试验的性能测量中期望的噪声水平(默认为10-4), and beta 指定你希望算法探索的程度,而不是简单地利用超参数空间的已知良好区域(默认为2.6)。
bayesian_opt_tuner = kt.BayesianOptimization(
MyClassificationHyperModel(), objective="val_accuracy", seed=42,
max_trials=10, alpha=1e-4, beta=2.6,
overwrite=True, directory="my_fashion_mnist", project_name="bayesian_opt")
bayesian_opt_tuner.search([...])
Hyperparameter tuning is still an active area of research, and many other approaches are being explored.
隐藏层数量
与浅层网络相比,深层网络可以使用更少的神经元对复杂的函数进行建模,从而使它们在相同数量的训练数据下可以获得更好的性能。
深层网络能利用低层网络的结构。
迁移学习:
这种分层架构不仅可以帮助DNN更快地收敛到一个好的解,而且还可以提高DNN泛化到新数据集的能力。例如,如果你已经训练了一个模型来识别图片中的人脸,并且现在想训练一个新的神经网络来识别发型,则可以通过重用第一个网络的较低层来开始训练。你可以将它们初始化为第一个网络较低层的权重和偏置值,而不是随机初始化新神经网络前几层的权重和偏置值。这样,网络就不必从头开始学习大多数图片中出现的所有底层结构。只需学习更高层次的结构(例如发型)。这称为迁移学习。 This is called transfer learning.
对于许多问题,你可以仅从一两个隐藏层开始,然后神经网络就可以正常工作。例如,仅使用一个具有几百个神经元的隐藏层,就可以轻松地在MNIST数据集上达到97%以上的准确率,而使用具有相同总数的神经元的两个隐藏层,可以在大致相同训练时间上轻松达到98%以上的精度。对于更复杂的问题,你可以增加隐藏层的数量,直到开始过拟合训练集为止。非常复杂的任务(例如图像分类或语音识别)通常需要具有数十层(甚至数百层,但不是全连接的网络)的网络,并且它们需要大量的训练数据。你几乎不必从头开始训练这样的网络:重用一部分类似任务的经过预训练的最新网络更为普遍。这样,训练就会快得多,所需的数据也要少得多
每个隐藏层的神经元数量
输入层和输出层中神经元的数量取决于任务所需的输入类型和输出类型。例如,MNIST任务需要28×28=784个输入神经元和10个输出神经元。
对于隐藏层,过去通常将它们调整大小以形成金字塔状,每一层的神经元越来越少,理由是许多低层特征可以合并成更少的高层特征。MNIST的典型神经网络可能具有3个隐藏层,第一层包含300个神经元,第二层包含200个神经元,第三层包含100个神经元。但是,现在这种做法已被很大程度上放弃了,因为似乎在所有隐藏层中使用相同数量的神经元,在大多数情况下层的表现都一样好,甚至更好;另外,只需要调整一个超参数,而不是每层一个。也就是说,根据数据集,有时使第一个隐藏层大于其他隐藏层是有帮助的。
通常通过增加层数而不是每层神经元数,你将获得更多收益。
学习率、批大小和其他超参数
Learning Rate, Batch Size, and Other Hyperparameters
学习率
the most important hyperparameter一般而言,最佳学习率约为最大学习率的一半。找到一个好的学习率的一种方法是对模型进行数百次迭代训练,从非常低的学习率(例如10-5)开始,然后逐渐将其增加到非常大的值(例如10)。这是通过在每次迭代中将学习率乘以恒定因子来完成的(例如,将exp(log(106)/500)乘以500次迭代中的10-5到10)。如果将损失作为学习率的函数进行绘制(对学习率使用对数坐标),你应该首先看到它在下降。但是过一会儿学习率将过大,因此损失将重新上升:最佳学习率将比损失开始攀升的点低一些(通常比转折点低约10倍)。然后你可以重新初始化模型,并以这种良好的学习率正常训练模型。
优化器:
选择比 mini-batch gradient descent 更好的优化器也很重要。
批大小Batch size:
业界尚无定论,有争议。一种是选取小批量,不超过32,在2-32范围内。另一种是使用可容纳在GPU RAM中的最大批量,并慢慢增加学习率。
激活函数:
通常,ReLU激活函数是所有隐藏层的良好的默认设置。对于输出层,这实际上取决于你的任务。
迭代次数:
在大多数情况下,实际上不需要调整训练迭代次数,只需使用提前停止即可。
最佳学习率取决于其他超参数,尤其是批量大小,因此如果你修改了任何超参数,请确保也更新学习率。
(本文结束)
参阅
- 《人工智能》(美)Rob Callan著;黄厚宽,田盛丰等译;ISBN: 9787505399235
- 《机器学习 西瓜书》南京大学 周志华
- 《神经网络与深度学习》复旦大学 邱锡鹏
- 斯坦福CS231,Andrew NG