自然语言处理

| |

文本处理层Text Preprocessing

Keras provides a TextVectorization layer for basic text preprocessing. Much like the StringLookup layer, you must either pass it a vocabulary upon creation, or let it learn the vocabulary from some training data using the adapt() method. Let’s look at an example:

python

>>> train_data = ["To be", "!(to be)", "That's the question", "Be, be, be."]
>>> text_vec_layer = tf.keras.layers.TextVectorization()
>>> text_vec_layer.adapt(train_data)
>>> text_vec_layer(["Be good!", "Question: be or be?"])
<tf.Tensor: shape=(2, 4), dtype=int64, numpy=
array([[2, 1, 0, 0],
       [6, 2, 1, 2]])>

The two sentences “Be good!” and “Question: be or be?” were encoded as [2, 1, 0, 0] and [6, 2, 1, 2], respectively. The vocabulary was learned from the four sentences in the training data: “be” = 2, “to” = 3, etc. To construct the vocabulary, the adapt() method first converted the training sentences to lowercase and removed punctuation, which is why “Be”, “be”, and “be?” are all encoded as “be” = 2. Next, the sentences were split on whitespace, and the resulting words were sorted by descending frequency, producing the final vocabulary. When encoding sentences, unknown words get encoded as 1s. Lastly, since the first sentence is shorter than the second, it was padded with 0s.

The TextVectorization layer has many options. For example, you can preserve the case and punctuation if you want, by setting standardize=None, or you can pass any standardization function you please as the standardize argument. You can prevent splitting by setting split=None, or you can pass your own splitting function instead. You can set the output_sequence_length argument to ensure that the output sequences all get cropped or padded to the desired length, or you can set ragged=True to get a ragged(不规则) tensor instead of a regular tensor. Please check out the documentation for more options.

稀有单词的加权TF-IDF:

$$ TF-IDF=词频(TF)× 逆文档频率(IDF) $$
$$ 词频(TF) = \frac{某个词在文档中出现的次数}{文章的总词数} $$
$$ 逆文档频率(IDF) = log(\frac{语料库文档总数}{包含该词的文档数+1}) $$

TF-IDF有很多变体,这里不展开了。
The word IDs must be encoded, typically using an Embedding layer: we will do this in Chapter 16. Alternatively, you can set the TextVectorization layer’s output_mode argument to "multi_hot" or "count" to get the corresponding encodings. However, simply counting words is usually not ideal: words like “to” and “the” are so frequent that they hardly matter at all, whereas, rarer words such as “basketball” are much more informative. So, rather than setting output_mode to "multi_hot" or "count", it is usually preferable to set it to "tf_idf", which stands for term-frequency × inverse-document-frequency (术语频率×反文档频率TF-IDF). This is similar to the count encoding, but words that occur frequently in the training data are downweighted, and conversely, rare words are upweighted. For example(下面是Tensorflow实现):

python

>>> text_vec_layer = tf.keras.layers.TextVectorization(output_mode="tf_idf")
>>> text_vec_layer.adapt(train_data)
>>> text_vec_layer(["Be good!", "Question: be or be?"])
<tf.Tensor: shape=(2, 6), dtype=float32, numpy=
array([[0.96725637, 0.6931472 , 0. , 0. , 0. , 0.        ],
    [0.96725637, 1.3862944 , 0. , 0. , 0. , 1.0986123 ]], dtype=float32)>

自己训练一个简单的NLP模型

训练莎士比亚风格文本模型并保存

文件下载:shakespeare.txt

在使用GPU的前提下,这个模型大约要训练2小时左右

我训练好的模型可以在此处下载

python

import tensorflow as tf
filepath = "shakespeare.txt"
with open(filepath) as f:
    shakespeare_text = f.read()

print(shakespeare_text[:80])
text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(buffer_size=100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

# 我们将窗口长度设置为100,但您可以尝试调整它:在较短的输入序列上训练RNN更容易、更快,
# 但RNN将无法学习任何长于长度的模式,所以不要使其太小。
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])
# 这个模型不处理文本预处理,所以让我们把它包装在一个最终模型中,该模型包含tf.keras.layers.TextVectorization层作为第一层,
# 再加上tf.keras.layers.Lambda层,从字符ID中减去2,因为我们现在没有使用填充和未知标记:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

shakespeare_model.save("my_shakespeare_model", save_format="tf")

加载并使用模型

python

import tensorflow as tf
new_model = tf.keras.models.load_model("my_shakespeare_model")

new_model.summary()

# 单次调用:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)
text_vec_layer.get_vocabulary()[y_pred + 2]
# 输出 'e'

我们可以给它提供一些文本,让模型预测最有可能的下一个字母,将其添加到文本的末尾,然后将扩展文本提供给模型以猜测下一个单词,等等。这被称为贪婪解码。但在实践中,这往往会导致相同的单词被一遍又一遍地重复。相反,我们可以使用TensorFlow的tf.srandom.categorial()函数,以与估计概率相等的概率随机采样下一个字符。这将产生更加多样化和有趣的文本。categorical()函数对给定类对数概率(logits)的随机类索引进行采样。为了更好地控制生成文本的多样性,我们可以将logits除以一个称为温度的数字,我们可以根据自己的意愿进行调整。接近零的温度有利于高概率字符,而高温则赋予所有字符相同的概率。当生成相当严格和精确的文本(如数学方程)时,通常首选较低的温度,而当生成更多样和更有创意的文本时,则首选较高的温度。下面的next_char()自定义帮助器函数使用这种方法来选择要添加到输入文本中的下一个字符:

python

def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]
# 反复调用next_char
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text


# 可以使用了
tf.random.set_seed(42)

print(extend_text("To be or not to be", temperature=0.01))

print(extend_text("To be or not to be", temperature=1))

print(extend_text("To be or not to be", temperature=100))

可用观察到,temperature 越高,内容越具有创造性,temperature 越低,内容越严格。

小结

训练得很慢,效果也很一般,仅仅能作为一个例子。
而且需要GPU,配置GPU的环境太麻烦了,我搞了半天最后发现tensorflow 2.11.0版本开始,在windows上不再支持GPU,只能使用docker,或者卸载高版本tensorflow重装低版本tensorflow。巨坑!!!

调用国内的智谱ai大模型

前言

自己没有条件在现行方案下训练很好的nlp模型,不过可用调用开源的模型。
本来想搞一下 chatGPT 的api,但是发现好像我的 chatGPT 账号异常,创建api后api无法使用,无奈转向国内的大模型。目前国内的大模型我只用过百度的文心一言和科大讯飞的讯飞星火,我记得几个月前这两个大模型刚刚发布的时候都不咋地,在一个水平线上,但是从现在的情况看,百度文心一言已经很聪明了,进步明显。但是百度文心一言的api需要申请(人工审核),感觉很麻烦,不太适合给我练手,我查到 智谱ai 开放了api,直接注册账号就有,我就试了试,成功了。下面是实现方法。

先吐槽一下,这个 智谱ai 的文档写得真烂啊……

使用方法

注册 https://open.bigmodel.cn/ 账号,进入开发工作台-个人账户-账号管理,可以看到有个系统默认的api,并且还有18元的免费额度。把api key 复制下来。

先安装sdk

shell

pip install zhipuai

简单玩一下就行了,不深究:

python

import zhipuai

# your api key复制粘贴过来
zhipuai.api_key = "your api key"

def invoke_test():
    response = zhipuai.model_api.invoke(
        model="chatglm_turbo",
        prompt=[{"role": "user", "content": "请介绍一下诗人李白"}],
        top_p=0.7,
        temperature=0.9,
    )
    return response

res = invoke_test()
print(res['data']['choices'][0]['content'])

即可 输出内容:

plaintext

" 李白(701年2月8日—762年12月),字太白,号青莲居士,又号谪仙人”。是唐代伟大的浪漫主义诗人,被后人誉为“诗仙”。与杜甫并称为“李杜”,为了与另两位诗人李商隐与杜牧即“小李杜”区别,杜甫与李白又合称“大李杜”。\n\n李白出身于一个富有的、有文化教养的家庭。他的少年时代生长于蜀中,蜀中是道教气氛浓郁的地方,环境对他的神仙道教信仰影响甚大。大约18岁左右,他学习纵横术。\n\n李白的诗歌风格独具特色,他善于运用夸张、比喻、对比等修辞手法,表现他的豪放、奔放的个性。他的诗歌作品涵盖了山水田园、历史传说、神话故事等各种主题,表现出广阔的视野和丰富的想象力。\n\n李白的一生,绝大部分在漫游中度过。天宝元年(742年),因道士吴筠的推荐,被召至长安,供奉翰林。文章风采,名动一时,颇为玄宗所赏识。后因不能见容于权贵,在京仅三年,就弃官而去,仍然继续他那飘荡四方的流浪生活。安史之乱发生的第二年,李白被永王李璘招聘为幕僚,后来李璘叛败,李白受到牵连,被流放夜郎,途中遇赦,晚年漂泊于武昌、庐山等地,最后病死在途中。\n\n李白的诗歌作品广泛传颂,被誉为“诗仙”,对后世的文学影响深远。他的诗歌风格和艺术成就,在中国文学史上占有重要的地位。"

感觉这个输出也怪怪的,不管了。
仔细看一下输出,有胡说的地方。

其他信息

这个 reponse() 输出的是一个字典,里面还嵌套了好几层字典,我觉得这一点儿也不优雅,先记在这里,省得以后重新找:

plaintext

<class 'dict'>
dict_keys(['code', 'msg', 'data', 'success'])

参阅