Text Generation自动文本生成（LSTM实现）

最新推荐文章于 2024-09-08 09:00:00 发布

原创最新推荐文章于 2024-09-08 09:00:00 发布 · 1.2k 阅读

5 ·

本内容遵循CC 4.0 BY-SA版权协议

NLP 专栏收录该内容

52 篇文章

订阅专栏

step 1：训练模型（train the network)

训练样本是是从什么文风的文章截取的，这个模型在predict阶段就会生成什么样文风的文本。
在这里插入图片描述
1）准备training example pairs：（input_segment, target_character)。target_character是input_segment之后的character。例如，从一篇3000词的莎士比亚文章中截取training example pairs，设定input segment length = 30，stride = 3，那么能截取出约1000个（input_segment, target_character)，因为stride意思是每次向后移动三个character，再截取下一个pair。
在这里插入图片描述

2）encoding：把input_segment用one-hot encoding方式编码成矩阵，size=length x vocabulary (=#unique tokens，字母，数字，标点，空格…), 把target编码成向量。
对于这个模型/任务，预测对象是character，vocab大小也就几十，所以不用word embedding来缩小每个character vector的长度。character-level tokenization不需要embedding，word-level tokenization需要。
在这里插入图片描述
3）train：input—LSTM—Dense（softmax）— y^: probability distribution of the predicted character. 注意不能用双向LSTM，因为文本生成是看前面的文本，生成下一个字符，是单向问题。
cost function = H(y, y^) = -logP(y), 因为y为one-hot encoding vector只在正确预测的那一维是1。P(y)：probability distribution中对应target的那一维。
在这里插入图片描述
LSTM: #weights + #bias = （128+57）x4x128 + 4x128 = 95232
Dense: #weights + #bias = 128x57 + 57= 7353

LSTM：
reminder：
input size为[batch_size（句子数）, sequence_length_T（每句话长度 i.e. 单词个数）, LSTM_num_units（h，c的向量长度 i.e. num_units）]

output size为:
if return_sequences=True, output is output sequence y(1:T), with size of [sequence_length, batch_size, num_units].
if return_sequences=False, output只保留y(t), with size of [batch_size, num_units].

对于文本生成任务，input和output size：

input size： [batch_size（#segments）, segment_length, LSTM_num_units]

output size：这里应该只返回最后一个y(T) = [batch_size, num_units]用来预测the next character，所以return_sequences=False
在这里插入图片描述
batch_size = 128: 每次输入一批segments进行训练，更新一次模型参数
epochs = 1: 把所有training pairs in dataset过一遍

step2：使用模型进行预测（predict）

训练好了模型，现在使用模型进行文本生成。
在这里插入图片描述
1）自己写一个开头”seed“ 用作第一次的input
2）重复下列步骤，每次能生成一个character：

feed the input into the network
output the probability districution of the predicted next character
sample from the distribution to determine the next character, there are three options: 取可能性最大的；直接sample；用temperature处理概率分布后再sample。
更新input：去掉第一个character，加上next character