介绍

LSTM（Long Short-Term Memory Networks），翻译过来是长时间的短期记忆网络，它的本质就是能够记住很长时期内的信息（能记的长度长，但是保存的时间短）。

在普通 RNN 中，循环神经网络结构都是由结构完全相同的模块进行复制而成的。

LSTM 也有类似的结构，唯一的区别就是中间的部分，LSTM 不再只是一个单一的 tanh 层，而使用了四个相互作用的层。

这里解释图中的符号。

黄色的矩阵：表示的是一个神经网络层。
粉红色的圆圈：表示逐点操作，如向量乘法、加法等。
线：表示传递着一个向量，从一个节点中输入到另一个节点。
合并的线：表示把两条线上所携带的向量进行合并（比如一个是 \(ℎ_{t−1}\)，另一个是 \(x_t\)，那么合并后的输出就是 \([h_{t-1},x_t]\)）。
分开的线：表示将线上传递的向量复制一份，传给两个地方。

LSTM 核心思想

LSTM 的关键是 cell 状态，即贯穿图顶部的水平线。
cell 状态的传输就像一条传送带，向量从整个 cell 中穿过，只是做了少量的线性操作，这种结构能很轻松地实现信息从整个 cell 中穿过而不做改变（这样就可以实现长时期地记忆保留）。

LSTM 也有能力向 cell 状态中添加或删除信息，这是由称为门（gates） 的结构仔细控制的。门可以选择性的让信息通过，它们由 sigmoid 神经网络层和逐点相乘实现。

每个 LSTM 有三个这样的门结构来实现控制信息： - forget gate 遗忘门 - input gate 输入门 - output gate 输出门

遗忘门

LSTM 的第一步是决定要从 cell 状态中丢弃什么信息，这个决定是由一个名为 forget gate layer 的 sigmoid 神经层来实现的。

它的输入是 \(h_{t-1}\) 和 \(x_t\)，输出是一个数值都在 0~1 之间的向量（向量长度和 \(C_{t-1}\) 一样），表示让 \(C_{t-1}\) 的各部分信息通过的比重，0 表示不让任何信息通过，1 表示让所有信息通过

思考一个具体的例子：假设一个语言模型试图基于前面所有的词预测下一个单词，在这种情况下，每个 cell 状态都应该包含了当前主语的性别（保留信息），这样接下来我们才能正确使用代词；但是当我们又开始描述一个新的主语时，就应该把旧主语的性别给忘了才对（忘记信息）

输入门

第二步是决定要让多少新的信息加入到 cell 状态中。
实现这个需要包括两个步骤： 1. 首先，一个叫做 input gate layer 的 sigmoid 层决定哪些信息需要更新。另一个 tanh 层创建一个新的 candidate 向量 \(C_t\)。 2. 之后，我们把这两个部分联合起来对 cell 状态进行更新。

在我们的语言模型的例子中，我们想把新的主语性别信息添加到 cell 状态中，替换掉老的状态信息。
有了上述的结构，我们就能够更新 cell 状态了，即把 \(C_{t-1}\) 更新为 \(C_t\)。
从结构图中应该能一目了然，首先我们把旧的状态 \(C_{t-1}\) 和 \(f_t\) 相乘，把一些不想保留的信息忘掉，然后加上 \(i_{t}* C_t\)。这部分信息就是我们要添加的新内容。

输出门

最后，需要决定输出什么值。这个输出主要是依赖于 cell 状态 \(C_t\)，但是是经过筛选的版本。

首先，经过一个 sigmoid 层，它决定 \(C_t\) 中的哪些部分将会被输出。接着，我们把 \(C_t\) 通过一个 tanh 层（把数值归一化到 - 1 和 1 之间），然后把 tanh 层的输出和 simoid 层计算出来的权重相乘，这样就得到了最后的输出结果。

在语言模型例子中，假设我们的模型刚刚接触了一个代词，接下来可能要输出一个动词，这个输出可能就和代词的信息有关了。比如说，这个动词应该采用单数形式还是复数形式，那么我们就得把刚学到的和代词相关的信息都加入到 cell 状态中来，才能够进行正确的预测。

原理示例

假设传入到 cell 的 input 叫做 \(z\)，操控 input gate 的信号记为 \(z_i\)，控制 forget gate 的信号记为 \(z_f\)，控制 output gate 的信号记为 \(z_o\)，综合这些东西得到的 output 记为 \(a\)，假设 memory 里面已经存了一个值 \(c\)。

先看下图具体的 cell，每条线上面的值就是 \(weight\)，绿色框和线构成 \(bias\)，input gate 和 forget gate 的 activation function 都是 sigmoid，为了方便，\(z_i,z_o,z_f\) 直接利用输入的 vector，\(g\) 和 \(h\) 假设都是 Linear（这样计算比较方便），假设存到 memory 里面的初值是 0。

在实际运算前，我们先根据它的 input 以及其它参数，分析可能会得到的结果。
底下这个从外界传入的 cell，\(x_1\) 乘以 1，其它 \(x\) 乘以 0，相当于直接把 \(x_1\) 当做输入。
在 input gate 中，\(x_2\) 乘以 100，\(bias\) 是 -10。
在 forget gate 中，也是 \(x_2\) 乘以 100， \(bias\) 是 10。
在 output gate 中，\(x_3\) 乘以 100，\(bias\) 是 - 10。

例 1

带入一个实际的 input([3, 1, 0]) 。
input 这里 3*1=3。
input gate 这边是 1*100-10=90，但是由于要经过一个 sigmoid，所以 ≈1，那么 \(g(z)*f(z_i)=3\)。
forget gate 这边是 1*100+10=110，也要经过一个 sigmoid，所以 ≈1，那么此时 memory 的值 \(c\) 就更新为 \(c'=g(z)*f(z_i)+c*f(z_f)=3+0∗1=3\)。
output gate 这边是 0*100-10=-10，经过 sigmoid 以后 ≈0。
最终输出 \(a=h(c')*f(z_o)=0\)。

掐他案例同理~（注意更新后的 \(c\) 依赖于更新前的 \(c\)）

PyTorch 实现

这里给出 LSTM 的 Api 链接：LSTM — PyTorch 1.13 documentation

输入数据格式：

input：[seq_len, batch, input_size]
\(ℎ_0\) ：[num_layers * num_directions, batch, hidden_size]
\(c_0\) ：[num_layers * num_directions, batch, hidden_size]

输出数据格式：

output：[seq_len, batch, hidden_size * num_directions]
\(h_n\) ：[num_layers * num_directions, batch, hidden_size]
\(c_n\) ：[num_layers * num_directions, batch, hidden_size]

导包

1 2	import torch import torch.nn as nn

调用

lstm = nn.LSTM(input_size=100, hidden_size=20, num_layers=4)  

# 一个句子10个单词，送进去3条句子，每个单词用一个100维的vector表示  
X = torch.randn(10, 3, 100)       

out, (h_n, c_n) = lstm(X)  
print(out.shape, h_n.shape, c_n.shape)

torch. Size ([10, 3, 20])
torch. Size ([4, 3, 20])
torch. Size ([4, 3, 20])

LSTMCell

LSTM 和 LSTMCell

LSTMCEll 就是 LSTM 的物理结构，即：

而 LSTM 则是 LSTMcell 在时间上的扩展，即：

输入 input_size 的 shape 是 [batch, input_size]
输出 \(h_t\) 和 \(c_t\) 的 shape 是 [batch, hidden_size]

单层 LSTM

cell = nn.LSTMCell(input_size=100, hidden_size=20)  
x = torch.randn(10, 3, 100)  
h = torch.zeros(3,20)  
c = torch.zeros(3,20)  
  
for xt in x:  
    h, c = cell(xt, [h,c])  
  
print(h.shape, c.shape)

torch. Size ([3, 20])
torch. Size ([3, 20])

双层 LSTM

cell1 = nn.LSTMCell(input_size=100, hidden_size=30)  
cell2 = nn.LSTMCell(input_size=30, hidden_size=20)  
  
x = torch.randn(10, 3, 100)  
  
h1 = torch.zeros(3, 30)  
c1 = torch.zeros(3, 30)  
h2 = torch.zeros(3,20)  
c2 = torch.zeros(3,20)  
  
for xt in x:  
    h1, c1 = cell1(xt, [h1,c1])  
    h2, c2 = cell2(h1, [h2,c2])  
  
print(h2.shape, c2.shape)

torch. Size ([3, 20])
torch. Size ([3, 20])

BiLSTM 实现文本预测

导包

import torch  
import numpy as np  
import torch.nn as nn  
import torch.optim as optim  
import torch.utils.data as Data  
  
dtype = torch.FloatTensor

准备数据

构建词向量索引。

sentence = (  
    'GitHub Actions makes it easy to automate all your software workflows '  
    'from continuous integration and delivery to issue triage and more')    # 实际上是一句话  
  
word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}  
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}  
  
n_class = len(word2idx)  
max_len = len(sentence.split())  
  
n_hidden = 5

sentence
> 'GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more'
word2idx
> {'issue': 0,'integration': 1,'from': 2,'workflows': 3,'and': 4,'continuous': 5,'your': 6,
> 'software': 7,'GitHub': 8,'automate': 9,'easy': 10,'to': 11,'triage': 12, makes': 13,
> 'Actions': 14,'more': 15,'all': 16,'it': 17,'delivery': 18}
idx2word
> {0: 'issue', 1: 'integration', 2: 'from', 3: 'workflows', 4: 'and', 5: 'continuous', 6: 'your',
> 7: 'software', 8: 'GitHub', 9: 'automate', 10: 'easy', 11: 'to', 12: 'triage', 13: 'makes',
> 14: 'Actions', 15: 'more', 16: 'all', 17: 'it', 18: 'delivery'}
n_class
> 19
max_len
> 21

数据预处理

定义 dataset，构建 dataloader。

def make_data(sentence):  
    input_batch = []  
    target_batch = []  
  
    words = sentence.split()  
  
    for i in range(max_len-1):  
        input = [word2idx[w] for w in words[:(i+1)]]     # (i+1)位置取不到  
        input = input + [0] * (max_len - len(input))     
        input_batch.append(np.eye(n_class)[input])  
        target = word2idx[words[i+1]]  
        target_batch.append(target)  
  
    return torch.Tensor(input_batch), torch.LongTensor(target_batch)  
  
input_batch, target_batch = make_data(sentence)  
dataset = Data.TensorDataset(input_batch, target_batch)  
dataloader = Data.DataLoader(dataset, 16 ,True)     # 中间参数是batch_size

注意这里 n_class==19，表示词向量字典的长度。

首先开始循环，input 的第一个赋值语句会将第一个词 "Github"对应的索引存起来。input 的第二个赋值语句会将剩下的 max_len-len(input) 都用 0 去填充。
第二次循环，input 的第一个赋值语句会将前两个词 "Github" 和 "Actions" 对应的索引存起来。input 的第二个赋值语句会将剩下的 max_len-len(input) 都用 0 去填充……

在本例中，input_batch 中的向量分别表示"GitHub"，"GitHub Actions", "GitHub Actions makes"……target_batch 中的向量分别表示"Actions"，"makes"，"it"……

np.eye(n_class)[input] 表示构建一个 len(input)*n_class 的矩阵，input 表示在适当的位置填 1。

最终 input_batch 的维度是 [max_len - 1, max_len, n_class]。

定义网络

Bi-LSTM 的网络结构图如下所示，其中 Backward Layer 意思不是 "反向传播"，而是 "将句子反向输入"。

具体流程就是，现有有由四个词构成的一句话 "i like your friends"。 - 常规单向 LSTM 的做法就是直接输入 "i like your"，然后预测出 "friends"。 - 双向 LSTM 会同时输入 "i like your" 和 "your like i"，然后将 Forward Layer 和 Backward Layer 的 output 进行 concat（这样做可以理解为同时 "汲取" 正向和反向的信息），最后预测出 "friends"。

因为多了一个反向的输入，所以整个网络结构中很多隐藏层的输入和输出的某些维度会变为原来的两倍，具体如下图所示。对于双向 LSTM 来说：num_directions = 2。

class BiLSTM(nn.Module):  
    def __init__(self):  
        super(BiLSTM, self).__init__()  
        # input_size表示每个词用input_size维的vector表示  
        self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)  
        self.fc = nn.Linear(n_hidden * 2, n_class)  
  
    def forward(self, X):  
        # X: [batch_size, max_len, n_class]  
        batch_size = X.shape[0]  
        # input: [max_len, batch_size, n_class]  
        input = X.transpose(0,1)  
        # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]  
        hidden_state = torch.randn(1*2, batch_size, n_hidden)  
        # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]  
        cell_state = torch.randn(1*2, batch_size, n_hidden)  
  
        outputs, (_,_) = self.lstm(input, (hidden_state, cell_state))  
        # outputs: [batch_size, n_hidden * 2]  
        outputs = outputs[-1]  
        # model: [batch_size, n_class]  
        model = self.fc(outputs)  
  
        return model  
  
model = BiLSTM()  
criterion = nn.CrossEntropyLoss()  
optimizer = optim.Adam(modol.parameters(), lr=0.001)

训练

epoch = 10000  
  
for e in range(epoch):  
    for x, y in dataloader:  
        pred = model(x)  
        loss = criterion(pred, y)  
        if (epoch + 1)% 1000 == 0:  
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))  
  
        optimizer.zero_grad()  
        loss.backward()  
        optimizer.step()

测试

1
2
3

predict = model(input_batch).data.max(1, keepdim=True)[1]  
print(sentence)  
print([idx2word[n.item()] for n in predict.squeeze()])

GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['to', 'to', 'easy', 'to', 'to', 'to', 'to', 'your', 'software', 'workflows', 'from', 'continuous', 'integration', 'delivery', 'delivery', 'to', 'issue', 'triage', 'more', 'more']

Pytorch基础系列（5）

介绍