NLP 大杀器 BERT 源码分析 - 文章 - 开发者社区

picture.image

文/高开远图片来源于网络

写在前面

update@2020.03.10

最近在看paddle相关资料，于是就打算仔细过一遍百度ERNIE的源码。之前粗看的时候还没有ERNIE2.0、ERNIE-tiny,整体感觉跟BERT也挺类似的，不知道更新了之后会是啥样~看完也会整理跟下面类似的总结，刚好也在研究paddle或ERNIE的同学可以加我一起讨论哈哈哈

原内容@2019.05.16

BERT 模型也出来很久了，之前有看过论文和一些博客对其做了解读：NLP 大杀器 BERT 模型解读[1]，但是一直没有细致地去看源码具体实现。最近有用到就抽时间来仔细看看记录下来，和大家一起讨论。

注意，源码阅读系列需要提前对 NLP 相关知识有所了解，比如 attention 机制、transformer 框架以及 python 和 tensorflow 基础等，关于 BERT 的原理不是本文的重点。

附上关于 BERT 的资料汇总：BERT 相关论文、文章和代码资源汇总[2]

今天要介绍的是 BERT 最主要的模型实现部分-----BertModel，代码位于

modeling.py 模块[3]

除了代码块外部，在代码块内部也有注释噢

如有解读不正确，请务必指出~

1、配置类（BertConfig）

这部分代码主要定义了 BERT 模型的一些默认参数，另外包括了一些文件处理函数。


          
class BertConfig(object):  
  """BERT模型的配置类."""  
  
  def \_\_init\_\_(self,  
 vocab\_size,  
 hidden\_size=768,  
 num\_hidden\_layers=12,  
 num\_attention\_heads=12,  
 intermediate\_size=3072,  
 hidden\_act="gelu",  
 hidden\_dropout\_prob=0.1,  
 attention\_probs\_dropout\_prob=0.1,  
 max\_position\_embeddings=512,  
 type\_vocab\_size=16,  
 initializer\_range=0.02):  
  
    self.vocab_size = vocab_size  
    self.hidden_size = hidden_size  
    self.num_hidden_layers = num_hidden_layers  
    self.num_attention_heads = num_attention_heads  
    self.hidden_act = hidden_act  
    self.intermediate_size = intermediate_size  
    self.hidden_dropout_prob = hidden_dropout_prob  
    self.attention_probs_dropout_prob = attention_probs_dropout_prob  
    self.max_position_embeddings = max_position_embeddings  
    self.type_vocab_size = type_vocab_size  
    self.initializer_range = initializer_range  
  
 @classmethod  
  def from\_dict(cls, json\_object):  
    """Constructs a `BertConfig` from a Python dictionary of parameters."""  
    config = BertConfig(vocab_size=None)  
    for (key, value) in six.iteritems(json_object):  
      config.__dict__[key] = value  
    return config  
  
 @classmethod  
  def from\_json\_file(cls, json\_file):  
    """Constructs a `BertConfig` from a json file of parameters."""  
    with tf.gfile.GFile(json_file, "r") as reader:  
      text = reader.read()  
    return cls.from_dict(json.loads(text))  
  
  def to\_dict(self):  
    """Serializes this instance to a Python dictionary."""  
    output = copy.deepcopy(self.__dict__)  
    return output  
  
  def to\_json\_string(self):  
    """Serializes this instance to a JSON string."""  
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

「参数具体含义」

vocab_size：词表大小
hidden_size：隐藏层神经元数
num_hidden_layers：Transformer encoder 中的隐藏层数
*num_attention_heads：*multi-head attention 的 head 数
intermediate_size：encoder 的“中间”隐层神经元数（例如 feed-forward layer）
hidden_act：隐藏层激活函数
hidden_dropout_prob：隐层 dropout 率
attention_probs_dropout_prob：注意力部分的 dropout
max_position_embeddings：最大位置编码
type_vocab_size：token_type_ids 的词典大小
initializer_range：truncated_normal_initializer 初始化方法的 stdev

这里要注意一点，可能刚看的时候对type_vocab_size这个参数会有点不理解，其实就是在next sentence prediction任务里的Segment A和 Segment B。在下载的bert_config.json文件里也有说明，默认值应该为 2。参考这个 Issue[4]

2、获取词向量（Embedding_lookup）

对于输入 word_ids，返回 embedding table。可以选用 one-hot 或者 tf.gather()


          
def embedding\_lookup(input\_ids,# word\_id：【batch\_size, seq\_length】  
 vocab\_size,  
 embedding\_size=128,  
 initializer\_range=0.02,  
 word\_embedding\_name="word\_embeddings",  
 use\_one\_hot\_embeddings=False):  
  
  # 该函数默认输入的形状为【batch\_size, seq\_length, input\_num】  
  # 如果输入为2D的【batch\_size, seq\_length】，则扩展到【batch\_size, seq\_length, 1】  
  if input_ids.shape.ndims == 2:  
    input_ids = tf.expand_dims(input_ids, axis=[-1])  
  
  embedding_table = tf.get_variable(  
      name=word_embedding_name,  
      shape=[vocab_size, embedding_size],  
      initializer=create_initializer(initializer_range))  
  
  flat_input_ids = tf.reshape(input_ids, [-1])    #【batch\_size*seq\_length*input\_num】  
  if use_one_hot_embeddings:  
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)  
    output = tf.matmul(one_hot_input_ids, embedding_table)  
  else:# 按索引取值  
    output = tf.gather(embedding_table, flat_input_ids)  
  
  input_shape = get_shape_list(input_ids)  
  
  # output：[batch\_size, seq\_length, num\_inputs]  
  # 转成:[batch\_size, seq\_length, num\_inputs*embedding\_size]  
  output = tf.reshape(output,  
        input_shape[0:-1] + [input_shape[-1] * embedding_size])  
  return (output, embedding_table)

「参数具体含义」

input_ids：word id 【batch_size, seq_length】
vocab_size：embedding 词表
embedding_size：embedding 维度
initializer_range：embedding 初始化范围
word_embedding_name：embeddding table 命名
use_one_hot_embeddings：是否使用 one-hotembedding
Return：【batch_size, seq_length, embedding_size】

3、词向量的后续处理（embedding_postprocessor）

我们知道 BERT 模型的输入有三部分：token embedding ，segment embedding以及position embedding。上一节中我们只获得了 token embedding，这部分代码对其完善信息，正则化，dropout 之后输出最终 embedding。注意，在 Transformer 论文中的position embedding是由 sin/cos 函数生成的固定的值，而在这里代码实现中是跟普通 word embedding 一样随机生成的，可以训练的。作者这里这样选择的原因可能是 BERT 训练的数据比 Transformer 那篇大很多，完全可以让模型自己去学习。 picture.image


          
def embedding\_postprocessor(input\_tensor,# [batch\_size, seq\_length, embedding\_size]  
 use\_token\_type=False,  
 token\_type\_ids=None,  
 token\_type\_vocab\_size=16,# 一般是2  
 token\_type\_embedding\_name="token\_type\_embeddings",  
 use\_position\_embeddings=True,  
 position\_embedding\_name="position\_embeddings",  
 initializer\_range=0.02,  
 max\_position\_embeddings=512, #最大位置编码，必须大于等于max\_seq\_len  
 dropout\_prob=0.1):  
  
  input_shape = get_shape_list(input_tensor, expected_rank=3)   #【batch\_size,seq\_length,embedding\_size】  
  batch_size = input_shape[0]  
  seq_length = input_shape[1]  
  width = input_shape[2]  
  
  output = input_tensor  
  
  # Segment position信息  
  if use_token_type:  
    if token_type_ids is None:  
      raise ValueError("`token\_type\_ids` must be specified if"  
                       "`use\_token\_type` is True.")  
    token_type_table = tf.get_variable(  
        name=token_type_embedding_name,  
        shape=[token_type_vocab_size, width],  
        initializer=create_initializer(initializer_range))  
    # 由于token-type-table比较小，所以这里采用one-hot的embedding方式加速  
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])  
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)  
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)  
    token_type_embeddings = tf.reshape(token_type_embeddings,  
                                       [batch_size, seq_length, width])  
    output += token_type_embeddings  
  
  # Position embedding信息  
  if use_position_embeddings:  
    # 确保seq\_length小于等于max\_position\_embeddings  
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)  
    with tf.control_dependencies([assert_op]):  
      full_position_embeddings = tf.get_variable(  
          name=position_embedding_name,  
          shape=[max_position_embeddings, width],  
          initializer=create_initializer(initializer_range))  
  
      # 这里position embedding是可学习的参数，[max\_position\_embeddings, width]  
      # 但是通常实际输入序列没有达到max\_position\_embeddings  
      # 所以为了提高训练速度，使用tf.slice取出句子长度的embedding  
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],  
                                     [seq_length, -1])  
      num_dims = len(output.shape.as_list())  
  
      # word embedding之后的tensor是[batch\_size, seq\_length, width]  
  # 因为位置编码是与输入内容无关，它的shape总是[seq\_length, width]  
  # 我们无法把位置Embedding加到word embedding上  
  # 因此我们需要扩展位置编码为[1, seq\_length, width]  
  # 然后就能通过broadcasting加上去了。  
      position_broadcast_shape = []  
      for _ in range(num_dims - 2):  
        position_broadcast_shape.append(1)  
      position_broadcast_shape.extend([seq_length, width])  
      position_embeddings = tf.reshape(position_embeddings,  
                                       position_broadcast_shape)  
      output += position_embeddings  
  
  output = layer_norm_and_dropout(output, dropout_prob)  
  return output

4、构造 attention_mask

该部分代码的作用是构造 attention 可视域的 attention_mask，因为每个样本都经过 padding 过程，在做self-attention的是padding的部分不能attend到其他部分上。输入为形状为 [batch_size, from_seq_length,...] 的 padding 好的 input_ids 和形状为 [batch_size, to_seq_length] 的 mask 标记向量。


          
def create\_attention\_mask\_from\_input\_mask(from\_tensor, to\_mask):  
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])  
  batch_size = from_shape[0]  
  from_seq_length = from_shape[1]  
  
  to_shape = get_shape_list(to_mask, expected_rank=2)  
  to_seq_length = to_shape[1]  
  
  to_mask = tf.cast(  
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)  
  
  broadcast_ones = tf.ones(  
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)  
  
  mask = broadcast_ones * to_mask  
  
  return mask

5、注意力层（attention layer）

这部分代码是「multi-head attention」的实现，主要来自《Attention is all you need》这篇论文。考虑key-query-value形式的 attention，输入的from_tensor当做是 query， to_tensor当做是 key 和 value，当两者相同的时候即为 self-attention。关于 attention 更详细的介绍可以转到【理解 Attention 机制原理及模型[5]】。


          
def attention\_layer(from\_tensor, # 【batch\_size, from\_seq\_length, from\_width】  
 to\_tensor,#【batch\_size, to\_seq\_length, to\_width】  
 attention\_mask=None,#【batch\_size,from\_seq\_length, to\_seq\_length】  
 num\_attention\_heads=1,# attention head numbers  
 size\_per\_head=512,# 每个head的大小  
 query\_act=None,# query变换的激活函数  
 key\_act=None,# key变换的激活函数  
 value\_act=None,# value变换的激活函数  
 attention\_probs\_dropout\_prob=0.0,# attention层的dropout  
 initializer\_range=0.02,# 初始化取值范围  
 do\_return\_2d\_tensor=False,# 是否返回2d张量。  
#如果True，输出形状【batch\_size*from\_seq\_length,num\_attention\_heads*size\_per\_head】  
#如果False，输出形状【batch\_size, from\_seq\_length, num\_attention\_heads*size\_per\_head】  
 batch\_size=None,#如果输入是3D的，  
#那么batch就是第一维，但是可能3D的压缩成了2D的，所以需要告诉函数batch\_size  
 from\_seq\_length=None,# 同上  
 to\_seq\_length=None):# 同上  
  
  def transpose\_for\_scores(input\_tensor, batch\_size, num\_attention\_heads,  
 seq\_length, width):  
    output_tensor = tf.reshape(  
        input_tensor, [batch_size, seq_length, num_attention_heads, width])  
  
    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])#[batch\_size, num\_attention\_heads, seq\_length, width]  
    return output_tensor  
  
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])  
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])  
  
  if len(from_shape) != len(to_shape):  
    raise ValueError(  
        "The rank of `from\_tensor` must match the rank of `to\_tensor`.")  
  
  if len(from_shape) == 3:  
    batch_size = from_shape[0]  
    from_seq_length = from_shape[1]  
    to_seq_length = to_shape[1]  
  elif len(from_shape) == 2:  
    if (batch_size is None or from_seq_length is None or to_seq_length is None):  
      raise ValueError(  
          "When passing in rank 2 tensors to attention\_layer, the values "  
          "for `batch\_size`, `from\_seq\_length`, and `to\_seq\_length` "  
          "must all be specified.")  
  
  # 为了方便备注shape，采用以下简写:  
  # B = batch size (number of sequences)  
  # F = `from\_tensor` sequence length  
  # T = `to\_tensor` sequence length  
  # N = `num\_attention\_heads`  
  # H = `size\_per\_head`  
  
  # 把from\_tensor和to\_tensor压缩成2D张量  
  from_tensor_2d = reshape_to_matrix(from_tensor)# 【B*F, hidden\_size】  
  to_tensor_2d = reshape_to_matrix(to_tensor)# 【B*T, hidden\_size】  
  
  # 将from\_tensor输入全连接层得到query\_layer  
  # `query\_layer` = [B*F, N*H]  
  query_layer = tf.layers.dense(  
      from_tensor_2d,  
      num_attention_heads * size_per_head,  
      activation=query_act,  
      name="query",  
      kernel_initializer=create_initializer(initializer_range))  
  
  # 将from\_tensor输入全连接层得到query\_layer  
  # `key\_layer` = [B*T, N*H]  
  key_layer = tf.layers.dense(  
      to_tensor_2d,  
      num_attention_heads * size_per_head,  
      activation=key_act,  
      name="key",  
      kernel_initializer=create_initializer(initializer_range))  
  
  # 同上  
  # `value\_layer` = [B*T, N*H]  
  value_layer = tf.layers.dense(  
      to_tensor_2d,  
      num_attention_heads * size_per_head,  
      activation=value_act,  
      name="value",  
      kernel_initializer=create_initializer(initializer_range))  
  
  # query\_layer转成多头：[B*F, N*H]==>[B, F, N, H]==>[B, N, F, H]  
  query_layer = transpose_for_scores(query_layer, batch_size,  
                                     num_attention_heads, from_seq_length,  
                                     size_per_head)  
  
  # key\_layer转成多头：[B*T, N*H] ==> [B, T, N, H] ==> [B, N, T, H]  
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,  
                                   to_seq_length, size_per_head)  
  
  # 将query与key做点积，然后做一个scale，公式可以参见原始论文  
  # `attention\_scores` = [B, N, F, T]  
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)  
  attention_scores = tf.multiply(attention_scores,  
                                 1.0 / math.sqrt(float(size_per_head)))  
  
  if attention_mask is not None:  
    # `attention\_mask` = [B, 1, F, T]  
    attention_mask = tf.expand_dims(attention_mask, axis=[1])  
  
    # 如果attention\_mask里的元素为1，则通过下面运算有（1-1）*-10000，adder就是0  
    # 如果attention\_mask里的元素为0，则通过下面运算有（1-0）*-10000，adder就是-10000  
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0  
  
    # 我们最终得到的attention\_score一般不会很大，  
    #所以上述操作对mask为0的地方得到的score可以认为是负无穷  
    attention_scores += adder  
  
  # 负无穷经过softmax之后为0，就相当于mask为0的位置不计算attention\_score  
  # `attention\_probs` = [B, N, F, T]  
  attention_probs = tf.nn.softmax(attention_scores)  
  
  # 对attention\_probs进行dropout，这虽然有点奇怪，但是Transforme原始论文就是这么做的  
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)  
  
  # `value\_layer` = [B, T, N, H]  
  value_layer = tf.reshape(  
      value_layer,  
      [batch_size, to_seq_length, num_attention_heads, size_per_head])  
  
  # `value\_layer` = [B, N, T, H]  
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])  
  
  # `context\_layer` = [B, N, F, H]  
  context_layer = tf.matmul(attention_probs, value_layer)  
  
  # `context\_layer` = [B, F, N, H]  
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])  
  
  if do_return_2d_tensor:  
    # `context\_layer` = [B*F, N*H]  
    context_layer = tf.reshape(  
        context_layer,  
        [batch_size * from_seq_length, num_attention_heads * size_per_head])  
  else:  
    # `context\_layer` = [B, F, N*H]  
    context_layer = tf.reshape(  
        context_layer,  
        [batch_size, from_seq_length, num_attention_heads * size_per_head])  
  
  return context_layer

总结一下，attention layer 的主要流程：

对输入的 tensor 进行形状校验，提取batch_size、from_seq_length 、to_seq_length；
输入如果是 3d 张量则转化成 2d 矩阵；
from_tensor 作为 query， to_tensor 作为 key 和 value，经过一层全连接层后得到 query_layer、key_layer 、value_layer；
将上述张量通过transpose_for_scores转化成 multi-head；
根据论文公式计算 attention_score 以及 attention_probs（注意 attention_mask 的 trick）：
将得到的 attention_probs 与 value 相乘，返回 2D 或 3D 张量

6、Transformer

接下来的代码就是大名鼎鼎的 Transformer 的核心代码了，可以认为是"Attention is All You Need"原始代码重现。可以参见【原始论文[6]】和【原始代码[7]】。


          
def transformer\_model(input\_tensor,# 【batch\_size, seq\_length, hidden\_size】  
 attention\_mask=None,# 【batch\_size, seq\_length, seq\_length】  
 hidden\_size=768,  
 num\_hidden\_layers=12,  
 num\_attention\_heads=12,  
 intermediate\_size=3072,  
 intermediate\_act\_fn=gelu,# feed-forward层的激活函数  
 hidden\_dropout\_prob=0.1,  
 attention\_probs\_dropout\_prob=0.1,  
 initializer\_range=0.02,  
 do\_return\_all\_layers=False):  
  
  # 这里注意，因为最终要输出hidden\_size， 我们有num\_attention\_head个区域，  
  # 每个head区域有size\_per\_head多的隐层  
  # 所以有 hidden\_size = num\_attention\_head * size\_per\_head  
  if hidden_size % num_attention_heads != 0:  
    raise ValueError(  
        "The hidden size (%d) is not a multiple of the number of attention "  
        "heads (%d)" % (hidden_size, num_attention_heads))  
  
  attention_head_size = int(hidden_size / num_attention_heads)  
  input_shape = get_shape_list(input_tensor, expected_rank=3)  
  batch_size = input_shape[0]  
  seq_length = input_shape[1]  
  input_width = input_shape[2]  
  
  # 因为encoder中有残差操作，所以需要shape相同  
  if input_width != hidden_size:  
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %  
                     (input_width, hidden_size))  
  
  # reshape操作在CPU/GPU上很快，但是在TPU上很不友好  
  # 所以为了避免2D和3D之间的频繁reshape，我们把所有的3D张量用2D矩阵表示  
  prev_output = reshape_to_matrix(input_tensor)  
  
  all_layer_outputs = []  
  for layer_idx in range(num_hidden_layers):  
    with tf.variable_scope("layer\_%d" % layer_idx):  
      layer_input = prev_output  
  
      with tf.variable_scope("attention"):  
      # multi-head attention  
        attention_heads = []  
        with tf.variable_scope("self"):  
        # self-attention  
          attention_head = attention_layer(  
              from_tensor=layer_input,  
              to_tensor=layer_input,  
              attention_mask=attention_mask,  
              num_attention_heads=num_attention_heads,  
              size_per_head=attention_head_size,  
              attention_probs_dropout_prob=attention_probs_dropout_prob,  
              initializer_range=initializer_range,  
              do_return_2d_tensor=True,  
              batch_size=batch_size,  
              from_seq_length=seq_length,  
              to_seq_length=seq_length)  
          attention_heads.append(attention_head)  
  
        attention_output = None  
        if len(attention_heads) == 1:  
          attention_output = attention_heads[0]  
        else:  
          # 如果有多个head，将他们拼接起来  
          attention_output = tf.concat(attention_heads, axis=-1)  
  
        # 对attention的输出进行线性映射, 目的是将shape变成与input一致  
        # 然后dropout+residual+norm  
        with tf.variable_scope("output"):  
          attention_output = tf.layers.dense(  
              attention_output,  
              hidden_size,  
              kernel_initializer=create_initializer(initializer_range))  
          attention_output = dropout(attention_output, hidden_dropout_prob)  
          attention_output = layer_norm(attention_output + layer_input)  
  
      # feed-forward  
      with tf.variable_scope("intermediate"):  
        intermediate_output = tf.layers.dense(  
            attention_output,  
            intermediate_size,  
            activation=intermediate_act_fn,  
            kernel_initializer=create_initializer(initializer_range))  
  
      # 对feed-forward层的输出使用线性变换变回‘hidden\_size’  
      # 然后dropout + residual + norm  
      with tf.variable_scope("output"):  
        layer_output = tf.layers.dense(  
            intermediate_output,  
            hidden_size,  
            kernel_initializer=create_initializer(initializer_range))  
        layer_output = dropout(layer_output, hidden_dropout_prob)  
        layer_output = layer_norm(layer_output + attention_output)  
        prev_output = layer_output  
        all_layer_outputs.append(layer_output)  
  
  if do_return_all_layers:  
    final_outputs = []  
    for layer_output in all_layer_outputs:  
      final_output = reshape_from_matrix(layer_output, input_shape)  
      final_outputs.append(final_output)  
    return final_outputs  
  else:  
    final_output = reshape_from_matrix(prev_output, input_shape)  
    return final_output

配上下图一同使用效果更佳，因为 BERT 里只有 encoder，所有 decoder 没有姓名 picture.image

7、函数入口（init）

BertModel 类的构造函数，有了上面几节的铺垫，我们就可以来实现 BERT 模型了。


          
def \_\_init\_\_(self,  
 config,# BertConfig对象  
 is\_training,  
 input\_ids,# 【batch\_size, seq\_length】  
 input\_mask=None,# 【batch\_size, seq\_length】  
 token\_type\_ids=None,# 【batch\_size, seq\_length】  
 use\_one\_hot\_embeddings=False,# 是否使用one-hot；否则tf.gather()  
 scope=None):  
  
    config = copy.deepcopy(config)  
    if not is_training:  
      config.hidden_dropout_prob = 0.0  
      config.attention_probs_dropout_prob = 0.0  
  
    input_shape = get_shape_list(input_ids, expected_rank=2)  
    batch_size = input_shape[0]  
    seq_length = input_shape[1]  
# 不做mask，即所有元素为1  
    if input_mask is None:  
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)  
  
    if token_type_ids is None:  
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)  
  
    with tf.variable_scope(scope, default_name="bert"):  
      with tf.variable_scope("embeddings"):  
        # word embedding  
        (self.embedding_output, self.embedding_table) = embedding_lookup(  
            input_ids=input_ids,  
            vocab_size=config.vocab_size,  
            embedding_size=config.hidden_size,  
            initializer_range=config.initializer_range,  
            word_embedding_name="word\_embeddings",  
            use_one_hot_embeddings=use_one_hot_embeddings)  
  
        # 添加position embedding和segment embedding  
        # layer norm + dropout  
        self.embedding_output = embedding_postprocessor(  
            input_tensor=self.embedding_output,  
            use_token_type=True,  
            token_type_ids=token_type_ids,  
            token_type_vocab_size=config.type_vocab_size,  
            token_type_embedding_name="token\_type\_embeddings",  
            use_position_embeddings=True,  
            position_embedding_name="position\_embeddings",  
            initializer_range=config.initializer_range,  
            max_position_embeddings=config.max_position_embeddings,  
            dropout_prob=config.hidden_dropout_prob)  
  
      with tf.variable_scope("encoder"):  
  
        # input\_ids是经过padding的word\_ids：[25, 120, 34, 0, 0]  
        # input\_mask是有效词标记：[1, 1, 1, 0, 0]  
        attention_mask = create_attention_mask_from_input_mask(  
            input_ids, input_mask)  
  
        # transformer模块叠加  
        # `sequence\_output` shape = [batch\_size, seq\_length, hidden\_size].  
        self.all_encoder_layers = transformer_model(  
            input_tensor=self.embedding_output,  
            attention_mask=attention_mask,  
            hidden_size=config.hidden_size,  
            num_hidden_layers=config.num_hidden_layers,  
            num_attention_heads=config.num_attention_heads,  
            intermediate_size=config.intermediate_size,  
            intermediate_act_fn=get_activation(config.hidden_act),  
            hidden_dropout_prob=config.hidden_dropout_prob,  
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,  
            initializer_range=config.initializer_range,  
            do_return_all_layers=True)  
  
  # `self.sequence\_output`是最后一层的输出，shape为【batch\_size, seq\_length, hidden\_size】  
      self.sequence_output = self.all_encoder_layers[-1]  
  
      # ‘pooler’部分将encoder输出【batch\_size, seq\_length, hidden\_size】  
      # 转成【batch\_size, hidden\_size】  
      with tf.variable_scope("pooler"):  
        # 取最后一层的第一个时刻[CLS]对应的tensor， 对于分类任务很重要  
# sequence\_output[:, 0:1, :]得到的是[batch\_size, 1, hidden\_size]  
# 我们需要用squeeze把第二维去掉  
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)  
        # 然后再加一个全连接层，输出仍然是[batch\_size, hidden\_size]  
        self.pooled_output = tf.layers.dense(  
            first_token_tensor,  
            config.hidden_size,  
            activation=tf.tanh,  
            kernel_initializer=create_initializer(config.initializer_range))

总结一哈

有了以上对源码的深入了解之后，我们在使用 BertModel 的时候就会更加得心应手。举个模型使用的简单栗子：


          
# 假设输入已经经过分词变成word\_ids. shape=[2, 3]  
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])  
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])  
# segment\_emebdding. 表示第一个样本前两个词属于句子1，后一个词属于句子2.  
# 第二个样本的第一个词属于句子1， 第二次词属于句子2，第三个元素0表示padding  
# 原始代码是下面这样的，但是感觉么必要用 2，不知道是不是我哪里没理解  
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])  
  
# 创建BertConfig实例  
config = modeling.BertConfig(vocab_size=32000, hidden_size=512,  
         num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)  
  
# 创建BertModel实例  
model = modeling.BertModel(config=config, is_training=True,  
     input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)  
  
  
label_embeddings = tf.get_variable(...)  
#得到最后一层的第一个Token也就是[CLS]向量表示，可以看成是一个句子的embedding  
pooled_output = model.get_pooled_output()  
logits = tf.matmul(pooled_output, label_embeddings)

在 BERT 模型构建这一块的主要流程：

对输入序列进行 Embedding（三个），接下去就是‘Attention is all you need’的内容了
简单一点就是将 embedding 输入 transformer 得到输出结果；
详细一点就是 embedding --> N *【multi-head attention --> Add(Residual) &Norm--> Feed-Forward --> Add(Residual) &Norm】；
哈，是不是很简单~
源码中还有一些其他的辅助函数，不是很难理解，这里就不再啰嗦。

本文参考资料

[1] NLP 大杀器 BERT 模型解读: https://blog.csdn.net/Kaiyuan\_sjtu/article/details/83991186

[2] BERT 相关论文、文章和代码资源汇总: http://www.52nlp.cn/bert-paper-%E8%AE%BA%E6%96%87-%E6%96%87%E7%AB%A0-%E4%BB%A3%E7%A0%81%E8%B5%84%E6%BA%90%E6%B1%87%E6%80%BB

[3] modeling.py 模块: https://github.com/google-research/bert/blob/master/modeling.py

[4] 参考这个 Issue: https://github.com/google-research/bert/issues/16

[5] 理解 Attention 机制原理及模型: https://blog.csdn.net/Kaiyuan\_sjtu/article/details/81806123

[6] 原始论文: https://arxiv.org/abs/1706.03762

[7] 原始代码: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

投稿邮箱：pythonpost@163.com

▼ 投稿请 点击阅读原文

喜欢文章，点个 在看