Tiny Universe - Llama3架构

最新推荐文章于 2026-01-07 15:55:20 发布

原创

最新推荐文章于 2026-01-07 15:55:20 发布 · 1.8k 阅读

标签

#语言模型 #人工智能

Llama3和Llama2和Qwen2的整体架构相似，本篇文章主要讲解它们的一些主要不同点。

关于Qwen2架构可参考 Qwen2架构学习笔记

llama3区别于llama2在模型层面的区别主要体现在全模型使用GQA。

基础知识

MLP

MLP（Multi-Layer Perceptron）多层感知机是一种前馈神经网络，由一个或多个全连接层组成。每个全连接层包含一组可学习的权重矩阵和偏置向量，用于将输入数据进行线性变换和非线性激活。MLP可以用于各种任务，如分类、回归等。

在大模型中，MLP通常作为基本的网络组件，用于构建更复杂的结构。例如，在Transformer中，前馈神经网络部分就是一个MLP。此外，MLP还可以与其他网络结构（如卷积神经网络）结合，形成更强大的模型。

典型的MLP包括包括三层：输入层、隐层和输出层，MLP神经网络不同层之间是全连接的( 全连接的意思就是：上一层的任何一个神经元与下一层的所有神经元都有连接)。

如图所示

Attention

在自注意力机制中，输入序列的每个元素首先通过三个不同的线性变换，分别生成 Query（查询）、Key（键）、和 Value（值）矩阵。这三个矩阵共同用于计算输入序列中各个元素之间的注意力权重。

假设输入序列为 X=[x1,x2,…,xn]X = [x_1, x_2, \dots, x_n]X=[x1,x2,…,xn]，这些元素经过线性变换后得到：

Q=XWQ,K=XWK,V=XWVQ = XW^Q, \quad K = XW^K, \quad V = XW^VQ=XWQ,K=XWK,V=XWV

其中：

XXX 是输入序列，每个元素是一个向量。
WQ,WK,WVW^Q, W^K, W^VWQ,WK,WV 分别是用于生成 Query、Key、Value 的可学习权重矩阵。

1.1 点积注意力（Scaled Dot-Product Attention）

通过 Q、K 矩阵，计算每个输入与其他输入的相关性。具体公式如下：

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V$

其中：

Q 是查询矩阵。
K 是键矩阵， $K^\top$ 是键矩阵的转置。
V 是值矩阵。
$d_k$ 是键向量的维度， $\sqrt{d_k}$ 是缩放因子，用于避免点积值过大导致 softmax 输出过小的梯度。

1.2 计算步骤解析

生成 Query、Key、Value 矩阵：输入序列经过不同的线性变换，生成对应的 Q、K、V 矩阵。
计算点积：对 Query 和 Key 矩阵进行点积，得到输入序列中每个元素与其他元素之间的相关性分数。
缩放与归一化：将相关性分数除以 dk\sqrt{d_k}dk 进行缩放，并通过 softmax 归一化，得到注意力权重。
加权求和：将注意力权重与 Value 矩阵相乘，得到最终的上下文向量。

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k

    def forward(self, Q, K, V, mask=None):
        # Q, K, V: batch_size x seq_len x d_k
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)  # Apply the mask (optional)
        attn_weights = F.softmax(scores, dim=-1)  # softmax over the last dimension
        output = torch.matmul(attn_weights, V)  # Weighted sum of values
        return output, attn_weights

多头注意力机制通过并行计算多个注意力头来捕捉不同子空间的特征。每个头独立生成自己的 Q、K、V 矩阵，进行自注意力计算，然后将各个头的结果拼接起来，通过一个线性层投影到最终输出。

2.1 多头注意力公式

对于多头注意力机制，公式如下：

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$

其中每个头的计算方式为：

$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

$W_i^Q, W_i^K, W_i^V$ 是每个头的可学习权重。
$W^O$ 是用于拼接后投影的线性变换矩阵。

多头注意力机制允许模型在不同的子空间中关注输入序列的不同部分，从而增强了模型的表达能力。

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        # Define weight matrices for Q, K, V and output projection
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.fc_out = nn.Linear(d_model, d_model)

        # Attention module
        self.attention = ScaledDotProductAttention(self.d_k)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # Perform linear transformation and split into multiple heads
        Q = self.W_Q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply scaled dot-product attention to each head
        attn_output, attn_weights = self.attention(Q, K, V, mask)

        # Concatenate heads and apply final linear projection
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        output = self.fc_out(attn_output)

        return output, attn_weights