LLM微调完全指南 - 使用Unsloth进行Qwopus3.5-27B LoRA微调

最新推荐文章于 2026-06-11 18:37:14 发布

翻译最新推荐文章于 2026-06-11 18:37:14 发布 · 394 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

原文链接：ttps://github.com/R6410418/Jackrong-llm-finetuning-guide

GEO检测

标签

#人工智能

Note 专栏收录该内容

38 篇文章

订阅专栏

来自https://github.com/R6410418/Jackrong-llm-finetuning-guide，原文是个英文PDF，我先把PDF 转成文本，然后让DS 格式化成英文markdown，再生成个中文版本。不一定和原文完全一致，至少格式是调整过的，因为markdown 里没那些花样。首先是中文版，英文版也附在后面。

LLM微调完全指南

使用Unsloth进行Qwopus3.5-27B LoRA微调

Jackrong
2026年4月5日

摘要

本文提出并实现了一个面向推理的监督微调pipeline，用于Qwen3.5-27B。该工作流基于Unsloth构建并在Google Colab上运行，结合了4-bit量化加载和rank-64 LoRA适配，使得在单GPU显存限制下能够高效训练27B规模的模型。与通用的chat微调不同，该pipeline混合了三个推理数据源，并将异构的assistant输出归一化为统一的监督格式，该格式结合了<think>...</think>推理轨迹和最终答案。结合Qwen thinking chat template和response-only监督，优化目标明确地聚焦于assistant侧推理续写，而不是记忆完整的对话轮次。

训练完成后，该pipeline支持LoRA adapter检查点保存、合并后的16-bit导出以及多格式GGUF发布，形成了从数据归一化到参数高效微调和部署交付的完整工程链路。重要的是，尽管Qwen3.5属于多模态模型家族，但本文实现的notebook执行的是纯文本推理SFT，而非视觉语言联合训练。

关键词: Qwen3.5-27B, Reasoning SFT, LoRA, Unsloth, 4-bit Quantization, Google Colab, GGUF

在这里插入图片描述

基座LLM (预训练) + 训练数据 (领域/任务) → 微调 (LoRA / PEFT) → 适配后模型 (基座 + adapter)

LLM微调：使用特定任务的数据将预训练LLM适配到您的领域（通常通过训练一个小的LoRA adapter）。

作者寄语 —— 给构建者的话

对于初学者、爱好者以及任何对AI感到好奇的人：这条道路是可以学会的。

本文档的目的不仅仅是描述一次训练运行，也是向初学者、爱好者以及任何对AI感到好奇的人传达一个更广泛的信息：微调、后训练，甚至中等规模的预训练，都不是难以企及的技术仪式。它们是可以通过学习、复现并逐步掌握的工程实践。借助开源模型、公开数据集、云计算平台以及日益成熟的训练工具链，您通常只需要一个Google账号、一台普通的笔记本电脑和持续的好奇心。

作为一个同样从零开始的学习者，我理解许多新手面临的不确定性：环境搭建的复杂性、不透明的超参数以及对计算资源的焦虑，往往成为入门的第一道障碍。这正是像Unsloth这样的优化工具链至关重要的原因：通过提高训练效率和资源利用率，它们大幅降低了大型模型微调的实际门槛，将过去需要昂贵硬件和专业经验才能做的事情，变成了普通开发者可以尝试和掌握的技能。从这个意义上说，我们都有机会站在巨人的肩膀上，理解模型、适配模型，并赋予它们新的能力。

—— 从一个学习者到另一个学习者

1. 先决条件与平台概述

1.1 什么是Unsloth？

Unsloth是一个高影响力的开源团队和工具包，专注于LLM后训练。除了高效的LoRA/QLoRA，它还支持全量微调、持续预训练和低精度设置（4-bit、8-bit、16-bit、FP8），同时保持与Hugging Face/TRL工作流的兼容性，包括SFT、GRPO、GSPO和DPO。

Unsloth: 快速总结

Unsloth的核心优势在于后训练的系统级优化——官方文档和基准测试页面反复强调大约2倍的训练加速和大幅的VRAM节省（通常报道接近70%，RL工作流常声称50-90%），使得在有限GPU上进行更长上下文和更大模型的微调成为可能；此外，它还提供了250+个notebook并广泛支持500+个模型。

1.2 什么是Google Colab？

Google Colab是Google提供的云托管Jupyter notebook环境。它允许您在浏览器中运行Python代码，连接到GPU运行时，并快速原型化机器学习工作流，而无需从头配置本地CUDA环境。

为什么这里常用Colab

快速的基于notebook的实验，用于训练和调试。
与Google Drive轻松集成，用于checkpoint持久化。
直接访问Colab Pro/Pro+套餐中的GPU后端运行时。

1.3 硬件要求：NVIDIA CUDA GPU

对于此项目设置，您应该使用支持CUDA的NVIDIA GPU。本指南中使用的PyTorch + 加速训练栈需要CUDA，包括4-bit加载和LoRA微调工作流。

GPU要求说明

推荐：Colab中可用的NVIDIA GPU（例如，可用的A100/H100类）。

在仅CPU的运行时上训练27B类模型不切实际。
如果CUDA不可用，与模型加载/训练相关的notebook单元格可能会失败。

2. 引言

概述：范围与目标

项目目标

目标。通过提炼高质量的分析性和逐步问题解决模式，将Qwen3.5-27B适配成一个更具结构性、推理效率更高的模型，重点在于在编程、数学和离线分析任务上表现更强，同时减少对简单查询的不必要冗长或重复推理。

设置

设置。通过Unsloth以适合Colab的内存高效格式加载Qwen3.5-27B，并在精选的高保真推理数据上执行仅针对response的监督微调，以生成一个更有条理、更透明且对下游推理和研究导向实验实际有用的模型。

2.1 背景与动机

要点：为什么需要参数高效微调(PEFT)？

随着大型语言模型(LLM)的快速发展，将通用模型高效适配到特定领域或任务已成为一个关键挑战。

全参数微调通常在计算和内存上代价高昂。
PEFT方法（特别是LoRA）在冻结基座模型的同时训练一小部分参数，使个人开发者和小团队能够快速迭代。

2.2 技术栈概述

仅关键栈（关注点）：

组件	关键点
Unsloth	为Colab提供更快、更省内存的微调工作流。
LoRA (PEFT)	仅训练低秩adapter以减少计算和成本。
4-bit quantization	降低VRAM使用，同时使27B规模的训练变得可行。
Colab Pro GPU (A100/H100)	为此pipeline提供所需的CUDA GPU运行时。

表1：基础技术栈（精简版）

3. 环境搭建与准备

3.1 实验追踪（可选）

# 3.1.1 它的作用

Weights & Biases (WandB) 是一个强大的实验追踪平台。它可以实时监控训练指标，包括：

损失曲线
学习率调度
VRAM利用率
训练速度和预计完成时间(ETA)

代码: WandB登录设置

import os
import wandb
from google.colab import drive
from google.colab import userdata

drive.mount('/content/drive')
wandb_api_key = userdata.get('WANDB_API_KEY')
wandb.login(key=wandb_api_key)

drive_output_path = "/content/drive/MyDrive/Qwen3.5-27B--checkpoints"
os.makedirs(drive_output_path, exist_ok=True)

警告: Colab使用说明

在Google Colab（而非Kaggle）中运行此单元格，并在提示时授予Drive权限 drive.mount()。
将您的WandB API密钥作为 WANDB_API_KEY 存储在Colab Secrets中；如果登录失败，请重新检查密钥名称和值。
确保您的运行时具有互联网访问权限和足够的Drive空间，否则保存checkpoint到 /content/drive/MyDrive/... 可能会失败。

3.2 安装依赖项

# 3.2.1 核心库

微调LLM通常需要这些核心库：

库	版本	目的
Unsloth	GitHub (unsloth[base])	从Unsloth GitHub安装以获取最新的基础运行时更新
PyTorch	2.8.0	此Colab设置使用的核心深度学习后端
Transformers	5.2.0	锁定到notebook环境的Hugging Face模型栈
TRL	0.22.2	与Unsloth工作流对齐的训练工具栈
xformers	0.0.32.post2	与固定torch版本匹配的注意力加速后端

表2：核心依赖项列表

代码: 安装命令

%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"
    except: _numpy = "numpy"; _pil = "pillow"
    !uv pip install -qqq \
        "torch==2.8.0" "triton>=3.3.0" {_numpy} {_pil} torchvision bitsandbytes xformers==0.0.32.post2 \
        "unsloth_zoo[base] @ git+https://github.com/unsloth/unsloth_zoo.git" \
        "unsloth[base] @ git+https://github.com/unsloth/unsloth.git"
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth
!uv pip install --upgrade --no-deps tokenizers trl==0.22.2 unsloth unsloth_zoo
!uv pip install transformers==5.2.0
# causal_conv1d 仅在 torch==2.8.0 上支持。如果您有更新的torch版本，请等待10分钟！
!uv pip install --no-build-isolation flash-linear-attention causal_conv1d==1.6.0

提示: 安装说明

使用 %%capture magic来隐藏安装日志，保持notebook整洁。
安装通常需要大约3-5分钟。
如果您看到版本警告但Unsloth导入正常，通常可以忽略它们。

4. 模型加载与LoRA配置

4.1 加载预训练模型

# 4.1.1 4-bit量化如何工作

4-bit量化将模型参数以4位精度存储，显著降低VRAM使用，同时保持推理和训练稳定（尤其与量化感知内核和混合精度结合时）。在实践中，4-bit通常是单GPU微调的最佳平衡。

# 4.1.2 模型加载代码

代码: 使用Unsloth加载Qwen3.5-27B

from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",
    # 4bit动态量化，具有卓越的准确性和低内存使用
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [新] 我们支持TTS模型！
] # 更多模型请访问 https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3.5-27B",
    max_seq_length = 32768, # 选择任意长度以支持长上下文！
    load_in_4bit = True,    # 4位量化以减少内存
    load_in_8bit = False,   # [新!] 更精确，但使用2倍内存
    full_finetuning = False, # [新!] 我们现在支持全量微调！
)

警告: 常见陷阱

如果遇到OOM，请减小 max_seq_length（例如，8192或16384）或使用更小的batch size。
确保模型标识符匹配现有的HF仓库名称。
如果模型是gated的，您必须提供有效的HF token。

# 4.1.3 最佳实践: Thinking vs. Instruct 采样设置

Qwen模型通常以两种推理模式使用：Instruct（指令跟随对话）和Thinking（带有显式思考标签的推理）。推荐的采样设置不同：Thinking通常使用较低的温度和较高的top_p以提高推理稳定性。

参数	Instruct (Qwen3)	Thinking (Qwen3/Qwen3.5)
Temperature	0.7	0.6/0.7/1
Top P	0.80	0.95
TopK	20	20

表3：Qwen Instruct vs. Thinking的推荐采样设置（经验法则）

输出长度： 对于大多数查询，将最大输出长度设置为32768个token（如果您的运行时/VRAM允许）；这对于许多长格式响应是足够的。

# 4.1.4 模板优先：训练前确认Chat格式

为什么这在SFT之前很重要

如果您已经选择了要训练的模型，下一步关键步骤是确认其对应的chat template。不同的模型家族和公司使用不同的角色标记、分隔符、特殊token以及system/user/assistant格式化规则。

如果您的训练数据格式与模型的原生chat template不匹配，您可能会看到不稳定的损失、较弱的指令跟随行为或响应质量下降。简而言之：先选择模型，然后将数据集和提示格式与该模型的template对齐。 下面的示例展示了一个典型的template结构。

代码: Qwen3 Chat template 示例主要部分 (Instruct / Thinking)

<|im_start|>user
Hey there! What is 1+1? <|im_end|>
<|im_start|>assistant
<think>

# Instruct mode: <think></think>answer
# Thinking mode: may include <think>..reasoning..</think>answer

</think>
The answer is 2.
<|im_end|>

4.2 配置LoRA Adapter

# 4.2.1 LoRA原理

LoRA (Low-Rank Adaptation) 是一种参数高效的微调方法。LoRA不是更新所有原始模型权重，而是保持预训练模型冻结，并向选定的线性层添加两个小的可训练低秩矩阵（通常记为A和B）。这些矩阵的乘积学习一个轻量级的任务特定更新，从而使模型能够以远少于全量微调的可训练参数适配新领域。在实践中，这大大降低了VRAM使用和训练成本，同时保留了基座模型的大部分能力，这就是为什么LoRA被广泛用于单GPU或基于Colab的LLM微调。

在这里插入图片描述

图1: LoRA (Low-Rank Adaptation) 原理：保持基座模型冻结，学习一个小的、可训练的低秩更新。

# 4.2.2 LoRA配置代码

代码: 附加LoRA adapters

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # 选择任意 > 0 的数字！建议 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "out_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # 支持任意值，但 = 0 是优化过的
    bias = "none",    # 支持任意值，但 = "none" 是优化过的
    # [新] "unsloth" 使用少30%的VRAM，可容纳2倍大的batch size！
    use_gradient_checkpointing = "unsloth", # True 或 "unsloth" 用于超长上下文
    random_state = 3407,
    use_rslora = False,  # 我们支持rank stabilized LoRA
    loftq_config = None, # 也支持LoftQ
)

提示: 推荐的起始值

r: 8/16/64/128（更高的rank可以提高质量但消耗更多VRAM）。
lora_alpha: 通常设置为 r 或 2r（两倍 r）作为实践基线。
target_modules: 从attention + MLP投影开始（如上所示）。

5. 数据处理与预处理

5.1 第一部分：配置与多源数据集加载

# 5.1.1 Pipeline目标

此阶段定义了数据pipeline配置，并将三个异构的推理数据集加载到一个统一的工作流中。关键的设计决策是以固定比例对每个源进行采样，以便同时保留高质量的推理轨迹和对话多样性。

给SFT初学者的核心原则

数据集的选择很大程度上决定了模型最终学到什么。初学者常常认为模型大小是最重要的因素，但在监督微调中，数据集设计通常比简单地运行更多steps影响更大。在实践中，这些因素通常是最终质量的最强驱动因素：

数据质量（信噪比、正确性和推理忠实度）
数据风格（回答语气、结构、冗长程度和格式化习惯）
任务分布（哪些能力最常被强调）
格式一致性（跨样本的稳定模板/角色格式）

底线

对于SFT，这些数据决策通常比“只是多训练一会儿”更重要。

设置全局控制（seed、上下文窗口、每个数据集样本数）。
为下游序列化尽早初始化Qwen thinking chat template。
加载三个源，并为数据集特定schema问题提供回退逻辑。

注意

下面显示的数据集选择和超参数并非最终的生产训练设置。在实际项目中，您应该根据您的训练目标、数据集特征、硬件和运行时环境来调整它们。本节仅作为示例提供，以帮助您理解pipeline和工作流。代码可以由AI生成或辅助，但您仍然需要理解每个步骤在做什么以及为什么做。

TRL SFTTrainer详细参考

有关监督微调训练器的完整参考，请参阅： https://huggingface.co/docs/trl/sft_trainer

该页面详细解释了以下内容：

如何使用SFTTrainer进行快速入门SFT训练。
支持的数据集schema（语言建模和prompt-completion；标准和对话格式）。
核心SFT机制：预处理、tokenization、next-token交叉熵损失、label shifting以及padding-mask处理。
实际选项：packing、assistant_only_loss、completion_only_loss以及PEFT/LoRA集成。
常见训练指标和 SFTConfig 中的关键自定义选项。

代码: 配置与数据集加载

from datasets import load_dataset, concatenate_datasets, Dataset
from unsloth.chat_templates import get_chat_template
import re
import json
import multiprocessing as mp
import pandas as pd

RANDOM_SEED = 12181531
MAX_CONTEXT_WINDOW = 8192

num_samples_dict = {
    "ds1": 3900, # nohurry/Opus-4.6-Reasoning-3000x-filtered
    "ds2": 700,  # Jackrong/Qwen3.5-reasoning-700x
    "ds3": 9633, # Roman1111111/claude-opus-4.6-10000x
}

tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen3-thinking",
)

def load_ds3_via_pandas_parquet():
    parquet_path = (
        "hf://datasets/Roman1111111/claude-opus-4.6-10000x"
        "@refs/convert/parquet/default/train/0000.parquet"
    )
    df = pd.read_parquet(parquet_path)
    return Dataset.from_pandas(df, preserve_index=False)

def load_and_sample(dataset_name, sample_count=None, split="train", subset=None):
    try:
        if subset:
            ds = load_dataset(dataset_name, subset, split=split)
        else:
            ds = load_dataset(dataset_name, split=split)
    except ValueError as e:
        err = str(e)
        if dataset_name == "Roman1111111/claude-opus-4.6-10000x" and "Feature type 'Json' not found" in err:
            ds = load_ds3_via_pandas_parquet()
        else:
            raise

    if sample_count is not None:
        sample_count = min(sample_count, len(ds))
        ds = ds.shuffle(seed=RANDOM_SEED).select(range(sample_count))

    return ds

# ds1: problem / thinking / solution
# ds2: multi-turn conversation
# ds3: messages with possible reasoning fields
ds1 = load_and_sample("nohurry/Opus-4.6-Reasoning-3000x-filtered", num_samples_dict["ds1"], split="train")
ds2 = load_and_sample("Jackrong/Qwen3.5-reasoning-700x", num_samples_dict["ds2"], split="train")
ds3 = load_and_sample("Roman1111111/claude-opus-4.6-10000x", num_samples_dict["ds3"], split="train")

5.2 第二部分：对话归一化与推理格式统一

# 5.2.1 归一化策略

因为三个数据集使用不同的schema（问题/解决方案对、多轮对话以及带有可选推理字段的message对象），我们将它们归一化为单一的对话结构，并强制执行以 <think>...</think> 加最终答案文本为中心的assistant输出格式。

将assistant消息标准化为推理可见的格式。
安全地解析混合消息类型（字典/JSON字符串）。
使用特定数据集的格式化函数转换每个源。

代码: 多源格式化为统一对话

def _strip(x):
    return (x or "").strip()

THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", flags=re.DOTALL)

def normalize_assistant_to_think_solution(text: str) -> str:
    text = _strip(text)
    if not text:
        return "<think></think>\n"

    m = THINK_BLOCK_RE.search(text)
    if m:
        think_block = m.group(0).strip()
        rest = text[m.end():].lstrip()
        return f"{think_block}\n{rest}".rstrip() if rest else f"{think_block}\n"

    return f"<think></think>\n{text}".rstrip()

def build_assistant_with_reasoning(content: str, reasoning: str = "") -> str:
    content = _strip(content)
    reasoning = _strip(reasoning)

    if "<think>" in content and "</think>" in content:
        return normalize_assistant_to_think_solution(content)

    if reasoning:
        return f"<think>{reasoning}</think>\n{content}" if content else f"<think>{reasoning}</think>\n"

    return normalize_assistant_to_think_solution(content)

def parse_message_item(m):
    if isinstance(m, dict):
        return m
    if isinstance(m, str):
        s = m.strip()
        if not s:
            return None
        try:
            obj = json.loads(s)
            return obj if isinstance(obj, dict) else None
        except Exception:
            return None
    return None

def format_ds1(examples):
    out = []
    for p, t, s in zip(examples.get("problem", []), examples.get("thinking", []), examples.get("solution", [])):
        p, t, s = _strip(p), _strip(t), _strip(s)
        if not p or not s:
            continue
        assistant = f"<think>{t}</think>\n{s}" if t else f"<think></think>\n{s}"
        out.append([
            {"role": "user", "content": p},
            {"role": "assistant", "content": assistant},
        ])
    return {"conversations": out}

def format_ds2(examples):
    out = []
    for conv in examples.get("conversation", []):
        if not conv:
            continue
        cleaned = []
        for m in conv:
            frm = (m.get("from") or "").strip()
            val = m.get("value", "")
            if frm == "human":
                cleaned.append({"role": "user", "content": _strip(val)})
            elif frm == "gpt":
                cleaned.append({"role": "assistant", "content": normalize_assistant_to_think_solution(val)})
        if len(cleaned) >= 2 and cleaned[-1]["role"] == "assistant":
            out.append(cleaned)
    return {"conversations": out}

def format_ds3(examples):
    out = []
    for msgs in examples.get("messages", []):
        if not msgs:
            continue
        parsed = [pm for pm in (parse_message_item(m) for m in msgs) if pm is not None]
        if not parsed:
            continue

        convo = [m for m in parsed if m.get("role") != "system"]
        if len(convo) < 2 or convo[-1].get("role") != "assistant":
            continue

        cleaned = []
        for m in convo:
            role = m.get("role")
            content = m.get("content", "")
            reasoning = m.get("reasoning", "")
            if role == "assistant":
                content = build_assistant_with_reasoning(content, reasoning)
            else:
                content = _strip(content)
            if role in ("user", "assistant") and content is not None:
                cleaned.append({"role": role, "content": content})

        if len(cleaned) >= 2 and cleaned[-1]["role"] == "assistant":
            out.append(cleaned)

    return {"conversations": out}

ds1 = ds1.map(format_ds1, batched=True, remove_columns=ds1.column_names)
ds2 = ds2.map(format_ds2, batched=True, remove_columns=ds2.column_names)
ds3 = ds3.map(format_ds3, batched=True, remove_columns=ds3.column_names)

5.3 第三部分：模板序列化、长度控制与格式QA

# 5.3.1 最终数据集构建

归一化之后，我们合并三个数据集，通过Qwen thinking template序列化每个对话，根据tokenized长度过滤超长样本，并在训练前运行最终的assistant格式质量检查。

仅保留非空的归一化对话。
通过 apply_chat_template 构建训练文本。
强制执行上下文窗口约束和推理标签完整性。

代码: 合并、模板化、过滤和验证

ds1 = ds1.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)
ds2 = ds2.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)
ds3 = ds3.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)

combined_dataset = concatenate_datasets([ds1, ds2, ds3]).shuffle(seed=RANDOM_SEED)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False,
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = combined_dataset.map(formatting_prompts_func, batched=True)

num_proc = mp.cpu_count()
_text_tok = getattr(tokenizer, "tokenizer", tokenizer)

def filter_long_sequences_batched(examples):
    texts = examples["text"]
    tokenized = _text_tok(
        texts,
        truncation=False,
        padding=False,
        add_special_tokens=False,
    )["input_ids"]
    return [len(toks) <= MAX_CONTEXT_WINDOW for toks in tokenized]

dataset = dataset.filter(filter_long_sequences_batched, batched=True, num_proc=num_proc)

def check_assistant_format(examples):
    convos = examples["conversations"]
    ok = []
    for convo in convos:
        good = True
        for m in convo:
            if m["role"] == "assistant":
                c = m.get("content", "")
                if "<think>" not in c or "</think>" not in c:
                    good = False
                    break
                if not re.search(r"</think>\n", c):
                    good = False
                    break
        ok.append(good)
    return {"_ok": ok}

check = dataset.map(
    check_assistant_format,
    batched=True,
    remove_columns=dataset.column_names,
    num_proc=num_proc,
)

bad = len(check) - sum(check["_ok"])
if bad > 0:
    dataset = dataset.filter(lambda x: all(
        (m["role"] != "assistant") or (
            ("<think>" in m["content"]) and ("</think>\n" in m["content"])
        )
        for m in x["conversations"]
    ))

print(dataset[0]["text"][:8000])

提示: 训练前的实用检查

打印几个模板化后的样本，以验证角色分隔符和推理标签。
跟踪长度过滤后保留的样本比例，避免意外过度修剪。
如果许多样本未通过格式QA，首先检查原始特定源的转换函数。

6. 训练配置与执行

6.1 训练器配置

# 6.1.1 训练器代码

代码: 使用TRL + Unsloth进行SFT训练

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # 可以设置评估集！
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 6,
        gradient_accumulation_steps = 6, # 使用GA模拟更大的batch size！
        warmup_ratio = 0.03,
        # warmup_steps = 60,
        num_train_epochs = 1, # 设置为1以进行一次完整的训练
        # max_steps = 50,
        learning_rate = 2e-4, # 对于长时间训练降低到2e-5
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        save_steps = 200,
        save_total_limit = 1,
        save_strategy = "steps",
        report_to = "wandb", # 可以使用Weights & Biases
        output_dir = drive_output_path,
    ),
)

# 6.1.2 关键训练参数

参数	典型值	说明
`per_device_train_batch_size`	1–128	取决于VRAM
`gradient_accumulation_steps`	1–128	增加有效batch size
`learning_rate`	1e-5–3e-4	仔细调整
`warmup_steps`	1–5000	稳定早期训练
`max_steps / num_train_epochs`	任务相关	为了可复现性优先使用steps
`logging_steps`	1–20	用于监控
`save_steps`	10–500	Checkpoint保存频率

表4：常见训练超参数

# 6.1.3 关键参数的详细说明

学习率：如果损失不稳定，降低LR；如果收敛太慢，稍微增加。
有效batch size：batch_size × grad_accum；越大通常越稳定。
序列长度：更长的上下文会显著增加VRAM和计算量；从小规模开始。
优化器：adamw_8bit 有助于减少内存使用。

6.2 仅训练响应部分（推荐）

在大多数chat微调设置中，您希望模型只学习assistant的响应，而不对用户指令或模板token进行反向传播。Unsloth提供了 train_on_responses_only 来自动屏蔽assistant跨度之外的标签。

代码: 屏蔽标签以仅训练assistant响应

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n<think>",
)

提示: 为什么这很重要

如果您在整个序列化后的提示上进行训练，模型可能会浪费容量去学习复现system/user内容和特殊token，这可能会损害指令跟随质量。

6.3 标签合理性检查（训练前/后）

为了验证masking是否按预期工作，您可以解码一个样本的labels。在许多训练器中，您不想训练的那些token被设置为-100（忽略索引）。下面的代码片段将这些位置替换为pad token id以便解码运行，然后将可见的pad token替换为空格以提高可读性。

代码: 解码labels以验证masking

tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

7. 开始训练（最重要）

配置好训练器后（并可选地mask为仅响应部分），开始训练：

代码: 运行训练

trainer_stats = trainer.train()

提示: `trainer_stats` 是什么？

trainer_stats 通常包含训练指标（例如，损失、steps、运行时间）。您可以打印或记录它以进行快速验证。

8. 保存与部署

8.1 保存16-bit模型

训练后，您可以将合并后的16-bit模型直接推送到Hugging Face Hub，以便进行标准Transformers推理和分享。

提示: 此代码块的作用

从Colab Secrets安全地读取 HF_TOKEN（而不是在notebook中硬编码token）。
通过 whoami() 验证token并自动构建您的目标repo id。
上传合并后的16-bit checkpoint以及tokenizer，这样模型就可以直接从Hugging Face Hub加载。

代码: 将合并后的16-bit模型推送到Hugging Face Hub

from huggingface_hub import whoami
from google.colab import userdata

try:
    hf_token = userdata.get("HF_TOKEN")
    if not hf_token:
        raise ValueError("HF_TOKEN is not set")
except Exception as e:
    raise RuntimeError("HF_TOKEN was not found in Colab Secrets.") from e

try:
    username = whoami(token=hf_token)["name"]
    repo_id = f"{username}/Qwopus3.5-27B"
except Exception as e:
    raise RuntimeError("Failed to authenticate with Hugging Face.") from e

model.push_to_hub_merged(
    repo_id,
    tokenizer,
    save_method="merged_16bit",
    token=hf_token,
)

print(f"Uploaded to https://huggingface.co/{repo_id}")

8.2 导出GGUF（可选）

如果您想在llama.cpp兼容的栈中运行模型，请导出并发布GGUF量化模型。

提示: 此GGUF步骤的作用

从Colab Secrets读取 HF_TOKEN 并验证Hugging Face身份。
将多个GGUF变体（q4_k_m, q8_0, bf16）导出并上传到一个Hub仓库。
使模型可直接在llama.cpp风格的运行时中使用，提供不同的质量/速度权衡。

警告: 兼容性说明

GGUF导出取决于模型架构和您的环境。请遵循最新的Unsloth / llama.cpp指南以了解Qwen家族的兼容性。

代码: 将GGUF模型推送到Hugging Face Hub

from huggingface_hub import whoami
from google.colab import userdata

try:
    hf_token = userdata.get("HF_TOKEN")
    if not hf_token:
        raise ValueError("HF_TOKEN is not set")
except Exception as e:
    raise RuntimeError("HF_TOKEN was not found in Colab Secrets.") from e

try:
    username = whoami(token=hf_token)["name"]
    repo_id = f"{username}/Qwopus3.5-27B-GGUF"
except Exception as e:
    raise RuntimeError("Failed to authenticate with Hugging Face.") from e

model.push_to_hub_gguf(
    repo_id,
    tokenizer,
    quantization_method=["q4_k_m","q8_0","bf16"],
    token=hf_token,
)

print(f"Uploaded to https://huggingface.co/{repo_id}")

9. 常见问题与优化技巧

9.1 典型故障模式

警告: 故障模式检查清单

内存不足 (OOM)：减少序列长度、batch size，或增加gradient accumulation；启用gradient checkpointing。
损失不下降：检查数据格式和chat template；考虑降低学习率。
过拟合：减少steps/epochs，增加数据多样性，或启用dropout。

9.2 超参数调优工作流

提示: 一个实用、可复现的工作流

阶段1: 验证数据 + template
- 运行50-200步并确认损失下降。
- 抽查解码后的prompts和targets（确保角色/分隔符正确）。
阶段2: 学习率扫描
- 尝试 [1e-5, 2e-5, 3e-5]（必要时根据您的规模扩展）。
- 选择损失稳定下降最快的（避免发散/振荡）。
阶段3: LoRA rank调优
- 尝试 [16, 32, 64]。
- 在质量与VRAM和训练时间之间取得平衡。
阶段4: Batch size优化
- 在VRAM允许的情况下增加batch size。
- 使用gradient accumulation steps来保持适当的有效batch size。

10. 总结与展望

10.1 关键要点

端到端工作流总结

本指南涵盖了LLM微调的完整端到端工作流：

环境：Google Colab GPU运行时 + Unsloth栈，可选的WandB日志记录和Google Drive checkpoint持久化。
模型加载：Qwen3.5-27B以4-bit模式加载，以适应单GPU VRAM限制，同时保持训练可行性。
参数高效适配：rank-64 LoRA adapters被附加到关键的投影模块，因此只更新一小部分参数。
数据pipeline：混合推理数据集被归一化为单一的assistant目标格式，使用Qwen chat template和response-only监督。
训练与发布：使用内存感知的超参数运行SFT，然后作为合并后的16-bit权重和可选的GGUF变体导出到Hugging Face Hub。

10.2 典型指标

指标	典型值 / 观察范围
基座模型规模	Qwen3.5-27B
运行时环境	Google Colab (A100/H100 class GPU)
模型加载模式	4-bit量化基座 + LoRA adapters
本指南中使用的LoRA rank	64
导出目标	merged 16-bit + GGUF (`q4_k_m` / `q8_0` / `bf16`)

表5：与Qwen3.5-27B Colab工作流对齐的配置级指标

提示: 如何解读这些指标

对于在Colab上进行27B规模的微调，绝对运行时间和损失值对GPU类型、序列长度、有效batch size和数据集组成高度敏感。在实践中，比较同一设置下的运行，而不是依赖单一的通用数值。

10.3 下一步学习方向

首先加强当前的Qwen3.5-27B pipeline
- 为编程/数学/推理任务构建一个小型、高质量的评估集。
- 在相同数据划分下比较r=16/32/64和序列长度设置。
- 跟踪推理行为变化（推理长度、答案准确性和格式稳定性）。
提高数据质量和监督策略
- 添加更困难的长上下文和多步推理样本。
- 移除混合源数据集中有噪声或矛盾的输出。
- 尝试使用风格约束以获得更简洁的最终答案。
从SFT转向偏好对齐
- 尝试使用从目标任务收集的成对偏好数据进行DPO。
- 评估对齐是否在不过度扩展推理轨迹的情况下提高了有用性。
- 保留SFT checkpoints作为稳定的基线，用于A/B比较。
部署与可复现性
- 使用相同的提示套件验证Hub合并模型和GGUF导出。
- 记录确切的包版本和notebook单元格，以实现可重复训练。
- 一旦离线质量得到验证，添加轻量级服务端点（例如，vLLM/TGI）。

10.4 参考文献

部分参考文献

Unsloth仓库: https://github.com/unslothai/unsloth
Hugging Face Transformers: https://github.com/huggingface/transformers
Qwen文档: https://huggingface.co/Qwen
LoRA论文: https://arxiv.org/abs/2106.09685
WandB文档: https://docs.wandb.ai/

10.5 致谢

致谢

感谢以下开源项目和社区：

Unsloth团队提供了极其优化的微调框架。
Hugging Face社区提供了丰富的模型生态系统。
Google Colab提供了可访问的云GPU实验环境。
所有数据集贡献者和开源模型团队。

恭喜您完成了LLM微调之旅！

您现在知道如何将通用LLM适配成定制的AI助手。
去构建和迭代吧——期待您训练出的优秀模型！

本指南根据实践经验编写，并将持续更新。欢迎提出问题和建议。

没有人一开始就是专家。但每一位专家都曾勇敢地迈出第一步。

Complete Guide to LLM Fine-tuning

Qwopus3.5-27B LoRA Fine-tuning with Unsloth

Jackrong
April 5, 2026

Abstract

This work proposes and implements a reasoning-oriented supervised fine-tuning pipeline for Qwen3.5-27B. Built with Unsloth on Google Colab, the workflow combines 4-bit quantized loading and rank-64 LoRA adaptation to enable efficient training of a 27B-scale model under single-GPU VRAM constraints. Unlike generic chat fine-tuning, the pipeline mixes three reasoning data sources and normalizes heterogeneous assistant outputs into a unified supervision format that combines <think>...</think> traces with final answers. Together with the Qwen thinking chat template and response-only supervision, the optimization target is explicitly focused on assistant-side reasoning continuation rather than memorizing full dialogue turns.

After training, the pipeline supports LoRA adapter checkpointing, merged 16-bit export, and multi-format GGUF release, forming a complete engineering chain from data normalization to parameter-efficient fine-tuning and deployment delivery. Importantly, although Qwen3.5 belongs to a multimodal model family, the notebook implemented in this study executes text-only reasoning SFT rather than vision-language joint training.

Keywords: Qwen3.5-27B, Reasoning SFT, LoRA, Unsloth, 4-bit Quantization, Google Colab, GGUF

Base LLM (pretrained) + Training Data (domain / task) → Fine-tuning (LoRA / PEFT) → Adapted Model (base + adapter)

LLM Fine-tuning: adapt a pretrained LLM to your domain with task-specific data (often by training a small LoRA adapter).

Author’s Note — A Message to Builders

For beginners, hobbyists, and anyone curious about AI: this path is learnable.

The purpose of this document is not only to describe one training run, but also to communicate a broader message to beginners, hobbyists, and anyone curious about AI: fine-tuning, post-training, and even moderate-scale pretraining are not inaccessible technical rituals. They are engineering practices that can be learned, reproduced, and gradually mastered. With open-source models, public datasets, cloud compute platforms, and an increasingly mature training toolchain, what you often need is simply a Google account, a regular laptop, and sustained curiosity.

As a learner who also started from zero, I understand the uncertainty many newcomers face: environment setup complexity, opaque hyperparameters, and anxiety about compute resources often become the first barrier to entry. This is exactly why optimization toolchains such as Unsloth matter: by improving training efficiency and resource utilization, they substantially lower the practical threshold for large-model fine-tuning, turning what once required expensive hardware and specialized experience into something ordinary developers can attempt and master. In that sense, we all have the opportunity to stand on the shoulders of giants, understand models, adapt models, and give them new capabilities.

— From one learner to another

1. Prerequisites and Platform Overview

1.1 What is Unsloth?

Unsloth is a high-impact open-source team and toolkit for LLM post-training. Beyond efficient LoRA/QLoRA, it also supports full fine-tuning, continued pretraining, and low-precision setups (4-bit, 8-bit, 16-bit, FP8), while remaining compatible with Hugging Face/TRL workflows including SFT, GRPO, GSPO, and DPO.

Unsloth: quick summary

Unsloth’s core advantage is systems-level optimization for post-training—official docs and benchmark pages repeatedly highlight around 2× training speedups and large VRAM savings (commonly reported near 70%, with RL workflows often claiming 50–90%), enabling longer context and larger-model fine-tuning on limited GPUs; on top of this, it still provides 250+ notebooks and broad support across 500+ models.

1.2 What is Google Colab?

Google Colab is a cloud-hosted Jupyter notebook environment provided by Google. It lets you run Python code in the browser, connect to GPU runtimes, and quickly prototype machine learning workflows without configuring a local CUDA environment from scratch.

Why Colab is commonly used here

Fast notebook-based experimentation for training and debugging.
Easy integration with Google Drive for checkpoint persistence.
Direct access to GPU-backed runtimes in Colab Pro/Pro+ tiers.

1.3 Hardware Requirement: NVIDIA CUDA GPU

For this project setup, you should use an NVIDIA GPU with CUDA support. CUDA is required by the PyTorch + accelerated training stack used in this guide, including 4-bit loading and LoRA fine-tuning workflows.

GPU requirement note

Recommended: NVIDIA GPUs available in Colab (e.g., A100/H100 class when available).

Training 27B-class models is not practical on CPU-only runtimes.
If CUDA is unavailable, notebook cells related to model loading/training may fail.

2. Introduction

Overview: Scope and goal

Project Goal

Goal. Adapt Qwen3.5-27B into a more structured, reasoning-efficient model by distilling high-quality analytical and step-by-step problem-solving patterns, with a focus on stronger performance in coding, mathematics, and offline analytical tasks while reducing unnecessarily long or repetitive reasoning on simpler queries.

Setting

Setting. Load Qwen3.5-27B through Unsloth in a memory-efficient format suitable for Colab, and perform response-only supervised fine-tuning on curated high-fidelity reasoning data to produce a more organized, transparent, and practically useful model for downstream inference and research-oriented experimentation.

2.1 Background and Motivation

Key Points: Why parameter-efficient fine-tuning (PEFT)?

With the rapid progress of Large Language Models (LLMs), efficiently adapting a general-purpose model to a specific domain or task has become a key challenge.

Full-parameter fine-tuning is often expensive in compute and memory.
PEFT methods (especially LoRA) train a small set of parameters while keeping the base model frozen, enabling fast iteration for individual developers and small teams.

2.2 Tech Stack Overview

Key stack only (focus points):

Component	Key point
Unsloth	Faster and more memory-efficient fine-tuning workflow for Colab.
LoRA (PEFT)	Train only low-rank adapters to reduce compute and cost.
4-bit quantization	Lower VRAM usage while keeping 27B-scale training practical.
Colab Pro GPU (A100/H100)	Provides the CUDA GPU runtime needed for this pipeline.

Table 1: Essential tech stack (condensed)

3. Environment Setup and Preparation

3.1 Experiment Tracking (Optional)

# 3.1.1 What it does

Weights & Biases (WandB) is a powerful experiment tracking platform. It can monitor training metrics in real time, including:

Loss curves
Learning-rate schedules
VRAM utilization
Training speed and ETA

Code: WandB login setup

import os
import wandb
from google.colab import drive
from google.colab import userdata

drive.mount('/content/drive')
wandb_api_key = userdata.get('WANDB_API_KEY')
wandb.login(key=wandb_api_key)

drive_output_path = "/content/drive/MyDrive/Qwen3.5-27B--checkpoints"
os.makedirs(drive_output_path, exist_ok=True)

Warning: Colab usage notes

Run this cell in Google Colab (not Kaggle), and grant Drive permission when prompted by drive.mount().
Store your WandB API key in Colab Secrets as WANDB_API_KEY; if login fails, re-check the secret name and value.
Ensure your runtime has internet access and enough Drive space, otherwise checkpoint saving to /content/drive/MyDrive/... may fail.

3.2 Installing Dependencies

# 3.2.1 Core libraries

Fine-tuning an LLM typically requires these core libraries:

Library	Version	Purpose
Unsloth	GitHub (unsloth[base])	Installed from Unsloth GitHub for latest base runtime updates
PyTorch	2.8.0	Core deep learning backend used by this Colab setup
Transformers	5.2.0	Hugging Face model stack pinned to the notebook environment
TRL	0.22.2	Training utility stack aligned with Unsloth workflow
xformers	0.0.32.post2	Attention acceleration backend matched to the pinned torch version

Table 2: Core dependency list

Code: Install commands

%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"
    except: _numpy = "numpy"; _pil = "pillow"
    !uv pip install -qqq \
        "torch==2.8.0" "triton>=3.3.0" {_numpy} {_pil} torchvision bitsandbytes xformers==0.0.32.post2 \
        "unsloth_zoo[base] @ git+https://github.com/unsloth/unsloth_zoo.git" \
        "unsloth[base] @ git+https://github.com/unsloth/unsloth.git"
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth
!uv pip install --upgrade --no-deps tokenizers trl==0.22.2 unsloth unsloth_zoo
!uv pip install transformers==5.2.0
# causal_conv1d is supported only on torch==2.8.0. If you have newer torch versions, please wait 10 minutes!
!uv pip install --no-build-isolation flash-linear-attention causal_conv1d==1.6.0

Tip: Installation notes

Use the %%capture magic to hide install logs and keep the notebook clean.
Installation usually takes about 3–5 minutes.
If you see version warnings but Unsloth imports correctly, you can typically ignore them.

4. Model Loading and LoRA Configuration

4.1 Loading the Pretrained Model

# 4.1.1 How 4-bit quantization works

4-bit quantization stores model parameters in 4-bit precision, dramatically reducing VRAM usage while keeping inference and training stable (especially when combined with quantization-aware kernels and mixed precision). In practice, 4-bit is often the best balance for single-GPU fine-tuning.

# 4.1.2 Model loading code

Code: Load Qwen3.5-27B with Unsloth

from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3.5-27B",
    max_seq_length = 32768, # Choose any for long context!
    load_in_4bit = True,    # 4 bit quantization to reduce memory
    load_in_8bit = False,   # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

Warning: Common pitfalls

If you hit OOM, reduce max_seq_length (e.g., 8192 or 16384) or use smaller batch size.
Make sure the model identifier matches an existing HF repo name.
If the model is gated, you must provide a valid HF token.

# 4.1.3 Best Practices: Thinking vs. Instruct Sampling Settings

Qwen models are commonly used in two inference modes: Instruct (instruction-following chat) and Thinking (reasoning with explicit thinking tags). The recommended sampling settings are different: Thinking typically uses a lower temperature and a higher top_p to improve reasoning stability.

Parameter	Instruct (Qwen3)	Thinking (Qwen3/Qwen3.5)
Temperature	0.7	0.6/0.7/1
Top P	0.80	0.95
TopK	20	20

Table 3: Recommended sampling settings for Qwen Instruct vs. Thinking (rule-of-thumb)

Output length: For most queries, set the maximum output length to 32768 tokens (if your runtime/VRAM allows); this is sufficient for many long-form responses.

# 4.1.4 Template First: Confirm the Chat Format Before Training

Why this matters before SFT

If you have already chosen the model to train, the next critical step is to confirm its corresponding chat template. Different model families and companies use different role markers, separators, special tokens, and system/user/assistant formatting rules.

If your training data format does not match the model’s native chat template, you may see unstable loss, weaker instruction-following behavior, or degraded response quality. In short: pick the model first, then align the dataset and prompt formatting to that model’s template. The example below shows a typical template structure.

Code: Qwen3 Chat template Example Main Part (Instruct / Thinking)

<|im_start|>user
Hey there! What is 1+1? <|im_end|>
<|im_start|>assistant
<think>

# Instruct mode: <think></think>answer
# Thinking mode: may include <think>..reasoning..</think>answer

</think>
The answer is 2.
<|im_end|>

4.2 Configuring the LoRA Adapter

# 4.2.1 LoRA principles

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. Instead of updating all original model weights, LoRA keeps the pretrained model frozen and adds two small trainable low-rank matrices (often denoted as A and B) to selected linear layers. The product of these matrices learns a lightweight task-specific update, so the model can adapt to new domains with far fewer trainable parameters. In practice, this greatly reduces VRAM usage and training cost while preserving most of the base model’s capability, which is why LoRA is widely used for single-GPU or Colab-based LLM fine-tuning.

Figure 1: LoRA (Low-Rank Adaptation) principle: keep the base model frozen and learn a small, trainable low-rank update.

# 4.2.2 LoRA configuration code

Code: Attach LoRA adapters

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "out_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Tip: Recommended starting values

r: 8/16/64/128 (higher rank can improve quality but costs more VRAM).
lora_alpha: often set to r or 2r (twice r) as a practical baseline.
target_modules: start with attention + MLP projections (as above).

5. Data Processing and Preprocessing

5.1 Part I: Configuration and Multi-source Dataset Loading

# 5.1.1 Pipeline goal

This stage defines the data pipeline configuration and loads three heterogeneous reasoning datasets into a unified workflow. The key design choice is to sample each source with a fixed ratio so that high-quality reasoning traces and conversational diversity are both preserved.

Core principle for SFT beginners

Dataset selection largely determines what the model finally learns. Beginners often assume model size is the most important factor, but in supervised fine-tuning, dataset design usually has a larger impact than simply running more steps. In practice, these factors are often the strongest drivers of final quality:

Data quality (signal-to-noise, correctness, and reasoning faithfulness).
Data style (answer tone, structure, verbosity, and formatting habits).
Task distribution (which capabilities are emphasized most often).
Format consistency (stable template/role formatting across samples).

Bottom line

for SFT, these data decisions usually matter more than “just training a bit longer.”

Set global controls (seed, context window, per-dataset sample sizes).
Initialize the Qwen thinking chat template early for downstream serialization.
Load three sources with fallback logic for schema-specific dataset issues.

Note

The dataset choices and hyperparameters shown below are not the final production training settings. In real projects, you should tune them based on your training goals, dataset characteristics, hardware, and runtime environment. This section is provided as an example to help you understand the pipeline and workflow. Code can be generated or assisted by AI, but you still need to understand what each step is doing and why.*

Detailed reference: TRL SFTTrainer

For a full reference of the supervised fine-tuning trainer, see: https://huggingface.co/docs/trl/sft_trainer

What this page explains in detail:

How to use SFTTrainer for quick-start SFT training.
Supported dataset schemas (language modeling and prompt-completion; standard and conversational formats).
Core SFT mechanics: preprocessing, tokenization, next-token cross-entropy loss, label shifting, and padding-mask handling.
Practical options: packing, assistant_only_loss, completion_only_loss, and PEFT/LoRA integration.
Common training metrics and key customization options in SFTConfig.

Code: Configuration and dataset loading

from datasets import load_dataset, concatenate_datasets, Dataset
from unsloth.chat_templates import get_chat_template
import re
import json
import multiprocessing as mp
import pandas as pd

RANDOM_SEED = 12181531
MAX_CONTEXT_WINDOW = 8192

num_samples_dict = {
    "ds1": 3900, # nohurry/Opus-4.6-Reasoning-3000x-filtered
    "ds2": 700,  # Jackrong/Qwen3.5-reasoning-700x
    "ds3": 9633, # Roman1111111/claude-opus-4.6-10000x
}

tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen3-thinking",
)

def load_ds3_via_pandas_parquet():
    parquet_path = (
        "hf://datasets/Roman1111111/claude-opus-4.6-10000x"
        "@refs/convert/parquet/default/train/0000.parquet"
    )
    df = pd.read_parquet(parquet_path)
    return Dataset.from_pandas(df, preserve_index=False)

def load_and_sample(dataset_name, sample_count=None, split="train", subset=None):
    try:
        if subset:
            ds = load_dataset(dataset_name, subset, split=split)
        else:
            ds = load_dataset(dataset_name, split=split)
    except ValueError as e:
        err = str(e)
        if dataset_name == "Roman1111111/claude-opus-4.6-10000x" and "Feature type 'Json' not found" in err:
            ds = load_ds3_via_pandas_parquet()
        else:
            raise

    if sample_count is not None:
        sample_count = min(sample_count, len(ds))
        ds = ds.shuffle(seed=RANDOM_SEED).select(range(sample_count))

    return ds

# ds1: problem / thinking / solution
# ds2: multi-turn conversation
# ds3: messages with possible reasoning fields
ds1 = load_and_sample("nohurry/Opus-4.6-Reasoning-3000x-filtered", num_samples_dict["ds1"], split="train")
ds2 = load_and_sample("Jackrong/Qwen3.5-reasoning-700x", num_samples_dict["ds2"], split="train")
ds3 = load_and_sample("Roman1111111/claude-opus-4.6-10000x", num_samples_dict["ds3"], split="train")

5.2 Part II: Conversation Normalization and Reasoning-format Unification

# 5.2.1 Normalization strategy

Because the three datasets use different schemas (problem/solution pairs, multi-turn chat, and message objects with optional reasoning fields), we normalize them into a single conversations structure and enforce an assistant output format centered on <think>...</think> plus final answer text.

Standardize assistant messages into a reasoning-visible format.
Parse mixed message types (dict / JSON string) safely.
Convert each source with dataset-specific formatting functions.

Code: Multi-source formatting to unified conversations

def _strip(x):
    return (x or "").strip()

THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", flags=re.DOTALL)

def normalize_assistant_to_think_solution(text: str) -> str:
    text = _strip(text)
    if not text:
        return "<think></think>\n"

    m = THINK_BLOCK_RE.search(text)
    if m:
        think_block = m.group(0).strip()
        rest = text[m.end():].lstrip()
        return f"{think_block}\n{rest}".rstrip() if rest else f"{think_block}\n"

    return f"<think></think>\n{text}".rstrip()

def build_assistant_with_reasoning(content: str, reasoning: str = "") -> str:
    content = _strip(content)
    reasoning = _strip(reasoning)

    if "<think>" in content and "</think>" in content:
        return normalize_assistant_to_think_solution(content)

    if reasoning:
        return f"<think>{reasoning}</think>\n{content}" if content else f"<think>{reasoning}</think>\n"

    return normalize_assistant_to_think_solution(content)

def parse_message_item(m):
    if isinstance(m, dict):
        return m
    if isinstance(m, str):
        s = m.strip()
        if not s:
            return None
        try:
            obj = json.loads(s)
            return obj if isinstance(obj, dict) else None
        except Exception:
            return None
    return None

def format_ds1(examples):
    out = []
    for p, t, s in zip(examples.get("problem", []), examples.get("thinking", []), examples.get("solution", [])):
        p, t, s = _strip(p), _strip(t), _strip(s)
        if not p or not s:
            continue
        assistant = f"<think>{t}</think>\n{s}" if t else f"<think></think>\n{s}"
        out.append([
            {"role": "user", "content": p},
            {"role": "assistant", "content": assistant},
        ])
    return {"conversations": out}

def format_ds2(examples):
    out = []
    for conv in examples.get("conversation", []):
        if not conv:
            continue
        cleaned = []
        for m in conv:
            frm = (m.get("from") or "").strip()
            val = m.get("value", "")
            if frm == "human":
                cleaned.append({"role": "user", "content": _strip(val)})
            elif frm == "gpt":
                cleaned.append({"role": "assistant", "content": normalize_assistant_to_think_solution(val)})
        if len(cleaned) >= 2 and cleaned[-1]["role"] == "assistant":
            out.append(cleaned)
    return {"conversations": out}

def format_ds3(examples):
    out = []
    for msgs in examples.get("messages", []):
        if not msgs:
            continue
        parsed = [pm for pm in (parse_message_item(m) for m in msgs) if pm is not None]
        if not parsed:
            continue

        convo = [m for m in parsed if m.get("role") != "system"]
        if len(convo) < 2 or convo[-1].get("role") != "assistant":
            continue

        cleaned = []
        for m in convo:
            role = m.get("role")
            content = m.get("content", "")
            reasoning = m.get("reasoning", "")
            if role == "assistant":
                content = build_assistant_with_reasoning(content, reasoning)
            else:
                content = _strip(content)
            if role in ("user", "assistant") and content is not None:
                cleaned.append({"role": role, "content": content})

        if len(cleaned) >= 2 and cleaned[-1]["role"] == "assistant":
            out.append(cleaned)

    return {"conversations": out}

ds1 = ds1.map(format_ds1, batched=True, remove_columns=ds1.column_names)
ds2 = ds2.map(format_ds2, batched=True, remove_columns=ds2.column_names)
ds3 = ds3.map(format_ds3, batched=True, remove_columns=ds3.column_names)

5.3 Part III: Template Serialization, Length Control, and Format QA

# 5.3.1 Final dataset construction

After normalization, we merge the three datasets, serialize each conversation through the Qwen thinking template, filter over-length samples by tokenized length, and run a final assistant-format quality check before training.

Keep only non-empty normalized conversations.
Build training text via apply_chat_template.
Enforce context-window constraints and reasoning tag integrity.

Code: Merge, template, filter, and validate

ds1 = ds1.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)
ds2 = ds2.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)
ds3 = ds3.filter(lambda x: x["conversations"] is not None and len(x["conversations"]) > 0)

combined_dataset = concatenate_datasets([ds1, ds2, ds3]).shuffle(seed=RANDOM_SEED)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False,
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = combined_dataset.map(formatting_prompts_func, batched=True)

num_proc = mp.cpu_count()
_text_tok = getattr(tokenizer, "tokenizer", tokenizer)

def filter_long_sequences_batched(examples):
    texts = examples["text"]
    tokenized = _text_tok(
        texts,
        truncation=False,
        padding=False,
        add_special_tokens=False,
    )["input_ids"]
    return [len(toks) <= MAX_CONTEXT_WINDOW for toks in tokenized]

dataset = dataset.filter(filter_long_sequences_batched, batched=True, num_proc=num_proc)

def check_assistant_format(examples):
    convos = examples["conversations"]
    ok = []
    for convo in convos:
        good = True
        for m in convo:
            if m["role"] == "assistant":
                c = m.get("content", "")
                if "<think>" not in c or "</think>" not in c:
                    good = False
                    break
                if not re.search(r"</think>\n", c):
                    good = False
                    break
        ok.append(good)
    return {"_ok": ok}

check = dataset.map(
    check_assistant_format,
    batched=True,
    remove_columns=dataset.column_names,
    num_proc=num_proc,
)

bad = len(check) - sum(check["_ok"])
if bad > 0:
    dataset = dataset.filter(lambda x: all(
        (m["role"] != "assistant") or (
            ("<think>" in m["content"]) and ("</think>\n" in m["content"])
        )
        for m in x["conversations"]
    ))

print(dataset[0]["text"][:8000])

Tip: Practical checks before training

Print a few post-template samples to verify role separators and reasoning tags.
Track the retained-sample ratio after length filtering to avoid accidental over-pruning.
If many samples fail format QA, inspect the original source-specific conversion functions first.

6. Training Configuration and Execution

6.1 Trainer Configuration

# 6.1.1 Trainer code

Code: SFT training with TRL + Unsloth

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 6,
        gradient_accumulation_steps = 6, # Use GA to mimic batch size!
        warmup_ratio = 0.03,
        # warmup_steps = 60,
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 50,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        save_steps = 200,
        save_total_limit = 1,
        save_strategy = "steps",
        report_to = "wandb", # Can use Weights & Biases
        output_dir = drive_output_path,
    ),
)

# 6.1.2 Key training parameters

Parameter	Typical value	Notes
`per_device_train_batch_size`	1–128	Depends on VRAM
`gradient_accumulation_steps`	1–128	Increase effective batch size
`learning_rate`	1e-5–3e-4	Tune carefully
`warmup_steps`	1–5000	Stabilizes early training
`max_steps / num_train_epochs`	task-dependent	Prefer steps for reproducibility
`logging_steps`	1–20	For monitoring
`save_steps`	10–500	Checkpoint cadence

Table 4: Common training hyperparameters

# 6.1.3 Detailed notes on key parameters

Learning rate: If loss is unstable, reduce LR; if convergence is too slow, increase slightly.
Effective batch size: batch_size × grad_accum; bigger is often more stable.
Sequence length: Longer context increases VRAM and compute dramatically; start small.
Optimizer: adamw_8bit helps reduce memory usage.

6.2 Train on Responses Only (Recommended)

In most chat fine-tuning settings, you want the model to learn only the assistant responses, while not backpropagating on the user instruction or template tokens. Unsloth provides train_on_responses_only to automatically mask labels outside the assistant spans.

Code: Mask labels to train only on assistant responses

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n<think>",
)

Tip: Why this matters

If you train on the entire serialized prompt, the model may waste capacity learning to reproduce system/user content and special tokens, which can hurt instruction-following quality.

6.3 Label Sanity Check (Before/After Training)

To verify that masking worked as expected, you can decode a sample’s labels. In many trainers, tokens you do not want to train on are set to -100 (ignore index). The snippet below replaces those positions with the pad token id so decode can run, and then replaces the visible pad token with spaces for readability.

Code: Decode labels to verify masking

tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

7. Start Training (Most Important)

After the trainer is configured (and optionally masked to responses-only), start training:

Code: Run training

trainer_stats = trainer.train()

Tip: What is `trainer_stats`?

trainer_stats typically contains training metrics (e.g., loss, steps, runtime). You can print or log it for quick validation.

8. Saving and Deployment

8.1 Saving a 16-bit model

After training, you can push a merged 16-bit model directly to Hugging Face Hub for standard Transformers inference and sharing.

Tip: What this block does

Reads HF_TOKEN securely from Colab Secrets (instead of hard-coding tokens in notebooks).
Verifies the token via whoami() and auto-builds your target repo id.
Uploads the merged 16-bit checkpoint with tokenizer, so the model can be loaded directly from Hugging Face Hub.

Code: Push merged 16-bit model to Hugging Face Hub

from huggingface_hub import whoami
from google.colab import userdata

try:
    hf_token = userdata.get("HF_TOKEN")
    if not hf_token:
        raise ValueError("HF_TOKEN is not set")
except Exception as e:
    raise RuntimeError("HF_TOKEN was not found in Colab Secrets.") from e

try:
    username = whoami(token=hf_token)["name"]
    repo_id = f"{username}/Qwopus3.5-27B"
except Exception as e:
    raise RuntimeError("Failed to authenticate with Hugging Face.") from e

model.push_to_hub_merged(
    repo_id,
    tokenizer,
    save_method="merged_16bit",
    token=hf_token,
)

print(f"Uploaded to https://huggingface.co/{repo_id}")

8.2 Exporting GGUF (optional)

If you want to run the model in llama.cpp-compatible stacks, export and publish a GGUF quantized model.

Tip: What this GGUF step does

Reads HF_TOKEN from Colab Secrets and validates Hugging Face identity.
Exports and uploads multiple GGUF variants (q4_k_m, q8_0, bf16) to one Hub repo.
Makes the model directly usable in llama.cpp-style runtimes with different quality/speed trade-offs.

Warning: Compatibility note

GGUF export depends on the model architecture and your environment. Follow the latest Unsloth / llama.cpp guidance for Qwen-family compatibility.

Code: Push GGUF model to Hugging Face Hub

from huggingface_hub import whoami
from google.colab import userdata

try:
    hf_token = userdata.get("HF_TOKEN")
    if not hf_token:
        raise ValueError("HF_TOKEN is not set")
except Exception as e:
    raise RuntimeError("HF_TOKEN was not found in Colab Secrets.") from e

try:
    username = whoami(token=hf_token)["name"]
    repo_id = f"{username}/Qwopus3.5-27B-GGUF"
except Exception as e:
    raise RuntimeError("Failed to authenticate with Hugging Face.") from e

model.push_to_hub_gguf(
    repo_id,
    tokenizer,
    quantization_method=["q4_k_m","q8_0","bf16"],
    token=hf_token,
)

print(f"Uploaded to https://huggingface.co/{repo_id}")

9. Common Issues and Optimization Tips

9.1 Typical failure modes

Warning: Failure modes checklist

Out-of-memory (OOM): reduce sequence length, batch size, or increase grad accumulation; enable gradient checkpointing.
Loss does not decrease: verify data formatting and the chat template; consider lowering the learning rate.
Overfitting: reduce steps/epochs, increase data diversity, or enable dropout.

9.2 Hyperparameter tuning workflow

Tip: A practical, reproducible workflow

Stage 1: Validate data + template
- Run 50–200 steps and confirm the loss decreases.
- Spot-check decoded prompts and targets (ensure roles/separators are correct).
Stage 2: Learning-rate sweep
- Try [1e-5, 2e-5, 3e-5] (expand if needed for your scale).
- Choose the fastest stable loss decrease (avoid divergence/oscillation).
Stage 3: LoRA rank tuning
- Try [16, 32, 64].
- Balance quality vs. VRAM and training time.
Stage 4: Batch size optimization
- Increase batch size as VRAM allows.
- Use gradient accumulation steps to keep an appropriate effective batch size.

10. Summary and Outlook

10.1 Key takeaways

Summary of the end-to-end workflow

This guide covered the full end-to-end workflow for LLM fine-tuning:

Environment: Google Colab GPU runtime + Unsloth stack, with optional WandB logging and Google Drive checkpoint persistence.
Model loading: Qwen3.5-27B is loaded in 4-bit mode to fit single-GPU VRAM constraints while keeping training practical.
Parameter-efficient adaptation: rank-64 LoRA adapters are attached to key projection modules, so only a small parameter subset is updated.
Data pipeline: mixed reasoning datasets are normalized into one assistant-target format with Qwen chat template and response-only supervision.
Training and release: SFT is run with memory-aware hyperparameters, then exported to Hugging Face Hub as merged 16-bit weights and optional GGUF variants.

10.2 Typical metrics

Metric	Typical value / observed range
Base model scale	Qwen3.5-27B
Runtime environment	Google Colab (A100/H100 class GPU)
Model loading mode	4-bit quantized base + LoRA adapters
LoRA rank used in this guide	64
Export targets	merged 16-bit + GGUF (`q4_k_m` / `q8_0` / `bf16`)

Table 5: Configuration-level metrics aligned with the Qwen3.5-27B Colab workflow

Tip: How to interpret these metrics

For 27B-scale fine-tuning on Colab, absolute runtime and loss values are highly sensitive to GPU type, sequence length, effective batch size, and dataset composition. In practice, compare runs within the same setup rather than relying on a single universal number.

10.3 Next learning steps

Strengthen the current Qwen3.5-27B pipeline first
- Build a small, high-quality evaluation set for coding/math/reasoning tasks.
- Compare r=16/32/64 and sequence length settings under the same data split.
- Track inference behavior changes (reasoning length, answer accuracy, and format stability).
Improve data quality and supervision strategy
- Add harder long-context and multi-step reasoning samples.
- Remove noisy or contradictory outputs in mixed-source datasets.
- Experiment with style constraints for more concise final answers.
Move from SFT to preference alignment
- Try DPO with pairwise preference data collected from your target tasks.
- Evaluate whether alignment improves helpfulness without overextending reasoning traces.
- Keep SFT checkpoints as stable baselines for A/B comparison.
Deployment and reproducibility
- Validate both Hub merged models and GGUF exports with the same prompt suite.
- Document exact package versions and notebook cells for rerunnable training.
- Add lightweight serving endpoints (e.g., vLLM/TGI) once offline quality is validated.