复现《Microscaling Data Formats for Deep Learning》

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化,内置完整运行环境与 Qwen3-32B 模型依赖,开箱即用。

文章:Microscaling Data Formats for Deep Learning

1.  复现resnet50(参考文献9)FP32、MXINT8、MXFP8、MXFP6、MXFP4

Table2. Direct-cast inference with MX data formats

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Classification/

// download ILSVRC2012_img_train.tar

mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..

mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

docker build . -t nvidia_resnet50

将本地目录/data/workspaces/zhangxin/ILSVRC2012挂载到docker的/imagenet目录下:

nvidia-docker run --rm -it -v /data/workspaces/zhangxin/ILSVRC2012:/imagenet --ipc=host nvidia_resnet50

 启动docker后,在docker中下载预训练模型并进行推理(--amp为bool选项:是否启用混合精度):

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_pyt_amp/versions/20.06.0/zip -O resnet50_pyt_amp_20.06.0.zip
unzip resnet50_pyt_amp_20.06.0.zip

vim /workspace/rn50/image_classification/dataloaders.py
注释掉135行、189行


// 混合精度
python ./main.py --arch=resnet50 --evaluate --amp --epochs=1 --pretrained -b=256 /imagenet/ 
// FP32
python ./main.py --arch=resnet50 --evaluate --epochs=1 --pretrained -b=256 /imagenet/ 

再去这里PyTorch emulation library for Microscaling (MX)-compatible data formats将torch.nn算子换为MX算子:

git clone https://github.com/microsoft/microxcaling.git
python -c "import torch; print(torch.__version__)"
pip install torch --upgrade
cd path/to/exmaple/
bash run_mxfp6.sh
cd path/to/mx/test/
python -m pytest .
vim main.py

from mx import finalize_mx_specs
from mx import mx_mapping

if __name__ == "__main__":
    epilog = [
        "Based on the architecture picked by --arch flag, you may use the following options:\n"
    ]

    # Simple MX spec for MXFP6 weights+activations
    mx_specs = {
        'w_elem_format': 'fp6_e3m2',
        'a_elem_format': 'fp6_e3m2',
        'block_size': 32,
        'bfloat': 16,
        'custom_cuda': True,
        # For quantization-aware finetuning, do backward pass in FP32
        'quantize_backprop': False,
    }
    mx_specs = finalize_mx_specs(mx_specs)

    # Auto-inject MX modules and functions
    # This will replace certain torch.nn.* and torch.nn.functional.*
    # modules/functions in the global namespace!
    mx_mapping.inject_pyt_ops(mx_specs)

    for model, ep in available_models().items():
        model_help = "\n".join(ep.parser().format_help().split("\n")[2:])
        epilog.append(model_help)
    parser = argparse.ArgumentParser(
        description="PyTorch ImageNet Training",
        epilog="\n".join(epilog),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )

    add_parser_arguments(parser)

    args, rest = parser.parse_known_args()

    model_arch = available_models()[args.arch]
    model_args, rest = model_arch.parser().parse_known_args(rest)
    print(model_args)

    assert len(rest) == 0, f"Unknown args passed: {rest}"

    cudnn.benchmark = True

    main(args, model_args, model_arch)
// 多行注释
vim main.py
Ctrl+v
向下箭头选中多行
大写I
#
Esc
// 取消注释
:659,675s/#//g

2.  复现GNMT

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Translation/GNMT
bash scripts/docker/build.sh
bash scripts/docker/interactive.sh
bash scripts/wmt16_en_de.sh
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024

python3 translate.py \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --model gnmt/model_best.pth

python3 translate.py \
  --input-text "The quick brown fox jumps over the lazy dog" \
  --model gnmt/model_best.pth

3.  transformer复制docker中的文件至主机

docker ps

(py311) zhangxin@SH-AI-GPU04:~/code$ docker ps
CONTAINER ID   IMAGE                            COMMAND                   CREATED       STATUS       PORTS                NAMES
b3966e6d051a   nvidia_transformer:transformer   "/opt/nvidia/nvidia_…"   2 hours ago   Up 2 hours   6006/tcp, 8888/tcp   focused_bartik

docker cp focused_bartik:/workspace/translation/transformer.txt ~/code/
docker cp focused_bartik:/workspace/translation/transformer-model_list.txt ~/code/

您可能感兴趣的与本文相关的镜像

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

Qwen
文本生成
Qwen3

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化,内置完整运行环境与 Qwen3-32B 模型依赖,开箱即用。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值