caffe 源码解读
Contens
作为算法工程师,在调库的时候,总会好奇,what is under the hood?在这种好奇心的驱使下,我读了caffe的源码。caffe代码量5万行左右,比较简洁。
caffe 是个轻量级的深度学习库,在计算机视觉领域用的比较多。提升代码能力有两种途径,一是多写,二是多读。相比自己写bug ridden 的代码,读优秀的代码是一件非常愉快的事情。caffe代码量5万行左右,比较简洁,但是这5万多行代码却支撑了很多公司的业务,从这点看,这5万多行代码非常值得你花几天读一读。接下来,本文梳理一下caffe代码的骨架,希望对对caffe 源码感兴趣的读者有所帮助。
caffe 项目主要目录
data\ //该目录包含下载数据的脚本
examples\ //该目录下包含使用caffe的使用样例,有些是jupter notebook,可以使用jupyter打开看看
include\ //头文件单独放
matlab\ //matlab相关代码,实现了caffe matlab 插件
models\ //源码附带的模型
python\ //使用boost.python导出python 模块,以及python代码,封装了boost.python导出的接口
tools\ //数据处理工具和caffe命令行工具源码
scripts\ //一些辅助编译的脚本
src\ //源码目录
理清源码头绪
看代码从main 方法开始,我们从caffe提供的两个工具开始。
tools目录下有一个caffe命令行工具,编译出来是caffe.bin,该工具是caffe的命令行版本,支持模型的训练和测试。训练可以使用多张gpu,数据并行。
python目录下有一个train.py 脚本,是使用caffe python 版的训练模型脚本。
caffe.bin
caffe.bin 有三种运行模式。分别是test、timer、train,对应三个函数,静态注册在g_brew_map里。精简后的main 方法如下,它根据传入的第一个参数,调用GetBrewFunction(string)从全局的map<string,(int)()>g_brew_map里把对应的函数取出来,调用。
int main(int argc, char** argv) {
return GetBrewFunction(caffe::string(argv[1]))();
}
test
test 根据传入的模型文件,构造一个caffe::Net对象,调用Net对象的forward和backward 方法进行前向计算和后向计算。精简后的代码骨架如下。
int test(){
......
Net<float> caffe_net(modelpath,...);
......
for (int i = 0; i < FLAGS_iterations; ++i) {
float iter_loss;
const vector<Blob<float>*>& result =
caffe_net.Forward(&iter_loss);
loss += iter_loss;
......
}
}
train
train 根据传入的solver 配置文件,构造一个solver。如果要使用gpu多卡训练,使用solver 构造一个NCCL对象,执行该对象的run方法;如果是gpu单卡或cpu直接调用solver的solve 方法。
int train(){
......
caffe::SolverParameter solver_param;
caffe::ReadSolverParamsFromTextFileOrDie(FLAGS_solver, &solver_param);
......
shared_ptr<caffe::Solver<float>>
solver(caffe::SolverRegistry<float>::CreateSolver(solver_param));
if (gpus.size() > 1) {
caffe::NCCL<float> nccl(solver);
// 单进程多gpu 卡
nccl.Run(gpus, FLAGS_snapshot.size() > 0 ? FLAGS_snapshot.c_str() : NULL);
}else {
solver->Solve();
}
......
}
train.py
这个脚本借助python multiprocessing模块,使用多进程训练。该脚本的逻辑和上面的caffe命令行版本中train 方法里调用nccl.Run(…)运行的逻辑类似。所不同的是,命令行版本里是纯C++环境,可以使用多线程,每个线程对应一个gpu卡,持有一个nccl 通信句柄。而在python 版本里,用户可以使用python自定义layer,以及即使是caffe预定义的layer也可能注册python 函数作为回调。这带来极大的灵活性,但也导致了麻烦。python 代码和C++ 代码调用交错混杂在一起,没有明确的界限。如果使用多线程,每个训练线程都可能访问pyobject或使用python解释器,访问pyobject不获得GIL锁是线程不安全的,使用python解释器也需要获得GIL锁。这意味着caffe python库没法多线程使用多卡加速训练,这里使用multiprocessing 模块,采用多进程多卡绕过了这个麻烦。使用多进程多卡加速时,每个进程对应一个gpu 卡,持有一个nccl 通信句柄。
#!/usr/bin/env python
"""
Trains a model using one or more GPUs.
"""
from multiprocessing import Process
import caffe
def train(
solver, # solver proto definition
snapshot, # solver snapshot to restore
gpus, # list of device ids
timing=False, # show timing info for compute and communications
):
# NCCL uses a uid to identify a session
// 建立一次nccl 会话。
uid = caffe.NCCL.new_uid()
caffe.init_log()
caffe.log('Using devices %s' % str(gpus))
procs = []
for rank in range(len(gpus)):
p = Process(target=solve,
args=(solver, snapshot, gpus, timing, uid, rank))
p.daemon = True
p.start()
procs.append(p)
for p in procs:
p.join()
def time(solver, nccl):
fprop = []
bprop = []
total = caffe.Timer()
allrd = caffe.Timer()
for _ in range(len(solver.net.layers)):
fprop.append(caffe.Timer())
bprop.append(caffe.Timer())
display = solver.param.display
def show_time():
if solver.iter % display == 0:
s = '\n'
for i in range(len(solver.net.layers)):
s += 'forw %3d %8s ' % (i, solver.net._layer_names[i])
s += ': %.2f\n' % fprop[i].ms
for i in range(len(solver.net.layers) - 1, -1, -1):
s += 'back %3d %8s ' % (i, solver.net._layer_names[i])
s += ': %.2f\n' % bprop[i].ms
s += 'solver total: %.2f\n' % total.ms
s += 'allreduce: %.2f\n' % allrd.ms
caffe.log(s)
solver.net.before_forward(lambda layer: fprop[layer].start())
solver.net.after_forward(lambda layer: fprop[layer].stop())
solver.net.before_backward(lambda layer: bprop[layer].start())
solver.net.after_backward(lambda layer: bprop[layer].stop())
solver.add_callback(lambda: total.start(), lambda: (total.stop(), allrd.start()))
solver.add_callback(nccl)
solver.add_callback(lambda: '', lambda: (allrd.stop(), show_time()))
def solve(proto, snapshot, gpus, timing, uid, rank):
# set gpu flag
caffe.set_mode_gpu()
caffe.set_device(gpus[rank])
caffe.set_solver_count(len(gpus))
caffe.set_solver_rank(rank)
# set multiprocess flag to true
caffe.set_multiprocess(True)
solver = caffe.SGDSolver(proto)
if snapshot and len(snapshot) != 0:
solver.restore(snapshot)
nccl = caffe.NCCL(solver, uid)
nccl.bcast()
if timing and rank == 0:
time(solver, nccl)
else:
solver.add_callback(nccl)
if solver.param.layer_wise_reduce:
solver.net.after_backward(nccl)
solver.step(solver.param.max_iter)
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--solver", required=True, help="Solver proto definition.")
parser.add_argument("--snapshot", help="Solver snapshot to restore.")
parser.add_argument("--gpus", type=int, nargs='+', default=[0],
help="List of device ids.")
parser.add_argument("--timing", action='store_true', help="Show timing info.")
args = parser.parse_args()
train(args.solver, args.snapshot, args.gpus, args.timing)
核心类解读
通过上面的初步梳理,我们看到几个核心的类,它们是blob、layer、net、solver、nccl。这些类都通过boost.python 导出了python 对应的类和接口。接下来逐一解读各个类。
blob
blob 是caffe的数据存储媒介,数据是batch*channel*height*width格式的多维数组。blob 主要有两个重要的变量。
shared_ptr<SyncedMemory> data_ //变量
shared_ptr<SyncedMemory> diff_ //梯度
这两个变量分别负责存储变量和该变量的梯度。
需要提一下的是SyncedMemory类,该类自动同步host 内存和device内存。该类内部维持了一个名为head_ 的枚举类型变量,标识现在的数据状态,该变量的取值可以是 UNINITIALIZED, HEAD_AT_CPU, HEAD_AT_GPU, SYNCED。如果目前数据在gpu的数据不是最新的版本,调用mutable_gpu_data,首先会把数据同步到gpu,再返回gpu 指针。调用mutable_cpu_data会有类似的操作。
enum SyncedHead { UNINITIALIZED, HEAD_AT_CPU, HEAD_AT_GPU, SYNCED };
void* SyncedMemory::mutable_cpu_data() {
check_device();
to_cpu();
head_ = HEAD_AT_CPU;
return cpu_ptr_;
}
void* SyncedMemory::mutable_gpu_data() {
check_device();
#ifndef CPU_ONLY
to_gpu();
head_ = HEAD_AT_GPU;
return gpu_ptr_;
#else
NO_GPU;
return NULL;
#endif
}
# 确保cpu的数据是最新的,如果有必要,分配内存。
inline void SyncedMemory::to_cpu() {
check_device();
switch (head_) {
case UNINITIALIZED:
CaffeMallocHost(&cpu_ptr_, size_, &cpu_malloc_use_cuda_);
caffe_memset(size_, 0, cpu_ptr_);
head_ = HEAD_AT_CPU;
own_cpu_data_ = true;
break;
case HEAD_AT_GPU:
#ifndef CPU_ONLY
if (cpu_ptr_ == NULL) {
CaffeMallocHost(&cpu_ptr_, size_, &cpu_malloc_use_cuda_);
own_cpu_data_ = true;
}
caffe_gpu_memcpy(size_, gpu_ptr_, cpu_ptr_);
head_ = SYNCED;
#else
NO_GPU;
#endif
break;
case HEAD_AT_CPU:
case SYNCED:
break;
}
}
# 确保gpu的数据是最新的,如果有必要,分配gpu内存。
inline void SyncedMemory::to_gpu() {
check_device();
#ifndef CPU_ONLY
switch (head_) {
case UNINITIALIZED:
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
caffe_gpu_memset(size_, 0, gpu_ptr_);
head_ = HEAD_AT_GPU;
own_gpu_data_ = true;
break;
case HEAD_AT_CPU:
if (gpu_ptr_ == NULL) {
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
own_gpu_data_ = true;
}
caffe_gpu_memcpy(size_, cpu_ptr_, gpu_ptr_);
head_ = SYNCED;
break;
case HEAD_AT_GPU:
case SYNCED:
break;
}
#else
NO_GPU;
#endif
}
layer
layer是caffe中网络的基石,caffe库预定义了很多layer,每种layer都注册了创建该类实例的工厂方法。layer 两核心的方法是Forward和Backward,分别实现了本层的前向和后向计算。前向计算根据bottom传入的blob vector,执行前向计算把结果存到top,该过程读取和修改的是blob 里的data_元素。后向计算根据 bottom 和top传入的blob,计算bottom 里对应的blob 的diff_元素。player 是所有layer 的基类,实现了一些通用的骨架代码。layer 的子类具体实现Forward_cpu、Forward_gpu和Backward_cpu、Backward_gpu。Forward_cpu、Backward_cpu是纯虚函数,layer没有实现,由具体子类实现;Forward_gpu、Backward_gpu是虚函数,layer默认的实现分别是调用Forward_cpu、Backward_cpu。也就是说layer 的子类必须要实现Forward_cpu和Backward_cpu,前向计算和后向计算的gpu实现是可选的。读者可以看一些具体的layer的实现,比如datalayer,cnnlayer。
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) = 0;
/**
* @brief Using the GPU device, compute the layer output.
* Fall back to Forward_cpu() if unavailable.
*/
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
// LOG(WARNING) << "Using CPU code as backup.";
return Forward_cpu(bottom, top);
}
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) = 0;
/**
* @brief Using the GPU device, compute the gradients for any parameters and
* for the bottom blobs if propagate_down is true.
* Fall back to Backward_cpu() if unavailable.
*/
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
// LOG(WARNING) << "Using CPU code as backup.";
Backward_cpu(top, propagate_down, bottom);
}
template <typename Dtype>
inline Dtype Layer<Dtype>::Forward(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
Dtype loss = 0;
Reshape(bottom, top);
switch (Caffe::mode()) {
case Caffe::CPU:
Forward_cpu(bottom, top);
for (int top_id = 0; top_id < top.size(); ++top_id) {
if (!this->loss(top_id)) { continue; }
const int count = top[top_id]->count();
const Dtype* data = top[top_id]->cpu_data();
const Dtype* loss_weights = top[top_id]->cpu_diff();
loss += caffe_cpu_dot(count, data, loss_weights);
}
break;
case Caffe::GPU:
Forward_gpu(bottom, top);
#ifndef CPU_ONLY
for (int top_id = 0; top_id < top.size(); ++top_id) {
if (!this->loss(top_id)) { continue; }
const int count = top[top_id]->count();
const Dtype* data = top[top_id]->gpu_data();
const Dtype* loss_weights = top[top_id]->gpu_diff();
Dtype blob_loss = 0;
caffe_gpu_dot(count, data, loss_weights, &blob_loss);
loss += blob_loss;
}
#endif
break;
default:
LOG(FATAL) << "Unknown caffe mode.";
}
return loss;
}
template <typename Dtype>
inline void Layer<Dtype>::Backward(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
switch (Caffe::mode()) {
case Caffe::CPU:
Backward_cpu(top, propagate_down, bottom);
break;
case Caffe::GPU:
Backward_gpu(top, propagate_down, bottom);
break;
default:
LOG(FATAL) << "Unknown caffe mode.";
}
}
net
net 是整个网络的骨架,net 把网络的层组装在一起,协调各层的前向和后向计算。 Net 的 Init(const NetParameter& in_param)方法根据传入的NetParameter对象生成整个网络的每一层,并把所有的层的前向和后向计算连接在一起,将所有的层组装成一个网络。net 最重要的方法就是forward 和backward系列方法,该系列方法最终都是调用ForwardFromTo和BackwardFromTo。如下面代码所示,具体是按照拓扑顺序依次调用各层的Forward 和backward。
template <typename Dtype>
Dtype Net<Dtype>::ForwardFromTo(int start, int end) {
CHECK_GE(start, 0);
CHECK_LT(end, layers_.size());
Dtype loss = 0;
for (int i = start; i <= end; ++i) {
// 调用before_forward_回调
for (int c = 0; c < before_forward_.size(); ++c) {
before_forward_[c]->run(i);
}
//该层的forward
Dtype layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);
loss += layer_loss;
if (debug_info_) { ForwardDebugInfo(i); }
//调用after_forward_回调
for (int c = 0; c < after_forward_.size(); ++c) {
after_forward_[c]->run(i);
}
}
return loss;
}
template <typename Dtype>
void Net<Dtype>::BackwardFromTo(int start, int end) {
CHECK_GE(end, 0);
CHECK_LT(start, layers_.size());
for (int i = start; i >= end; --i) {
// 调用before_backward回调
for (int c = 0; c < before_backward_.size(); ++c) {
before_backward_[c]->run(i);
}
//如果有必要,执行该层的backward计算
if (layer_need_backward_[i]) {
layers_[i]->Backward(
top_vecs_[i], bottom_need_backward_[i], bottom_vecs_[i]);
if (debug_info_) { BackwardDebugInfo(i); }
}
// 调用after_backward回调
for (int c = 0; c < after_backward_.size(); ++c) {
after_backward_[c]->run(i);
}
}
}
在上面的代码中我们看到,net 在调用各层的forward方法前后会调用before_forward_[c]->run(i)和after_forward_[c]->run(i);在调用各层的backward方法前后会调用before_backward_[c]->run(i)和after_backward_[c]->run(i)。回调的方法签名是void run(int layer),传入的int 用来标识第几层,Net提供了四个注册回调的方法,如下代码所示。在后面要讲的nccl 类,会借助这种回调机制。
class Callback {
protected:
virtual void run(int layer) = 0;
template <typename T>
friend class Net;
};
const vector<Callback*>& before_forward() const { return before_forward_; }
void add_before_forward(Callback* value) {
before_forward_.push_back(value);
}
const vector<Callback*>& after_forward() const { return after_forward_; }
void add_after_forward(Callback* value) {
after_forward_.push_back(value);
}
const vector<Callback*>& before_backward() const { return before_backward_; }
void add_before_backward(Callback* value) {
before_backward_.push_back(value);
}
const vector<Callback*>& after_backward() const { return after_backward_; }
void add_after_backward(Callback* value) {
after_backward_.push_back(value);
solver
solver 有好几个子类,每一个子类实现了一种优化方法。solver的逻辑是,计算梯度,对梯度做某种处理,使用处理好的梯度更新模型权重。solver的关键方法是solve 和step。
template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file) {
CHECK(Caffe::root_solver());
LOG(INFO) << "Solving " << net_->name();
LOG(INFO) << "Learning Rate Policy: " << param_.lr_policy();
// Initialize to false every time we start solving.
requested_early_exit_ = false;
// 如果保存了之前的solver状态,恢复状态。
if (resume_file) {
LOG(INFO) << "Restoring previous solver status from " << resume_file;
Restore(resume_file);
}
// For a network that is trained by the solver, no bottom or top vecs
// should be given, and we will just provide dummy vecs.
int start_iter = iter_;
//迭代训练
Step(param_.max_iter() - iter_);
// If we haven't already, save a snapshot after optimization, unless
// overridden by setting snapshot_after_train := false
if (param_.snapshot_after_train()
&& (!param_.snapshot() || iter_ % param_.snapshot() != 0)) {
Snapshot();
}
if (requested_early_exit_) {
LOG(INFO) << "Optimization stopped early.";
return;
}
// After the optimization is done, run an additional train and test pass to
// display the train and test loss/outputs if appropriate (based on the
// display and test_interval settings, respectively). Unlike in the rest of
// training, for the train net we only run a forward pass as we've already
// updated the parameters "max_iter" times -- this final pass is only done to
// display the loss, which is computed in the forward pass.
if (param_.display() && iter_ % param_.display() == 0) {
int average_loss = this->param_.average_loss();
Dtype loss;
net_->Forward(&loss);
UpdateSmoothedLoss(loss, start_iter, average_loss);
LOG(INFO) << "Iteration " << iter_ << ", loss = " << smoothed_loss_;
}
if (param_.test_interval() && iter_ % param_.test_interval() == 0) {
TestAll();
}
LOG(INFO) << "Optimization Done.";
}
template <typename Dtype>
void Solver<Dtype>::Step(int iters) {
const int start_iter = iter_;
const int stop_iter = iter_ + iters;
int average_loss = this->param_.average_loss();
losses_.clear();
smoothed_loss_ = 0;
iteration_timer_.Start();
while (iter_ < stop_iter) {
// zero-init the params
net_->ClearParamDiffs();
if (param_.test_interval() && iter_ % param_.test_interval() == 0
&& (iter_ > 0 || param_.test_initialization())) {
//只对root_solver 测试
if (Caffe::root_solver()) {
TestAll();
}
if (requested_early_exit_) {
// Break out of the while loop because stop was requested while testing.
break;
}
}
// 执行solver回调的on_start
for (int i = 0; i < callbacks_.size(); ++i) {
callbacks_[i]->on_start();
}
const bool display = param_.display() && iter_ % param_.display() == 0;
net_->set_debug_info(display && param_.debug_info());
// accumulate the loss and gradient
Dtype loss = 0;
// 多次迭代,累计多次的梯度
for (int i = 0; i < param_.iter_size(); ++i) {
loss += net_->ForwardBackward();
}
loss /= param_.iter_size();
// average the loss across iterations for smoothed reporting
UpdateSmoothedLoss(loss, start_iter, average_loss);
if (display) {
float lapse = iteration_timer_.Seconds();
float per_s = (iter_ - iterations_last_) / (lapse ? lapse : 1);
LOG_IF(INFO, Caffe::root_solver()) << "Iteration " << iter_
<< " (" << per_s << " iter/s, " << lapse << "s/"
<< param_.display() << " iters), loss = " << smoothed_loss_;
iteration_timer_.Start();
iterations_last_ = iter_;
const vector<Blob<Dtype>*>& result = net_->output_blobs();
int score_index = 0;
for (int j = 0; j < result.size(); ++j) {
const Dtype* result_vec = result[j]->cpu_data();
const string& output_name =
net_->blob_names()[net_->output_blob_indices()[j]];
const Dtype loss_weight =
net_->blob_loss_weights()[net_->output_blob_indices()[j]];
for (int k = 0; k < result[j]->count(); ++k) {
ostringstream loss_msg_stream;
if (loss_weight) {
loss_msg_stream << " (* " << loss_weight
<< " = " << loss_weight * result_vec[k] << " loss)";
}
LOG_IF(INFO, Caffe::root_solver()) << " Train net output #"
<< score_index++ << ": " << output_name << " = "
<< result_vec[k] << loss_msg_stream.str();
}
}
}
for (int i = 0; i < callbacks_.size(); ++i) {
callbacks_[i]->on_gradients_ready();
}
// 更新梯度
ApplyUpdate();
// Increment the internal iter_ counter -- its value should always indicate
// the number of times the weights have been updated.
++iter_;
// 如果监听了中断信号,按ctr ^c ,GetRequestedAction()会返回SolverAction::STOP,提前停止迭代
SolverAction::Enum request = GetRequestedAction();
// Save a snapshot if needed.
if ((param_.snapshot()
&& iter_ % param_.snapshot() == 0
&& Caffe::root_solver()) ||
(request == SolverAction::SNAPSHOT)) {
Snapshot();
}
if (SolverAction::STOP == request) {
requested_early_exit_ = true;
// Break out of training loop.
break;
}
}
}
上面的代码我们可以看到,solver 也定义了回调,回调接口有两个方法,分别是on_start(),on_gradients_ready(),分别在solver 每一步优化的开始和梯度准备好时调用。
class Callback {
protected:
virtual void on_start() = 0;
virtual void on_gradients_ready() = 0;
template <typename T>
friend class Solver;
};
Nccl
Nccl 实现caffe 单机多卡并行加速训练。Nccl 类封装了solver,借助nccl库实现多卡加速。支持单进程多卡和单机多进程两种模式。使用nccl 类有两种方式。一种是和C++ 里的方法一样,让nccl 组织单进程多卡的并行训练。另一种是像python 脚本train.py 里一样。
先看看NCCL 类的定义,和构造函数。NCCL 同时继承了Solver::Callback和Net::Callback两种Callback接口,主要实现了void run(int layer)和void on_gradients_ready()两个方法。
template<typename Dtype>
class NCCL : public GPUParams<Dtype>,
public Solver<Dtype>::Callback,
public Net<Dtype>::Callback {
public:
/**
* Single process version.
*/
explicit NCCL(shared_ptr<Solver<Dtype> > solver);
/**
* In multi-process settings, first create a NCCL id (new_uid), then
* pass it to each process to create connected instances.
*/
NCCL(shared_ptr<Solver<Dtype> > solver, const string& uid);
~NCCL();
boost::barrier* barrier();
void set_barrier(boost::barrier* value);
/**
* In single process settings, create instances without uids and
* call this to connect them.
*/
// 用来单进程多GPU时用来初始化nccl通信环,构建nccl 通信句柄
static void InitSingleProcess(vector<NCCL<Dtype>*>* nccls);
// 单机多进程情况下先创建nccl会话,之后会把这个uid传给NCCL 的构造函数
//在构造函数里创建nccl通信句柄。
static string new_uid();
/**
* Broadcast weights from rank 0 other solvers.
*/
//nccl 操作,广播
void Broadcast();
/**
* Single process multi-GPU.
*/
// 单进程多GPU训练入口
void Run(const vector<int>& gpus, const char* restore);
protected:
void Init();
void on_start() {}//solver callback
void run(int layer); // Net callback
void on_gradients_ready();// solver callback
// nccl 通信句柄
ncclComm_t comm_;
// cuda 流句柄
cudaStream_t stream_;
shared_ptr<Solver<Dtype> > solver_;
// Should not be necessary, https://github.com/NVIDIA/nccl/issues/37
boost::barrier* barrier_;
using Params<Dtype>::size_;
using Params<Dtype>::data_;
using Params<Dtype>::diff_;
};
//构造函数一,单进程多卡时使用
template<typename Dtype>
NCCL<Dtype>::NCCL(shared_ptr<Solver<Dtype> > solver)
: GPUParams<Dtype>(solver, getDevice()),
comm_(), solver_(solver), barrier_() {
// apply buffer,把gpu内存换成连续的。
this->Configure(solver.get());
Init();
}
//构造函数二,多进程多卡时使用
template<typename Dtype>
NCCL<Dtype>::NCCL(shared_ptr<Solver<Dtype> > solver, const string& uid)
: GPUParams<Dtype>(solver, getDevice()),
solver_(solver), barrier_() {
// apply buffer,把gpu内存换成连续的。
this->Configure(solver.get());
Caffe::set_multiprocess(true);
ncclUniqueId nccl_uid;
memcpy(&nccl_uid, &uid[0], NCCL_UNIQUE_ID_BYTES); // NOLINT(caffe/alt_fn)
// 创建nccl 通信句柄
NCCL_CHECK(ncclCommInitRank(&comm_,
Caffe::solver_count(),
nccl_uid,
Caffe::solver_rank()));
Init();
}
// apply buffer
void GPUParams<Dtype>::Configure(Solver<Dtype>* solver) const {
const vector<Blob<Dtype>*>& net =
solver->net()->learnable_params();
apply_buffers(net, data_, size_, replace_gpu);
apply_buffers(net, diff_, size_, replace_gpu_diff);
}
再看看callback 是如何实现的,似乎是并行训练常用的allreduce。NCCL作为net callback,run(int layer)在各layer反向传递结束后,对该层的训练参数的梯度在各个gpu间进行allruduce,然后求平均;NCCL作为 solver callback,on_gradients_ready()在所有反向传递计算完成后,所有层的训练参数在各个gpu 之间进行allreduce,然后求平均。到这里,敏锐的读者可能会发现,这里的allreduce 似乎要求,所有的参数分布在连续的gpu内存。是的,在nccl 构建的时候,会调用Configure,Configure的一个作用就是调用apply_buffers把所有的可训练参数放到一块连续的gpu 内存。
//net callback,对应的层backforward 执行完时调用
template<typename Dtype>
void NCCL<Dtype>::run(int layer) {
CHECK(solver_->param().layer_wise_reduce());
vector<shared_ptr<Blob<Dtype> > >& blobs =
solver_->net()->layers()[layer]->blobs();
#ifdef DEBUG
// Assert blobs are contiguous to reduce in one step (e.g. bias often small)
for (int i = 1; i < blobs.size(); ++i) {
CHECK_EQ(blobs[i - 1]->gpu_diff() + blobs[i - 1]->count(),
blobs[i + 0]->gpu_diff());
}
#endif
if (blobs.size() > 0) {
// Make sure default stream is done computing gradients. Could be
// replaced by cudaEventRecord+cudaStreamWaitEvent to avoid
// blocking the default stream, but it's actually slower.
CUDA_CHECK(cudaStreamSynchronize(cudaStreamDefault));
// Reduce asynchronously
int size = 0;
for (int i = 0; i < blobs.size(); ++i) {
size += blobs[i]->count();
}
if (barrier_) { // NULL in multi process case
barrier_->wait();
}
//nccl allruduce
NCCL_CHECK(ncclAllReduce(blobs[0]->mutable_gpu_diff(),
blobs[0]->mutable_gpu_diff(),
size,
nccl::dataType<Dtype>::type,
ncclSum, comm_, stream_));
//求平均
caffe_gpu_scal(size, (Dtype) 1.0 / Caffe::solver_count(),
blobs[0]->mutable_gpu_diff(), stream_);
}
}
//solver callback,在solver 完成一次训练迭代的梯度计算后调用
template<typename Dtype>
void NCCL<Dtype>::on_gradients_ready() {
if (solver_->param().layer_wise_reduce()) {
CHECK_EQ(solver_->net()->params().size(),
solver_->net()->learnable_params().size())
<< "Layer-wise reduce is not supported for nets with shared weights.";
// Make sure reduction is done before applying gradients
CUDA_CHECK(cudaStreamSynchronize(stream_));
} else {
if (barrier_) { // NULL in multi process case
// 同步点
barrier_->wait();
}
NCCL_CHECK(ncclAllReduce(diff_, diff_, static_cast<int>(size_),
nccl::dataType<Dtype>::type, ncclSum, comm_,
cudaStreamDefault));
caffe_gpu_scal(static_cast<int>(size_),
(Dtype) 1.0 / Caffe::solver_count(), diff_);
}
}
单进程多卡实现
回到caffe.bin 里的train 函数。单进程多gpu卡的入口是NCCL::Run(const vector& gpus, const char* restore)
int train(){
......
caffe::SolverParameter solver_param;
caffe::ReadSolverParamsFromTextFileOrDie(FLAGS_solver, &solver_param);
......
shared_ptr<caffe::Solver<float>>
solver(caffe::SolverRegistry<float>::CreateSolver(solver_param));
// 单进程多卡
if (gpus.size() > 1) {
// 构造NCCL类的对象nccl
caffe::NCCL<float> nccl(solver);
// 单进程多卡训练入口
nccl.Run(gpus, FLAGS_snapshot.size() > 0 ? FLAGS_snapshot.c_str() : NULL);
}else {
solver->Solve();
}
......
}
我们来看看NCCL::Run(const vector& gpus, const char* restore)方法做了什么。run 方法里创建了多个线程,每个绑定一个gpu 卡,对应一个NCCL 对象。
template<typename Dtype>
void NCCL<Dtype>::Run(const vector<int>& gpus, const char* restore) {
//创建一个barrier用来同步
boost::barrier barrier(static_cast<int>(gpus.size()));
//每个gpu对应一个NCCL对象
vector<NCCL<Dtype>*> nccls(gpus.size());
// Create workers
// 每个NCCL对应一个线程
vector<shared_ptr<Worker<Dtype> > > workers(gpus.size());
//除了本线程外,还要创建gpus.size()-1 个线程。
for (int i = 1; i < gpus.size(); ++i) {
CUDA_CHECK(cudaSetDevice(gpus[i]));
Caffe::set_solver_rank(i);
Worker<Dtype>* w = new Worker<Dtype>(solver_, gpus[i], &barrier,
&nccls, restore);
w->StartInternalThread();
workers[i].reset(w);
}
CUDA_CHECK(cudaSetDevice(gpus[0]));
// 本线程的solver是主solver,rank为0。
Caffe::set_solver_rank(0);
//barrier_是本nccl的属性,是个指针
barrier_ = &barrier;
// 向solver 注册 nccl做为回调
solver_->add_callback(this);
// 向net 注册回调
if (solver_->param().layer_wise_reduce()) {
solver_->net()->add_after_backward(this);
}
//第一个NCCL对象是此对象本身
nccls[0] = this;
// Wait for workers
// 第一个同步点
barrier.wait();
// Init NCCL
InitSingleProcess(&nccls);
/*函数的定义如下,主要是初始化nccl 通信句柄。
*template<typename Dtype>
* void NCCL<Dtype>::InitSingleProcess(vector<NCCL<Dtype>*>* nccls) {
* ncclComm_t* comms = new ncclComm_t[nccls->size()];
* int* gpu_list = new int[nccls->size()];
* for (int i = 0; i < nccls->size(); ++i) {
* gpu_list[i] = (*nccls)[i]->solver_->param().device_id();
* }
*//初始化nccl 通信句柄
* NCCL_CHECK(ncclCommInitAll(comms, static_cast<int>(nccls->size()), gpu_list));
* for (int i = 0; i < nccls->size(); ++i) {
* //把初始化好的nccl通信句柄赋给每个NCCL对象
* (*nccls)[i]->comm_ = comms[i];
*}
*}
*/
// 第二个同步点
barrier.wait();
// Run first solver on current thread
Broadcast();
// solver优化,里面有很多同步点
solver_->Solve();
// 最后一个同步点
barrier.wait(); // Hangs without it when running tests
// Wait for shutdown
for (int i = 1; i < gpus.size(); ++i) {
workers[i]->StopInternalThread();
}
}
再看看每个线程做了什么,每个work 执行的和主线程执行的类似,只少了创建线程,创建nccl 通信句柄那些部分。
template<typename Dtype>
class Worker : public InternalThread {
public:
explicit Worker(shared_ptr<Solver<Dtype> > rank0, int device,
boost::barrier* barrier, vector<NCCL<Dtype>*>* nccls,
const char* restore)
: rank0_(rank0), device_(device), barrier_(barrier),
nccls_(nccls), restore_(restore) {
}
virtual ~Worker() {}
protected:
void InternalThreadEntry() {
// Create solver and install callbacks
SolverParameter param(rank0_->param());
param.set_device_id(device_);
#ifdef DEBUG
int device;
CUDA_CHECK(cudaGetDevice(&device));
CHECK_EQ(device, device_);
#endif
param.set_type(rank0_->type());
shared_ptr<Solver<Dtype> > s(SolverRegistry<Dtype>::CreateSolver(param));
CHECK_EQ(s->type(), rank0_->type());
if (restore_) {
// Could not make NCCL broadcast solver state, it seems to crash
// if called in a tight loop, regardless of barriers etc. so
// restore all solvers from file.
s->Restore(restore_);
}
// 创建NCCL 对象
NCCL<Dtype> nccl(s);
nccl.set_barrier(barrier_);
//注册solver 回调,在回调里执行nccl allreduce
s->add_callback(&nccl);
//注册net 回调,在回调里执行nccl allreduce
if (s->param().layer_wise_reduce()) {
s->net()->add_after_backward(&nccl);
}
(*nccls_)[Caffe::solver_rank()] = &nccl;
// Wait for other threads
//第一个同步点,和主线程以及其他线程同步,可以和主线程里的同步点对应起来看
barrier_->wait();
// Wait for NCCL init
//第二个同步点
barrier_->wait();
// Broadcast rank 0 state
nccl.Broadcast();
// Solve
// 里面很多同步点
s->Step(param.max_iter() - s->iter());
// 最后一个同步点
barrier_->wait();
#ifdef DEBUG
// Check all solvers have same state
SGDSolver<Dtype>* sa = static_cast<SGDSolver<Dtype>*>(rank0_.get());
SGDSolver<Dtype>* sb = static_cast<SGDSolver<Dtype>*>(s.get());
for (int h = 0; h < sa->history().size(); ++h) {
CUDA_CHECK(cudaSetDevice(sa->param().device_id()));
const Dtype* a = sa->history()[h]->cpu_data();
CUDA_CHECK(cudaSetDevice(sb->param().device_id()));
const Dtype* b = sb->history()[h]->cpu_data();
for (int v = 0; v < sa->history()[h]->count(); ++v) {
CHECK_DOUBLE_EQ(a[v], b[v]);
}
}
#endif
}
shared_ptr<Solver<Dtype> > rank0_;
int device_;
boost::barrier* barrier_;
vector<NCCL<Dtype>*>* nccls_;
const char* restore_;
};
多进程多卡实现
多进程多卡实现和单进程多卡实现类似,只是把多线程改成了多经常。一个不同的地方就是,nccl 通信句柄的创建方式不同,这点下一小节会解释。
由于python 多线程的缺陷,caffe python 版多卡并行是基于multiprocessing 实现的,以下解析以下caffe 提供的train.py 脚本。
#!/usr/bin/env python
"""
Trains a model using one or more GPUs.
"""
from multiprocessing import Process
import caffe
def train(
solver, # solver proto definition
snapshot, # solver snapshot to restore
gpus, # list of device ids
timing=False, # show timing info for compute and communications
):
# NCCL uses a uid to identify a session
#用来标识一个nccl session
uid = caffe.NCCL.new_uid()
caffe.init_log()
caffe.log('Using devices %s' % str(gpus))
procs = []
# mannually assign rank
for rank in range(len(gpus)):
# #建立多个进程,每个进程执行solve
p = Process(target=solve,
args=(solver, snapshot, gpus, timing, uid, rank))
p.daemon = True
p.start()
procs.append(p)
for p in procs:
# 主进程等待其他进程完成。
p.join()
def time(solver, nccl):
fprop = []
bprop = []
total = caffe.Timer()
allrd = caffe.Timer()
for _ in range(len(solver.net.layers)):
fprop.append(caffe.Timer())
bprop.append(caffe.Timer())
display = solver.param.display
def show_time():
if solver.iter % display == 0:
s = '\n'
for i in range(len(solver.net.layers)):
s += 'forw %3d %8s ' % (i, solver.net._layer_names[i])
s += ': %.2f\n' % fprop[i].ms
for i in range(len(solver.net.layers) - 1, -1, -1):
s += 'back %3d %8s ' % (i, solver.net._layer_names[i])
s += ': %.2f\n' % bprop[i].ms
s += 'solver total: %.2f\n' % total.ms
s += 'allreduce: %.2f\n' % allrd.ms
caffe.log(s)
#对每层的前向和后计时
solver.net.before_forward(lambda layer: fprop[layer].start())
solver.net.after_forward(lambda layer: fprop[layer].stop())
solver.net.before_backward(lambda layer: bprop[layer].start())
solver.net.after_backward(lambda layer: bprop[layer].stop())
#解析到这,顺便提一下,caffe 通过boost.python 导出了注册回调的接口,可以注册python 函数和callable对 #象,具体的可以看python/caffe目录下的_caffe.cpp。
"""
void Net_before_forward(Net<Dtype>* net, bp::object run) {
net->add_before_forward(new NetCallback<Dtype>(run));
}
void Net_after_forward(Net<Dtype>* net, bp::object run) {
net->add_after_forward(new NetCallback<Dtype>(run));
}
void Net_before_backward(Net<Dtype>* net, bp::object run) {
net->add_before_backward(new NetCallback<Dtype>(run));
}
void Net_after_backward(Net<Dtype>* net, bp::object run) {
net->add_after_backward(new NetCallback<Dtype>(run));
}
void Net_add_nccl(Net<Dtype>* net
#ifdef USE_NCCL
, NCCL<Dtype>* nccl
#endif
) {
#ifdef USE_NCCL
net->add_after_backward(nccl);
#endif
}
"""
#对前向后向整个流程计时,以及对alreduce计时
#这个solver 回调,前向后向开始的时候,开始对整流程计时,前向后向结束的时候,
#这个回调结束total计时,开始allrd 计间
solver.add_callback(lambda: total.start(), lambda: (total.stop(), allrd.start()))
#然后是执行nccl 的回调,进行allreduce
solver.add_callback(nccl)
#最后结束allrd 计间,显示计间结果
solver.add_callback(lambda: '', lambda: (allrd.stop(), show_time()))
def solve(proto, snapshot, gpus, timing, uid, rank):
# set gpu flag
caffe.set_mode_gpu()
caffe.set_device(gpus[rank])
caffe.set_solver_count(len(gpus))
#手动指定rank
caffe.set_solver_rank(rank)
# set multiprocess flag to true
caffe.set_multiprocess(True)
#
solver = caffe.SGDSolver(proto)
if snapshot and len(snapshot) != 0:
solver.restore(snapshot)
#创建NCCL对象
nccl = caffe.NCCL(solver, uid)
nccl.bcast()
if timing and rank == 0:
# 如果是root solver,注册一些记时的回调函数。
time(solver, nccl)
else:
#把nccl 作为solver回调注册到solver
solver.add_callback(nccl)
if solver.param.layer_wise_reduce:
#把nccl 作为net 回调注册到net
solver.net.after_backward(nccl)
#执行优化迭代
solver.step(solver.param.max_iter)
if __name__ == '__main__':
#解析命令行
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--solver", required=True, help="Solver proto definition.")
parser.add_argument("--snapshot", help="Solver snapshot to restore.")
parser.add_argument("--gpus", type=int, nargs='+', default=[0],
help="List of device ids.")
parser.add_argument("--timing", action='store_true', help="Show timing info.")
args = parser.parse_args()
//多进程多卡并行训练入口。
train(args.solver, args.snapshot, args.gpus, args.timing)
单进程和多进程使用nccl库的区别
nccl 是Nvidia的Multi-GPU多卡通信框架,支持类似MPI的broadcast/allreduce 等操作,tensorflow1.13也开始使用nccl。nccl2 支持单机单进程多卡、多机多进程多卡、单机多进程多卡等,使用nccl 的例子可以参考https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html。nccl通信一般是环状的,单进程多卡时,调用ncclCommInitAll可以构建本进程内的多个gpu 卡的通信环,caffe的单进程多卡就是这样操作的。在多进程环境下,涉及到进程间通信,需要一个中间人(大概的实现是建立一个单独的一个线程,开启一个tcp 监听端口),各个进程根据中间人的地址(通信地址)去找中间人,在中间人的协调下建立通信环。所以caffe train.py 脚本里先使用caffe.NCCL.new_uid()(调用nccl api ncclGetUniqueId())创建一个中间人的地址,然后把这个uid传给各个NCCL的构造函数,这个构造函数里调用ncclCommInitRank 初始化nccl通信句柄。
总结
caffe是个轻量级的库,代码5万多行,比较简洁。本文是我阅读caffe源代码的笔记,梳理了caffe 框架的代码骨架,特别剖析了caffe 多GPU卡数据并行的实现。本文并没有太详细地解读源代码,只是梳理了代码的脉络,希望本文对想阅读caffe源代码的读者有所帮助,代码细节还需要读者自己去读。

2万+

被折叠的 条评论
为什么被折叠?



