HuggingFace - Medium

Simple considerations for simple people building fancy neural networks

Victor Sanh — Tue, 22 Sep 2020 13:31:12 GMT

As machine learning continues penetrating all aspects of the industry, neural networks have never been so hyped. For instance, models like GPT-3 have been all over social media in the past few weeks and continue to make headlines outside of tech news outlets with fear-mongering titles.

An article from The Guardian

At the same time, deep learning frameworks, tools, and specialized libraries democratize machine learning research by making state-of-the-art research easier to use than ever. It is quite common to see these almost-magical/plug-and-play 5 lines of code that promise (near) state-of-the-art results. Working at Hugging Face 🤗, I admit that I am partially guilty of that. 😅 It can give an inexperienced user the misleading impression that neural networks are now a mature technology while in fact, the field is in constant development.

When I was doing research on language models a decade ago, I could never in my wildest dreams imagine that MC Hammer would be debating Gary Marcus about the latest work in the field today.

In reality, building and training neural networks can often be an extremely frustrating experience:

It is sometimes hard to understand if your performance comes from a bug in your model/code or is simply limited by your model’s expressiveness.
You can make tons of tiny mistakes at every step of the process without realizing at first, and your model will still train and give a decent performance.

In this post, I will try to highlight a few steps of my mental process when it comes to building and debugging neural networks. By “debugging”, I mean making sure you align what you have built and what you have in mind. I will also point out things you can look at when you are not sure what your next step should be by listing the typical questions I ask myself.

A lot of these thoughts stem from my experience doing research in natural language processing but most of these principles can be applied to other fields of machine learning.

1. 🙈 Start by putting machine learning aside

It might sound counter-intuitive but the very first step of building a neural network is to put aside machine learning and simply focus on your data. Look at the examples, their labels, the diversity of the vocabulary if you are working with text, their length distribution, etc. You should dive into the data to get a first sense of the raw product you are working with and focus on extracting general patterns that a model might be able to catch. Hopefully, by looking at a few hundred examples, you will be able to identify high-level patterns. A few standard questions you can ask yourself:

Are the labels balanced?
Are there gold-labels that you do not agree with?
How were the data obtained? What are the possible sources of noise in this process?
Are there any preprocessing steps that seem natural (tokenization, URL or hashtag removing, etc.)?
How diverse are the examples?
What rule-based algorithm would perform decently on this problem?

It is important to get a high-level feeling (qualitative) of your dataset along with a fine-grained analysis (quantitative). If you are working with a public dataset, someone else might have already dived into the data and reported their analysis (it is quite common in Kaggle competition for instance) so you should absolutely have a look at these!

2. 📚 Continue as if you just started machine learning

Once you have a deep and broad understanding of your data, I always recommend to put yourself in the shoes of your old self when you just started machine learning and were watching introduction classes from Andrew Ng on Coursera. Start as simple as possible to get a sense of the difficulty of your task and how well standard baselines would perform. For instance, if you work with text, standard baselines for binary text classification can include a logistic regression trained on top of word2vec or fastText embeddings. With the current tools, running these baselines is as easy (if not more) as running BERT which can arguably be considered one of the standard tools for many natural language processing problems. If other baselines are available, run (or implement) some of them. It will help you get even more familiar with the data.

As developers, it easy to feel good when building something fancy but it is sometimes hard to rationally justify it if it beats easy baselines by only a few points, so it is central to make sure you have reasonable points of comparisons:

How would a random predictor perform (especially in classification problems)? Dataset can be unbalanced…
What would the loss look like for a random predictor?
What is (are) the best metric(s) to measure progress on my task?
What are the limits of this metric? If it’s perfect, what can I conclude? What can’t I conclude?
What is missing in “simple approaches” to reach a perfect score?
Are there architectures in my neural network toolbox that would be good to model the inductive bias of the data?

3. 🦸‍♀️ Don’t be afraid to look under the hood of these 5-liners templates

Next, you can start building your model based on the insights and understanding you acquired previously. As mentioned earlier, implementing neural networks can quickly become quite tricky: there are many moving parts that work together (the optimizer, the model, the input processing pipeline, etc.), and many small things can go wrong when implementing these parts and connecting them to each other. The challenge lies in the fact that you can make these mistakes, train a model without it ever crashing, and still get a decent performance…

Yet, it is a good habit when you think you have finished implementing to overfit a small batch of examples (16 for instance). If your implementation is (nearly) correct, your model will be able to overfit and remember these examples by displaying a 0-loss (make sure you remove any form of regularization such as weight decay). If not, it is highly possible that you did something wrong in your implementation. In some rare cases, it means that your model is not expressive enough or lacks capacity. Again, start with a small-scale model (fewer layers for instance): you are looking to debug your model so you want a quick feedback loop, not a high performance.

Pro-tip: in my experience working with pre-trained language models, freezing the embeddings modules to their pre-trained values doesn’t affect much the fine-tuning task performance while considerably speeding up the training.

Some common errors include:

Wrong indexing… (these are really the worst 😅). Make sure you are gathering tensors along the correct dimensions for instance…
You forgot to call `model.eval()` in evaluation mode (in PyTorch) or `model.zero_grad()` to clean the gradients
Something went wrong in the pre-processing of the inputs
The loss got wrong arguments (for instance passing probabilities when it expects logits)
Initialization doesn’t break the symmetry (usually happens when you initialize a whole matrix with a single constant value)
Some parameters are never called during the forward pass (and thus receive no gradients)
The learning rate is taking funky values like 0 all the time
Your inputs are being truncated in a suboptimal way

Pro-tip: when you work with language, have a serious look at the outputs of the tokenizers. I can’t count the number of lost hours I spent trying to reproduce results (and sometimes my own old results) because something went wrong with the tokenization.🤦‍♂️

Another useful tool is deep-diving into the training dynamic and plot (in Tensorboard for instance) the evolution of multiple scalars through training. At the bare minimum, you should look at the dynamic of your loss(es), the parameters, and their gradients.

As the loss decreases, you also want to look at the model’s predictions: either by evaluating on your development set or, my personal favorite, print a couple of model outputs. For instance, if you are training a machine translation model, it is quite satisfying to see the generations become more and more convincing through the training. You want to be more specifically careful about overfitting: your training loss continues to decreases while your evaluation loss is aiming at the stars.💫

4. 👀 Tune but don’t tune blindly

Once you have everything up and running, you might want to tune your hyperparameters to find the best configuration for your setup. I generally stick with a random grid search as it turns out to be fairly effective in practice.

Some people report successes using fancy hyperparameter tuning methods such as Bayesian optimization but in my experience, random over a reasonably manually defined grid search is still a tough-to-beat baseline.

Most importantly, there is no point of launching 1000 runs with different hyperparameters (or architecture tweaks like activation functions): compare a couple of runs with different hyperparameters to get an idea of which hyperparameters have the highest impact but in general, it is delusional to expect to get your biggest jumps of performance by simply tuning a few values. For instance, if your best performing model is trained with a learning rate of 4e2, there is probably something more fundamental happening inside your neural network and you want to identify and understand this behavior so that you can re-use this knowledge outside of your current specific context.

The experience of the human setting the hyper-parameters of a deep learning method impacts the resulting performance significantly. This "black magic" has now been analyzed in a controlled study by Anand et al. in a #BMVC 2020 paper: https://t.co/MXVz0CIIqg pic.twitter.com/dDOCb2OOeb

To conclude, a piece of general advice that has helped me become better at building neural networks is to favor (as most as possible) a deep understanding of each component of your neural network instead of blindly (not to say magically) tweak the architecture. Keep it simple and avoid small tweaks that you can’t reasonably justify even after trying really hard. Obviously, there is the right balance to find between a “trial-and-error” and an “analysis approach” but a lot of these intuitions feel more natural as you accumulate practical experience. You too are training your internal model. 🤯

A few related pointers to complete your reading:

Reproducibility (in ML) as a vehicle for engineering best practices from Joel Grus
Checklist for debugging neural networks from Cecelia Shao
How to unit test machine learning code from Chase Roberts
A recipe for Training Neural Networks from Andrej Karpathy

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🚧 Simple considerations for simple people building fancy neural networks was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Sparse Neural Networks (2/N): GPU Performance.

François Lagunas — Thu, 28 May 2020 21:23:50 GMT

Sparse Neural Networks (2/N): Understanding GPU Performance.

NVIDIA Ampere A100 introduces fine-grained structured sparsity

Welcome back for this series on Sparse Neural Networks. In case you have not read our first introductory episode, here it is.

I told you last time that sparsity would a major topic in 2020, and it looks like it’s getting indeed some steam: Nvidia is announcing with the Ampere GPU generation that sparsity is directly baked into their GPU design.

It’s quite a bold move: if you consider the time it takes to design and produce a new GPU line, they made this decision at least 2 years ago, and you need some vista to understand that it would be an important trend 2 years later.

André Ampère, 1825 (from Wikipedia)

So that’s the perfect pretext to make a large digression on GPU architectures and why knowing better about them may matter for your daily Machine Learning jobs.

To be honest, this will more matter to you if you are working on some low-level code.

If you are using PyTorch or other libraries, and you are just using the extremely good tools it provides, you are probably fine.

But leaky abstractions come back at you faster than you’d think. Your model got a bit heavier? Want to train faster? OK, let’s use a DataParallel PyTorch node, and we’ll be fine on 8 GPUs. But wait, why my GPU usage is down the gutter? And on 8 GPUs it’s only 3 times as fast as on a single one?

It especially matters to me, as I have been telling you last time that the performance of sparse matrices operations was not satisfactory. Today we’ll see why it can be hard to get good performance on GPUs, how it depends on your data structure and algorithms, and how you can overcome it, or at some times at least mitigate some issues.

And of course, all this is a good pretext to read about some mind-blowing GFlops numbers and killer optimizations, nothing to sneeze at…

Some physics

You may wonder why your PC/Mac is not significantly faster than a few years ago. That’s because most of the apps you are using are mostly sequential: they are doing only one thing at a time, or almost, and sequential performance has been stagnating for some years.

That’s because sequential performance is mostly limited by operating frequency, which is itself limited by:

the size of the finest details that are drawn on the silicon, something that is getting harder and harder to improve,
the amount of heat that is created by the chips, a function of voltage and frequency. First, a transistor emits heat when changing state, so proportionally to frequency. Second, the higher the frequency, the higher the voltage you need. So in the end emitted heat is more than linear in the frequency, not something ideal.

From https://youtu.be/Knd-U-avG0c

So if you could efficiently and cheaply remove heat from the chips, you could get higher frequencies, but only marginally, and it gets quickly impractical (water-cooling, you know, is cool, but not when it leaks…).

The recent ARM takeover is not an accident. When you work for years on low consumption and so low heat producing chips, when everybody hits the “heat wall”, you are in a good position to push performance higher, even if computers migrating to your pocket was the opportunity that made the difference.

Chip design

So people invented tricks to make use of the same amount of cycles to do more, to do almost any instruction in one single cycle, to forecast what’s the next instruction etc. Very different architectures to tackle the same issues were used (RISC, CISC). But the returns are diminishing, as always.

So what can you do to feed the hungry “Moore’s Law Beast”, and the marketing guys who keep asking why the numbers are flattening?

You look for problems that need to do the same kind of task a billion times, and each task does not need the result of another task, so all tasks can be computed at the same time. (the technical slang for this is “Embarrassingly Parallel” …).

Fortunately, there are a lot of them. Linear Algebra, for example, is highly parallel by nature, and machine learning is using it a lot, like lots of physics simulation, computer graphics, and so on.

So instead of increasingly complex single cores processors, we see much simpler (and smaller on silicon) cores but grouped by the hundreds or thousands. This way you are guaranteed that the ratio computation/silicon area is getting through the roof.

Great. That’s a simple idea. But of course, reality is more complex than that.

Bottlenecks

If you have a lot of computing power available, you have to feed it with data. Memory is getting faster with time, but it’s harder than just duplicating cores. Because memory buses are basically 1D, and compute cores are 2D.

From https://unsplash.com/photos/VEVfbQtyB8s

You can think about it as a city (the computing cores), and the suburban workers coming each morning in the city (the data). The city is 2D, the highways are 1D, and of course, you get some heavy traffic jams. So you add some new lanes on the highways (the width of the memory bus), but it’s always the bottleneck

If you want to maximize the highway utility, you would have to use all day long, encouraging people to come to and leave from work earlier or later.

That’s the same thing for the memory bus: you have to make sure that you are balancing computation and memory transfers so you don’t waste time waiting without using the memory bus or the compute cores. That’s why it’s hard to reach peak performance for every task.

Some tasks even prefer to compute twice the same thing instead of transferring some data: compute is plentiful and memory bandwidth is scarce (and the gap is growing each year). In graphics, procedural texturing is used more and more for this exact reason: textures need bandwidth, and so if you can generate the same result with few memory transfers but some additional compute, it’s a win.

GPU Architecture principles

A lot of the complexities of GPU architectures exist to overcome those bottlenecks.

Hierarchy

You don’t get the 1000s of cores in a GPU in a single bag: they are grouped at multiple levels. We’ll take the example of the new Ampere A100. Numbers change according to the generation, but the general principles are slowly evolving. (Numbers below come mostly from the Nvidia blog)

The GA100 streaming multiprocessor (SM)

At the lower level you have a Streaming Processor (SP). He is part of a group of 16 SP which computes the same sequence of instructions at the same time.

(To be more precise, you have 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, 1 Tensor Core, and 1 texture unit per group. More on tensor cores later)

The first constraint is the following: the 16 SP in the group cannot diverge from a single sequence instruction. This is called SIMD: Same Instruction, Multiple Data. That’s not exactly true, the instructions can contain “if“ statement, but if different branches are taken, some compute will be lost because every processor will have to execute both branches, and throw the results that are not useful for its own work.

4 groups of 16 SPs form a Streaming Multiprocessor (SM). Each group executes the same kernel (=function), but not in a strictly synchronized way. Still, you’ll have at least 64 cores working on the same task, or you’ll lose some computing capacity.

Then, you group 2 SMs to form a “Texture Processing Cluster” (TPC), and you group 8 TPCs to form a GPC (GPU Processing Cluster). 8GPCs and you have an A100 GPU. Pfew!

To sum it up, there are 128 SMs in an A100, so 8192 FP32 cores, but as you can see, we are far from getting a flat set of “8192” cores! (those are maximum numbers, first processors won’t have the full set of cores).

If you compare the A100 structure with the Volta V100, these structural numbers are almost the same, except for the PCs, and so for the grand total of course. The innards of the cores of course have changed too, but it looks like that the communication structure of the V100 was considered quite good for the kind of job it’s usually given. The Tensor Cores seems to be the area where the most innovation is taking place (more on this later).

You can see in the comparison below that all those numbers varied significantly with time, in search of the best performance :

Why so many levels? Performance

The main reason is of course to improve real-life performance. And in real life, you don’t have a single task to be done.

First, there may be several processes using your GPU at the same time on your machine. Not sure if it’s a good idea to get some good performance, but it’s of course something very usual.

In the new Ampere GPU, you can even partition your GPU to server multiple Virtual Machines with strong guarantees on your data security: the new feature is called “Multi-Instance GPU”.

In a single process, if your network contains several layers, some linear, some non-linear, some embedding, each one will use one or several kernels to do its job.

You may think that they are executed one after the other. It’s true to some extent, but in order to keep your GPU busy, your CPU is sending a stream of tasks to be done, not a task after the other, and the GPU will do them without the CPU waiting for each one to complete.

The CPU will basically wait after a full batch has been processed, after the forward and backward pass, because he has to update the full model before starting a new batch.

There are several reasons to have this “stream of task” model:

The first reason is that starting a task takes some time, so the GPU can prepare the next task before the previous is started: changing the active kernel on some part of the GPU takes some time, pipelining saves time.
Second, in the task stream, some tasks are not dependent on each other, so both can be executed in parallel in the GPU, so more work to be done, so less chance some part of the GPU is idling.

Some networks are very, very parallel to compute, like Transformers, and so their efficiency is very good:

there are only a few different layers, so few kernel changes and a lot of work for each kernel
there are only loose dependencies between computations (eg for each token), so the GPU has a lot of degrees of freedom when scheduling the different parts of the computation: if a kernel is waiting for some data, maybe another one can compute its result because it already has its own data available.

Why so many levels? Economics

Another reason is that it’s hard to get zero-defect silicon at this level of detail.

Ampere GPUs contain 54 billion transistors. Any defective transistor, and you may have to throw the GPU to the bin. The fraction of chips that pass the test is called the yield. Those chips are huge, and silicon real estate costs a lot, so each failed chip is a big loss, just for a small defect on a single transistor somewhere in the silicon.

So instead of throwing the chip to the bin, you test some sub-parts of the chip, and you just disable the failing sub-parts. That means, for example, disabling a GPC (remember, there are 7 of them in a A100, instead of a theoretical 8). And you sell it in a lower-end card, with reduced specs. This process is called binning. If you are really good, and your chips are all perfect, you may even disable perfectly working parts of your chip, to segment your offer (and back in time, some users were able to re-enable those disabled parts of silicon to get the bang without the buck…)

Developing for GPUs

So what are the consequences of the GPU architecture choices on development?

Kernels

First, you have to write some kernels, using the primitives you get. It’s a quite specific exercise, as you have to manually manage caches, registers, the synchronization of the different cores, etc. For simple stuff like matrix products, or activation layers, it’s quite straightforward, as they are completely parallel by nature.

But for some algorithms, like sorting, it can be a lot trickier to have something efficient, because you will have some issues using all the cores all the time.

Grids and performance

That’s because the kernel is only a small part of the problem, the other is the way you distribute the work among cores. And the performance gains are often made more on the distribution than on an optimal kernel.

The way you distribute the work is usually done by partitioning your job into a 2D or 3D grid, then mapping each point of the grid to a thread, and finally mapping those threads to physical cores. Those dimensions will correspond for example to the dimensions of the output of a layer, plus the batch dimension.

As you have seen, in a GPU you get thousands of cores to work with, but with a really complex multi-layered structure. And this structure change according to the generation and model of the GPU. So it’s hard to find the right way to choose those mappings. You often have to make some benchmarks to find the right way to do a computation with given dimensions on a specific GPU, and that information will be used in the future to choose the best strategy at runtime.

Memory

But the main and the most difficult hurdle a developer face while developing for GPU’s architecture is managing memory. And specifically memory transfers. The available memory bandwidth is huge, but the computing power is even larger. And just as you did not get a flat space of computing cores, you don’t get completely random access to the memory for free.

If you want to access a float number stored in the main memory from a GPU core, you will wait literally for ages compared to the time it takes to compute a sum or a multiply. So you need to be able to start hundreds of computations at once, and when the data is finally available, you resume your kernel, you execute a few local operations, until you need some more data from the main memory.

Some special ops like “prefetch” exist, to declare that you will need some data in a few instructions, and the role of the compiler is to reorder the instructions so you keep the memory controllers busy while keeping the core busy too. And at runtime, a large part of the GPU silicon is devoted to handling all those threads that are “in flight” and their current memory requests.

But there are some low-level constraints that may cost you a lot. Just like the base computation unit is 16 cores doing the same job, you really get peak memory performance if you load memory by quite large contiguous blocks, for example, 16 floats = 64 bytes, by a group of threads (called warp in CUDA lingo). This is called coalesced access. This is another reason, and often the main one, why choosing the right grid to dispatch your task on is important.

So now, let’s unroll back to our initial issue if you still remember (I would forgive you, I can barely): why sparse matrices ops are slow?

If you look at the memory access pattern you need to make a sparse matrix/ matrix multiplication, you’ll see that by definition it’s hard to have those blocks of 16 floats when reading the matrix weights. And reading 16 contiguous floats is just a minimum, you’ll need to read more data at once to reach full performance.

That explains why a naive implementation can be at least an order of magnitude slower than the dense version.

Unless you make some compromise and use a block sparse matrix: each block, if large enough, will produce large contiguous accesses. 8x8 blocks is a minimum in OpenAI implementation, but you will get even better performance with 32x32 blocks.

But of course, you have to make sure that your model is working in a similar fashion with block sparse compared to pure sparse matrices. It can be the case if your matrices are large enough so block size is small in comparison, but you have to check.

The other way is to convince an executive at Nvidia to add some hardware sparse support into their next-gen GPU, and now it’s done. More on this below!

Inter-GPU memory transfer

Memory bottlenecks exist within the GPU, but if you work with multiple GPUs sharing a single model, the available bandwidth is way lower than between memory and cores.

The DataParallel node of PyTorch is convenient, but it is no magic: after each batch, the GPUs must send their gradients to a single GPU, and then this latter must broadcast the updated model to each GPU. If your model is big enough, this transfer can take very significant time, and the performance will suffer. Another point is that the transfers are synchronous, no GPU can work if the new model has not been received.

Another way to use multiple GPUs is to split a single model between the different GPUs, and then transfer only the “frontier” layers from a GPU to the next. Same thing for backpropagation. This may not be ideal either as the first layer will have to wait for the last to complete before the backpropagation can occur. The performance will depend heavily on the morphology of the network.

Ampere Highlights

Let’s finish where we started, with the latest Nvidia announcement.

Tensor Cores

With Volta, Nvidia introduced new “Tensor Core units”, and it looks like they are here to stay. Turing and now Ampere iterated on these new units.

You can see them as ultra-specialized units, with some significant dedicated silicon.

And this means a lot in terms of speed, especially quantized networks inference :

From https://youtu.be/yyR0ZoCeBO8?t=19

For training, it was a bit more difficult on Volta, as working with FP16 was possible but a bit tricky (the 8x gain in speed was indeed tempting).

But now with Ampere, Nvidia announces support for FP32 and even FP64 for Tensor Cores. And it looks like FP32 is now 20 times faster than on Volta with sparsity, and 10 times without sparsity. And this is for training and inference because it’s just big tensor ops, nothing special here.

It looks like we’ll be getting some nice toys to play with.

Sparsity

From the Nvidia Blog :

NVIDIA has developed a simple and universal recipe for sparsifying deep neural networks for inference using this 2:4 structured sparsity pattern.

If you have read the first part of this series, you should feel at home.

The idea is simple: maybe using a fully dense matrix is not useful. And what Nvidia is claiming is that it’s true, keeping only half the weights has a minimal impact on precision.

And so they propose a method to reduce the number of weights. But what is more interesting, is that the A100 GPU has new instructions to process efficiently these sparse matrices, at twice the speed of dense ones (no magic here, only half the multiply occurs of course).

So anyone can try its own method to sparsify the matrices and use the new instructions to speed things up. The only constraint is that the sparse pattern is fixed, as every 4 cells must have 2 sparse ones at most.

You can compare this to the way textures are compressed to save memory but for floating computation and not just graphics.

I see it mostly for inference at first, but I am sure some clever people will come with imaginative ways to use those new capabilities for training too, as it’s just some new compute ops.

What about “sparse block sparse matrices”, by combining soon to be released OpenAI “block sparse matrices” with this? We’ll see.

Conclusion

I hope you enjoyed this second part of our trip to sparse land, even if it may have been a bit harder to digest.

I hope too this will help you to better understand the level of mastery developers in the PyTorch or Keras team show: they manage to hide all this complexity and make it easy for mere mortals to use these supercomputer-on-a-chip to their full power, in just a few lines of python.

Next time we will get back to more usual depths: we’ll see some techniques we can use to train sparse networks, and how performance is impacted.

By the way, congrats to Victor Sanh, Thomas Wolf, and Alexander M. Rush for their latest paper “Movement Pruning: Adaptive Sparsity by Fine-Tuning”!

Sparse Neural Networks (2/N): GPU Performance. was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

A brief history of machine translation paradigms

Teven Le Scao — Thu, 14 May 2020 12:41:14 GMT

As a young European, having access to translation at any time just by pulling up my phone is an unprecedented luxury. There’s convenience in knowing I can order a kebab in a hundred languages, cu de tuate und scharf, bez příliš mnoho cibule mais avec frites, and there’s beauty in being able to do so even in countries where older generations remember being in conflict with mine. This made machine translation my first love of sorts among machine learning applications: it is why I originally went from applied mathematics to NLP. This week, Hugging Face is proud to release 1000+ translation models from University of Helsinki thanks to the hard work of Helsinki’s Jörg Tiedemann and our own Sam Shleifer. To accompany the release, here’s a short history of machine translation efforts over the last century. It’s written with a public familiar with modern NLP in mind, and I tried to draw connections with other fields throughout.

1. Genesis (1933–1945)

The first automated translation systems were independently created in 1933, by George Artsrouni in France¹ and Petr Troyanskii in the USSR². Unfortunately, neither really took hold in engineering or research circles, for different reasons. Artsrouni’s system, which was a mechanically automated retrieval system that could function as a dictionary, generated a lot of interest in the French administration but could not come to fruition before the start of the Second World War. Troyanskii’s system, which also started as an automated dictionary but grew to incorporate a memory as well as electronic components (those were at the time still mechanical computers !) was ignored by the Soviet scientific establishment.

Marian Rejewski’s statue with an Enigma machine in his hometown of Bydgoszcz³

Elsewhere in Europe, events that would prove (only slightly) more impactful were unfolding at the same time. From 1932 to 1933, the Polish Cipher Bureau — most notably Marian Rejewski, who the Marian NMT system is named after — broke the code of early German Enigma machines. During the Second World War itself, cryptography became a key topic and mobilized significant intellectual and financial resources. After the war ended, with the cold war rising, machine translation became a topic of interest to both superpowers’ intelligence communities. A key problem, for example, was automatically translating scientific articles from the other side, as scientific output out-scaled the number of competent translators.

In this context, the 1949 Weaver memorandum on translation⁴ was a landmark in the US, advocating that automated translation was becoming possible thanks to the newly created computer. It proposed several approaches, like storing the rules of language in the machine or learning statistical similarities between sentences, with even a mention of early efforts on perceptrons. Quite striking is the direct filiation it establishes between machine translation and wartime cryptography efforts: it opens with a war anecdote and raises the task of translating Russian as if it were code.

When I look at an article in Russian, I say ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’

Nothing like a good bout of great-power rivalry for research funding !

2. Rule-based MT (1949–1984)

Early rule-based MT (1949–1967)

After the publication of the Weaver memorandum, research in machine translation began in earnest in the United States, mostly focused on translating Russian scientific articles into English. Translation systems of the time can generally be placed on a scale between empiric and linguistically-grounded approaches. For example, on the empiric end of the scale, research at the RAND corporation proceeded in cycles of translation and editing. First, start from a few basic rules, observing the result on a predetermined corpus of Russian texts. Then, revise the glossary and grammatical rules of the system, and repeat the cycle, in a sort of expectation-maximization algorithm performed by humans (perhaps an ancestor of graduate student descent ?) On the other hand, academic research, especially at MIT, focused on finding intermediate representations between the source and target sentences. Sufficiently expressive representations, it was hoped, could allow for general-purpose translation. Another goal was building an interlingua, e.g. a representation of semantic meaning independent of language; Noam Chomsky was introducing universal grammar at the same time⁵.

A hybrid system to translate Russian technical documents was demonstrated in 1954 at Georgetown University. Deemed very impressive at the time, it spurred investment in the United States and seeded interest elsewhere, mostly in the Soviet Union and in Europe, where research concentrated on the theoretical approach. Systems at the time relied on the work of extensive teams of linguists: they translated human instructions into code rather than learning word correspondences on their own.

Knowledge-based MT (1967–1984)

The 1967 ALPAC report⁶ is generally held to have been the end of that first phase of machine translation hype, after it made the case that American research funding should be directed to machine-aided human translation rather than fully automated machine translation. After its publication, research funding dried up in the United States, leaving machine translation research efforts in Canada and Europe, and the Soviet Union.

We have already noted that, while we have machine-aided translation of general scientific text, we do not have useful machine translation. Further, there is no immediate or predictable prospect of useful machine translation.

Classic reviewer #2.

The Vauquois Pyramid

One influential concept to understand the evolution of rule-based MT during that time is the Vauquois pyramid, reproduced here. First, the system attempts to understand the source text (analysis) and to represent this understanding. Then, it produces text in the target language (generation) from this representation. This was christened knowledge-based MT: the goal was to have ever more complete and general representations, moving up the pyramid, as opposed to earlier rule-based systems’ direct translation or only syntactically-informed translation. However, transfer machine translation, operating at a lower level, remained more effective and powered the systems of the time. Those were mostly domain-limited technical use cases like Canada’s Météo system.

3. Data-driven MT (1984-present)

Example-based MT (1984–1993)

Looks like you’ve missed your daily French lesson today!

By the 1980s, computers had gotten a lot more powerful, especially in storage capacities. This allowed for larger databases of text, which remained yet to be systematically exploited. One early idea to do so was example-based MT, which was first proposed in 1984 in Japan⁷. Example-based systems made the observation that beginner-level foreign language speakers rely on sentences they already know to produce new ones: an example between French and English is shown in the figure⁸. Similarly, they relied on databases of known examples to produce new translations, querying the closest one. Although those ideas would eventually be subsumed in the broader framework of statistical MT, they were the first example of data-driven translation.

Statistical MT (1993–2013)

Underneath all this, a revolution was brewing, as a few outside developments came to fruition in the 90s. Statistical speech recognition started showing strong results on the back of advances in automata theory and hidden Markov models; computers became more powerful and accessible still; and high-quality, abundant datasets appeared, such as the Hansards accounts of the Canadian parliament. In 1988, IBM researchers had published the outline of modern statistical translation⁹, which proved controversial to say the least. As a famous anonymous review of the time states:

The validity of a statistical (information theoretic) approach to MT has indeed been recognized, as the authors mention, by Weaver as early as 1949. And was universally recognized as mistaken by 1950 (cf. Hutchins, MT — Past, Present, Future, Ellis Horwood, 1986, p. 30ff and references therein). The crude force of computers is not science. The paper is simply beyond the scope of COLING.

Reviewer #2 strikes again !

Nevertheless, the statistical approach quickly proved fruitful, as IBM’s models 1–5 became references in machine translation. Those were powered by the expectation maximization algorithm to learn both alignments between languages — which and how many words in the source and target correspond to each other — and a dictionary to translate after computing alignments. In a sense, they were direct descendants of the early RAND empirical approach: instead of being fed instructions by teams of linguists, the computer could learn all of the relationships from data on its own. By the 2010s, statistical methods had asserted their hegemony, as they powered virtually all of the internet-based translation services that comprise the bulk of translation use.

Neural MT (2013-present)

In his 1949 memorandum, Warren Weaver briefly touches upon early perceptron research as a promising avenue for machine translation. 60 years later, neural networks had made significant progress in other tasks, but had yet to be convincingly applied to translation. The first functional neural language models appeared in 2011, powered by recurrent neural networks¹⁰. Translation could then be reformulated as a conditional language modeling task: instead of predicting the most likely next word, predicting the most likely next word conditioned on the source text. The first modern machine translation paper appeared within a few years, in 2013. It consisted of an encoder model that produced a representation of the input with a convolutional neural network and of a decoder model that generated text from that representation with a vanilla recurrent neural network (RNN)¹¹.

At the time, neural MT was still underperforming compared to statistical MT, and it required two main developments from 2014 to eventually come out on top. First, vanilla RNNs were replaced with long short-term memory RNNs (LSTMs)¹². Then, learnable attention mechanisms were re-purposed from their computer vision roots and added to LSTMs¹³. By 2016, Google Translate had switched to neural MT. Transformer-based models¹⁴, which do away with the recurrent network part and only use iterated attention modules, have become the norm in recent years as they scale better than LSTMs with compute time and available data. The Helsinki models we’re releasing today all rely on this architecture. The power of transformers was quickly noticed outside of machine translation and, combined with pre-training, they now form the backbone of most modern NLP applications.

Conclusion

MT performance can be divided in 3 goals: human-level translation, general-purpose translation, and translation without human input. Older knowledge-based systems could do human-level translation without human input but only on very narrowly defined data. Human-in-the-loop systems had faster human-level translation but required manual intervention. Finally, statistical translation could handle any text without a human being, but not always at the level of human translators. If you’re lucky enough to have to translate between data-rich similar languages, Neural MT offers the best of all worlds. However, if the language pair you’re interested in is not data-rich, there is still quite a bit of work to do before we get there. An interesting project to realize the size of the task at hand is DARPA’s Lorelei program, which simulates a crisis in a region of the world whose language is underserved, and asks researchers to build a translation system in two weeks. Even for languages spoken by tens of millions of people, throwing teams of highly trained linguists at the problem is sometimes still the way to go!

References

[1] “La machine à traduire française aura bientôt trente ans”, Automatisme 5(3): 87–91, M. Corbé, 1960

[2] Machine translation: past, present, future. J. Hutchins, 1986

[3] Marian Rejewski statue photo from Peter Reed

[4] Reproduced in: Locke, W.N.; Booth, D.A., eds. (1955). “Translation” (PDF). Machine Translation of Languages. Cambridge, Massachusetts: MIT Press. pp. 15–23. ISBN 0–8371–8434–7.

[5] Aspects of the Theory of Syntax, Noam Chomsky, 1965

[6] LANGUAGE AND MACHINES: COMPUTERS IN TRANSLATION AND LINGUISTICS, ALPAC 1966

[7] A framework for a mechanical translation between Japanese and English by analogy principle, Nagao 1984

[8] EBMT figure from Purest ever example-based machine translation: detailed presentation and assessment, Y. Lepage, E. Denoual, Machine Translation, Springer Verlag, 2007, pp.251–282. hal-00260994

[9] A Statistical Approach to Language Translation, P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, P. Roosin, COLING, 1988.

[10] RNNLM — Recurrent Neural Network Language Modeling Toolkit, T. Mikolov, S. Kombrink, A. Deoras, L. Burget, J. Černocký, 2011

[11] Recurrent Continuous Translation Models, N. Kalchbrenner, P. Blunsom, 2013

[12] Sequence to Sequence Learning with Neural Networks, I. Sutskever, O. Vinyals, Q. Le, 2014

[13] Neural Machine Translation by Jointly Learning to Align and Translate, D. Bahdanau, K. Cho, Y. Bengio, 2014

[14] Attention Is All You Need, A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, 2017

A brief history of machine translation paradigms was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is the future of Neural Networks Sparse? An Introduction (1/N)

François Lagunas — Tue, 04 Feb 2020 17:08:58 GMT

From principles to real-world library support.

TLDR: Yes

Hi, I am François Lagunas.

I am doing Machine Learning research, and I have been working for the last months on using sparse matrices, especially in Transformers. The recent announcement that OpenAI is porting its block sparse toolbox in PyTorch is really big news:

“We are in the process of writing PyTorch bindings for our highly-optimized blocksparse kernels, and will open-source those bindings in upcoming months”

I was talking about it with the outstanding Hugging Face team, (I am one of their early investors), and I wanted to share with you my excitement!

What is a Sparse Matrix?

A sparse matrix is just a matrix with some zeros. Usually, a lot of them. So every place you are using a dense matrix, in a linear layer, for example, you could be using a sparse one.

Matrices with increasing sparsity

The sparsity of the matrix is the fraction of zeros against the size of the matrix

The pros? If you have a lot of zeros, you don’t have to compute some multiplications, and you don’t have to store them. So you may gain on size and speed, for training and inference (more on this today).

The cons? Of course, having all these zeros will probably have an impact on network accuracy/performance. But to what extent? You may be surprised.

Where are they from?

The first researchers/engineers to use sparse matrices were Finite Elements users.

A 2D mesh (roof of Omni Coliseum, Atlanta) and its finite element matrix (source).

When you have to deal with large physical simulations, you get a large graph of interconnected vertices.

Each vertex is a point of your system, and each edge connects two vertices. That means that these two points will have some influence on each other in the model. And so there is a non-zero value in the matrix that describes the graph.

This last sentence sums it up: you need non-zero values in the matrix when two dimensions are interacting in some way.

Now getting back to ML, you should ask yourself the same question: are all the dimensions of my input vector interacting with all the others? Usually not. So going sparse maybe useful.

We have actually a very good, and famous, example of a successful trip to sparse-land: convolutional layers.

Learned convolutional filters. From http://cs231n.github.io/convolutional-networks/

Convolutional layers are a smart and efficient way to implement a sparse transformation on an input tensor.

When processing images, it comes down to two things:

Sparsity: the transformation is local → each output pixel should depend on a few neighboring input pixels.

Invariance: the transformation does not depend on the position in the image

Then you just add the constraint that the transformation is linear: if you were to represent this transformation, you would get a HUGE matrix with only a few non-zeros. But of course, the right way to do this is to do a multiplication of the input tensor with a small set of small matrices (each square in the image before).

The importance of convolutions in today’s ML success is obvious. But you can see that finding a clever way to make things sparse sounds like a good recipe to save time and space.

Where are they useful?

Convolutions are already an efficient form of sparsity, so you could try to make them even more sparse, but some other networks contain much larger matrices that may benefit from sparsity: Transformers.

And those are getting bigger and bigger. We have greatly exceeded the 1 billion parameters in 2019, and it’s not stopping here. The cost to train and to use those networks is getting unpractical, so every method to reduce their size will be welcome.

From https://devblogs.nvidia.com/training-bert-with-gpus/

Why the OpenAI announcement is so important?

So, if everything is fine in sparse-land, we should all be trying sparse matrices, shouldn’t we?

Yes. But there is this stupid thing called implementation. It’s easy to see the theoretical improvements we could get with sparse compute. But the support in libraries is quite … sparse.

PyTorch developers, for example, have done a significant effort to support sparse compute. But there is still a big gap in performance between dense and sparse matrices operations, which defeats the whole purpose of using them. Even memory usage is quite large: sparsity has to be more than 80% to save some room on sparse matrices (more on that in my next post). Even basic serialization was broken before version 1.4. The reason is that the underlying libraries (for example cuSPARSE) are not doing a great job because the problem is ill-suited to the way GPU works.

So the OpenAI announcement on their block sparse tools is very good news for those who want to use sparse ops without sacrificing training speed (and it looks like some people have been waiting for some time now). And we are not talking about a few percents.

“Our kernels typically performed one or two orders of magnitude faster in terms of GFLOPS.”

From OpenAI blocksparse paper

(The worst thing is that the paper concludes that cuBLAS is faster that cuSPARSE even with very sparse matrices. How sad.)

The magic keyword here is “block”. It’s hard to implement general sparse matrice computations on GPUs in an efficient way. But it gets much easier if you add a “reasonable” constraint on the form of the matrices: their non-zeros should be grouped in small fixed-size blocks, and that makes GPU processing much easier to parallelize efficiently. Typically 8x8, 16x16 or 32x32 blocks, 16x16 already giving a very good performance, with 32x32 giving a slightly better one.

A 8-block-sparse matrice

Of course, the “block” constraint may be crippling some sparsification algorithms, or at least it would require some changes to take it into account.

But at least we can play with large high sparsity matrices, and the block constraint may not be a big issue: if you think about it, it means that there is some locality in the dimensions, and that sounds a quite reasonable constraint. That’s the same reason band matrices have been useful in the past (finite difference, finite elements), and it was a much stronger constraint.

Band matrix

Conclusion

I hope I have convinced you that 2020 will be the sparse network year (it already has two zeros, that’s a sign).

Next time for those who are curious about what happens when they are using some CUDA based PyTorch code, we’ll dig a bit deeper in GPU internals, (and we will understand why block sparse code is outrunning sparse code by a large margin).

This article series will continue on the different techniques that have been proposed to make sparse networks, and what are the potential long term benefits.

Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq

Rémi Louf — Tue, 03 Dec 2019 13:10:39 GMT

How to use them with a sneak peak into upcoming features 🕵️‍♀️

Our Transformers library implements many (11 at the time of writing) state-of-the-art transformer models. It is used by researchers and practitioners alike to perform tasks such as text classification, named entity recognition, question answering or text generation. Its API is compatible with both PyTorch and Tensorflow.

While many recent models have focused on single-stack architectures, encoder-decoders have come under the spotlight again recently, notably with Facebook’s BART and Google’s T5.

This post briefly goes through the (modern) history of transformers and the comeback of the encoder-decoder architecture. I will walk you through the implementation of encoder-decoders in the transformers library, show you can use them for your projects, and give you a taste of what is coming in the next releases.

Hello 👾 Transformers

The transformer storm began with “Attention is all you need”, and the architecture proposed in the paper featured both an encoder and a decoder; it was originally aimed at translation, a Seq2Seq task. Its principal innovation compared to RNNs was to stack layers of bidirectional attention so every token can attend to every other token.

The original transformer architecture — that you have probably seen everywhere — has an encoder and decoder stack.

🚀 The rise of single-stack architectures

Following this, two papers came and further disrupted model architectures:

GPT from OpenAI
BERT from Google AI Language

👋 GPT

The authors of GPT completely dropped the decoder of the original Transformer. They left us with this:

Our poor transformer cut in half.

The authors trained the model by teaching it a language model, the probability distribution of possible sequences, in an unsupervised way. They did so by factorizing the distribution in a particular way:

Which is mathematically trivially true: the probability of a sequence is the product of the probabilities of the tokens conditioned on the previous tokens. Note that this is not the only possible factorization, just one that seems particularly useful.

However, encoders are stacks of self-attention layers; everyone can attend to everyone and at the top of the encoder, the probability of each token will depend on every other token. How can the model learn the language model above?

The authors used a trick: the attention mask. Given a query Q, keys K and value V the output of (single-headed) attention layer reads:

Attention mechanism with masking. The mask specifies which positions the output can attend to by forcing the output of the softmax to 0 if the position cannot be attended to.

The idea is to add a matrix that will “forbid” tokens (say words) to attend to one another. The following mask is used in GPT to prevent tokens to attend to tokens later in the sequence:

Left-to-right mask. For a given token in the sequence, we assign a mask value of 0 for this token and the preceding ones; a value of minus infinity for the later ones. As a result tokens can only attend to tokens preceding them in the sequence.

Using this mask you can train the model by making each token in the sequence predict the next one, and you can generate new sequences in an auto-regressive way. While the generation abilities are nothing short of amazing, natural language understanding (NLU) is not GPT’s strong suit. That is where BERT entered the stage and took the NLP world by a storm.

👋 BERT

BERT, unlike GPT, does not use any mask trick during pre-training. It is the pre-training task that pulls all the weight.

Instead of teaching the model to predict the next word in a sentence, it masks a fixed proportion of tokens at random in a sequence and trains the model to recover these masked words (this is a Cloze test used, among other things, to evaluate people’s abilities in a foreign language). This pre-trained model can then be fine-tuned on many language understanding tasks such as named entity recognition, question answering and text classification. BERT thus achieved a qualitative jump in many NLU benchmarks.

As the figure below shows, many of the papers that followed are iterations on the foundations laid by BERT and GPT:

A transformer encoder;
Various pre-training tasks and associated attention masks.

Not all models implement the Encoder-Decoder architecture; they are actually only becoming popular now. Transformer-XL, GPT2, XLNet and CTRL approximate a decoder stack during generation by using the hidden state of the previous state as the key & values of the attention module. Side note: all these ☝️ models are implemented in the transformers library or will be soon.

Yet every task cannot be reduced to solely a text generation task or a NLU task. Some tasks require both understanding and generation capabilities. For instance:

Me reaching the limits of my drawing skills.

In these situations, what we would like the model to learn is not only the probability of the generated sequence, but the probability of this sequence given another sequence:

Language model and Seq2Seq language models. Sometimes the distinction is pedantic, sometimes it’s not.

In a plot twist, the authors of XLM and UniLM managed to fit these two tasks in a single encoder. How? With a smart use of embeddings (XLM, for translation) or a clever mask trick (UniLM)!

The prefix mask as defined in the UniLM paper. Words in the first sequence can attend to any other word in this sequence; words in the second sequence can attend to every word in the first sequence and only the preceding words in their sequence.

👋 The comeback of Encoder-decoder architectures

So why should we care about Encoder-decoder architecture if one, smaller, architecture does the job very well? Can it even do what the smaller architecture does?

The authors of the T5 paper recently answered the last question with the affirmative; they even perform extremely well. Building on previous ideas, they proposed a scheme to map any natural language understanding task to a text-to-text task. (read the paper if you have time, you won’t regret it).

To answer the first question, I would say that there is one thing that might be much easier to do with encoder-decoders: transfer learning on every task that can be mapped to a translation task.

(note: these are speculations)

Say you have a pre-trained model in language A, a pre-trained model in language B. You could theoretically use one as the encoder, the other as the decoder and fine-tune the model on a translation task.

This is not only true for natural language. Take the example of a data scientist bored from having to write simple SQL queries whenever asked, and a boss who couldn’t care less about using a frontend to answer their own questions. They could pre-train BERT on SQL, use a pre-trained weights for the English languages, finetune on a year worth of requests. Et voilà!

Boss2SQL (patent pending). The encoder is a Bert model pre-trained on the English language (you can even use pre-trained weights!), the decoder a Bert model pre-trained on the SQL language. Fine-tune the model on year’s worth of requests and you will never have to write a single line of SQL again.

Now imagine if we had a bank of BERTs pre-trained in many, many languages. Writing translators would become much easier, and thanks to transfer learning this would make the whole translation business easier to scale.

Encoder-decoder architectures could theoretically allow us to compound pre-training efforts to do transfer learning on a vast number of translation tasks.

HuggingFace 🤗❤️ Seq2Seq

When I joined HuggingFace, my colleagues had the intuition that the transformers literature would go full circle and that encoder-decoders would make a comeback. We thought that we should anticipate this move, and allow researchers to easily implement such models with our library.

Well, everything moves fast in NLP these days: within a few weeks BART and T5 were published; both are encoder-decoder architectures showcasing all sorts of new state-of-the-art results.

Allowing the integration was fairly straightforward. All we needed to do was to modify the library to allow the existing models (encoders) to also act as decoders. Which meant:

Adding a cross-attention layer, whose weights will be randomly initialized;
Transforming the attention mask on the decoder input as a left-to-right mask adapted for generation tasks.

What happens schematically in our encoder-decoder architectures. The encoder has bi-directional layers of self attention; the decoder is in fact the same model to which we add layers of cross-attention and causal masks when it is used as a decoder. It allows us to leverage the models already implemented by the community with very little code.

🔧 Use encoder-decoder architectures to build amazing things🔧

We defined a simple API that allows you to initialize encoder-decoders with pre-trained encoders and decoders. We call these hybrid pre-trained architectures the combiners:

They allow you to combine, for instance, the NLU superpowers of BERT with the generation superpowers of GPT-2.

Thanks to transformers being central in the ecosystem and making state-of-the-art models available, encoder-decoder models benefit from a substantial compounding effect: 11 models implemented in the library means 121 possible combinations for you to start building cool things. When you account for all the different languages the numbers become astronomical.

The combiners are where the open-source philosophy of Hugging Face and its amazing community start to really shine.

Only need the superpowers of one model? No worries! We created a simpler API for you:

Knowing how to pass the arguments of the two models can be (the only) tricky (step), so here is a reference you can use for your implementation:

To pass keyword arguments to the encoder and the decoder you need to respectively prefix them with `encoder_` and `decoder_`. Keyword arguments that are not prefixed will be passed to both models.

We recognize there are situations (notably for finetuning) in which you want to randomly initialize either the encoder or decoder. Easy:

Initialize an encoder-decoder model with a pre-trained BERT encoder and a randomly initialized GPT2 XL

Finally, if you want to share the weights between the encoder and the decoder, you have access to both architecture via model.encoder and model.decoder. This is very application-specific, so we do not provide an API for this. Don’t hesitate to open an issue if you need help.

All this is all available since the 2.2.0 release of the transformers library. For the moment, only BERT has been adapted to work as a decoder, but we’re working our way through the other ones!

What combiner would you like most to play with? Let us know in the comments 👇 or ping us on Twitter @huggingface

⌨️ Generate text with Transformers ⌨️

When we started working on an illustrative example, we realized that the text generation capabilities of the libraries were limited (although we do have an awesome example script and an online demo of text generation). Since they are essential for Seq2Seq tasks, we started working on a simple module for you to generate sequences. The API is subject to change, but you should be able to generate text as in the following:

Sample sequences at various temperatures using k-filtering, nucleus sampling and applying repetition penalty.

It will include at the very least sampling for both single-stack (GPT, XLNet, CTRL, XLM, Transfo-XL, GPT2) and encoder-decoder stacks. The following example of transformers playing exquisite corpse was generated using an early version of this module. Look what 10 lines of code can do for you:

Transformers playing exquisite corps, a game invented by surrealists in the 1930s. Each algorithm is given the sequence written by the previous one, leading to an unexpected result.

Your GPU prefers beam search? We’ve got you covered:

And this is only scratching the surface of what is possible in text generation.

If you would like to see more state-of-the art methods to generate text in the library, let us know in the comments 👇 or ping us on Twitter @huggingface

📄 Abstractive summarization with Transformers 📄

Abstractive summarization has a attracted a lot of attention lately in the research literature. We have also had a substantial amount of feedback from the community. Users who are just curious about the current state-of-the-art but also practitioners who would be happy to use for it for their jobs.

We listened, so keep an eye on Twitter for the release 😉

At 🤗 HuggingFace we care deeply about the needs and aspiration of our community. What are the applications of Seq2Seq models that you find most interesting? Let us know in the comments 👇 or ping us on Twitter @huggingface

🦄🤝🦄 Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

How To Write With Transformer

Jamie Brew — Tue, 26 Nov 2019 16:56:51 GMT

How to Write With an Artificial Intelligence

Creative Writing 1010101

Text-generating neural networks like OpenAI’s GPT-2 often raise questions about the dangers of fake text: Can a machine write text that’s convincingly, deceptively human?

As a comedy writer, I’m more interested in the opposite question: Can a machine produce words that no human would ever write? Can it help me write things that I would never write?

Write With Transformer is a web app that lets you write in collaboration with a text-generating neural network. It’s a demo for Transformers, a state-of-the-art software library developed and maintained by Hugging Face.

This post covers the basics of the app, a few strategies for using it as a writer and some more advanced controls.

Basic controls

https://transformer.huggingface.co/doc/gpt2-large

Write With Transformer is a normal text editor with one twist: At any time, you can appeal to GPT-2 for suggestions.

To make them, the machine considers all possible next words, chooses one of those words, considers all possible next words after that, and repeats until it runs out of time. It does this in three different places at once, which is how it arrives at three different suggestions.

You can read more about what’s going on under the hood here.

For now, here are the main predictive text commands to know:

Press Tab to ask the neural network for 3 suggestions to continue what you have written so far.
Keep pressing Tab as many times as you like to repeatedly request three more suggestions.
Use the arrow keys and enter or click to select one of the suggestions.

Writing methods

Here’s a list of just a few approaches you can take to using Write With Transformer.

1. Blind devotion

To remove yourself from the equation and see what the neural net might generate “on its own”, you can decide from the start that you’ll always take the first suggestion. If you start from a blank page, the first few words can be disorienting, like falling asleep and waking up in a random corner of the internet…

2. Branching path

Limit yourself to the three options supplied by the app, letting it tell you a choose-your-own-adventure tale about whatever the internet had on its mind when the training data was collected…

3. Tag team

Prompt the machine with a thought, then let its response prompt you. Go back and forth as cowriters, or warring Wikipedia editors…

4. Rewrites

Bring in familiar text from somewhere else, delete the end of it and see how Transformer would have completed it…

Note: I got curious about the second option, which seems to be the start of a full-scale FAQ about chickens. So I opened the app again and kept going. You can read the FAQ here.

5. Continuing lists

Transformers are great at picking up patterns in series of items. This makes it especially fun to prompt them with incomplete lists.

Try prompting with the start of a horizontal list, like this:

Or the start of a vertical list, like this:

6. Freeform

The repetitive structure of lists lends itself to transformer writing. The same applies to any kind of writing with a recognizable, consistent structure. Try interviews, step-by-step instructions, or invent your own new format and see what patterns the neural net picks up.

Advanced settings

You can adjust four settings in the bottom left corner of the app, controlling Model size, Top-p, Temperature and Max time.

Let’s look at each of these in turn.

Model size

Larger models have more parameters, which roughly means they can remember patterns from their training set in greater detail. This means larger models offer suggestions that are more specifically related to the prompt.

Suggestions from larger models are also shorter. This is because the models run slower, so within the time window set by Max time (see below), they can generate fewer words.

Temperature

The most poetically named parameter, temperature controls how adventurous the algorithm is with its word choices. Turning the temperature up makes suggestions wilder and less predictable.

Here’s a typical continuation at low temperature:

And here’s one at high temperature:

Top-p

This setting controls how broad a range of continuations are considered. Set it high to consider all continuations. Set it low to just consider likely continuations. The overall effect is similar to temperature, but more subtle.

Max time

This controls how long the suggestions are. The model will always generate as many words as it has time for. To ask for just a few words, set the maximum time to low. For longer suggestion blocks, choose a small model size and a high maximum time.

Sharing your writing

Write With Transformer has two built-in sharing mechanisms.

Screenshot

For short paragraphs, this button exports your document to an image, with Transformer-written text rendered in bold.

2. Save and publish

Ideal for longer documents. This option gives you links that let you return to editing a document later, or share with friends, who can read it or edit further themselves.

For example: Here’s the Chicken FAQ I created from the document that started “Why did the chicken cross the road?”

3. Duplicate and edit

Starting from a shared document (like the Chicken FAQ) click the Duplicate & Edit button to do just that: create a copy that you can edit via any human-machine balance you choose.

Please, duplicate the FAQ and help me learn more about the chicken.

How To Write With Transformer was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Benchmarking Transformers: PyTorch and TensorFlow

Lysandre Debut — Fri, 18 Oct 2019 15:02:09 GMT

Our Transformers library implements several state-of-the-art transformer architectures used for NLP tasks like text classification, information extraction, question answering, and text generation. It is used by researchers and companies alike, offering PyTorch and TensorFlow front-ends.

Since the release of our TensorFlow implementation, we have been working on productionizing the models and making them available on TPU, slowly gearing ourselves towards performance.

This post compares the performance of our models in several environments. We compare them for inference, on CPU and GPU for PyTorch (1.3.0) as well as TensorFlow (2.0). As several factors affect benchmarks, this is the first of a series of blogposts concerning benchmarks and subsequent performance optimizations.

In addition to this post, we are creatingBenchmark section in our documentation, which will evolve as we further work on our models and benchmark them in different settings.

Results

The results are visible in this Google Spreadsheet. The average results are visible in the table below. The results are detailed in the discussion section.

Average inference time

The N/A entries in the spreadsheet indicate either an out-of-memory error or an inappropriate sequence length. Transformer-XL does not have TorchScript results as it is not currently serializable by TorchScript.

In most cases, the TensorFlow and PyTorch models obtain very similar results, both on GPU and CPU. Down below is a short discussion concerning the results, both as a comparison between PyTorch and TensorFlow as well as a comparison between models.

Measuring inference

Inference time is an important metric when putting a model in production. In order to evaluate the inference times of our models, we compare them with different batch sizes and different sequence lengths. We compare the reasonable batch sizes [1, 2, 4, 8] with the sequence lengths [8, 64, 128, 256, 512, 1024] . The batch sizes remain small as we are exclusively looking at an inference setup. BERT and other similar models have a maximum sequence length of 512 or 256 (for CTRL) and will therefore not be measured on the last sequence lengths.

We test the results in two different environments:

on CPU, using a GCP n1-standard-32 which has 32 vCPUs and 120GB of RAM. The CPU model is an Intel Xeon @ 2.3GHz.
on GPU, using a custom GCP machine that has 12 vCPUs, 40GB of RAM and a single V100 GPU (16GB VRAM).

Experiment details & best practices

In order to maximize performance, further optimizations are made:

The Intel Xeon CPU on which we measure the CPU inference comes with AVX and AVX2 extensions. As TensorFlow requires to be compiled from source to leverage those extensions, we do so.
We make sure we are not using TensorFlow’s eager mode by using tf.function and tracing the models beforehand.
We compare the inference with and without the library-dependant tools: TorchScript for PyTorch, and XLA (Auto-clustering) for TensorFlow with GPUs. These two tools are detailed below.
We use the native Python module timeit to measure the inference time. We run each of our experiments with repeat=30 and number=3 . We then average over the 30 values to get the expected average inference time. Averaging over 30 values yields very stable results.
We do not make use of production environments such as TFX, and we measure the models’ callable method: nn.Module.forward for PyTorch and tf.keras.layers.Layer.call for TensorFlow
We are careful to use the appropriate CUDA versions for both TensorFlow and PyTorch.

Discussion

PyTorch and TensorFlow

Both libraries obtain similar results in most cases, with TensorFlow generally being a bit slower on CPU compared to PyTorch, but a bit faster on GPU:

Across all models, on CPU, PyTorch has an average inference time of 0.748s while TensorFlow has an average of 0.823s.
Across all models, on GPU, PyTorch has an average inference time of 0.046s whereas TensorFlow has an average inference time of 0.043s.

These results compare the inference time across all models by averaging the results. As a consequence, the larger the input size, the larger the impact on the final result. PyTorch runs out of memory when the input sizes are too large; those results are removed from all measures when averaging as it would skew the results towards PyTorch.

The PyTorch models tend to run out of memory earlier than the TensorFlow models: apart from the Distilled models, PyTorch runs out of memory when the input size reaches a batch size of 8 and a sequence length of 1024.

TorchScript

TorchScript is PyTorch’s way of creating serializable models that can run on different runtimes, with no need for Python dependencies, such as C++ environments. Our tests were done by tracing the model in Python and re-using that traced model in the same environment. We make sure to trace the model before measuring its inference by executing a forward pass beforehand.

Disclaimer: while TorchScript does not seem to be inherently created for speed-up in a Python environment, our results show that tracing the model with TorchScript can yield performance improvements.

TorchScript seems to be very dependent on the models and the input size (batch size * sequence length); as an example, using TorchScript yields a permanent performance boost on XLNet whereas its use may be questionable on XLM, where it increases performance in smaller input sizes but decreases performance in larger input sizes.

On average, an inference with a model traced with TorchScript is 20% faster than an inference with the same PyTorch non-traced model.

XLA

XLA is a linear algebra compiler that can accelerate TensorFlow models. We’re using it solely on GPU where it is based on TensorFlow’s Auto-clustering which compiles some of our models’ subgraphs.

The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled.

We obtain an increase in performance with all of our models when XLA is enabled. In some extreme cases, we obtain a decrease of 70% in inference time, especially in lower input sizes

Models and their distilled version

Distilled models shine in this test as being very quick to benchmark. Both of the Hugging Face-engineered-models, DistilBERT and DistilGPT-2, see their inference times halved when compared to their teacher models.

Contributing

As benchmarking on all different setups, with every tool, isn’t achievable by a single organization, we welcome benchmarks from the community. The Github user @tlkh has already contributed by benchmarking performances that could be achieved using AMP, XLA and distributed strategies on our TensorFlow models. It is currently being added to the benchmarking section of the documentation.

How to contribute

If you would like to contribute, we have set up issues templates on our Github to make it easier. Feel free to open an issue with your results, or to open a pull request with your additions to the benchmark section of the documentation.

Benchmarking script

Accompanying the release of this blog post and the Benchmark page on our documentation, we add a new script in our example section: benchmarks.py , which is the script used to obtain the results detailed below. It can run benchmarks on TensorFlow, on PyTorch, using XLA or TorchScript and save the results to a CSV file.

What’s next?

Benchmarking our models is but the first step on our road to speed performance. We believe this introductory article may be of help when looking to compare the current state of our models, especially when looking at the difference between PyTorch and TensorFlow. As we delve in the production aspects of Transformers , we are bound to work on performance-oriented improvements.

Automated scripts, new architectures and custom TPU training for PyTorch and TensorFlow: keep an eye out for future releases!

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

Benchmarking Transformers: PyTorch and TensorFlow was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT

Victor Sanh — Wed, 28 Aug 2019 14:43:24 GMT

Photo by Shubham Sharan on Unsplash

🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT

2019, October 3rd — Update: We are releasing our NeurIPS 2019 workshop paper describing our approach on DistilBERT with improved results: 97% of BERT’s performance on GLUE (the results in the paper superseed the results presented here). The approach is slightly different from the one explained in this present blog post so this blog post should be a good entry point to the paper! We applied the same method to GPT2 and are releasing DistilGPT2! Training code and pre-trained weights for DistilBERT and DistilGPT2 are available here. 🤗

In the last 18 months, transfer learning from large-scale language models has significantly improved upon the state-of-the-art on pretty much every Natural Language Processing task.

Usually based on the Transformer architecture of Vaswani et al., these pre-trained language models keep getting larger and larger and being trained on bigger datasets. The latest model from Nvidia has 8.3 billion parameters: 24 times larger than BERT-large, 5 times larger than GPT-2, while RoBERTa, the latest work from Facebook AI, was trained on 160GB of text 😵

Some people in the community question the relevance of keeping on training larger and larger Transformer especially when you take into account the financial and environmental cost of training. Here’s are some of the latest large models and their size in millions of parameters.

At Hugging Face, we experienced first-hand the growing popularity of these models as our NLP library — which encapsulates most of them — got installed more than 400,000 times in just a few months.

However, as these models were reaching a larger NLP community, an important and challenging question started to emerge. How should we put these monsters in production? How can we use such large models under low latency constraints? Do we need (costly) GPU servers to serve at scale?

For many researchers and developers, these can be deal-breaking issues 💸

To build more privacy-respecting systems, we noticed an increasing need to have machine learning systems operate on the edge rather than calling a cloud API and sending possibly private data to servers. Running models on devices like your smartphone 📲 also requires light-weight, responsive and energy-efficient models!

Last but not least, we are more and more concerned about the environmental cost of scaling exponentially computing requirements of these models.

So, how can we reduce the size of these monster models⁉️

There are many techniques available to tackle the previous questions. The most common tools include quantization (approximating the weights of a network with a smaller precision) and weights pruning (removing some connections in the network). For these technics, you can have a look at the excellent blog post of Rasa on quantizing BERT.

We decided to focus on distillation: a technique you can use to compress a large model, called the teacher, into a smaller model, called the student.

⚗️ Knowledge Distillation — Transferring generalization capabilities

Knowledge distillation (sometimes also referred to as teacher-student learning) is a compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models). It was introduced by Bucila et al. and generalized by Hinton et al. a few years later. We will follow the latter method.

In supervised learning, a classification model is generally trained to predict a gold class by maximizing its probability (softmax of logits) using the log-likelihood signal. In many cases, a good performance model will predict an output distribution with the correct class having a high probability, leaving other classes with probabilities near zero.

But, some of these “almost-zero” probabilities are larger than the others, and this reflects, in part, the generalization capabilities of the model.

For instance, a desk chair might be mistaken with an armchair but should usually not be mistaken with a mushroom. This uncertainty is sometimes referred to as the “dark knowledge” 🌚

Another way to understand distillation is that it prevents the model to be too sure about its prediction (similarly to label smoothing).

Here is an example to see this idea in practice. In language modeling, we can easily observe this uncertainty by looking at the distribution over the vocabulary. Here are the top 20 guesses by BERT for completing this famous quote from the Casablanca movie:

The top 20 guesses from BERT (base) for the masked token. The Language model identified two highly probable tokens (day & life) followed by a long tail of valid tokens.

👯‍♂️ How can we copy this dark knowledge?

In the teacher-student training, we train a student network to mimic the full output distribution of the teacher network (its knowledge).

We are training the student to generalize the same way as the teacher by matching the output distribution.

Rather than training with a cross-entropy over the hard targets (one-hot encoding of the gold class), we transfer the knowledge from the teacher to the student with a cross-entropy over the soft targets (probabilities of the teacher). Our training loss thus becomes:

With t the logits from the teacher and s the logits of the student

This loss is a richer training signal since a single example enforces much more constraint than a single hard target.

To further expose the mass of the distribution over the classes, Hinton et al. introduce a softmax-temperature:

T is the temperature parameter.

When T → 0, the distribution becomes a Kronecker (and is equivalent to the one-hot target vector), when T →+∞, it becomes a uniform distribution. The same temperature parameter is applied both to the student and the teacher at training time, further revealing more signals for each training example. At inference, T is set to 1 and recover the standard Softmax.

🗜Hands-on coding in PyTorch — Compressing BERT

We want to compress a large language model using distilling. For distilling, we’ll use the Kullback-Leibler loss since the optimizations are equivalent:

When computing the gradients with respect to q (the student distribution) we obtain the same gradients. It allows us to leverage PyTorch implementation for faster computation:

A Knowledge distillation training step in PyTorch. Copy the gist from here.

Using the teacher signal, we are able to train a smaller language model, we call DistilBERT, from the supervision of BERT 👨‍👦 (we used the English bert-base-uncased version of BERT).

Following Hinton et al., the training loss is a linear combination of the distillation loss and the masked language modeling loss. Our student is a small version of BERT in which we removed the token-type embeddings and the pooler (used for the next sentence classification task) and kept the rest of the architecture identical while reducing the numbers of layers by a factor of two.

Overall, our distilled model, DistilBERT, has about half the total number of parameters of BERT base and retains 95% of BERT’s performances on the language understanding benchmark GLUE.

❓Note 1 — Why not reducing the hidden size as well?
Reducing it from 768 to 512 would reduce the total number of parameters by ~2. However, in modern frameworks, most of the operations are highly optimized and variations on the last dimension of the tensor (hidden dimension) have a small impact on most of the operations used in the Transformer architecture (linear layers and layer normalisation). In our experiments, the number of layers was the determining factor for the inference time, more than the hidden size.
Smaller does not necessarily imply faster…

❓Note 2 — Some works on distillation like Tang et al. use the L2 distance as a distillation loss directly on downstream tasks.
Our early experiments suggested that the cross-entropy loss leads to significantly better performance in our case. We hypothesis that in a language modeling setup, the output space (vocabulary) is significantly larger than the dimension of the downstream task output space. The logits may thus compensate for each other in the L2 loss.

Training a sub-network is not only about the architecture. It is also about finding the right initialization for the sub-network to converge (see The Lottery Ticket Hypothesis for instance). We thus initialize our student, DistilBERT, from its teacher, BERT, by taking one layer out of two, leveraging the common hidden size between student and teacher.

We also used a few training tricks from the recent RoBERTa paper which showed that the way BERT is trained is crucial for its final performance. Following RoBERTa, we trained DistilBERT on very large batches leveraging gradient accumulation (up to 4000 examples per batch), with dynamic masking and removed the next sentence prediction objective.

Our training setup is voluntarily limited in terms of resources. We train DistilBERT on eight 16GB V100 GPUs for approximately three and a half days using the concatenation of Toronto Book Corpus and English Wikipedia (same data as original BERT).

The code for DistilBERT is adapted in part from Facebook XLM’s code and in part from our PyTorch version of Google AI Bert and is available in our pytorch-transformers library 👾 along with several trained and fine-tuned versions of DistilBert and the code to reproduce the training and fine-tuning.

🎢 Model performances — Testing DistilBERT

We compare the performance of DistilBERT on the development sets of the GLUE benchmark against two baselines: BERT base (DistilBERT’s teacher) and a strong non-transformer baseline from NYU: two BiLSTMs on top of ELMo. We use the jiant library from NYU for ELMo baselines and pytorch-transformers for the BERT baseline.

As shown in the following table, DistilBERT’s performances compare favorably with the baselines while having respectively about half and one third the number of parameters (more on this below). Among the 9 tasks, DistilBERT is always on par or improving over the ELMo baseline (up to 14 points of accuracy on QNLI). DistilBERT also compares surprisingly well to BERT: we are able to retain more than 95% of the performance while having 40% fewer parameters.

Comparison on the dev sets of the GLUE benchmark. ELMo results as reported by the authors. BERT and DistilBERT results are medians of 5 runs with different seeds.

In terms of inference time, DistilBERT is more than 60% faster and smaller than BERT and 120% faster and smaller than ELMo+BiLSTM 🐎

To further investigate the speed-up/size trade-off of DistilBERT, we compare, in the left table, the number of parameters of each model along with the inference time needed to do a full pass on the STS-B dev set on CPU (using a batch size of 1).

🔮 Downstream task: Distillation & transfer-learning

We further study the use of DistilBERT on downstream tasks under efficient inference constraints. We use our compact pre-trained language model by fine-tuning it a classification task. A nice way to actually mix distillation pre-training and transfer-learning!

Extract from the IMDB Review dataset — Source: Kaggle

We selected the IMDB Review Sentiment Classification which is composed of 50'000 reviews in English labeled as positive or negative: 25'000 for training and 25'000 for test (and with balanced classes). We trained on a single 12GB K80.

First, we train bert-base-uncased on our dataset. Our dear BERT 💋 reaches an accuracy of 93.46% (average of 6 runs) without any hyper-parameters search.

We then train DistilBERT, using the same hyper-parameters. The compressed model reaches an accuracy of 93.07% (average of 6runs). An absolute difference of 0.4% in performances for a 60% reduction in latency and 40% in size 🏎!

❓Note 3 — As noted by the community, you can reach comparable or better score on the IMDB benchmark with lighter methods (size-wise and inference-wise) like ULMFiT. We encourage you to compare on your own use-case! In particular, DistilBERT can give a sensible lower-bound on Bert’s performances with the advantage of faster training.

Another common application of NLP is Question Answering. We compared the results of the bert-base-uncased version of BERT with DistilBERT on the SQuAD 1.1 dataset. On the development set, BERT reaches an F1 score of 88.5 and an EM (Exact-match) score of 81.2. We train DistilBERT on the same set of hyper-parameters and reach scores of 85.1 F1 and 76.5 EM, within 3 to 5 points of the full BERT.

We also studied whether we could add another step of distillation during the adaptation phase by finetuning DistilBERT on SQuAD using the finetuned BERT model as a teacher with a knowledge distillation loss.

Here we are finetuning by distilling a question answering model into a language model previously pre-trained with knowledge distillation! That a lot of teachers and students🎓

In this case, we were able to reach interesting performances given the size of the network: 86.2 F1 and 78.1 EM, ie. within 3 points of the full model!

Other works have also attempted to accelerate question answering models. Notably, Debajyoti Chatterjee, uploaded an interesting work on arXiv which follows a similar method for the adaptation phase on SQuAD (initializing a student from its teacher, and training a question-answering model via distillation). His experiments present similar relative performances with regards to BERT (base uncased). The main difference with our present work is that we pre-train DistilBERT with a general objective (Masked Language Modeling) in order to obtain a model that can be used for transfer-learning on a large range of tasks via finetuning (GLUE, SQuAD, classification…).

🙌 Less is more: smaller models also spark joy 🌟

We are very excited about DistilBERT’s potential. The work we’ve presented is just the beginning of what can be done and raises many questions: How far can we compress these models with knowledge distillation? Can these technics be used to get further insights into the knowledge stored in the large version? What aspects of linguistic/semantics do we lose in this type of compression?…

One essential aspect of our work at HuggingFace is open-source and knowledge sharing as you can see from our GitHub and medium pages. We think it is both the easiest and fairest way for everyone to participate and reap the fruits of the remarkable progress of deep learning for NLP.

Thus, together with this blog post, we release the code of our experiments 🎮 (in particular the code to reproduce the training and fine-tuning of DistilBERT) along with a trained version of DistilBERT in our pytorch-transformers library🔥.

Many thanks to Sam Bowman, Alex Wang and Thibault Févry for feedback and discussions!

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🏎 Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

From TensorFlow to PyTorch

Thomas Wolf — Fri, 09 Aug 2019 13:05:31 GMT

By Omair Khan

Friends and users of our open-source tools are often surprised how fast 🚀 we reimplement the latest SOTA pre-trained TensorFlow models to make them accessible for everyone in our libraries like PyTorch-Transformers 👾 or PyTorch-pretrained-BigGAN 🦋

In this post, you’ll learn the main recipe to convert a pretrained TensorFlow model in a pretrained PyTorch model, in just a few hours.

We’ll take the example of a simple architecture like OpenAI GPT-2 🦄

Doing such a conversion assumes a good familiarity with both TensorFlow and PyTorch but it’s also one of the best ways to get to know better both frameworks!

Looking at the scope structure 🔎

The first step is to retrieve the TensorFlow code and a pretrained checkpoint. Let’s get them from OpenAI GPT-2 official repository:

https://medium.com/media/a115aa6c9dcff38ed5e37376922a0bed/href

TensorFlow checkpoints are usually composed of three files named XXX.ckpt.data-YYY , XXX.ckpt.index and XXX.ckpt.meta :

A trained NLP model should also be provided with a vocabulary to associate the tokens to the embeddings indices (here encoder.json and vocab.bpe). We won’t talk in too many details about vocabulary and tokenizer here since you can usually directly reuse their original python code with minor modifications.

First, we can have a look at the hyper-parameters file: hparams.json. It contains a few hyper-parameters like the number of layers/heads and so on:

We can reuse this JSON file in a configuration class for our model.

Now, let’s have a look at the structure of the model. Starting from now, you’ll need to have TensorFlow installed on your computer (can be the CPU version). Once TensorFlow is set up, open a python interpreter to load the checkpoint to inspect the saved variables:

https://medium.com/media/ad038dfd38f57856a4408bac4c2346ce/href

The result is a (long) list of all the variables stored in the checkpoint with their name and shapes:

Variables are stored as Numpy arrays that you can load with tf.train.load_variable(name).

Now, what we are particularly interested in here are the path-like names of the variables like model/h0/ln_1/b which reflects the organization of TensorFlow variables in scopes.

Here is our first secret:

To build our PyTorch model as fast as possible, we will reuse exactly the same organization: for each sub-scope in the TensorFlow model, we’ll create a sub-class under the same name in PyTorch.

This will let us load weights easily by jointly iterating on scopes & classes.

As you can see, GPT-2 has three modules at the root of the model (at the end of the list): model/wte, model/wpe and model/ln_f, and the rest of the model is composed of a series of identical modules hXX, each comprising a self-attention sub-module attn , a feed-forward module mlp and two layer-normalization modules ln_1 and ln_2 .

Now that we know how the model is organized, let’s build our PyTorch model with a hierarchy that reproduces this organization of scopes.

Building the PyTorch model skeleton 👩‍🎨

It’s time to have a look at the TensorFlow code it-self. We’ll start with the code for the main model and reproduce the general organization in our PyTorch main model class:

https://medium.com/media/6cdae8fad8b4cb5fb921355a05c65231/href

As you can see, we’ve given our main sub-modules names (wte, wpe, h, ln_f) that are identical to the first-level scopes of the variables we saw in the TensorFlow checkpoint.

We can also write the code for our forward pass by converting the code for the main model from TensorFlow operations to PyTorch operations:

https://medium.com/media/89844e3a9294b6a2289ae9b683f71056/href

Now we dive deeper in the hierarchy, continuing to build our PyTorch model by adapting the rest of the TensorFlow code. Here is another example comparing the TensorFlow code for a “Block” module:

https://medium.com/media/763e6e6e7901d9c0cd69a2e031dd1f5c/href

To the PyTorch equivalent nn.Module class:

https://medium.com/media/951393d1e27c986e99bd392b00ef4d81/href

Here again, the name of the class attributes containing the sub-modules (ln_1, ln_2, attn, mlp) are identical to the associated TensorFlow scope names that we saw in the checkpoint list above. Doing that ensures that the PT hierarchical attributes structure will be identical to the TF scope structure.

Beware of the details — section I 🕵️

The computation flow

When you convert TensorFlow code to PyTorch code, you have to be attentive to reproduce the exact computation workflow of the TensorFlow model in PyTorch. For instance, you should take care of reimplementing all the operations, even the ones not associated to a Variable (i.e. not visible in the checkpoint), add the dropout modules at same places than the original ones and carefully check how to convert each TensorFlow method in an equivalent PyTorch operation.

It’s a good opportunity to dive in the internals of both frameworks to see how each operation is made under the hood. One example: TensorFlow & PyTorch layer normalizations are slightly different from each other (go check them out!) so I usually reimplement layer normalization from scratch in PyTorch.

The initialization and defaults

It’s also important to check default parameters of each module like epsilons and make sure you are using the same ones in PyTorch than the TensorFlow. Be especially careful about defaults values that may not be visible.

Loading the weights 🏋️

Once the code conversion step is finished and you can run a forward pass on dummy input without any errors with your newly defined PyTorch model, it’s time to load the TensorFlow weights in the newly created model 🐣

Having the same models' organization make the loading very easy:

We just jointly iterate on both the path-like names of TensorFlow variables & our PyTorch model attributes.

A commented loading function for GPT-2 looks like this:

https://medium.com/media/ffdcab638e39bbe583c5ec1ba5193856/href

Let’s talk about a few things to keep in mind at this stage 👇

Beware of the details — section II🕵️

Transposing tensors from TensorFlow to PyTorch

Some TensorFlow operations operate on weights that are transposed with regards to their PyTorch counter-part (or vice-versa 😉). In this case, your weights loading method should take care of transposing the weights when loading them.

The main cases where this happens in practice are Keras modules like tf.layer.dense whose kernel is the transposed of PyTorch’s nn.Linear weights.

This transposition issue can be especially tricky to detect for square matrices which bring us to our last section 👇

The final step —️ comparing the models 👭

Comparing hidden-states 🎼

Now that your model runs and all the weights are initialized with their TensorFlow counterpart it is time for the most important operation:

a careful comparison of both models!

The way I usually do it is by starting from one script running the TensorFlow model provided by the authors of the original implementation and:

modify the TensorFlow model to output the hidden-states at regular locations along the depth of the model,
modify our PyTorch model to output the hidden-states at the same regular locations along the depth of the model,
load the PyTorch model in parallel with the TensorFlow model and run them on the same inputs,
compare their behaviors during a forward pass to detect where an error may have been made.

You should take care of deactivating the DropOut modules and all nondeterministic modules to ensure maximal compatibility.

If your script is a fine-tuning script and your model contains weights which are newly initialized, you should take care of fully initializing the PyTorch model from the newly initialized TensorFlow model for good comparison. Here is an example of this process during the reimplementation of XLNet in pytorch-transformers where the new TensorFlow model is saved and loaded in PyTorch.

I usually compare the max absolute difference between the hidden-states after each layer of the models on a few real-life inputs:

https://medium.com/media/3d9b2a6d1f493f1d68eeefa085d185da/href

Comparing on a down-stream task 🚣

If your model is a pretrained model which can be fine-tuned on a down-stream task, you can further confirm the accuracy of the conversion by reproducing some results on a downstream task.

This task can be quite long as you will need to reproduce the pre-processing, optimization and post-processing of the original author’s work.

In our experience, a discrepancy at this stage, in pretty much every case, doesn’t come from a difference inside the models but from a discrepancy in the way the inputs are prepared, in the optimization parameters (one of the most often over-looked ones being the batch size) or in the post-processing and evaluation metrics.

That’s all folks👭

We’ve seen the main steps you can take to quickly and accurately reimplement a pretrained TensorFlow model in PyTorch.

This method has a few limits:

the model may end up having a deeper hierarchy than necessary. In this case, you can rewrite the model to reduce the number of classes and use a mapping between the TensorFlow variables and the PyTorch attributes 🗺
the model is sometimes implemented with operations that are fast in TensorFlow or TPU (e.g. multiplication with one-hot matrices) but may be suboptimal in PyTorch. Here again, some rewriting and conversion afterward can help speed up the resulting model in some cases 🏎
You need access to the TensorFlow code for the conversion. It’s possible to convert a TensorFlow model without access to the code, e.g. a model only available on TensorFlow Hub but it’s a far more difficult process. In PyTorch-pretrained-BigGAN we did that by inspecting the raw computation graph and guessing the high-level operations involved 🙃

👾 For detailed code examples of this process, you can have a look at the various models implemented in PyTorch-Transformers.

… and if you feel like adding one of your own, we will probably be more than happy to welcome a Pull Request on the repository! Just ping us before to be sure we are not already working on it 😉

https://medium.com/media/9028cd193efdc5a465b8ac91e4702628/href

🌓 From TensorFlow to PyTorch was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling a massive State-of-the-Art Deep Learning model in production

Lysandre Debut — Mon, 24 Jun 2019 15:09:18 GMT

Last week, at Hugging Face, we launched a new groundbreaking text editor app. It’s different from traditional text editors in that an NLP model can complete your sentences if you ask it to, bringing a new dimension to “writing with a machine”. It’s based on GPT-2, OpenAI’s language model that can generate syntactically accurate sentences and coherent paragraphs of text.

Telling a story with GPT-2’s help

The demo is live on https://transformer.huggingface.co and you’re welcome to try it out! 🦄 Write with transformer is to writing what calculators are to calculus.

This model is part of the latest trends in NLP which revolve around creating very large language models that obtain excellent results on a variety of tasks when fine-tuned on those specific tasks. This results in models, “Transformers”, with large amounts of parameters (up to 1.5 billion parameters for GPT-2 Large, or Grover), which are difficult to handle because of their weight.

Our app allows the user to choose between two models: GPT-2 small, and GPT-2 medium. Loading them both in the computer’s RAM takes a total of 2.4GB of memory.

Here we offer to show the approach we took in order to scale these models and respond to the 10,000 unique users and the equivalent of more than a hundred books written we got in the first few days. We explain the thoughts that went into it, define the best fitting architecture for optimal processing and discuss what we could have improved on.

Issue at hand

Disclaimer: our approach here is specific to models that cannot perform batch inference. For models that can do batch inference, like the one we used, the shown workaround may not be necessary.

This app has several constraints in order to be enjoyable by users. It must have the lowest possible response time and generate long-enough sentences. The system must offer several possible completions at each trigger so that the user may choose one of them, tripling the amount of data to be generated. The goal is, therefore, to optimize as best as possible the computation, creating a workflow taking advantage of the highly parallelizable aspect of GPUs.

Setting up our workspace

We’ll be building a server-side API to which our front-end app will connect. This API will be responsible for handling the computation needed to generate sentences. We’ll be using Python for this task as most NLP models are readily available. Other lower-level languages such as C++ or Rust would be more appropriate for performance-oriented backends, and we discuss their usage in the last part of this post.

We used falcon for the web servers(any other http framework would have worked too) in conjunction with gunicorn to run our instances and balance the load. Our own GPT-2 Pytorch implementation is the backbone of this project. We have a few examples in our examples directory if you’re interested in doing something similar.

Gunicorn sets up “workers” which will independently run the application, efficiently balancing the load across different workers. You can check exactly how they work on the official gunicorn documentation.

3-way autocompletion

On each autocompletion request, we want our API to return three different possible sentences. These sentences will be displayed to the end-user who will then choose one between the three. This is an essential part of our design, and our API must reflect that. The three sentences should appear at the same time and optimally a unique request should be sent to the server for each autocompletion.

The most naïve approach we could have is using a single worker with a model loaded behind:

Naïve API

Using this architecture, every request would be treated sequentially, and the model would be prompted to generate three different sentences before responding to the incoming request.

This infrastructure could be easily scaled up by adding more workers while keeping in mind that each worker loads the model in the RAM/VRAM according to GPU usage or not.

Multi-worker naïve API

Using this approach implies that we have processes loading the model and operating on them, requesting three different sentences. If our model is able to perform batch inference, it can generate the three sentences at once. However, if it cannot, it needs to generate each sentence individually — resulting in three model iterations. We will be considering the case where batch inference is not available as it requires a slightly more engineered approach.

It would be better to parallelize the three iterations as we are looking for the lowest response time on autocompletion. Luckily for us, Python gives us access to several parallelization options that could be of use in our scenario:

Multithreading (threading)
Multiprocessing (subprocess or multiprocessing)
Distinct web servers as a form of multiprocessing (our approach)

Multithreading

Multithreading in Python is usually done using the threading class, which allows the program to create several threads that will each go on about their respective operations. The problem with multithreading is the way the Global Interpreter Lock — or GIL — works in Python.

If a thread accesses our model object, then no other thread can access that object until the first thread has finished dealing with it. This approach is therefore similar in execution to not using any thread at all, as the three iterations will be treated sequentially. The only performance difference will be the additional time spent starting/joining each thread, which is detrimental to our objective.

If one really wanted to use threading, three different models could be loaded into the RAM, each being used by a separate thread. We did not choose to go this way as explained further below.

Multiprocessing

Multiprocessing can go two ways; either by booting up completely separate processes and connecting to their input/output (using the subprocess module) or by spawning python processes that can inherit the current Python interpreter process’ resources (bypassing the GIL issue, using the multiprocessing module).

A tricky part here is making sure the model doesn’t have to be loaded into the RAM every time it has to compute an inference; big models take a long time to load in memory.

We chose to take yet another, different approach.

Our approach using gunicorn load balancing

Our approach is slightly different in that we choose to use the power of gunicorn workers to parallelize our work. In order to do so, we add another layer to our previous model. The previously defined architecture can receive several requests and process them all at once on several workers. We will use that to our advantage. The final model is detailed below.

Final model with two different Falcon/Gunicorn servers

When a request is sent from the front-end app to our API, it is handled by our first web server. This web server has a single worker that runs our API. This API is responsible for sending three identical requests to the second web server. The requests sent from this API contain the current context (the previous sentences in the document) as well as some information regarding the parameters (small or medium model, specific top_k values, …).

This second web server has several workers which handle the requests separately. Three workers will handle each request received from the API, which can, therefore, be handled simultaneously. We use separate threads in the API so that requests can be sent to the second web server in parallel rather than sequentially (http requests -> no GIL issue).

This architecture has several advantages that other, previously mentioned methods, do not have out-of-the-box:

We can generate as many workers as the number of models that can fit in our memory. We split the workers among the different GPUs if we have a distributed system.
Each worker loads a single model in memory. Therefore, there may be more models loaded (more computing power) than if three models were loaded each time, such as for the threading approach.
Launched as webserver workers, the models will always stay loaded in memory.
We’re making use of gunicorn’s load balancing at every step in our architecture. We are not simply spawning processes running in parallel, we have a way to make sure each process handles loads relative to its computing capabilities. If we were to use two different GPUs of different computing power, the bottleneck created by the lower computing GPU wouldn’t impact the other one as much as it would in a purely multi-process program.

Here is a GIF showing how the architecture behaves for memory management during initialization and when two concurrent requests are sent to the API.

Initialization and concurrency behavior

Results

Unsurprisingly, we obtain large improvements in response time when using a parallel system compared to the initial sequential system. Benchmarking on a single request which has to be broken up in three model iterations, we get a third of the initial response time, the actual local http request only taking a few hundred microseconds.

This system is particularly adapted to vertical scaling as it adapts to the system’s memory and computing power. However, it does not compare to a model that can perform batch inference as this approach will store three models in the memory, versus a single one if using batch inference.

Further improvements

This system was designed to be run on a single machine, so we didn’t consider containerization or horizontal scaling. These are welcome and necessary in the case of a full-blown production system that needs to handle the 100,000s of users. This can be discussed in a future post.

An additional improvement could be the use of the TorchScript module. Since we used Pytorch for our model, we could see a TorchScript version of it that could be used to do inference in any programming language. We could, therefore, have optimized a better, more task-specific web server in a very low-level language if we wanted to optimize to the fullest.

This system has proven its worth as it held the load until now, handling more than 100,000 different requests in a week’s time while running on a single 4-GPU (K80) machine. If you would like to try out the app and see how our system responds to traffic, you’re welcome to try it out here 🦄

This concludes this quick post on the system architecture we had to optimize for parallel computing, using our big Transformer model in production. All thoughts and claps are welcome!

Scaling a massive State-of-the-Art Deep Learning model in production was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

HuggingFace - Medium

Simple considerations for simple people building fancy neural networks

1. 🙈 Start by putting machine learning aside

2. 📚 Continue as if you just started machine learning

3. 🦸‍♀️ Don’t be afraid to look under the hood of these 5-liners templates

4. 👀 Tune but don’t tune blindly

Sparse Neural Networks (2/N): GPU Performance.

Sparse Neural Networks (2/N): Understanding GPU Performance.

Some physics

Chip design

Bottlenecks

GPU Architecture principles

Hierarchy

Why so many levels? Performance

Why so many levels? Economics

Developing for GPUs

Kernels

Grids and performance

Memory

Inter-GPU memory transfer

Ampere Highlights

Tensor Cores

Sparsity

Conclusion

A brief history of machine translation paradigms

1. Genesis (1933–1945)

2. Rule-based MT (1949–1984)

Early rule-based MT (1949–1967)

Knowledge-based MT (1967–1984)

3. Data-driven MT (1984-present)

Example-based MT (1984–1993)

Statistical MT (1993–2013)

Neural MT (2013-present)

Conclusion

References

Is the future of Neural Networks Sparse? An Introduction (1/N)

From principles to real-world library support.

Hi, I am François Lagunas.

What is a Sparse Matrix?

Where are they from?

Where are they useful?

Why the OpenAI announcement is so important?

Conclusion

More reading

Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq

How to use them with a sneak peak into upcoming features 🕵️‍♀️

Hello 👾 Transformers

🚀 The rise of single-stack architectures

👋 GPT

👋 BERT

👋 The comeback of Encoder-decoder architectures

HuggingFace 🤗❤️ Seq2Seq

🔧 Use encoder-decoder architectures to build amazing things🔧

⌨️ Generate text with Transformers ⌨️

📄 Abstractive summarization with Transformers 📄

How To Write With Transformer

How to Write With an Artificial Intelligence

Creative Writing 1010101

Basic controls

Writing methods

1. Blind devotion

2. Branching path

3. Tag team

4. Rewrites

5. Continuing lists

6. Freeform

Advanced settings

Model size

Temperature

Top-p

Max time

Sharing your writing

Benchmarking Transformers: PyTorch and TensorFlow

Results

Measuring inference

Experiment details & best practices

Discussion

PyTorch and TensorFlow

TorchScript

XLA