Sitemap

Word Processing Approaches— Where Modern NLP Starts

8 min readNov 18, 2022

In this article, I give an overview of what word processing approaches are and the current state of the art. Before diving in those approaches, I introduce Natural Language Processing (NLP) and some important tasks to start with.

Press enter or click to view image in full size
Photo by Kelly Sikkema on Unsplash

Natural in “Natural Language Processing” allows us to differentiate between human languages, such as English of French, from machine languages, such as Python or C++. Modern NLP is the field of Data Science and Artificial Intelligence that uses textual data to solve tasks like text classification, language modeling, translation

This is clearly one of the most difficult pieces of data to work with because of the actual meaning of a word, a sentence, a text… Some words can have a different meaning depending on the context of the sentence and this is multiplied by hundreds of times for some languages.

Before Modern NLP, engineers who started working on texts were looking for a classical set of rules to understand the meaning of any sentence. But languages are not such data that can be summarized with rules of if/then/else, this was not performing well.

Modern NLP is pattern recognition applied to words, sentences, and paragraphs (Deep Learning with Python, François Chollet).

This pattern recognition is made possible by machines, not by programming rules, but by using their computing power. Machines help finding the meaning of language by learning on training dataset in order to solve a specific task.

Before diving into the main text preprocessing approaches, let’s understand how machines can work with data and the main tasks needed in NLP.

From text to vectors — Text Vectorization

Text data cannot be used without modification, the machine has to process numerical values and is lost when you give it text.

Text vectorization is the process of transforming text to vectors (numeric tensors). It consists of multiple tasks, each as important as the next : text standardization, text tokenization and text indexing.

  1. Text Standardization aims to eliminate unwanted characters and convert the text for easier processing.
    For example : removing punctuation, special characters, accents, uppercases…
    Stemming is a more advanced standardization as it converts variations of a term into its word stem which is the base or root of a given word (“fish” is the stem of “fishing”).
  2. Text Tokenization or splitting is, as it name suggests, the process of splitting each word into tokens which can be a character, a word, a group of words. N-grams is a type of tokenization that focuses on splitting group of words of N length : Bi-grams are groups of 2 words.
  3. Text Indexing consists in two steps :
    - Firstly, building a vocabulary which is an index of all terms found in the entire dataset (corpus).
    - Assigning an unique integer for each term in the vocabulary.
    Thus, each word can be replaced by its index in the vocabulary which can be seen as a huge dictionary.
    Out of vocabulary (OOV) are terms that are not present inside the vocabulary and are usually put in the index 1.
    Mask tokens are terms that are voluntary ignored by the vocabulary and put in the index 0 (this is generally used to pad too short sequences).

Here is the text vectorization process performed on the following sentence : I work as a Data Scientist, do you ?

Press enter or click to view image in full size

Now that we know how to represent text into vectors, let’s look at the different word preprocessing approaches…

The word order importance —Words Preprocessing Models

Word order is different from one language to another and even within the same language, you can change the order of the words in your sentence while maintaining the meaning.

This is the main distinction of the most known words preprocessing approaches : Bag of words models vs Sequence models.

Bag-of-words models

This approach takes texts as unordered set of words and put away any order.

A bag refers to the fact that you deal with set of tokens rather than a list or sequence (Deep Learning with Python).

Thus, you can concatenate all the text and mix it up without it mattering.
Bag-of-words can also called bag-of-N-grams as it can encode N-grams as sets.

But the idea is to represent an entire text as a single vector composed of 0s and 1s but mainly 0s, which create sparsity. It is basically a huge multi hot encoding.

Those models of word preprocessing have been augmented and improved by the information added from the frequency of each term/N-grams. But this was not enough because some words are meaningless and present in any text : a, the, has… This is where TF-IDF normalization appears as a way to take the most of word’s frequency but limiting useless terms.

TF-IDF stands for Term Frequency — Inverse Document Frequency

Before looking at TF-IDF, let’s introduce what is a document and what is the corpus of documents. A document can consist of sentences of any length or of several sentences (a text) while the corpus is the collection of various documents. A corpus can be the collection of several film descriptions for example.

Now that clears, those are the 2 terms of TF-IDF :

  • Term Frequency is the count of a word in a specific document.
  • Inverse Document Frequency is the inverse of the count of documents where a word appears at least once in the entire corpus.

Where N is the number of documents in the corpus and the denominator is the document frequency of the term t. We add 1 to prevent from a divide-by-zero error and use the logarithm to dampen the importance of a huge corpus (if N is 10M and the document frequency 10, then IDF would be 10M/10 = 1M while using log change it to 6).

Finally :

TF-IDF allows to give more information and meaning to words inside documents but doesn’t change the principle of bag-of-words models that discard word order while its importance is strong.

Sequence models

While bag-of-words approach is representing text inputs as a unique fixed representation, sequence models keep raw word sequences with the aim of finding patterns inside.

Those are the 3 steps of sequence models :

  1. Text indexing : each term of the sequence is represented by an integer index.
  2. Text vectorization : map each integer to a vector which leads to vector sequences.
  3. Input those vector sequences inside a stack of layers (RNN, Transformers) that aims to find patterns from those raw sequences.

However, using one-hot encoding to represent the create vectors is not the best way because the inputs dimension will be enormous (as the vocabulary is). This would result in slow computation and poor performance of the neural network. This is where Word Embeddings appear and drastically change words representation one to another.

Indeed, word embeddings are vector representations of words that map human language into a structured geometric space (Deep Learning with Python).

On the contrary of one-hot-encoding which considers each word independent of the others and therefore orthogonal vectors,

Press enter or click to view image in full size

Word embeddings considers that the geometric relationship of two words should represents their semantic relationship computed by some geometric distance such as the cosine similarity for example.

Word embeddings packs more information into far fewer dimensions than one-hot-encoding as the later are binary and sparse vectors.

One big advantage of word embeddings is the information it can incorporate because it learns from words similarity one to another (structured representation). Two words that mean the same thing should have close vectors.

For example, with vehicle words :

Press enter or click to view image in full size

There are 2 ways to use word embeddings in a Neural Network :

  • Either building your own word embeddings based on the task you want to achieve (text classification, translation…).
  • Or by loading a pretrained embeddings built from a different machine learning task.

Building you own

The idea is to train a model and adding an embedding layer that will adjust initially random word vectors which can be seen as weights in the layer with backpropagation process. At the end of the training, the embeddings represent the best semantic relationships for the given task.

Loading a pretrained embeddings

The assumption behind pretrained embeddings is to transfer the learning obtained from another task to your own task.

You expect that the features you need are fairly generic — common semantic features and therefore useful even if the pretrained embedding space comes from a different problem (Deep Learning with Python).

The pretraining of those embeddings is not necessarily done with neural networks but especially using word-occurence statistics.

Examples of pretrained embeddings :

Finally, best sequence models today are Transformers, a neural network architecture that has revolutionized the modern NLP.

Conclusion

Modern NLP represents the use of large textual datasets in order to obtain useful patterns to solve specific problems.

Text data needs particular preprocessing as the features are the words themselves. This processing starts with standardization by removing unwanted characters, continues with tokenization by splitting N-grams and ends with indexing which associates each word from the vocabulary to a unique integer.

This preprocessing leads to the question of the importance of word order. Indeed, two approaches can be distinguished : bag-of-words and sequence models. The latter is currently the most used for complex NLP task as it considers word order as the basis of learning.

Thank you for reading the article, I hope you enjoyed it and now understand better what is text/word preprocessing! If you are interested in data science and machine learning, check out my other articles here.

Resources

--

--

Nicolas Pogeant
Nicolas Pogeant

Written by Nicolas Pogeant

Data practitioner with a software mindset | Writing about data systems, analytics engineering, automation, and the tools behind modern data work | npogeant.com