SparkNLP: A Comprehensive Guide to NLP Library

SparkNLP is a powerful Python library designed for a wide range of Natural Language Processing (NLP) tasks, built on top of Apache Spark. This library offers high-performance annotators for tasks such as StopWordsCleaner, Tokenizer, Chunker, and more. By integrating the distributed computing power of Spark with state-of-the-art NLP algorithms, SparkNLP is suitable for both small projects and enterprise-level applications.

In this article, we will explore the functionalities of SparkNLP.

Installation of SparkNLP

Via PyPI:

If you want to install this library using pip, you'll need to install one of its dependencies, pyspark, if it's not already installed.

Below is the full command to install both:

pip install spark-nlp pyspark

Via Google Colab Kernel

If you're working in a Google Colab notebook, there's an easy way to get started without any installation or setup.

Simply run the following code in your Colab notebook, and you can start using Spark NLP right away:

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh -O - | bash

Functionalities of SparkNLP

1. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a fundamental yet crucial task in NLP, where we aim to identify and classify entities within a text. These entities can include names of people, organizations, locations, and numerical data such as money, percentages, and time. SparkNLP offers a powerful pre-trained model for NER called `recognize_entities_dl`, which can be seamlessly integrated into an NLP pipeline.

To implement NER using sparknlp we will perform following steps:

Import Necessary Libraries: Import sparknlp and the PretrainedPipeline function from sparknlp.pretrained.
Start a Spark Session:
- Start a Spark session using the sparknlp.start() function.
- Note: The initial startup might take more than a minute.
Create a Pre-trained Pipeline: Create a pre-trained pipeline by passing the recognize_entities_dl model into the PretrainedPipeline() function.
Define Sample Text: Define a sample text that you want to analyze using the pipeline.
Annotate the Text:
- Pass the sample text into the pipeline.annotate() function.
- This function will return various annotations such as 'entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', and 'sentence'.
Access Recognized Entities: To print the recognized entities, access the entities key from the result dictionary.

Python

from sparknlp.pretrained import PretrainedPipeline
import sparknlp

spark = sparknlp.start()
pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')

text = """
GeeksforGeeks is a popular platform for learning and coding.
Adil Naib, is one of the authors of GeeksForGeeks, has published many articles on topics like Data Science and Machine Learning.
"""

result = pipeline.annotate(text)
list(result.keys())
result['entities']

Output:

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]

['Adil Naib', 'GeeksForGeeks', 'Data Science and Machine Learning']

2. Stop Words Removal

Stop word recognition and removing them is an important step in text preprocessing while building language models as they don’t contribute much to the meaning of a sentence. We remove these stop words to reduce the noise in the text data and to improve the performance of the language models. SparkNLP provides ‘StopWordsCleaner’ annotator to remove stop words from text.

To implement stop word removal, we will follow these steps:

Import Necessary Classes: Import the necessary classes from sparknlp.base and sparknlp.annotator.
Create a Sample DataFrame: Use the spark.createDataFrame() function to create a DataFrame containing the text data you want to process.
Set Up the DocumentAssembler:
- Initialize the DocumentAssembler, which will convert the text data into a format that Spark NLP can process.
- Set the input column to your text data column using setInputCol("text").
- Set the output column to "document" using setOutputCol("document").
Set Up the Tokenizer:
- Initialize the Tokenizer, which will split the document into individual tokens.
- Set the input columns to "document" using setInputCols(["document"]).
- Set the output column to "token" using setOutputCol("token").
Set Up the StopWordsCleaner:
- Initialize the StopWordsCleaner, which will remove common stopwords from the tokens.
- Set the input columns to "token" using setInputCols(["token"]).
- Set the output column to "cleanTokens" using setOutputCol("cleanTokens").
- Optionally, set setCaseSensitive(False) if you want to ignore case when removing stopwords.
Create a Pipeline: Create a Pipeline with the stages [document_assembler, tokenizer, stopwords_cleaner].
Fit the Model to the Data: Fit the pipeline model to the data using the fit() function.
Transform the Data: Use the transform() function to apply the pipeline to your data and get the cleaned tokens.
Display the Cleaned Tokens:
- Use the select() function to select the "cleanTokens.result" column.
- Use the show(truncate=False) function to display the cleaned tokens without truncating the output.

Python

from sparknlp.base import *
from sparknlp.annotator import *

data = spark.createDataFrame([("Adil Naib, is one of the authors of GeeksForGeeks, has published many articles on topics like Data Science and Machine Learning.",)], ["text"])

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stopwords_cleaner = StopWordsCleaner() \
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

pipeline = Pipeline(stages=[document_assembler, tokenizer, stopwords_cleaner])

model = pipeline.fit(data)
result = model.transform(data)

result.select("cleanTokens.result").show(truncate=False)

Output:

[Adil, Naib, ,, one, authors, GeeksForGeeks, ,, published, many, articles, topics, like, Data, Science, Machine, Learning, .]

3. Tokenization

In tokenization we break down text into individual words called tokens. Effective tokenization is important for tasks such as text classification, sentiment analysis, and machine translation. SparkNLP provides ‘Tokenizer’ annotator which will tokenize the whole text.

We will implement tokenization using following steps:

DocumentAssembler Setup: Converts raw text into a format that Spark NLP can process and add input: "text" column, Output: "document" column.
Tokenizer Setup:
- Splits the text into individual tokens (words).
- Input: "document" column, Output: "token" column.
Pipeline Creation: Combines DocumentAssembler and Tokenizer into a sequential process.
Fit Pipeline: Trains the pipeline on the input data to prepare it for transformation.
Transform Data: Applies the trained pipeline to the data, generating tokens.
Display Tokens: Selects and shows the generated tokens from the "token.result" column without truncation.

Python

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline(stages=[document_assembler, tokenizer])

model = pipeline.fit(data)
result = model.transform(data)

result.select("token.result").show(truncate=False)

Output:

[Adil, Naib, ,, is, one, of, the, authors, of, GeeksForGeeks, ,, has, published, many, articles, on, topics, like, Data, Science, and, Machine, Learning, .]

Chunking

Chunking, also known as shallow parsing, involves grouping words into chunks based on their context and grammatical structure. This step is crucial for understanding the context and sentiment of the entire text. Chunking is commonly used in tasks such as text summarization and sentiment analysis. SparkNLP offers a `Chunker` annotator that enables us to chunk sentences effectively.

To implement chunking, we are going to implement following steps:

Set Up DocumentAssembler: Convert the text into a document format and use setInputCol() and setOutputCol() to specify input and output columns.
Set Up SentenceDetector: Identify sentences within the document and use setInputCols() and setOutputCol() to define input and output columns.
Set Up Tokenizer: Split sentences into individual tokens (words) and specify the input and output columns using setInputCols() and setOutputCol().
Set Up PerceptronModel: Tag each token with part-of-speech labels using the pre-trained PerceptronModel and set the input and output columns.
Set Up Chunker: Identify chunks of text based on specified regex patterns and define the input columns for sentences and POS tags, and set the output column.
Create Pipeline: Combine the stages (DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, and Chunker) into a pipeline.
Fit Pipeline to Data: Train the pipeline on the provided data using the fit() function.
Transform Text: Apply the trained pipeline to the text data to generate the chunks using the transform() function.
Display Chunks: Select and show the generated chunks using select() and show().

Python

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

pos_tagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunks") \
    .setRegexParsers(["<DT>?<JJ>*<NN>+", "<NNP>+"])

pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos_tagger, chunker])

model = pipeline.fit(data)
result = model.transform(data)

result.select("chunks.result").show(truncate=False)

Output:

[Adil Naib, GeeksForGeeks, Data Science, Machine Learning]

SparkNLP: A Comprehensive Guide to NLP Library

Installation of SparkNLP

Via PyPI:

Via Google Colab Kernel

Functionalities of SparkNLP

1. Named Entity Recognition (NER)

2. Stop Words Removal

3. Tokenization

Chunking

Explore