SparkNLP is a powerful Python library designed for a wide range of Natural Language Processing (NLP) tasks, built on top of Apache Spark. This library offers high-performance annotators for tasks such as StopWordsCleaner, Tokenizer, Chunker, and more. By integrating the distributed computing power of Spark with state-of-the-art NLP algorithms, SparkNLP is suitable for both small projects and enterprise-level applications.
In this article, we will explore the functionalities of SparkNLP.
Installation of SparkNLP
Via PyPI:
If you want to install this library using pip, you'll need to install one of its dependencies, pyspark, if it's not already installed.
Below is the full command to install both:
pip install spark-nlp pysparkVia Google Colab Kernel
If you're working in a Google Colab notebook, there's an easy way to get started without any installation or setup.
Simply run the following code in your Colab notebook, and you can start using Spark NLP right away:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh -O - | bashFunctionalities of SparkNLP
1. Named Entity Recognition (NER)
Named Entity Recognition (NER) is a fundamental yet crucial task in NLP, where we aim to identify and classify entities within a text. These entities can include names of people, organizations, locations, and numerical data such as money, percentages, and time. SparkNLP offers a powerful pre-trained model for NER called `recognize_entities_dl`, which can be seamlessly integrated into an NLP pipeline.
To implement NER using sparknlp we will perform following steps:
- Import Necessary Libraries: Import
sparknlpand thePretrainedPipelinefunction fromsparknlp.pretrained. - Start a Spark Session:
- Start a Spark session using the
sparknlp.start()function. - Note: The initial startup might take more than a minute.
- Start a Spark session using the
- Create a Pre-trained Pipeline: Create a pre-trained pipeline by passing the
recognize_entities_dlmodel into thePretrainedPipeline()function. - Define Sample Text: Define a sample text that you want to analyze using the pipeline.
- Annotate the Text:
- Pass the sample text into the
pipeline.annotate()function. - This function will return various annotations such as 'entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', and 'sentence'.
- Pass the sample text into the
- Access Recognized Entities: To print the recognized entities, access the
entitieskey from the result dictionary.
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
spark = sparknlp.start()
pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')
text = """
GeeksforGeeks is a popular platform for learning and coding.
Adil Naib, is one of the authors of GeeksForGeeks, has published many articles on topics like Data Science and Machine Learning.
"""
result = pipeline.annotate(text)
list(result.keys())
result['entities']
Output:
recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]
['Adil Naib', 'GeeksForGeeks', 'Data Science and Machine Learning']
2. Stop Words Removal
Stop word recognition and removing them is an important step in text preprocessing while building language models as they don’t contribute much to the meaning of a sentence. We remove these stop words to reduce the noise in the text data and to improve the performance of the language models. SparkNLP provides ‘StopWordsCleaner’ annotator to remove stop words from text.
To implement stop word removal, we will follow these steps:
- Import Necessary Classes: Import the necessary classes from
sparknlp.baseandsparknlp.annotator. - Create a Sample DataFrame: Use the
spark.createDataFrame()function to create a DataFrame containing the text data you want to process. - Set Up the DocumentAssembler:
- Initialize the
DocumentAssembler, which will convert the text data into a format that Spark NLP can process. - Set the input column to your text data column using
setInputCol("text"). - Set the output column to
"document"usingsetOutputCol("document").
- Initialize the
- Set Up the Tokenizer:
- Initialize the
Tokenizer, which will split the document into individual tokens. - Set the input columns to
"document"usingsetInputCols(["document"]). - Set the output column to
"token"usingsetOutputCol("token").
- Initialize the
- Set Up the StopWordsCleaner:
- Initialize the
StopWordsCleaner, which will remove common stopwords from the tokens. - Set the input columns to
"token"usingsetInputCols(["token"]). - Set the output column to
"cleanTokens"usingsetOutputCol("cleanTokens"). - Optionally, set
setCaseSensitive(False)if you want to ignore case when removing stopwords.
- Initialize the
- Create a Pipeline: Create a
Pipelinewith the stages[document_assembler, tokenizer, stopwords_cleaner]. - Fit the Model to the Data: Fit the pipeline model to the data using the
fit()function. - Transform the Data: Use the
transform()function to apply the pipeline to your data and get the cleaned tokens. - Display the Cleaned Tokens:
- Use the
select()function to select the"cleanTokens.result"column. - Use the
show(truncate=False)function to display the cleaned tokens without truncating the output.
- Use the
from sparknlp.base import *
from sparknlp.annotator import *
data = spark.createDataFrame([("Adil Naib, is one of the authors of GeeksForGeeks, has published many articles on topics like Data Science and Machine Learning.",)], ["text"])
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
stopwords_cleaner = StopWordsCleaner() \
.setInputCols(["token"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
pipeline = Pipeline(stages=[document_assembler, tokenizer, stopwords_cleaner])
model = pipeline.fit(data)
result = model.transform(data)
result.select("cleanTokens.result").show(truncate=False)
Output:
[Adil, Naib, ,, one, authors, GeeksForGeeks, ,, published, many, articles, topics, like, Data, Science, Machine, Learning, .]3. Tokenization
In tokenization we break down text into individual words called tokens. Effective tokenization is important for tasks such as text classification, sentiment analysis, and machine translation. SparkNLP provides ‘Tokenizer’ annotator which will tokenize the whole text.
We will implement tokenization using following steps:
- DocumentAssembler Setup: Converts raw text into a format that Spark NLP can process and add input:
"text"column, Output:"document"column. - Tokenizer Setup:
- Splits the text into individual tokens (words).
- Input:
"document"column, Output:"token"column.
- Pipeline Creation: Combines
DocumentAssemblerandTokenizerinto a sequential process. - Fit Pipeline: Trains the pipeline on the input data to prepare it for transformation.
- Transform Data: Applies the trained pipeline to the data, generating tokens.
- Display Tokens: Selects and shows the generated tokens from the
"token.result"column without truncation.
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, tokenizer])
model = pipeline.fit(data)
result = model.transform(data)
result.select("token.result").show(truncate=False)
Output:
[Adil, Naib, ,, is, one, of, the, authors, of, GeeksForGeeks, ,, has, published, many, articles, on, topics, like, Data, Science, and, Machine, Learning, .]Chunking
Chunking, also known as shallow parsing, involves grouping words into chunks based on their context and grammatical structure. This step is crucial for understanding the context and sentiment of the entire text. Chunking is commonly used in tasks such as text summarization and sentiment analysis. SparkNLP offers a `Chunker` annotator that enables us to chunk sentences effectively.
To implement chunking, we are going to implement following steps:
- Set Up DocumentAssembler: Convert the text into a document format and use
setInputCol()andsetOutputCol()to specify input and output columns. - Set Up SentenceDetector: Identify sentences within the document and use
setInputCols()andsetOutputCol()to define input and output columns. - Set Up Tokenizer: Split sentences into individual tokens (words) and specify the input and output columns using
setInputCols()andsetOutputCol(). - Set Up PerceptronModel: Tag each token with part-of-speech labels using the pre-trained
PerceptronModel and set the input and output columns. - Set Up Chunker: Identify chunks of text based on specified regex patterns and define the input columns for sentences and POS tags, and set the output column.
- Create Pipeline: Combine the stages (
DocumentAssembler,SentenceDetector,Tokenizer,PerceptronModel, andChunker) into a pipeline. - Fit Pipeline to Data: Train the pipeline on the provided data using the
fit()function. - Transform Text: Apply the trained pipeline to the text data to generate the chunks using the
transform()function. - Display Chunks: Select and show the generated chunks using
select()andshow().
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
pos_tagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = Chunker() \
.setInputCols(["sentence", "pos"]) \
.setOutputCol("chunks") \
.setRegexParsers(["<DT>?<JJ>*<NN>+", "<NNP>+"])
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos_tagger, chunker])
model = pipeline.fit(data)
result = model.transform(data)
result.select("chunks.result").show(truncate=False)
Output:
[Adil Naib, GeeksForGeeks, Data Science, Machine Learning]