Simplified Data Pipelines with Pulsar Transformation Functions
They provide a low-code way to develop basic processing and routing of data using existing Pulsar features.
Apr 18th, 2023 8:53am by
DataStax sponsored this post.
- Build your own service with one of the Pulsar clients that will consume from a topic, process the message and publish the result to another topic. A lot of boilerplate code needs to be written for this.
- Use a full-fledged stream processing engine such as Apache Flink or Apache Spark. These technologies are very advanced and support SQL so you don’t need to write a lot of code. But that’s another technology to deploy in your stack that has its own maintenance burden and cost of acquisition. Flink and Spark are useful for complex real-time analytics but they are overkill for simple cases such as removing or renaming a field in a structured message.
- Provide a low-code solution to develop basic processing and routing of data.
- Use existing Pulsar features that don’t require anything more than standard Pulsar.
- Have the possibility to be played in memory, in front of a sink so you don’t have to use an intermediate topic. (This feature comes from PIP 193.)
About Transformation Functions
A Transformation Function is essentially a regular Pulsar Function created in Java. The functions are a suite of commonly used operations. Similar to connectors and other Pulsar artifacts, Transformation Functions are packaged as a NAR and can be deployed in a Pulsar cluster using the pulsar-admin CLI or as a built-in function. Transformation Functions can be “connected” together to perform multiple-step processes and can include a “when” conditional to skip certain steps in the flow. Because it’s a Pulsar Function, there are no needed add-ons or extensions to use it. A function can be deployed quickly to a Pulsar standalone instance or in a fully functioning production cluster. When you create an instance of a function, you pass a JSON formatted configuration. The configuration contains the list of operations to apply in series on the data. As a low-code solution, the only “language” you need to know is the basic DSL (domain specific language) used by the configuration.
A transformation function that doubles the input value
Function Operations
Available Transformation Functions include:- Cast: modifies the key or value schema to a target-compatible schema.
- Drop-fields: drops fields from structured data.
- Merge-key-value: merges the fields of key-value records where both the key and value are structured data with the same schema type.
- Unwrap-key-value: if the record is a key-value, extracts the key-value’s key or value and makes it the record value.
- Flatten: flattens structured data.
- Drop: drops a record from further processing.
- Compute: computes new field values on the fly or replaces existing ones.
Example Configuration
Here is an example of connecting multiple functions together in series, to manipulate message data:
{
"steps": [
{"type": "drop-fields", "fields": "password", "part": "value"},
{"type": "merge-key-value"},
{"type": "unwrap-key-value"},
{"type": "cast", "schema-type": "STRING"}
]
}
firstname, lastname, and password fields. The function would automatically perform the following steps on the message data:
- Drop the “password” field from processing
- Merge the
userIdkey-value with the rest of the fields - Unwrap the value out of the key-value object
- Cast to a string type and return
userId, firstName and lastName.
Transformation Function Compute Operation
Among all the operation types, one that is particularly powerful is the “compute” operation. It is used to create or update message values, properties or metadata with an expression. The expression can take input from fixed values, message values, properties or metadata. The expression language features:- Arithmetic operations: +, – (binary), *, / and div, % and mod, – (unary)
- Logical operations: and, &&, or, ||, not, !
- Relational operations: ==, eq, !=, ne, <, lt, >, gt, <=, ge, >=, le.
- Utility functions: uppercase, contains, trim, concat, coalesce, now, dateadd
- Referencing values from: key (for key-value), value, messageKey, topicName, destinationTopic, eventTime, properties
- Referencing nested values of structured key and value (such as `value.my_value_field`)
{
"steps": [
{
"type": "compute",
"fields": [
{
"name": "destinationTopic",
"expression" : "fn:concat('routed-', messageKey)"
}
]
}
]
}
bar will be published to the topic routed-bar.
Taking Transformation Functions Further
Let’s take a concrete example and see how Transformation Functions make things so much easier. For this example, we’ll refer to the use cases from a previous blog post “Developing and Running Serverless Apache Pulsar Functions.” In this post, three functions were written:- enricher: takes a byte array input, converts it to string and adds an “EUR” suffix.
- filter: takes a String input, extracts the first word (up to a space), converts it to double and filters values that are below a configurable threshold.
- content-based router: takes a Double input and routes values below 1,000 to the topic `cbr-low` and values above 1,000 to the topic `cbr-high` after converting them to String.
{
"steps": [
{
"type": "compute",
"fields": [{"name": "value", "expression": "fn:concat(value, ' EUR')"}]
}
]
}
{
"steps": [
{
"type": "compute",
"fields": [{"name": "value", "expression": "fn:replace(value, ' .*', '')"}]
},
{
"type": "drop",
"when": "value < 123.45"
}
]
}
{
"steps": [
{
"type": "compute",
"fields": [{ "name": "destinationTopic", "expression": "'persistent://cbornet-examples/default/cbr-low'"}],
"when": "value < 1000"
}
]
}
Deploying the Functions on Astra Streaming
Transformation Functions are built into DataStax’s managed Pulsar platform, Astra Streaming. You deploy as a standard function, declaring thefunction-type as transforms. Continuing from the example blog functions, we can deploy those transformations with the following commands using the pulsar-admin CLI.
The “enricher” function:
bin/pulsar-admin functions create \
--function-type transforms \
--name enricher \
--inputs cbornet-examples/default/enricher-in \
--output cbornet-examples/default/enricher-out \
--user-config "{\"steps\": [{ \"type\": \"compute\", \"fields\": [{ \"name\": \"value\", \"expression\": \"fn:concat(value, ' EUR')\" }] }] }" \
--tenant cbornet-examples \
--namespace default \
--auto-ack true
bin/pulsar-admin functions create \
--function-type transforms \
--name filter \
--inputs cbornet-examples/default/enricher-out \
--output cbornet-examples/default/filter-out \
--user-config "{\"steps\": [{ \"type\": \"compute\", \"fields\": [{ \"name\": \"value\", \"expression\": \"fn:replace(value, ' .*', '')\"}] }, { \"type\": \"drop\", \"when\": \"value < 123.45\" } ]}" \
--tenant cbornet-examples \
--namespace default \
--auto-ack true
bin/pulsar-admin functions create \
--function-type transforms \
--name cbr \
--inputs cbornet-examples/default/filter-out \
--output cbornet-examples/default/cbr-high \
--user-config "{\"steps\": [{\"type\": \"compute\", \"fields\": [{\"name\": \"destinationTopic\", \"expression\": \"'persistent://cbornet-examples/default/cbr-low'\"}], \"when\": \"value < 1000\"} ]}" \
--tenant cbornet-examples \
--namespace default \
--auto-ack true
Getting Started with Transformation Functions
This first set of operations are based on the use cases we have seen in the field. We know there are many more operations that could be added. Please provide your feedback and make suggestions in the issue tracker of the project. If you want to test this feature quickly, you can get a free Pulsar instance in under a minute with Astra Streaming. This instance will have the Transformation Functions built in. You can immediately create transformation instances with the pulsar-admin CLI. This is already used in production by customers. Astra Streaming and Luna Streaming 2.10 also have the possibility to bundle the function with a sink and have the transformation done in memory (see PIP 193), which is a great way to reduce storage costs and to improve latency by avoiding the use of an intermediate topic. The ability to bundle a function with a sink will also be a part of the Apache Pulsar project, starting in version 3.0. Learn more about DataStax.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.