Streaming Transformations
Transformation Gems
Transform Gems that generate code but do not work within Streaming Applications include OrderBy
, Limit
, and Window
(partitionBy).
Although Window
would work with a watermarked column (explained below) as part of the partitioning, it is advised to use window()
or session_window()
from the pyspark.sql.functions package
(link).
Watermarking
Watermarking is a technique that enables aggregations on streaming data by limiting the state over which the aggregation is performed. In order to prevent out-of-memory errors, we have introduced support for watermarking. More information on watermarking is available in the Spark documentation here
We have added a Watermarking Gem in the Transform Section that allows a user to add a Watermark to a DataFrame.
In this example, we add Watermarking to the timestamp
column. A user may enter the column name or select one from the Schema Table on the left. The text box is editable. Finally, define the Watermark Duration.
It is recommended to use Watermarking on a Streaming DataFrame in case you're planning to use any of the following operations on it: