Pipelines
Pipelines are groups of data transformations that you can build from a visual or code interface. When using the visual interface, each component of a Pipeline is automatically compiled into code that you can reuse and customize.
Under the hood, Pipelines are based on Spark-native code. Pipelines are ideal for Spark environments like Databricks or EMR, particularly for tasks such as complex data ingestion (e.g., loading data from Salesforce or JDBC), handling advanced data transformations (e.g., working with complex data types), and supporting machine learning workflows.
Actions
- To view a list of Pipelines in Prophecy, navigate to the Metadata page from the left sidebar.
- To create a new Pipeline, navigate to the Create Entity page from the left sidebar.
Parameters
Field | Description |
---|---|
Project | The Project to create the Pipeline in. This controls access to the Pipeline, groups Pipelines together, and lets you use Datasets in the project. |
Branch | The Git branch to use for Pipeline development. |
Name | The name of the Pipeline. |
Mode | Whether the Pipeline will be Batch mode or Streaming. |
Description | A field to descibe the purpose of the Pipeline. |
Edit a Pipeline
When editing a Pipeline, you'll be using the editor shown below.
- Pipeline Config: Buttons that bring up views related to the configuration of your Pipeline:
- Config
- Scheduling
- Unit Tests
- UDFs/UDAFs
- Dependencies
- Spark Version
- Visual Language
- View switch: Switch between the Visual view and Code view. The generated code will only update if there are no errors detected in the Visual view.
- Fabric switch: Select the Fabric to use during development; start a new cluster or connect to an existing one.
- Gem drawer: Click to select a category, then click on the Gem you wish to insert into the editor. See here for more information on the Gems.
- Workspace: Everything in the middle of the view is the Workspace. This is where you'll create and manipulate the Gems that make up this Pipeline.
- Git status: In Prophecy almost everything is stored in Git, and this is where you can interact with the version control for your Pipeline. See Projects & Git for more information.
- Run: If you just want to run your Pipeline from start to finish, click this button.
What's next
Visit the Spark documentation to learn more, or view how to use Spark with Databricks.