Skip to main content

Prophecy Concepts

Advance your productivity by becoming familiar with these key Prophecy concepts.

Projects

The Project is the primary unit of development and deployment to production in Prophecy.

A project contains

  • Pipelines (Spark) or Models (SQL) which read, transform and write data.
  • Datasets point to the data that is read and written to by the Data Pipelines or Data Models.
  • Jobs run Data Pipelines and Data Models individually or together based on a schedule.

Project is Code on Git

A project is code on Git. This means that within a project, the business logic of all the assets including Pipelines/Models, Datasets, and Jobs is stored as code on Git. This might be a repository on GitHub or a folder in a repository.

Project is code

Metadata

Metadata is how you access and configure all entities in Prophecy. You can go to metadata page to view, edit, delete any of your entities like Pipelines, Projects, Jobs, Fabrics etc.

Metadata

Pipelines and Models

Pipelines and Models within Prophecy are used to represent the flow of data. They are similar to a map you might use on a road trip: You have a Start and Finish (Datasets) and the stops to make along the way (Gems).

Pipeline

Pipelines (for Spark)

Pipelines are based on Spark-native code. Therefore, Prophecy users can leverage all of the powers of Spark to build their transformations. Spark can handle virtually any complexity and scale transformation.

You will want to create pipelines when you’re running on a Spark environment, like Databricks or EMR. They’re particularly useful for more complex ingestion (e.g. loading data from Salesforce or JDBC), data transformations (e.g. working with complex data types) and machine learning use-cases.

Learn more here about Prophecy’s Data Transformation Copilot for Spark and checkout this guide.

Models (for SQL)

Models are based on SQL-native code and are backed by the dbt Core™️ build system. Models define a single Dataset (most commonly a table or a view) and they’re stored as a select statement in a SQL file. Models are used for SQL where Pipelines are used for Spark.

You will want to create models if you’d like to transform data directly on your data warehouse or you’re an existing dbt user. They’re best suited for data analytics and data transformation use-cases.

Learn more here about Prophecy’s Data Transformation Copilot for SQL and get started with SQL on Databricks with this guide or Snowflake with this guide.

Fabrics

A Fabric is a logical execution environment. Teams organize their data engineering into multiple environments such as development, staging, and production. As an example, if you have a Databricks Workspace, that would map to one Fabric in Prophecy.

Fabrics

Datasets

Datasets (not to be confused with a Spark Dataset) in Prophecy are the union of two concepts:

  • WHAT your data looks like
  • HOW to read it or write it.

To define a dataset, all you need is

  • Type: The type of data you are reading/writing like CSV, Parquet files or catalog tables.
  • Location: The location of your data. It could be a file path for CSV or a table name.
  • Properties: Properties consists of Schema which is the shape of the data and some other attributes specific to the file format. For example, in case of CSV, you can give Column delimiter in additional attributes. User can also define Metadata for each column here like description, tags and mappings.

Dataset

Gems

If a Pipeline is a roadmap of the journey that your data will take from Start to Finish, the Gems are the stops to make along the way. Each Gem instance can be seen as just another Spark Transformation you perform on your data like Reformat, Join, Limit or Filter. Gems are categorised as below:

Gem
CategoryDescription
Source and TargetSource and TargetThe set of Gems that help with loading and saving data.
TransformTransformThe set of Gems that help with transforming data.
Join and splitJoin and SplitThe set of Gems that help with the process of merging or splitting DataFrame(s) to create new DataFrame(s)
CustomCustomSet of Gems that do not fall into the other categories.
SubgraphSubgraphA Gem that can contain many other Gems within it.

Each Gem instance comes with their own configuration and each produces its own block of output code.

Jobs

Once you have developed a Spark data Pipeline using Prophecy, you will want to schedule it to run at some frequency. You can schedule just a single Pipeline or create a DAG of Pipelines. You can also add scripts to run in between these Pipelines.

To support Jobs, Prophecy provides you with an easy to use interface to develop, using two different schedulers:

  • Databricks Jobs - for simpler data-Pipeline use-cases, where you just orchestrate multiple data-Pipelines to run together. Databricks Jobs is a recommended scheduler, if you're Databricks Native.

  • Airflow - for more complex use-cases, where you have to use various operators, or need any additional data pre-and-post-processing, you can design your Jobs using Prophecy's interface for Airflow.

Alternatively, since Prophecy provides you native Spark code on Git, you can easily integrate with any other scheduler.

Job

Teams And Users

Teams are the primary mechanism of ownership. Each Team represents a group of users who work together.

  • Teams own Projects where Pipelines, Datasets, and Jobs live.
  • Teams own Fabrics that provide the execution and storage resources for execution including Spark clusters.

Users get access by being added to a team.

  • Personal Teams For every user, a default team in created that only contains one user. If only one user is to be given access to a project, it can be done via this team.
  • Personal Projects Users can put personal projects in their personal team and not share it with others.

TeamsUsers

What's next

To learn more about Prophecy concepts, see the following pages: