Datasets (not to be confused with a Spark Dataset) in Prophecy are the union of two concepts:
- Schema: Schema, in this case, is the shape of the data. Column names, their types and how to read and write the data that fits this schema
- Fabric: The Fabric is the execution environment in which this data resides.
In short: WHAT your data looks like and HOW to read it or write it in a given Fabric. Datasets are grouped together in Projects.
Creating a Dataset
Datasets are created where they're first used: When creating
Targets. Opening a
Target Gem will show a screen where you can either pick from an existing Dataset, or create a new one.
The screens that follow will depend on what kind of Source or Target in question. Once you finish configuring the Dataset, give it a name and click
Save. This Dataset can now be used by Pipelines within this project, and in some cases can even be used by other projects within the same Team. This makes defining common
Targets super easy!
Viewing project Datasets
There's two places to see Datasets: The Dataset list, which displays all Datasets across all of your Projects, and the Project Metadata section of a Project page, which is scoped to the Datasets of that given project.
If you navigate to the Metadata page for one of the Datasets you'll see the following:
|1||Metadata views||Switch between the different views of the Metadata page|
|2||Dataset name||The name of this Dataset. Clicking the pencil icon to edit the name|
|3||Dataset description||Description of this Dataset|
|4||Fabric dropdown||Switch between the various Fabrics available to your account to see how this Dataset is defined for that Fabric|
|5||Dataset properties||A small subset of the properties used for Reading or Writing to this Dataset|
|6||Dataset schema||This Dataset's Columns and their types|
|7||Delete Dataset||Delete this Dataset. Use with caution.|
On the Dataset Metadata page, you'll find the
|1||Physical Datasets||Each Physical Dataset represents a variation of this Dataset in a particular Fabric.|
|2||Pipelines||This is a list of Pipelines that use this Dataset in some way. The |
|3||Jobs||This is a list of Jobs that use this Dataset in some way. The |
|4||Open Lineage Viewer||Opens this Dataset in the Lineage Viewer, showing column-level lineage for this Dataset.|
As part of the Project Release process, Datasets within that Project are published to other Projects within the same Team, and can be published to other Teams in read-only mode. This allows you to share your Dataset configurations with other Teams without allowing them to make changes to the original Dataset definitions. Let's see this in action:
DI_TEAMis the central Data Infrastructure team. They have defined a common Project named
DI_Common_Pythonhas a number of Datasets defined within it:
DI_Teammerges and releases (following this development pattern) the
DI_Common_PythonProject, tagging it
- As you can see, the
DI_Teamhas published the
DI_Common_Pythonproject to the
DE_Team, the Data Engineering Team.
- Now, whenever the
DE_Teambuilds Pipelines, they can see the following:
We can see the
DI_Common_Python Project's Datasets, and the fact that they're listed as
Read-only. This means that
DE_Team can use the Datasets, but cannot edit them.