Skip to main content

Datasets

Datasets in Prophecy are the union of two concepts:

  1. Schema: Schema, in this case, is the shape of the data. Column names, their types and how to read and write the data that fits this schema
  2. Fabric: The Fabric is the execution environment in which this data resides.

In short: WHAT your data looks like and HOW to read it or write it in a given Fabric. Datasets are grouped together in Projects.

Creating a Dataset

Datasets are created where they're first used: When creating Sources or Targets. Opening a Source or Target Gem will show a screen where you can either pick from an existing Dataset, or create a new one.

Create a Dataset

The screens that follow will depend on what kind of Source or Target in question. Once you finish configuring the Dataset, give it a name and click Save. This Dataset can now be used by Pipelines within this project, and in some cases can even be used by other projects within the same Team. This makes defining common Sources and Targets super easy!

Viewing project Datasets

There's two places to see Datasets: The Dataset list, which displays all Datasets across all of your Projects, and the Project Metadata section of a Project page, which is scoped to the Datasets of that given project.

Dataset Metadata

If you navigate to the Metadata page for one of the Datasets you'll see the following:

Dataset metadata page

NameDescription
1Metadata viewsSwitch between the different views of the Metadata page
2Dataset nameThe name of this Dataset. Clicking the pencil icon to edit the name
3Dataset descriptionDescription of this Dataset
4Fabric dropdownSwitch between the various Fabrics available to your account to see how this Dataset is defined for that Fabric
5Dataset propertiesA small subset of the properties used for Reading or Writing to this Dataset
6Dataset schemaThis Dataset's Columns and their types
7Delete DatasetDelete this Dataset. Use with caution.

Dataset Relations

On the Dataset Metadata page, you'll find the Relations tab.

Dataset Relations

NameDescription
1Physical DatasetsEach Physical Dataset represents a variation of this Dataset in a particular Fabric.
2PipelinesThis is a list of Pipelines that use this Dataset in some way. The Relation column will indicate if this is for Read or Write purposes.
3JobsThis is a list of Jobs that use this Dataset in some way. The Relation column will indicate if this is for Read or Write purposes.
4Open Lineage ViewerOpens this Dataset in the Lineage Viewer, showing column-level lineage for this Dataset.

Publishing/Sharing Datasets

As part of the Project Release process, Datasets within that Project are published to other Projects within the same Team, and can be published to other Teams in read-only mode. This allows you to share your Dataset configurations with other Teams without allowing them to make changes to the original Dataset definitions. Let's see this in action:

  1. DI_TEAM is the central Data Infrastructure team. They have defined a common Project named DI_Common_Python: DI Team Project
  2. DI_Common_Python has a number of Datasets defined within it: DI Common Datasets
  3. The DI_Team merges and releases (following this development pattern) the DI_Common_Python Project, tagging it 0.1: DI Common Release
  4. As you can see, the DI_Team has published the DI_Common_Python project to the DE_Team, the Data Engineering Team.
  5. Now, whenever the DE_Team builds Pipelines, they can see the following: Common Datasets

We can see the DI_Common_Python Project's Datasets, and the fact that they're listed as Read-only. This means that DE_Team can use the Datasets, but cannot edit them.