Skip to main content

Datasets

In Prophecy, Datasets are grouped by Projects and rely on the following:

  • Schema: The structure or shape of the data, including column names, data types, and the method for reading and writing the data in this format.
  • Fabric: The execution environment in which the data resides.

Create Datasets

Datasets are created where they are first used in a Source or Target Gems. A Dataset definition includes its:

  • Type: The type of data you are reading/writing like CSV, Parquet files or catalog tables.
  • Location: The location of your data. It could be a file path for CSV or a table name.
  • Properties: Properties consists of Schema and some other attributes specific to the file format. For example, in case of CSV, you can give Column delimiter in additional attributes. You can also define Metadata for each column here like description, tags, and mappings.

Datasets can be used by any Pipeline within the same Project, and in some cases by other Projects within the same team.

View Datasets

There are two ways to view a list of Datasets:

  • To see all Datasets, navigate to Metadata > Datasets.
  • To see only one Project's Datasets, navigate to Metadata > Projects. Then, open a Project. Click on the Content tab, and then the Datasets subtab.

Dataset Metadata

If you open the metadata page for one of the Datasets, you'll find the following information:

NameDescription
Dataset nameThe name of this Dataset, which is editable.
Dataset descriptionThe description of this Dataset, which is editable.
Dataset propertiesA subset of properties used for reading or writing to this Dataset.
Dataset schemaThe columns of this Dataset and their data types.
Delete DatasetThe option to delete this Dataset. Use with caution.

In the Relations tab, there is additional information about where and how this Dataset is used.

NameDescription
Physical DatasetsLocation of the Dataset in relation to a Fabric.
PipelinesA list of Pipelines that use this Dataset, with the Relation column indicating if it is for Read or Write purposes.
JobsA list of Jobs that use this Dataset, with the Relation column indicating if it is for Read or Write purposes.
Open Lineage ViewerThe option to open this Dataset in the Lineage viewer, showing column-level lineage for this Dataset.

Publishing/Sharing Datasets

As part of the Project Release process, Datasets within that Project are published to other Projects within the same Team, and can be published to other Teams in read-only mode. This allows you to share your Dataset configurations with other Teams without allowing them to make changes to the original Dataset definitions. Let's see this in action:

  1. DI_TEAM is the central Data Infrastructure team. They have defined a common Project named DI_Common_Python.
  2. DI_Common_Python has a number of Datasets defined within it: DI Common Datasets
  3. The DI_Team merges and releases the DI_Common_Python Project, tagging it 0.1. DI Common Release
  4. As you can see, the DI_Team has published the DI_Common_Python project to the DE_Team, the Data Engineering Team.
  5. Now, whenever the DE_Team builds Pipelines, they can see the following: Common Datasets

We can see the DI_Common_Python Project's Datasets, and the fact that they're listed as Read-only. This means that DE_Team can use the Datasets, but cannot edit them.

info

For regular usage, we suggest having only one instance of a particular Dataset within a Pipeline, as the Dataset's properties and underlying data can change each time the dataset is read or written.