Data sampling settings

You can choose different data sampling settings to control how Prophecy generates data samples during interactive pipeline execution.

Interactive run configuration

For each pipeline, you can enable, disable, or change interactive data sampling and job data sampling modes. To quickly update the interactive execution settings:

Hover the large play button.
Click on the ellipses that appears on hover.
Change your data sampling settings. This will update your pipeline settings.

Interactive run configuration

Data sampling modes

Prophecy provides the following data sampling modes.

All (default)

All gems (excluding Target gems) generate data samples.

All interims

Selective (recommended)

When you choose selective data sampling, you gain the ability to enable or disable data samples for individual gems. To do so, use the Data Preview checkbox in the gem action menu.

Selective

If the Data Preview option is not selected for a gem, you'll see a pale-color gem output after running the data pipeline. This means that no data sample was generated. To generate it, open the pale interim and it will load the data. After it loads, the data sample will display the normal bold color.

note

Selectively-generated samples load up to 10,000 rows (or 2 MB payload) by default. Set the following environment variables in Databricks to modify this behavior:

EXECUTION_DATA_SAMPLE_LOADER_MAX_ROWS: Max number of rows (default is 10,000 rows).
EXECUTION_DATA_SAMPLE_LOADER_PAYLOAD_SIZE_LIMIT: Max payload size (default 2 MB).
EXECUTION_DATA_SAMPLE_LOADER_CHAR_LIMIT: Per column character limit (default 200 KB). Values exceeding the limit are truncated.

tip

When working on Databricks UC standard clusters, use selective data sampling mode. If you select another data sampling mode, samples will only be generated on edge nodes.

Sources

Only Source gems generate data samples.

Source interims

Targets

Only gems before Target gems generate data samples.

Target interims

IO

Data samples are only generated after Source gems or before Target gems (not between intermediate gems).

IO interims

Vanilla

This Databricks-specific setting generates data samples on the last node(s) of a pipeline. If the pipeline branch ends with a Target gem, the data sample appears before the Target gem (like the Target mode).

Vanilla interims

Cached interims

If you change your data samplings settings and re-run your pipeline, you might see some grayed-out data samples. These are the cached data samples from previous runs. They might be outdated.

Cached interims

Fabric settings

In a fabric, you can enable or disable data sampling and override pipeline-level settings when a pipeline runs on that fabric. This option is available in the Advanced tab of a fabric. A common use case is preventing sample data generation in production pipelines.

Create a new model test

By default, only team admins can access the Advanced tab in a fabric. However, there are two flags you can set in your deployment to change this behavior:

ALLOW_FABRIC_ACCESS_CLUSTER_ADMIN: Grants cluster admins full access to fabrics, even if they are not team admins.
DISALLOW_FABRIC_CODEDEPS_UPDATE_TEAM_ADMIN: Prevents team admins from modifying the data sampling settings within a fabric.

Interactive run configuration​

Data sampling modes​

All (default)​

Selective (recommended)​

Sources​

Targets​

IO​

Vanilla​

Cached interims​

Fabric settings​

Interactive run configuration

Data sampling modes

All (default)

Selective (recommended)

Sources

Targets

IO

Vanilla

Cached interims

Fabric settings