Skip to main content

XLSX (Excel)

If you've ever done anything with numbers in your line of work odds are you've worked with Excel at one point or another. Prophecy supports the format as both a data source and data target, so if you're migrating from a legacy system or you need to produce an Excel-compatible file for a report, we've got you covered.

Prerequisites

caution

If you receive an error about the excel format not being available you must add spark-excel library as a dependency.

Follow the instructions on this page to add the Maven coordinate com.crealytics:spark-excel_2.12:3.5.1_0.20.4 to your Pipeline.

Parameters

Source Parameters

The following is a list of options that are available while using XLSX as a Source:

ParameterDescriptionRequiredDefault
Column Name of Corrupt RecordName of the column to create for corrupt recordsFalseNone
Column Name of Row NumberName of the column to create using the original row numberFalseNone
Data AddressData address to read (see here) for more informationTrueA1 (Everything)
Date FormatDate format to useFalseSpark default
Excerpt SizeExcept SizeFalse
File ExtensionInput file extensionFalsexlsx
HeaderFirst line in input is a headerTrueTrue
Ignore After HeaderNumber of rows to ignore after headerFalse0
Ignore Leading WhitespaceFalseFalse
Ignore Trailing WhitespaceFalseFalse
Infer SchemaInfer the schema of the input.

Note: This setting is provided directly by the spark-excel library and is different than the Infer Schema button in the Prophecy UI. Both should provide the same results.
FalseFalse
Keep Undefined RowsIf true, keeps undefined Excel rowsFalseFalse
LocaleA language tag in the IETF BCP 47 formatFalse"US"
NaN ValueValue to use in the case of NaNFalse"NaN"
Negative InfinityValue to use in the case of negative infinityFalse"Inf"
Null ValueValue to use for NullFalse(empty)
Parse ModeParsing mode. Supports Permissive, Drop Malformed and Fail Fast.FalsePermissive
Positive InfinityValue to use in case of positive infinityFalse"Inf"
Sampling RatioDefines how much of the input to sample from when inferring the schema.False1.0
Timestamp FormatFormat to parse timestamps from text cellsFalseSpark default
Use Null for Error CellsUse null value for error cellsFalseTrue
Workbook PasswordPassword to secure workbookFalse(empty)
Timezone IDTimezone ID for Dates/Timestamps taken from the IANA Time Zone Database.

Note: See here for valid values.
False(empty)

Target Parameters

The following is a list of options that are available while using XLSX as a Target:

ParameterDescriptionRequiredDefault
Data AddressData address to write output toFalseA1
File ExtensionFile extension used when writingFalse"xlsx"
HeaderWrite header to fileFalseTrue
LocaleA language tag in the IETF BCP 47 formatFalse"US"
Date FormatFormat to use for Date columnsFalseSpark default
Use Plain Number FormatIf true, format the cells without rounding and scientific notationsFalseFalse
Workbook PasswordPassword to secure workbookFalse(empty)
Write ModeWrite mode, same as underlying Spark write modeFalse"append"
Parition ColumnsColumns to partition output files byFalse(empty)

Writing a single output file

When working with text-based files in Spark, your output isn't a single file but a directory containing multiple partitioned files due to Spark's distributed nature.

For example, if you write to a location like dbfs:/FileStore/Users/test/customers.xlsx, you'll see the following in the DBFS:

  • A customers.xlsx directory.
  • Partitions within the customers.xlsx directory.

Each partition is a separate valid XLSX file with a segment of the overall output data. If you want to output only a single file, you'll need to:

  1. Add a Repartition Gem in Coalesce mode with the Partition Count set to 1.

    Coalesce using Repartition

  2. Connect it between your second-to-last transformation and the Target Gem.

    Attach coalesce before desired target

After running, your output will still be a directory, but this time it will only contain a single output file.

Example code

Below is a snippet of the optimized code that is generated when using the XLSX source.

def Demo_XLSX_Source(spark: SparkSession) -> DataFrame:
if Config.fabricName == "dev":
return spark.read\
.format("excel")\
.option("header", True)\
.option("dataAddress", "A1")\
.option("inferSchema", True)\
.load("dbfs:/FileStore/Users/scott/plain_number.xlsx")
else:
raise Exception("No valid dataset present to read fabric")