Skip to main content

XLSX (Excel)

The XLSX (Excel) file type:

  • Is in XML format, which is easier for data access, manipulation, and compatibility with various software applications.
  • Offers password protection options, which allow users to secure sensitive data.

Prerequisites

  • If you receive an error about the excel format not being available you must add spark-excel library as a dependency.

  • To add the Maven coordinate com.crealytics:spark-excel_2.12:3.5.1_0.20.4 to your pipeline, see Spark dependencies.

Parameters

ParameterTabDescription
LocationLocationFile path to read from or write to the XLSX file.
SchemaPropertiesSchema to apply on the loaded data.
In the Source gem, you can define or edit the schema visually or in JSON code.
In the Target gem, you can view the schema visually or as JSON code.

Source

The Source gem reads data from XLSX files and allows you to optionally specify the following additional properties.

Source properties

Property nameDescriptionDefault
Enforce SchemaWhether to use the schema you define.false
HeaderWhether to read the first line as a header.true
DescriptionDescription of your dataset.None
Data AddressLocation to read data addresses from. To learn more, see Data Addresses.A1 (Everything)
Column Name of Corrupt RecordRename the field the PERMISSIVE mode creates to store malformed data._corrupt_records
Column Name of Row NumberName of the column to create using the original row number.None
Date FormatString that indicates a date format.yyyy-MM-dd
Excerpt SizeExcerpt Size.None
File ExtensionExtension of the file to read in.xlsx
Ignore After HeaderNumber of rows to ignore after the header.None
Ignore leading white spaces from valuesWhether to skip the leading whitespaces from values the Source gem reads.false
Ignore trailing white spaces from valuesWhether to skip the trailing whitespaces from values the Source gem reads.false
Infer SchemaWhether to automatically infer the input schema from the data. This requires one extra pass over the data.

Note: The spark-excel library provides this setting and is different than the Infer Schema button in the Prophecy UI. Both should provide the same results.
false
LocaleSets a locale as language tag in IETF BCP 47 format.en-US
NaN ValueSets the string representation of a non-number value.NaN
Negative Infinite valueSets the string representation of a negative infinity value.-Inf
Null ValueSets the string representation of a null value.None
Parse ModeHow to handle corrupt data.
For a list of the possible values, see Supported parse modes.
Permissive
Positive Infinite valueSets the string representation of a positive infinity value.Inf
Sampling RatioDefines a fraction of rows to use for schema inferring1.0
Timestamp FormatSets the string that indicates a timestamp format.yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Use Null for Error CellsWhether to use null for cells with errors.false
Workbook PasswordPassword to secure your workbook.None
Time Zone IDTimezone ID for the Date and Timestamp from the IANA Time Zone Database.
Note: For a list of valid values, see Class ZoneId.
None
Temporary file thresholdWhen the Source gem should start writing data to temporary files on disk instead of keeping it in memory.None
Maximum rows in memoryMaximum amount of rows to have in memory.None
Maximum byte array sizeMaximum size of your array.None

Supported parse modes

ModeDescription
PermissivePut the malformed string into the corrupt records column, and set the malformed fields to null.
Drop MalformedIgnore the entire corrupted record. This mode is not supported in the CSV built-in functions.
Fail FastThrow an exception when it meets a corrupted record.

Target

The Target gem writes data to XLSX files and allows you to optionally specify the following additional properties.

Target properties

Property nameDescriptionDefault
Data AddressLocation to write data addresses to. To learn more, see Data Addresses.A1
File ExtensionFile extension of the file to write to.xlsx
HeaderWhether to write a header to the file.true
LocaleSets a locale as language tag in IETF BCP 47 format.en-US
Date FormatString that indicates a date format.yyyy-MM-dd
Use Plain Number FormatWhether to format the cells without rounding and scientific notations.false
Workbook PasswordPassword to secure your workbook.None
Write ModeHow to handle existing data. For a list of the possible values, see Supported write modes.append
Partition ColumnsList of columns to partition the XLSX files by.None
Create single named XLSX fileWhether to create a single XLSX file.false

Supported write modes

Write modeDescription
errorIf the data already exists, throw an exception.
overwriteIf the data already exists, overwrite the data with the contents of the DataFrame.
appendIf the data already exists, append the contents of the DataFrame.
ignoreIf the data already exists, do nothing with the contents of the DataFrame.
This is similar to the CREATE TABLE IF NOT EXISTS clause in SQL.

Writing a single output file

Due to Spark's distributed nature, when you work with text-based files, your output is a directory containing multiple partitioned files.

For example, if you write to the following location: dbfs:/FileStore/Users/test/customers.xlsx, you see the following in the DBFS:

  • A customers.xlsx directory.
  • Partitions within the customers.xlsx directory.

Each partition is a separate valid XLSX file with a segment of the overall output data.

If you want the Target gem to output a single file:

  1. Add a Repartition gem in Coalesce mode with the Partition Count set to 1.

    Coalesce using Repartition

  2. Connect the Repartition gem between your second-to-last transformation and the Target gem.

    Attach coalesce before desired target

  3. Run your pipeline.

    After you run your pipeline, your output is still a directory, but this time it only contains a single output file.

Example code

tip

To see the generated source code of your project, switch to the Code view in the project header.

def Demo_XLSX_Source(spark: SparkSession) -> DataFrame:
if Config.fabricName == "dev":
return spark.read\
.format("excel")\
.option("header", True)\
.option("dataAddress", "A1")\
.option("inferSchema", True)\
.load("dbfs:/FileStore/Users/scott/plain_number.xlsx")
else:
raise Exception("No valid dataset present to read fabric")