Skip to main content

CSV

The CSV (Comma-separated Values) file type is:

  • Easy to read from, write to, and share.
  • Compatible with many programs, and easy to exchange data.

Requirements

The CSV file type requires different package and Spark versions depending on the location type you select.

Location TypeDependency versionsSpark compatibility
File Location
Sharepoint (Python only)
SFTP (Python only)

Parameters

ParameterTabDescription
Read FromLocationWhich platform to read the file from.
Possible values are: File Location, Sharepoint (Python only), or SFTP (Python only).
Write ToLocationWhich platform to write the file to.
You can only select File Location.
LocationLocationFile path to read from or write to the CSV file.
SchemaPropertiesSchema to apply on the loaded data.
In the Source gem, you can define or edit the schema visually or in JSON code.
In the Target gem, you can view the schema visually or as JSON code.

Source

The Source gem reads data from CSV files and allows you to optionally specify the following additional properties.

Source properties

Property nameDescriptionDefault
DescriptionDescription of your dataset.None
User-Defined SchemaWhether to use the schema you define.true
Column DelimeterCharacter to separate column names from your CSV file.,
First row is headerWhether to use the first line as names of columns.true
Infer schema from dataWhether to automatically infer the input schema from the data. This requires one extra pass over the data.
CSV built-in functions ignore this option.
false
Parse Multi-line recordsWhether to parse one record, which may span multiple lines, per file.
CSV built-in functions ignore this option.
false
Encoding TypeDecodes the CSV files by encoding type you give.
CSV built-in functions ignore this option.
UTF-8
Quote characterSets a single character to escape quoted values where the separator can be part of the value. If you want to turn off quotations, set this to an empty string."
Escape characterSets a single character to escape quotes inside a quoted value.\
Escape char for quote escaping charSets a single character to escape the escape for the quote character.\0
Skip line beginning with characterIf the line begins with the character you specify, skip the entire line.None
Enforce specified or inferred schemaWhether to forcibly apply the specified or inferred schema to data source files, and headers in CSV files.
CSV built-in functions ignore this option.
false
Sampling RatioDefines a fraction of rows to use for schema inferring.
CSV built-in functions ignore this option.
1.0
Ignore leading white spaces from valuesWhether to skip the leading whitespaces from values the Source gem reads.false
Ignore trailing white spaces from valuesWhether to skip the trailing whitespaces from values the Source gem reads.false
Null ValueSets the string representation of a null value.None
Empty ValueSets the string representation of an empty value.None
String representation for non-number valueSets the string representation of a non-number value.NaN
Positive infinity valueSets the string representation of a positive infinity value.Inf
Negative infinity valueSets the string representation of a negative infinity value.-Inf
Date format stringSets the string that indicates a date format.yyyy-MM-dd
Timestamp format stringSets the string that indicates a timestamp format.yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Max number of columns per recordDefines a hard limit of how many columns a record can have.20480
Allowed maximum characters per columnDefines the maximum number of characters allowed for any given value being read.-1 (unlimited length)
Corrupt record handlingHow to handle corrupt data. For a list of the possible values, see Supported corrupt record modes.PERMISSIVE
Column name of a corrupt recordRename the field the PERMISSIVE mode creates to store malformed data.
This overrides spark.sql.columnNameOfCorruptRecord.
_corrupt_records
Line SepSets a separator for each field and value. The separator can be one or more characters.\r, \r\n and \n
LocaleSets a locale as language tag in IETF BCP 47 format.en-US
Unescaped Quote HandlingHow the CsvParser handles values with unescaped quotes.
For a list of the possible values, see Supported unescaped quote handling.
STOP_AT_DELIMITER
Recursive File LookupWhether to recursively load files and disable partition inferring. If the data source explicitly specifies the partitionSpec when therecursiveFileLookup is true, the Source gem throws an exception.false
Path Global FilterGlob pattern to only include files with paths matching the pattern. The syntax follows GlobFilter and does not change the behavior of partition discovery.None
Modified BeforeTimestamp to only include files with modification times occurring before the time you specify. The timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00).None
Modified AfterTimestamp to only include files with modification times occurring after the time you specify. The timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00).None
Skip header linesNumber of lines to skip at the beginning of the file.None
Skip footer linesNumber of lines to skip at the end of the file.None

Supported corrupt record modes

ModeDescription
PERMISSIVEPut the malformed string into the corrupt records column, and set the malformed fields to null.
DROPMALFORMEDIgnore the entire corrupted record. This mode is not supported in the CSV built-in functions.
FAILFASTThrow an exception when it meets a corrupted record.

Supported unescaped quote handling

ModeDescription
STOP_AT_CLOSING_QUOTEAccumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
BACK_TO_DELIMITERConsider the value as an unquoted value. This makes the parser accumulate all characters of the current parsed value until it finds the delimiter. If it does not find the delimiter, the parser continues accumulating characters from the input until it finds a delimiter or line ending.
STOP_AT_DELIMITERConsider the value as an unquoted value. This makes the parser accumulate all characters from the input until it finds the delimiter or a line ending.
SKIP_VALUESkip the parsed content and set this to the value in the Null Value property.
RAISE_ERRORThrow a TextParsingException.

Target

The Target gem writes data to CSV files and allows you to optionally specify the following additional properties.

Target properties

Property nameDescriptionDefault
Dataset NameName of the dataset.True
DescriptionDescription of your dataset.None
Write ModeHow to handle existing data. For a list of the possible values, see Supported write modes.error
Column delimeterCharacter to separate column names from your CSV file.,
First row is headerWhether to use the first line as names of columns.true
Partition ColumnsList of columns to partition the CSV files by.None
Encoding TypeSpecifies encoding (charset) of saved CSV files.
CSV built-in functions ignore this option.
UTF-8
Quote characterSets a single character to escape quoted values where the separator can be part of the value. If you want to turn off quotations, you need to set this to an empty string."
Escape characterSets a single character to escape quotes inside a quoted value.\
Escape char for quote escaping charSets a single character to escape the escape for the quote character.\0
Null ValueSets the string representation of a null value.None
Empty ValueSets the string representation of an empty value."" (empty string)
CompressionCompression codec when writing to the CSV file.
The CSV file supports the following codecs: none, bzip2, gzip, lz4, snappy and deflate.
CSV built-in functions ignore this option.
None
Escape quotesWhether values containing quotes should always be enclosed in quotes.false
Quote AllWhether all values should always be enclosed in quotes.false
Date format stringSets the string that indicates a date format.yyyy-MM-dd
Timestamp format stringSets the string that indicates a timestamp format.yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Ignore leading white spaces from valuesWhether to skip the leading whitespaces for values the Target gem writes.true
Ignore trailing white spaces from valuesWhether to skip the trailing whitespaces for values the Target gem writes.true
Line SepDefines the line separator to use when parsing.\n
Create single CSV fileWhether to create a single CSV file.
Due to Spark's distributed nature, Prophecy by default writes output files as multiple separate partition files.
false

Supported write modes

Write modeDescription
errorIf the data already exists, throw an exception.
overwriteIf the data already exists, overwrite the data with the contents of the DataFrame.
appendIf the data already exists, append the contents of the DataFrame.
ignoreIf the data already exists, do nothing with the contents of the DataFrame.
This is similar to the CREATE TABLE IF NOT EXISTS clause in SQL.