Skip to main content

ORC

ORC (Optimized Row Columnar) is a columnar file format designed for Spark/Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

This Gem allows you to read from or write to ORC files.

Source

Reads data from ORC files present at a path.

Source Parameters

ParameterDescriptionRequiredDefault
LocationFile path where ORC files are presentTrueNone
SchemaSchema to be applied on the loaded data. Can be defined/edited as JSON or inferred using Infer Schema buttonTrueNone
Recursive File LookupThis is used to recursively load files and it disables partition inferring. Its default value is false. If data source explicitly specifies the partitionSpec when recursiveFileLookup is true, an exception will be thrown.FalseFalse

Example

ORC source example

Generated Code

def read_orc(spark: SparkSession) -> DataFrame:
return spark.read\
.format("orc")\
.load("dbfs:/FileStore/Users/orc/test.orc")


Target

Target Parameters

Write data as ORC files at the specified path.

ParameterDescriptionRequiredDefault
LocationFile path where ORC files will be writtenTrueNone
CompressionCompression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd). This will override orc.compressFalsesnappy
Write ModeWrite mode for DataFrameTrueerror
Partition ColumnsList of columns to partition the ORC files byFalseNone

Example

ORC target example

Generated Code

def write_orc(spark: SparkSession, in0: DataFrame):
in0.write\
.format("orc")\
.mode("overwrite")\
.save("dbfs:/data/test_output.orc")
info

To know more about tweaking orc related properties in Spark config click here.