Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 4

oops concepts in python:

class:

-it is a blueprint of an object


-class can be defined as a collection of objects
-it is a logical entity

object :

-it is an instance of a class


-it is an entity that existing in the real world
-it is a physical entity

data abstraction:

-hiding the implementation details and showing essential details

encapsulation:

-wrapping up of data and functions into a single unit

inheritance:

-the process of creating a new class form already existing class

polymorphism:

- implementing same method in different content

ORC Files
ORC Implementation
Vectorized Reader
Schema Merging
Zstandard
Bloom Filters
Columnar Encryption
Hive metastore ORC table conversion
Configuration
Data Source Option
Apache ORC is a columnar format which has more advanced features like native zstd
compression, bloom filter and columnar encryption.

ORC Implementation
Spark supports two ORC implementations (native and hive) which is controlled by
spark.sql.orc.impl. Two implementations share most functionalities with different
design goals.

native implementation is designed to follow Spark’s data source behavior like


Parquet.
hive implementation is designed to follow Hive’s behavior and uses Hive SerDe.
For example, historically, native implementation handles CHAR/VARCHAR with Spark’s
native String while hive implementation handles it via Hive CHAR/VARCHAR. The query
results are different. Since Spark 3.1.0, SPARK-33480 removes this difference by
supporting CHAR/VARCHAR from Spark-side.

Vectorized Reader
native implementation supports a vectorized ORC reader and has been the default ORC
implementation since Spark 2.3. The vectorized reader is used for the native ORC
tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl
is set to native and spark.sql.orc.enableVectorizedReader is set to true.

For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE
OPTIONS (fileFormat 'ORC')), the vectorized reader is used when
spark.sql.hive.convertMetastoreOrc is also set to true, and is turned on by
default.

Schema Merging
Like Protocol Buffer, Avro, and Thrift, ORC also supports schema evolution. Users
can start with a simple schema, and gradually add more columns to the schema as
needed. In this way, users may end up with multiple ORC files with different but
mutually compatible schemas. The ORC data source is now able to automatically
detect this case and merge schemas of all these files.

Since schema merging is a relatively expensive operation, and is not a necessity in


most cases, we turned it off by default . You may enable it by

setting data source option mergeSchema to true when reading ORC files, or
setting the global SQL option spark.sql.orc.mergeSchema to true.
Zstandard
Since Spark 3.2, you can take advantage of Zstandard compression in ORC files.
Please see Zstandard for the benefits.

SQL
CREATE TABLE compressed (
key STRING,
value STRING
)
USING ORC
OPTIONS (
compression 'zstd'
)
Bloom Filters
You can control bloom filters and dictionary encodings for ORC data sources. The
following ORC example will create bloom filter and use dictionary encoding only for
favorite_color. To find more detailed information about the extra ORC options,
visit the official Apache ORC websites.

SQL
CREATE TABLE users_with_options (
name STRING,
favorite_color STRING,
favorite_numbers array<integer>
)
USING ORC
OPTIONS (
orc.bloom.filter.columns 'favorite_color',
orc.dictionary.key.threshold '1.0',
orc.column.encoding.direct 'name'
)
Columnar Encryption
Since Spark 3.2, columnar encryption is supported for ORC tables with Apache ORC
1.6. The following example is using Hadoop KMS as a key provider with the given
location. Please visit Apache Hadoop KMS for the detail.

SQL
CREATE TABLE encrypted (
ssn STRING,
email STRING,
name STRING
)
USING ORC
OPTIONS (
hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
orc.key.provider "hadoop",
orc.encrypt "pii:ssn,email",
orc.mask "nullify:ssn;sha256:email"
)
Hive metastore ORC table conversion
When reading from Hive metastore ORC tables and inserting to Hive metastore ORC
tables, Spark SQL will try to use its own ORC support instead of Hive SerDe for
better performance. For CTAS statement, only non-partitioned Hive metastore ORC
tables are converted. This behavior is controlled by the
spark.sql.hive.convertMetastoreOrc configuration, and is turned on by default.

Configuration
Property Name Default Meaning Since Version
spark.sql.orc.impl native The name of ORC implementation. It can be one
of native and hive. native means the native ORC support. hive means the ORC library
in Hive. 2.3.0
spark.sql.orc.enableVectorizedReader true Enables vectorized orc decoding in
native implementation. If false, a new non-vectorized ORC reader is used in native
implementation. For hive implementation, this is ignored. 2.3.0
spark.sql.orc.columnarReaderBatchSize 4096 The number of rows to include in an
orc vectorized reader batch. The number should be carefully chosen to minimize
overhead and avoid OOMs in reading data. 2.4.0
spark.sql.orc.columnarWriterBatchSize 1024 The number of rows to include in an
orc vectorized writer batch. The number should be carefully chosen to minimize
overhead and avoid OOMs in writing data. 3.4.0
spark.sql.orc.enableNestedColumnVectorizedReader true Enables vectorized orc
decoding in native implementation for nested data types (array, map and struct). If
spark.sql.orc.enableVectorizedReader is set to false, this is ignored. 3.2.0
spark.sql.orc.filterPushdown true When true, enable filter pushdown for ORC
files. 1.4.0
spark.sql.orc.aggregatePushdown false If true, aggregates will be pushed down
to ORC for optimization. Support MIN, MAX and COUNT as aggregate expression. For
MIN/MAX, support boolean, integer, float and date type. For COUNT, support all data
types. If statistics is missing from any ORC file footer, exception would be
thrown. 3.3.0
spark.sql.orc.mergeSchema false
When true, the ORC data source merges schemas collected from all data files,
otherwise the schema is picked from a random data file.

3.0.0
spark.sql.hive.convertMetastoreOrc true When set to false, Spark SQL will use the
Hive SerDe for ORC tables instead of the built in support. 2.0.0
Data Source Option
Data source options of ORC can be set via:

the .option/.options methods of


DataFrameReader
DataFrameWriter
DataStreamReader
DataStreamWriter
OPTIONS clause at CREATE TABLE USING DATA_SOURCE
Property Name Default Meaning Scope
mergeSchema false sets whether we should merge schemas collected from all ORC part-
files. This will override spark.sql.orc.mergeSchema. The default value is specified
in spark.sql.orc.mergeSchema. read
compression snappy compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and
lz4). This will override orc.compress and spark.sql.orc.compression.codec. write
Other generic options can be found in Generic File Source Options.

You might also like