External Tables: - Not Just Loading A CSV File Kim Berg Hansen Senior Consultant

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

External Tables

- Not *Just* Loading a CSV File

Kim Berg Hansen


Senior Consultant

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF


HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
About me

• Danish geek
• SQL & PL/SQL developer since 2000
• Developer at Trivadis since 2016
http://www.trivadis.dk
• Oracle Certified Expert in SQL
• Oracle ACE Director
• Blogger at http://www.kibeha.dk
• SQL quizmaster at
http://devgym.oracle.com
• Likes to cook
• Reads sci-fi
• Member of Danish Beer Enthusiasts

2 9/21/2018 External Tables - Not *Just* Loading a CSV File


500+ Technical Experts
Helping Peers Globally

3 Membership Tiers Connect:


• Oracle ACE Director bit.ly/OracleACEProgram oracle-ace_ww@oracle.com
• Oracle ACE
• Oracle ACE Associate Facebook.com/oracleaces
@oracleace

Nominate yourself or someone you know: acenomination.oracle.com


About Trivadis

Trivadis is a market leader in IT consulting, system integration, solution engineering


and the provision of IT services focusing on and
technologies in Switzerland, Germany, Austria and Denmark.
We offer our services in the following strategic business fields:

OPERATION

Trivadis Services takes over the interacting operation of your IT systems.

4 9/21/2018 External Tables - Not *Just* Loading a CSV File


With over 600 specialists and IT experts in your region
COPENHAGEN

14 Trivadis branches and more than


600 employees
HAMBURG
260 Service Level Agreements
Over 4,000 training participants
Research and development budget:
DÜSSELDORF
EUR 5.0 million
FRANKFURT Financially self-supporting and
sustainably profitable
STUTTGART
Experience from more than 1,900
FREIBURG
MUNICH
VIENNA
projects per year at over 800
BASEL
BRUGG customers
ZURICH
BERN
GENEVA LAUSANNE

5 9/21/2018 External Tables - Not *Just* Loading a CSV File


External Tables - Not *Just* Loading a CSV File

1. Access Drivers, Parameters, Locations


2. Definition versus Runtime
3. Error Handling, Logging Files
4. Flat Files input
5. Preprocessor
6. Multiple Files, Parallelism, Partition Pruning
7. Trusted Relied Constraints
8. SQL*Loader as Generator
9. External Table with Datapump Dump Files
10. HDFS / HIVE

6 9/21/2018 External Tables - Not *Just* Loading a CSV File


Access Drivers
Parameters
Locations

7 9/21/2018 External Tables - Not *Just* Loading a CSV File


External Tables

A way to treat a file outside of the database as a rowsource


Enables SELECT from the file with all the power of SQL
– Without necessarily loading the data into a table in the database
Different filetypes supported with different Access Drivers

select t1.col1, t2.col2


from db_tab t1
join ext_tab t2
on t2.fk = t1.pk
where t1.grp = 'FOO';

9 9/21/2018 External Tables - Not *Just* Loading a CSV File


Creation

Definition created in data dictionary* like normal table (only data is outside DB)
(* in 18c not necessarily - more on that later)
Specify type (access driver), directory and location (file)
Specify access parameters depending on access driver
create table ext_tab (fk number, col2 varchar2(10))
organization external (
type oracle_loader
access parameters (
records delimited by newline
fields terminated by ";" optionally enclosed by '"'
( fk integer external(6), col2 char(10) )
)
location (ext_dir:'file.txt')
);

10 9/21/2018 External Tables - Not *Just* Loading a CSV File


Access Driver

Keyword TYPE specifies which access driver to use


ORACLE_LOADER
– Flat files - alternative to SQL*Loader
ORACLE_DATAPUMP
– Datadump dump files - can also write files (once - at creation time)
ORACLE_HDFS (12.2) Oracle Big Data SQL
– Read datafiles from HDFS (by creating a HIVE table)
ORACLE_HIVE (12.2) Oracle Big Data SQL
– Read datafiles from HDFS by querying a HIVE catalog

11 9/21/2018 External Tables - Not *Just* Loading a CSV File


Access Parameters

Specific for each Access Driver type


Tells DB the metadata of the file, how to get the values of each column

18c doc states opaque_format_spec in quotes used for INLINE EXTERNAL and
EXTERNAL_MODIFY, while without quotes is used for CREATE TABLE
– This appears to be a doc bug - without quotes seems always to work
Or a subquery can return the access parameters

12 9/21/2018 External Tables - Not *Just* Loading a CSV File


Location

Keyword LOCATION contains one or more filenames


For ORACLE_LOADER and ORACLE_DATAPUMP files in filesystem
– DIRECTORY object must be created and privileges granted
– DIRECTORY object specified for file: DIRNAME:'file.txt'
– Or DEFAULT DIRECTORY specifies directory for files where dir. is omitted
– (12.1) Location supports wildcards * and ?
For ORACLE_HDFS location specifies hdfs:/... style URI
For ORACLE_HIVE location unused - access parameters specifies cluster/table

13 9/21/2018 External Tables - Not *Just* Loading a CSV File


Definition versus Runtime

14 9/21/2018 External Tables - Not *Just* Loading a CSV File


Definition in Data Dictionary

Define with CREATE TABLE


Change with ALTER TABLE
– Often useful to change LOCATION
– Some restrictions on what can be altered - see manual of each version
Change the projection with ALTER TABLE
– PROJECT COLUMN ALL / PROJECT COLUMN REFERENCED
- The latter may cause inconsistencies if errors in un-referenced columns

16 9/21/2018 External Tables - Not *Just* Loading a CSV File


Overrides at Runtime (12.2)

SELECT ... FROM EXT_TAB EXTERNAL MODIFY (...)

– Modify default directory and/or location


- Allows each session/query to read own (identically structured) file(s)
– Modify reject limit
– Modify badfile / logfile / discardfile
Careful with your security
– A user with SELECT privilege on the external table can potentially read all files in
the DIRECTORY objects he has READ privilege on

17 9/21/2018 External Tables - Not *Just* Loading a CSV File


Everything at Runtime (18.1)

Inline definition of External Table

select fk, col2


from external (
(fk number, col2 varchar2(10)
type oracle_loader
access parameters (
records delimited by newline
fields terminated by ";" optionally enclosed by '"'
( fk integer external(6), col2 char(10) )
)
location (ext_dir:'file.txt')
);

Nothing in data dictionary (hence also less information for the optimizer)

18 9/21/2018 External Tables - Not *Just* Loading a CSV File


Error Handling
Logging Files

19 9/21/2018 External Tables - Not *Just* Loading a CSV File


Errors in the Data

Errors in the data may or may not return an error


– REJECT LIMIT 0 (default) = first occurrence of bad data throws error
– REJECT LIMIT {int} {int} occurrence of bad data throws error
– REJECT LIMIT UNLIMITED no errors thrown
Bad rows of data are copied to the BADFILE
Note: If you have ALTER TABLE ... PROJECT COLUMN REFERENCED
– When column with bad data is in SELECT list => row goes to BADFILE
– When column with bad data is not in SELECT list => row is selected

21 9/21/2018 External Tables - Not *Just* Loading a CSV File


Logging Files

Three parameter pairs


– NOLOGFILE / LOGFILE dir_obj:'ext.log'
– NOBADFILE / BADFILE dir_obj:'ext.bad'
– NODISCARDFILE / DISCARDFILE dir_obj:'ext.dcs'

Can use symbol substitution for uniqueness


- %p = Process id of user process doing the SELECT
- %a = Agent number of slave process by parallel access
Each of them defaults to {table_name}_%p.{ext}
BADFILE contains those rows that could not be imported
DISCARDFILE contains those rows that were skipped by LOAD WHEN clause

22 9/21/2018 External Tables - Not *Just* Loading a CSV File


Flat Files input

23 9/21/2018 External Tables - Not *Just* Loading a CSV File


Overall file characteristica

CHARACTERSET
– What characterset is the file (default is DB characterset, not client)
LANGUAGE
– Which language is used for month names, AM/PM, etc. in the file
TERRITORY
– How are decimal / thousand separators, week numbers, etc. in the file
DATA IS BIG ENDIAN / DATA IS LITTLE ENDIAN
– What endianness used the platform where the file originated

25 9/21/2018 External Tables - Not *Just* Loading a CSV File


Records

FIXED
– Each record a fixed length (in bytes)
VARIABLE
– Start of each record contains a character count
DELIMITED BY
– Each record ends with a given string
XMLTAG
– Each record is the content within a given XML tag: <MYTAG>....</MYTAG>

26 9/21/2018 External Tables - Not *Just* Loading a CSV File


Fields

Field list for file not necessarily match directly field list for table, can map differently
ALL FIELDS OVERRIDE - tells that field list does match directly table fields
– Then only list fields that needs extra info, like non-default date format or such
FIELD NAMES clause tells how to handle that first line contains field names
– Can be ignored or can map fields automatically by field name
TERMINATED BY / [OPTIONALLY] ENCLOSED BY
FIELDS CSV
– WITH / WITHOUT EMBEDDED - does file contain record delim within string fields
– TERMINATED / ENCLOSED - override default , and "

27 9/21/2018 External Tables - Not *Just* Loading a CSV File


Specifying Field Positions (when not delimited)

Start position
– Digit is position directly
– * means the start is the char after the end of previous field
– *+{offset} or *-{offset} means plus or minus offset chars after end of previous field
End can be specified as position (Digit) or as length (+Digit)
STRING SIZES ARE IN
– Parameter says if positions are measured in bytes or chars (for multibyte charsets)

28 9/21/2018 External Tables - Not *Just* Loading a CSV File


Datatypes

INTEGER, DECIMAL, FLOAT, DOUBLE


– Specifying EXTERNAL means the numbers are represented as strings in the file
– Without EXTERNAL means they are binary in the format as a C program
- Access parameter DATA IS BIG / LITTLE ENDIAN used here
RAW, VARRAW, VARRAWC
– Binary data, fixed length or variable with first bytes indicating length
ORACLE_DATE, ORACLE_NUMBER
– Binary representations of Oracle DATE or NUMBER datatype

29 9/21/2018 External Tables - Not *Just* Loading a CSV File


Datatypes (continued)

CHAR, VARCHAR, VARCHARC


– Character data, fixed length or variable with first bytes indicating length
– VARCHAR length indicator is bytes, VARCHARC length indicator is characters
– CHAR also used for DATE, TIMESTAMP, INTERVAL:
– DATE_FORMAT {type} MASK "{format mask}"

30 9/21/2018 External Tables - Not *Just* Loading a CSV File


COLUMN TRANSFORMS

{column_name} FROM {transformation}


– NULL - sets column in all rows to NULL
– CONSTANT - sets column in all rows to specified literal
– CONCAT - sets column to concatenation of field(s) and/or literal(s)
– STARTOF - sets column to a substring from the start of a field
– LOBFILE - sets column to a LOB loaded from another file
directory object / filename can be a field or literal

31 9/21/2018 External Tables - Not *Just* Loading a CSV File


Preprocessor

32 9/21/2018 External Tables - Not *Just* Loading a CSV File


Preprocessor

PREPROCESSOR [{directory}:]{script_or_exe_file}
Must have EXECUTE privilege on directory object
Can be different directory than the datafile - this is recommended for security
Preprocessor script/exe will be called with filename from LOCATION as parameter
Standard output from script/exe will become the input for the EXTERNAL TABLE
Cannot specify arguments directly
– if executable requires arguments, must wrap it in a script
Windows script (batch file) must have suffix .bat or .cmd
Windows batch file must start with @echo off

34 9/21/2018 External Tables - Not *Just* Loading a CSV File


Uses

Uncompress (gunzip / zcat)


– Process compressed file and stream uncompressed data as external table input
Directory listing
– Preprocessor script does ls / dir
Changing file content
– Do transformations with sed before the data is used for external table input
curl calls
– get http resources and feed them to external table input
Your imagination is the limit 

35 9/21/2018 External Tables - Not *Just* Loading a CSV File


Multiple Files
Parallelism
Partition Pruning

36 9/21/2018 External Tables - Not *Just* Loading a CSV File


Multiple Files

LOCATION can contain multiple files, with or without directory specification


– If without, directory specified in DEFAULT DIRECTORY is used
Selecting from the external table reads all the files (except by partition pruning)
If field names are in first row, it can be in either just first file or all files
– Specify which with FIELD NAMES FIRST / ALL

38 9/21/2018 External Tables - Not *Just* Loading a CSV File


Parallelism

Multiple files
– Each file specified in LOCATION handled by each slave process
- parallel degree not helpful to set larger than number of files
– That includes that PREPROCESSOR is called for each file by slave process
Large files
– ORACLE_LOADER parallel select can attempt to assign file chunks to slaves
– Cannot always be done, for example not by:
- Named pipes as input
- Multibyte charactersets (unless fixed byte length records)
- Variable length records with length indicator bytes

39 9/21/2018 External Tables - Not *Just* Loading a CSV File


Partition Pruning (12.2)

Can be partitioned with RANGE, INTERVAL, LIST or composites of them


Each partition has one or more files in LOCATION clause
When optimizer does partition pruning, for an external table that means it only scans
the file(s) of that partition
DB trusts that files of each partition only contains the specified partition key value(s)
If key values are wrong in the files:
– you can get output that does not match WHERE clause
– you may have data you cannot query with WHERE clause

40 9/21/2018 External Tables - Not *Just* Loading a CSV File


Trusted Relied Constraints

41 9/21/2018 External Tables - Not *Just* Loading a CSV File


Purposes of Constraints

On regular tables integrity constraints can be enforced


– Not possible to enforce on external tables - data comes from elsewhere
- Except NOT NULL constraint can be enforced - nulls go to bad file
– But you can say "trust me" and use RELY DISABLE on constraints (12.2)
- can do that for primary key, foreign key, unique constraints
- but not check constraint
With knowledge of the constraints, optimizer can make assumptions
that enables choosing more optimal access plans
– This also works with the trusted constraints on external tables
- QUERY_REWRITE_INTEGRITY = trusted or stale_tolerated

43 9/21/2018 External Tables - Not *Just* Loading a CSV File


SQL*Loader as Generator

44 9/21/2018 External Tables - Not *Just* Loading a CSV File


SQL*Loader for Creating External Tables

You have a SQL*Loader control file?


You want to do the same load (or almost) with an external table?
Use SQL*Loader parameter EXTERNAL_TABLE=GENERATE_ONLY
SQL*Loader won't load but instead create code in the log file
This code you can execute or edit as you wish

46 9/21/2018 External Tables - Not *Just* Loading a CSV File


External Table with
Datapump Dump Files

47 9/21/2018 External Tables - Not *Just* Loading a CSV File


Write (once) to Dump File

CTAS for ORACLE_DATAPUMP access driver

create table ext_emp_tab


organization external (
type oracle_datapump
default directory ext_dir
location ('ext_emp.dmp')
)
as select * from emp;

This created external table can be read, but not modified

49 9/21/2018 External Tables - Not *Just* Loading a CSV File


Driver Parameters for Write

COMPRESSION
– ENABLED BASIC / LOW / MEDIUM / HIGH
- requires Advanced Compression option
ENCRYPTION
– ENABLED / DISABLED
VERSION
– COMPATIBLE / LATEST / version number

50 9/21/2018 External Tables - Not *Just* Loading a CSV File


Parallel Write to Multiple Files

CTAS for ORACLE_LOADER access driver

create table ext_emp_tab


organization external (
type oracle_datapump
default directory ext_dir
location ('ext_emp1.dmp', 'ext_emp2.dmp', 'ext_emp3.dmp')
)
parallel 3
as select * from emp;

Parallel degree and number of files should match


– If number of files > parallel, extra files unused
– If parallel > number of files, parallel is reduced to number of files

51 9/21/2018 External Tables - Not *Just* Loading a CSV File


External Table to Read Dump File

Create external table on an existing Dump File (for example from other DB)

create table ext_emp_tab (


emp_id number, ename varchar2(20)
) organization external (
type oracle_datapump
default directory ext_dir
location ('ext_emp1.dmp', 'ext_emp2.dmp', 'ext_emp3.dmp')
);

Dump file can be from other DB charset, other DB endianness


Reading from multiple files require all have been written with identical metadata
– Ext.table name, column names/types, charset, timezone must be identical

52 9/21/2018 External Tables - Not *Just* Loading a CSV File


HDFS / HIVE

53 9/21/2018 External Tables - Not *Just* Loading a CSV File


Oracle Big Data SQL

External HDFS / HIVE tables for Oracle Big Data SQL (licensed product)
– Hadoop Clusters on Oracle Big Data Appliance
– Database on Exadata
HIVE metadata exposed to database
– ORACLE_HIVE external tables can just specify columns and HIVE cluster/table
– Can override mappings if desired
ORACLE_HDFS you specify HIVE style metadata directly, no table in HIVE catalog

55 9/21/2018 External Tables - Not *Just* Loading a CSV File


Advantages

Big Data SQL Engine


– SmartScan on Hadoop
– Fast direct reads
– Oracle PQ => Hadoop parallelism
Advantages of Hadoop data directly in SQL
– Immediate use by anything that uses SELECT
– Fine-grained access control of Hadoop
– Data redaction, data masking

56 9/21/2018 External Tables - Not *Just* Loading a CSV File


Questions & Answers
Kim Berg Hansen
Senior Consultant
email kim.berghansen@trivadis.com
twitter @kibeha
blog http://www.kibeha.dk

57 9/21/2018 External Tables - Not *Just* Loading a CSV File

You might also like