CREATE EXTERNAL TABLE
The CREATE EXTERNAL TABLE
command defines a table whose data resides outside of ScramDB’s internal storage, typically in files on local disk or cloud storage. ScramDB reads this data directly when the table is queried.
This is useful for querying data generated by other processes (like Kerosene) or existing data lakes without needing to import it into ScramDB’s internal storage.
Syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name ( column1_name data_type, column2_name data_type, ...)STORED AS formatLOCATION 'location_path'[TBLPROPERTIES (property_key = 'property_value', ...)];
IF NOT EXISTS
: (Optional) Prevents an error if a table with the same name already exists.table_name
: The name of the table to create.column_name
: The name of a column.data_type
: The data type for the column (e.g.,INT
,VARCHAR
,DOUBLE
). Must match the data in the external files.STORED AS format
: Specifies the format of the external files. Supported formats:PARQUET
CSV
(orTEXTFILE
)
LOCATION 'location_path'
: Specifies the location of the external data. This can be:- A single file path (e.g.,
/path/to/data.parquet
,file:///path/to/data.csv
). - A directory path (e.g.,
/path/to/data/
,s3://bucket/prefix/
). - A glob pattern (e.g.,
/path/to/data/*.parquet
,s3://bucket/data-??.csv
). - A comma-separated list of file paths (e.g.,
/path/file1.csv,/path/file2.csv
). (Note: Currently, only local paths are fully tested. Cloud storage paths like S3/GCS require future integration with theobject_store
crate).
- A single file path (e.g.,
TBLPROPERTIES (...)
: (Optional) Key-value pairs providing additional metadata and options:location_type
: (Required if ambiguous) Specifies how to interpret theLOCATION
.'FILE'
: Location is a single file.'DIRECTORY'
: Location is a directory.'GLOB'
: Location is a glob pattern.'LIST'
: Location is a comma-separated list of files. (Default: Auto-detected based onLOCATION
string - trailing/
implies DIRECTORY,*
implies GLOB, otherwise FILE).
recursive
: ('true'
|'false'
) ForDIRECTORY
orGLOB
. Whether to search subdirectories. Defaults totrue
for DIRECTORY/GLOB, must befalse
for FILE/LIST.has_header
: ('true'
|'false'
) ForCSV
/TEXTFILE
. Does the file have a header row? Defaults tofalse
.delimiter
: (Single character) ForCSV
/TEXTFILE
. Column delimiter. Defaults to,
.- (Future) Credential properties for cloud storage access.
Examples
Parquet File
CREATE EXTERNAL TABLE orders_parquet ( order_id BIGINT, customer_id BIGINT, order_date TIMESTAMP, total_amount DOUBLE)STORED AS PARQUETLOCATION '/mnt/data/orders.parquet'TBLPROPERTIES ('location_type' = 'FILE');
CSV Files in a Directory (Recursive)
CREATE EXTERNAL TABLE logs_csv ( ts TIMESTAMP, level VARCHAR, message TEXT)STORED AS CSVLOCATION '/var/log/applogs/'TBLPROPERTIES ( 'location_type' = 'DIRECTORY', 'recursive' = 'true', 'has_header' = 'true', 'delimiter' = '|');
Multiple Specific CSV Files
CREATE EXTERNAL TABLE sensor_readings ( sensor_id INT, reading_time TIMESTAMP, value DOUBLE)STORED AS TEXTFILE -- Alias for CSVLOCATION '/data/sensor_a.csv,/data/sensor_b.csv'TBLPROPERTIES ( 'location_type' = 'LIST', 'has_header' = 'false');
Parquet Files Matching a Glob Pattern
CREATE EXTERNAL TABLE sales_data ( product_id INT, region VARCHAR, sale_amount DECIMAL(10, 2), sale_date DATE)STORED AS PARQUETLOCATION '/data/sales/processed/*.parquet'TBLPROPERTIES ( 'location_type' = 'GLOB', 'recursive' = 'false' -- Only look in the top-level 'processed' directory);
Querying external tables works the same way as querying internal tables using SELECT
, but ScramDB reads the data directly from the specified external location during query execution.