CREATE EXTERNAL TABLE

The CREATE EXTERNAL TABLE command defines a table whose data resides outside of ScramDB’s internal storage, typically in files on local disk or cloud storage. ScramDB reads this data directly when the table is queried.

This is useful for querying data generated by other processes (like Kerosene) or existing data lakes without needing to import it into ScramDB’s internal storage.

Syntax

CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name (
    column1_name data_type,
    column2_name data_type,
    ...
)
STORED AS format
LOCATION 'location_path'
[TBLPROPERTIES (property_key = 'property_value', ...)];

IF NOT EXISTS: (Optional) Prevents an error if a table with the same name already exists.
table_name: The name of the table to create.
column_name: The name of a column.
data_type: The data type for the column (e.g., INT, VARCHAR, DOUBLE). Must match the data in the external files.
STORED AS format: Specifies the format of the external files. Supported formats:
- PARQUET
- CSV (or TEXTFILE)
LOCATION 'location_path': Specifies the location of the external data. This can be:
- A single file path (e.g., /path/to/data.parquet, file:///path/to/data.csv).
- A directory path (e.g., /path/to/data/, s3://bucket/prefix/).
- A glob pattern (e.g., /path/to/data/*.parquet, s3://bucket/data-??.csv).
- A comma-separated list of file paths (e.g., /path/file1.csv,/path/file2.csv). (Note: Currently, only local paths are fully tested. Cloud storage paths like S3/GCS require future integration with the object_store crate).
TBLPROPERTIES (...): (Optional) Key-value pairs providing additional metadata and options:
- location_type: (Required if ambiguous) Specifies how to interpret the LOCATION.
  - 'FILE': Location is a single file.
  - 'DIRECTORY': Location is a directory.
  - 'GLOB': Location is a glob pattern.
  - 'LIST': Location is a comma-separated list of files. (Default: Auto-detected based on LOCATION string - trailing / implies DIRECTORY, * implies GLOB, otherwise FILE).
- recursive: ('true'|'false') For DIRECTORY or GLOB. Whether to search subdirectories. Defaults to true for DIRECTORY/GLOB, must be false for FILE/LIST.
- has_header: ('true'|'false') For CSV/TEXTFILE. Does the file have a header row? Defaults to false.
- delimiter: (Single character) For CSV/TEXTFILE. Column delimiter. Defaults to ,.
- (Future) Credential properties for cloud storage access.

Examples

Parquet File

CREATE EXTERNAL TABLE orders_parquet (
    order_id BIGINT,
    customer_id BIGINT,
    order_date TIMESTAMP,
    total_amount DOUBLE
)
STORED AS PARQUET
LOCATION '/mnt/data/orders.parquet'
TBLPROPERTIES ('location_type' = 'FILE');

CSV Files in a Directory (Recursive)

CREATE EXTERNAL TABLE logs_csv (
    ts TIMESTAMP,
    level VARCHAR,
    message TEXT
)
STORED AS CSV
LOCATION '/var/log/applogs/'
TBLPROPERTIES (
    'location_type' = 'DIRECTORY',
    'recursive' = 'true',
    'has_header' = 'true',
    'delimiter' = '|'
);

Multiple Specific CSV Files

CREATE EXTERNAL TABLE sensor_readings (
    sensor_id INT,
    reading_time TIMESTAMP,
    value DOUBLE
)
STORED AS TEXTFILE -- Alias for CSV
LOCATION '/data/sensor_a.csv,/data/sensor_b.csv'
TBLPROPERTIES (
    'location_type' = 'LIST',
    'has_header' = 'false'
);

Parquet Files Matching a Glob Pattern

CREATE EXTERNAL TABLE sales_data (
    product_id INT,
    region VARCHAR,
    sale_amount DECIMAL(10, 2),
    sale_date DATE
)
STORED AS PARQUET
LOCATION '/data/sales/processed/*.parquet'
TBLPROPERTIES (
    'location_type' = 'GLOB',
    'recursive' = 'false' -- Only look in the top-level 'processed' directory
);

Querying external tables works the same way as querying internal tables using SELECT, but ScramDB reads the data directly from the specified external location during query execution.