Local Platform Setup
This guide explains how to run all components of the Pulsejet.AI platform (ScramDB, PulsejetDB, NTU, Kerosene, PulseDash) on your local machine using Docker Compose.
Prerequisites
- Docker and Docker Compose: Ensure you have Docker Engine and Docker Compose installed and running on your system. Refer to the official Docker documentation for installation instructions.
- Git: You’ll need Git to clone the main Pulsejet.AI repository.
- (Optional) AI Models: If you plan to use the AI features involving NTU (like embedding generation or the planned
ASK
command), you may need to create integrations for AI models via SQL. Check the project’s mainREADME.md
orMakefile
for specific commands likemake downloadllm
. - (Optional) Nvidia GPU & Drivers: For GPU-accelerated AI tasks within NTU, ensure you have a compatible Nvidia GPU, the appropriate drivers, and the NVIDIA Container Toolkit installed. The provided
docker-compose.yml
includes configuration for GPU access.
Running the Platform
Services can be accessed in the ordered list below.
- Access Services:
- PulseDash (Management UI):
http://localhost:3000
- ScramDB (PostgreSQL Port): Connect using a SQL client to
localhost:5433
(See Connecting guide). - Kerosene (Airflow UI):
http://localhost:8080
(Default Airflow credentials areadmin
/admin
). - PulsejetDB API (HTTP):
http://localhost:47044
- PulsejetDB API (GRPC):
http://localhost:47045
- Kerosene API:
http://localhost:32000
- PulseDash (Management UI):
Docker Compose Configuration
The platform is orchestrated using the following docker/docker-compose.yml
file. This defines all the services, their configurations, ports, volumes, and dependencies.
Note that Kerosene writes data to the cognitivelake
volume at /opt/pulsejetai/data
, and ScramDB mounts this same volume to read external table data from that location.
version: '3.8'
version: '3.8'
services:pulsejetdb: image: psila/pulsejetdb:latest container_name: pulsejetdb ports: - "4000:4000" # Dashboard port - "47044:47044" # HTTP API - "47045:47045" # gRPC API (Main GRPC API used by NTU) - "10273:10273" # Cluster communication volumes: - cognitivelake:/opt/pulsejetai environment: - LD_LIBRARY_PATH=/pulsejet/lib networks: - pulsejet-network restart: unless-stopped healthcheck: test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:47044/healthz"] interval: 30s timeout: 10s retries: 3 start_period: 10s
scramdb: image: psila/scramdb:latest container_name: scramdb ports: - "5433:5433" # PostgreSQL wire protocol port (mapped to host port 5433 as documented) volumes: - cognitivelake:/opt/pulsejetai # Add shared volume mount environment: - LD_LIBRARY_PATH=/pulsejet/lib - PULSEJET_LOG=info - NTU_ADDRESS=http://ntu:9786 # Add NTU address for heartbeat/ASK depends_on: - kerosene-api command: ["-p", "0.0.0.0:5433", "--pg-no-auth"] networks: - pulsejet-network restart: unless-stopped
# Python-based Kerosene serviceskerosene-postgres: image: postgres:14 container_name: kerosene-postgres environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - kerosene-postgres-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 networks: - pulsejet-network restart: always
kerosene-webserver: image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest' container_name: kerosene-webserver user: root # Start as root # command: airflow webserver # Original command moved into sh -c below command: sh -c "exec gosu airflow airflow webserver" # Simplified: Assume scheduler handles init environment: - AIRFLOW_HOME=/opt/pulsejetai/airflow - AIRFLOW__CORE__EXECUTOR=LocalExecutor - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow # Add new var - AIRFLOW__CORE__DAGS_FOLDER=/opt/pulsejetai/airflow/dags # Explicitly set DAGs folder - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False # Ensure DAGs start unpaused - AIRFLOW__CORE__LOAD_EXAMPLES=False # Ensure examples are off for webserver too - AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False - KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml - AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth # Enable Basic Auth for API volumes: - ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev - ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs - cognitivelake:/opt/pulsejetai ports: - "8080:8080" # Airflow webserver port healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 5 depends_on: - kerosene-postgres networks: - pulsejet-network restart: always
kerosene-scheduler: image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest' container_name: kerosene-scheduler user: root # Start as root # Run scheduler in foreground # command: kerosene scheduler start # Original command moved into sh -c below command: > sh -c "mkdir -p /opt/pulsejetai/data /opt/pulsejetai/airflow && chown -R airflow:airflow /opt/pulsejetai /opt/pulsejetai/data /opt/pulsejetai/airflow && /usr/local/bin/docker-entrypoint.sh && exec gosu airflow kerosene scheduler start" environment: - AIRFLOW_HOME=/opt/pulsejetai/airflow - AIRFLOW__CORE__EXECUTOR=LocalExecutor # Ensure examples are disabled for scheduler - AIRFLOW__CORE__LOAD_EXAMPLES=False - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow # Add new var - AIRFLOW__CORE__DAGS_FOLDER=/opt/pulsejetai/airflow/dags # Explicitly set DAGs folder - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False # Ensure DAGs start unpaused - AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=5 # Set fast scan interval - AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False - KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml volumes: - ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev - ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs - cognitivelake:/opt/pulsejetai # Rename volume depends_on: - kerosene-postgres - kerosene-webserver networks: - pulsejet-network restart: always
kerosene-api: image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest' container_name: kerosene-api user: root # Start as root # Run API in foreground # command: kerosene api start # Original command moved into sh -c below command: sh -c "exec gosu airflow kerosene api start" # Simplified: Assume scheduler handles init environment: # Explicitly set AIRFLOW_HOME for consistency - AIRFLOW_HOME=/opt/pulsejetai/airflow - KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml volumes: - ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev - ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs - cognitivelake:/opt/pulsejetai # Rename volume ports: - "32000:32000" # Kerosene API port (per docs, exposed on 32000) depends_on: - kerosene-postgres - kerosene-scheduler networks: - pulsejet-network restart: always
ntu: image: psila/ntu:latest container_name: ntu ports: - "9786:9786" volumes: - cognitivelake:/opt/pulsejetai # Mount for lake data environment: # Required by new config.rs - LAKE_DIR=/opt/pulsejetai - SCRAMDB_URL=postgresql://scramdb:5433/scramdb - PULSEJETDB_ADDRESS=pulsejetdb:47045 - POLL_INTERVAL_MINUTES=5 - MAX_SAMPLE_SIZE=20 - MAX_GROUPING_SIZE=20 # Existing env vars - PULSEJET_LOG=info depends_on: - pulsejetdb # Removed command - config now via env vars networks: - pulsejet-network restart: unless-stopped deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu]
pulsedash: image: psila/pulsedash:latest container_name: pulsedash volumes: - cognitivelake:/opt/pulsejetai # Mount for lake data ports: - "3000:3000" # Dashboard port (correct per docs) environment: - PULSEJET_GRPC_URL=pulsejetdb:47045 - SCRAMDB_PG_URL=postgresql://scramdb:5433/scramdb - NTU_URL=http://ntu:9786 - KEROSENE_API_URL=http://kerosene-api:32000 - AIRFLOW_API_URL=http://kerosene-webserver:8080 # Add Airflow API URL - AIRFLOW_API_USER=admin # Add Airflow API User - AIRFLOW_API_PASSWORD=admin # Add Airflow API Password - LAKE_DATA_DIR=/opt/pulsejetai/data depends_on: - pulsejetdb - scramdb - kerosene-api - ntu networks: - pulsejet-network restart: unless-stopped
# Demo Postgres instance for kerosene connectordemo-postgres: image: bradymholt/postgres-northwind:latest container_name: demo-postgres environment: - POSTGRES_USER=northwind - POSTGRES_PASSWORD=northwind - POSTGRES_DB=northwind networks: - pulsejet-network restart: unless-stopped
networks:pulsejet-network: driver: bridge
volumes:kerosene-postgres-volume: driver: localcognitivelake: driver: local
Example Use Case
As you can see we are also running an external postgres for demonstration purposes. It contains Northwind data. You can easily start the ingestion from external sources with connectors.
For this use case we will use PostgreSQL connector and load various tables. Let’s start with ingestion all data!
#!/bin/bash
# List of tables to processtables=( "products")
# Loop through each table and execute the curl commandfor table in "${tables[@]}"; do echo "Processing table: $table"
curl -X POST http://localhost:32000/workflows \ -H "Content-Type: application/json" \ -d '{ "type": "extract", "source": "postgres", "table": "'$table'", "schedule": "@daily", "shards": 4, "connector_config": { "host": "demo-postgres", "port": 5432, "user": "northwind", "password": "northwind", "database": "northwind" } }'
# Add a small delay between requests (optional) sleep 1 echo -e "\n"done
echo "All tables scheduled for ingestion successfully!"
After this you will see on http://localhost:8080
with admin
username and admin
password that Airflow automatically starts loading it.
This airflow is managed by Kerosene.
Kerosene also tracks the data lineage.
When data loading completes, already embeddings are generated and inserted into our Vector Database PulsejetDB.
You can register the data of this table with ScramDB (our main Hybrid OLAP/OLTP database) with:
CREATE EXTERNAL TABLE order_details (order_id INT, product_id INT, unit_price FLOAT, quantity INT, discount FLOAT)STORED AS PARQUET LOCATION "file:///opt/pulsejetai/data/tables/order_details"TBLPROPERTIES ('location_type' = 'DIRECTORY', 'recursive' = 'true');
You can select data from it by:
SELECT * FROM order_details;
Let’s ask business related questions to it:
SELECT ASK("Do we have longbread product?");
and answer will be:
" Yes, we have Scottish Longbreads in stock."
If you ingest all tables and you have enough GPU memory (32 GB+). You can answer questions like:
SELECT ASK("Which supplier's products have the most quantity in the warehouse?");
Stopping the Platform
To stop all running services:
cd docker # Ensure you are in the docker directorydocker compose down
To stop and remove associated volumes (use with caution, as this deletes data, including the cognitivelake
shared data):
cd dockerdocker compose down -v
Developing the Platform
- Clone the Repository:
git clone <your-psila-ai-repository-url>cd psila-ai
- Download AI Models (if needed):
# Example command, check project specificsmake downloadllm
- Start the Platform: Navigate to the
docker
directory within the cloned repository and use Docker Compose to start all services defined in thedocker-compose.yml
file (shown below):
cd dockerdocker compose up -d
The -d
flag runs the containers in detached mode (in the background).
- Verify Services: You can check the status of the running containers:
docker compose ps
You should see services like pulsejetdb
, scramdb
, kerosene-webserver
, ntu
, pulsedash
, etc., in a running state. It might take a few minutes for all services, especially Airflow (Kerosene), to fully initialize.
- Access Services:
- PulseDash (Management UI):
http://localhost:3000
- ScramDB (PostgreSQL Port): Connect using a SQL client to
localhost:5433
(See Connecting guide). - Kerosene (Airflow UI):
http://localhost:8080
(Default Airflow credentials might beairflow
/airflow
). - PulsejetDB API (HTTP):
http://localhost:47044
- Kerosene API:
http://localhost:32000