Skip to content

Local Platform Setup

This guide explains how to run all components of the Pulsejet.AI platform (ScramDB, PulsejetDB, NTU, Kerosene, PulseDash) on your local machine using Docker Compose.

Prerequisites

  1. Docker and Docker Compose: Ensure you have Docker Engine and Docker Compose installed and running on your system. Refer to the official Docker documentation for installation instructions.
  2. Git: You’ll need Git to clone the main Pulsejet.AI repository.
  3. (Optional) AI Models: If you plan to use the AI features involving NTU (like embedding generation or the planned ASK command), you may need to create integrations for AI models via SQL. Check the project’s main README.md or Makefile for specific commands like make downloadllm.
  4. (Optional) Nvidia GPU & Drivers: For GPU-accelerated AI tasks within NTU, ensure you have a compatible Nvidia GPU, the appropriate drivers, and the NVIDIA Container Toolkit installed. The provided docker-compose.yml includes configuration for GPU access.

Running the Platform

Services can be accessed in the ordered list below.

  1. Access Services:
    • PulseDash (Management UI): http://localhost:3000
    • ScramDB (PostgreSQL Port): Connect using a SQL client to localhost:5433 (See Connecting guide).
    • Kerosene (Airflow UI): http://localhost:8080 (Default Airflow credentials are admin/admin).
    • PulsejetDB API (HTTP): http://localhost:47044
    • PulsejetDB API (GRPC): http://localhost:47045
    • Kerosene API: http://localhost:32000

Docker Compose Configuration

The platform is orchestrated using the following docker/docker-compose.yml file. This defines all the services, their configurations, ports, volumes, and dependencies. Note that Kerosene writes data to the cognitivelake volume at /opt/pulsejetai/data, and ScramDB mounts this same volume to read external table data from that location.

docker/docker-compose.yml
version: '3.8'
version: '3.8'
services:
pulsejetdb:
image: psila/pulsejetdb:latest
container_name: pulsejetdb
ports:
- "4000:4000" # Dashboard port
- "47044:47044" # HTTP API
- "47045:47045" # gRPC API (Main GRPC API used by NTU)
- "10273:10273" # Cluster communication
volumes:
- cognitivelake:/opt/pulsejetai
environment:
- LD_LIBRARY_PATH=/pulsejet/lib
networks:
- pulsejet-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:47044/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
scramdb:
image: psila/scramdb:latest
container_name: scramdb
ports:
- "5433:5433" # PostgreSQL wire protocol port (mapped to host port 5433 as documented)
volumes:
- cognitivelake:/opt/pulsejetai # Add shared volume mount
environment:
- LD_LIBRARY_PATH=/pulsejet/lib
- PULSEJET_LOG=info
- NTU_ADDRESS=http://ntu:9786 # Add NTU address for heartbeat/ASK
depends_on:
- kerosene-api
command: ["-p", "0.0.0.0:5433", "--pg-no-auth"]
networks:
- pulsejet-network
restart: unless-stopped
# Python-based Kerosene services
kerosene-postgres:
image: postgres:14
container_name: kerosene-postgres
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- kerosene-postgres-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
networks:
- pulsejet-network
restart: always
kerosene-webserver:
image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest'
container_name: kerosene-webserver
user: root # Start as root
# command: airflow webserver # Original command moved into sh -c below
command: sh -c "exec gosu airflow airflow webserver" # Simplified: Assume scheduler handles init
environment:
- AIRFLOW_HOME=/opt/pulsejetai/airflow
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow
- AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow # Add new var
- AIRFLOW__CORE__DAGS_FOLDER=/opt/pulsejetai/airflow/dags # Explicitly set DAGs folder
- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False # Ensure DAGs start unpaused
- AIRFLOW__CORE__LOAD_EXAMPLES=False # Ensure examples are off for webserver too
- AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False
- KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml
- AIRFLOW__API__AUTH_BACKENDS=airflow.api.auth.backend.basic_auth # Enable Basic Auth for API
volumes:
- ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev
- ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs
- cognitivelake:/opt/pulsejetai
ports:
- "8080:8080" # Airflow webserver port
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
depends_on:
- kerosene-postgres
networks:
- pulsejet-network
restart: always
kerosene-scheduler:
image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest'
container_name: kerosene-scheduler
user: root # Start as root
# Run scheduler in foreground
# command: kerosene scheduler start # Original command moved into sh -c below
command: >
sh -c "mkdir -p /opt/pulsejetai/data /opt/pulsejetai/airflow && chown -R airflow:airflow /opt/pulsejetai /opt/pulsejetai/data /opt/pulsejetai/airflow && /usr/local/bin/docker-entrypoint.sh && exec gosu airflow kerosene scheduler start"
environment:
- AIRFLOW_HOME=/opt/pulsejetai/airflow
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
# Ensure examples are disabled for scheduler
- AIRFLOW__CORE__LOAD_EXAMPLES=False
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow
- AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@kerosene-postgres:5432/airflow # Add new var
- AIRFLOW__CORE__DAGS_FOLDER=/opt/pulsejetai/airflow/dags # Explicitly set DAGs folder
- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False # Ensure DAGs start unpaused
- AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=5 # Set fast scan interval
- AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False
- KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml
volumes:
- ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev
- ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs
- cognitivelake:/opt/pulsejetai # Rename volume
depends_on:
- kerosene-postgres
- kerosene-webserver
networks:
- pulsejet-network
restart: always
kerosene-api:
image: psila/kerosene:latest # Assuming local build tag is 'kerosene:latest'
container_name: kerosene-api
user: root # Start as root
# Run API in foreground
# command: kerosene api start # Original command moved into sh -c below
command: sh -c "exec gosu airflow kerosene api start" # Simplified: Assume scheduler handles init
environment:
# Explicitly set AIRFLOW_HOME for consistency
- AIRFLOW_HOME=/opt/pulsejetai/airflow
- KEROSENE_CONFIG=/opt/pulsejetai/kerosene/kerosene.yaml
volumes:
- ../kerosene/kerosene_py:/opt/kerosene/kerosene_py # Keep code mount if needed for local dev
- ../kerosene/kerosene.yaml:/opt/pulsejetai/kerosene/kerosene.yaml # Mount config to path shown in logs
- cognitivelake:/opt/pulsejetai # Rename volume
ports:
- "32000:32000" # Kerosene API port (per docs, exposed on 32000)
depends_on:
- kerosene-postgres
- kerosene-scheduler
networks:
- pulsejet-network
restart: always
ntu:
image: psila/ntu:latest
container_name: ntu
ports:
- "9786:9786"
volumes:
- cognitivelake:/opt/pulsejetai # Mount for lake data
environment:
# Required by new config.rs
- LAKE_DIR=/opt/pulsejetai
- SCRAMDB_URL=postgresql://scramdb:5433/scramdb
- PULSEJETDB_ADDRESS=pulsejetdb:47045
- POLL_INTERVAL_MINUTES=5
- MAX_SAMPLE_SIZE=20
- MAX_GROUPING_SIZE=20
# Existing env vars
- PULSEJET_LOG=info
depends_on:
- pulsejetdb
# Removed command - config now via env vars
networks:
- pulsejet-network
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
pulsedash:
image: psila/pulsedash:latest
container_name: pulsedash
volumes:
- cognitivelake:/opt/pulsejetai # Mount for lake data
ports:
- "3000:3000" # Dashboard port (correct per docs)
environment:
- PULSEJET_GRPC_URL=pulsejetdb:47045
- SCRAMDB_PG_URL=postgresql://scramdb:5433/scramdb
- NTU_URL=http://ntu:9786
- KEROSENE_API_URL=http://kerosene-api:32000
- AIRFLOW_API_URL=http://kerosene-webserver:8080 # Add Airflow API URL
- AIRFLOW_API_USER=admin # Add Airflow API User
- AIRFLOW_API_PASSWORD=admin # Add Airflow API Password
- LAKE_DATA_DIR=/opt/pulsejetai/data
depends_on:
- pulsejetdb
- scramdb
- kerosene-api
- ntu
networks:
- pulsejet-network
restart: unless-stopped
# Demo Postgres instance for kerosene connector
demo-postgres:
image: bradymholt/postgres-northwind:latest
container_name: demo-postgres
environment:
- POSTGRES_USER=northwind
- POSTGRES_PASSWORD=northwind
- POSTGRES_DB=northwind
networks:
- pulsejet-network
restart: unless-stopped
networks:
pulsejet-network:
driver: bridge
volumes:
kerosene-postgres-volume:
driver: local
cognitivelake:
driver: local

Example Use Case

As you can see we are also running an external postgres for demonstration purposes. It contains Northwind data. You can easily start the ingestion from external sources with connectors.

For this use case we will use PostgreSQL connector and load various tables. Let’s start with ingestion all data!

#!/bin/bash
# List of tables to process
tables=(
"products"
)
# Loop through each table and execute the curl command
for table in "${tables[@]}"; do
echo "Processing table: $table"
curl -X POST http://localhost:32000/workflows \
-H "Content-Type: application/json" \
-d '{
"type": "extract",
"source": "postgres",
"table": "'$table'",
"schedule": "@daily",
"shards": 4,
"connector_config": {
"host": "demo-postgres",
"port": 5432,
"user": "northwind",
"password": "northwind",
"database": "northwind"
}
}'
# Add a small delay between requests (optional)
sleep 1
echo -e "\n"
done
echo "All tables scheduled for ingestion successfully!"

After this you will see on http://localhost:8080 with admin username and admin password that Airflow automatically starts loading it. This airflow is managed by Kerosene. Kerosene also tracks the data lineage.

When data loading completes, already embeddings are generated and inserted into our Vector Database PulsejetDB.

You can register the data of this table with ScramDB (our main Hybrid OLAP/OLTP database) with:

CREATE EXTERNAL TABLE order_details (order_id INT, product_id INT, unit_price FLOAT, quantity INT, discount FLOAT)
STORED AS PARQUET LOCATION "file:///opt/pulsejetai/data/tables/order_details"
TBLPROPERTIES ('location_type' = 'DIRECTORY', 'recursive' = 'true');

You can select data from it by:

SELECT * FROM order_details;

Let’s ask business related questions to it:

SELECT ASK("Do we have longbread product?");

and answer will be:

" Yes, we have Scottish Longbreads in stock."

If you ingest all tables and you have enough GPU memory (32 GB+). You can answer questions like:

SELECT ASK("Which supplier's products have the most quantity in the warehouse?");

Stopping the Platform

To stop all running services:

Terminal window
cd docker # Ensure you are in the docker directory
docker compose down

To stop and remove associated volumes (use with caution, as this deletes data, including the cognitivelake shared data):

Terminal window
cd docker
docker compose down -v

Developing the Platform

  1. Clone the Repository:
Terminal window
git clone <your-psila-ai-repository-url>
cd psila-ai
  1. Download AI Models (if needed):
Terminal window
# Example command, check project specifics
make downloadllm
  1. Start the Platform: Navigate to the docker directory within the cloned repository and use Docker Compose to start all services defined in the docker-compose.yml file (shown below):
Terminal window
cd docker
docker compose up -d

The -d flag runs the containers in detached mode (in the background).

  1. Verify Services: You can check the status of the running containers:
Terminal window
docker compose ps

You should see services like pulsejetdb, scramdb, kerosene-webserver, ntu, pulsedash, etc., in a running state. It might take a few minutes for all services, especially Airflow (Kerosene), to fully initialize.

  1. Access Services:
  • PulseDash (Management UI): http://localhost:3000
  • ScramDB (PostgreSQL Port): Connect using a SQL client to localhost:5433 (See Connecting guide).
  • Kerosene (Airflow UI): http://localhost:8080 (Default Airflow credentials might be airflow/airflow).
  • PulsejetDB API (HTTP): http://localhost:47044
  • Kerosene API: http://localhost:32000