How Pulsejet.AI Works

Pulsejet.AI combines several specialized components into a cohesive CognitiveLake Platform, designed to handle both traditional data processing and advanced AI tasks efficiently. Here’s a simplified look at the main components and how they interact:

Client Interaction (SQL) Users and applications connect to ScramDB using standard PostgreSQL tools and clients (like psql, DBeaver, or application drivers) on port 5433. ScramDB acts as the main entry point for SQL queries.
Query Processing (ScramVM) Inside ScramDB, the ScramVM query engine takes over. It parses SQL, optimizes the query, and compiles it into efficient bytecode. ScramVM uses a parallel, morsel-based execution strategy, breaking down large tasks into smaller chunks that can be processed across multiple CPU cores simultaneously for high performance. It also caches compiled bytecode to speed up repeated queries.
Data Storage (Tundra & External)
- Internal Data: For tables created directly using CREATE TABLE, ScramDB uses the embedded Tundra storage engine, with fast, local data access.
- External Data: You can define EXTERNAL TABLEs that point to data stored elsewhere (e.g., Parquet or CSV files on local disk or cloud storage). ScramVM reads this data directly during query execution.
Data Orchestration (Kerosene) Kerosene (powered by Apache Airflow) handles the ingestion and processing of data from various external sources (databases, S3, Kafka, etc.). It typically processes this data into formats like Parquet and can then register these files as external tables within ScramDB (registration step planned). You can manage Kerosene workflows via its UI (usually port 8080).
Vector Storage & Search (PulsejetDB) PulsejetDB is a specialized database optimized for storing and searching high-dimensional vector embeddings, which are crucial for many AI tasks like similarity search or retrieval-augmented generation (RAG). It provides efficient indexing (e.g., HNSW) and search capabilities.
AI Processing (NTU) NTU acts as the AI brain of the platform. It handles tasks like:
- Generating vector embeddings from your data (reading from ScramDB, storing in PulsejetDB).
- Processing natural language queries using the planned ASK command (interacting with PulsejetDB for context and external LLMs for generation).
Management UI (PulseDash) PulseDash provides a web-based interface (port 3000) to monitor the platform, manage connections (like Kerosene connectors), inspect data (in ScramDB and potentially PulsejetDB), and execute queries.

In essence, ScramDB provides the familiar SQL front-end and high-performance query engine, Tundra handles fast internal storage, Kerosene manages external data pipelines, PulsejetDB specializes in vector data, NTU orchestrates AI tasks, and PulseDash offers a central management view. This modular architecture allows the platform to handle diverse data types and workloads efficiently.