Modern AI applications are no longer single-shot inference calls. They are long-running agents that plan, act, observe, and retry across time. An AI agent loop that retrieves context from a vector store, calls an LLM, writes results to a database, waits for human approval, and then triggers downstream actions can run for minutes, hours, or even days. Without a durable orchestration layer, any transient infrastructure failure restarts the entire loop from scratch: re-billing expensive LLM calls, duplicating side effects, and losing all accumulated context.

Coordinating these agents reliably requires a layer that:

  • Survives crashes: agent state is persisted after every step, not held in-memory
  • Guarantees exactly-once execution: an LLM call or external API write is never re-invoked after it succeeds
  • Scales horizontally: thousands of concurrent agent instances without bottlenecks
  • Enables long sleeps: an agent can wait days for a human-in-the-loop response without holding a process open

Temporal is the leading open-source platform for this class of problem. CockroachDB is its ideal distributed persistence backend, giving Temporal’s stateless execution cluster an indestructible, globally replicated storage foundation.

A workflow orchestration framework manages the lifecycle of long-running, multi-step programs. Instead of writing retry loops, checkpointing logic, and failure recovery by hand, you declare your business logic as a sequence of durable steps and let the framework handle the rest. The core promises are:

  • Durability: workflow state survives process crashes, restarts, and infrastructure failures
  • Exactly-once semantics: individual steps are never re-executed after they complete successfully
  • Idempotency: re-invoking the same workflow with the same ID is a no-op
  • Observability: the full execution history is inspectable at any point

What Is Temporal?

Temporal is an open-source, language-agnostic platform for building reliable distributed applications. It introduces the concept of Durable Execution, the guarantee that workflow logic runs to completion regardless of infrastructure failures.

Core Concepts

Concept Definition
Workflow A fault-tolerant function that orchestrates Activities; can run for years
Activity A single, retriable unit of work (API call, DB write, ML inference)
Worker A process that polls Temporal for tasks and executes Workflows and Activities
Event History An append-only log of every Command and Event in a workflow’s lifetime; the source of truth for recovery
Namespace A logical isolation boundary; separate event histories, task queues, and quotas
Task Queue A durable channel connecting a Workflow/Activity to a set of Workers
Signal / Query Mechanisms for external code to send data to, or read state from, a running workflow

How Temporal Works

Understanding why CockroachDB is the right persistence backend for Temporal requires understanding how Temporal stores state. The design choices that make Temporal reliable are the same ones that demand a distributed, strongly-consistent database underneath.

Workflow as a State Machine

Every running workflow is modelled as a state machine. Each external interaction (an activity completing, a timer firing, a signal arriving) produces a new event appended to the workflow’s event history log. The current state of a workflow is fully determined by replaying that log from the beginning.

Temporal workflow state machine

A workflow execution is a deterministic state machine driven by an append-only event history

When a Worker restarts after a crash, it re-fetches the event history and replays the workflow function. Steps that already completed are skipped instantly; execution continues from the last committed state.

Consistency is Non-Negotiable

Every state transition must atomically update the workflow state and enqueue the next task. If either write fails, the system enters an unrecoverable inconsistent state: a phantom task that will never be delivered.

Temporal transactional updates

State transitions require atomically updating workflow state and task queue entries in a single transaction

This is why Temporal demands a strongly-consistent relational store with full ACID guarantees. CockroachDB, which provides serializable isolation at any scale, is a natural fit where a single PostgreSQL primary would become a bottleneck.

Visibility: Queryable Workflow Index

In addition to the main persistence store, Temporal maintains a Visibility store, a secondary database optimised for querying workflow executions by status, type, start time, and custom search attributes stored as JSONB.

Temporal workflow visibility

The Visibility store indexes workflow executions for list and filter queries using JSONB search attributes

The standard PostgreSQL visibility schema indexes JSONB via CREATE EXTENSION IF NOT EXISTS btree_gin, a PostgreSQL-only extension that does not exist in CockroachDB. The fix is CockroachDB’s native CREATE INVERTED INDEX, which provides the same capability without any extension (see the schema section below).

Full Cluster Architecture with CockroachDB

A Temporal cluster consists of four independently scalable stateless services fronting two durable storage tiers. When CockroachDB backs both stores, the entire persistence tier gains distributed replication, automatic failover, and horizontal scalability, all transparent to Temporal’s services.

Temporal cluster architecture with CockroachDB

Temporal cluster with CockroachDB as the persistence and visibility backend

Service Role
Frontend gRPC gateway; routes client requests to the correct History shard
History Owns workflow state machines; processes commands and records events
Matching Manages task queues; dispatches tasks to available Workers
Worker Runs internal system workflows (replication, archival, cleanup)
Persistence Store (CockroachDB) Event histories, timers, transfer queues: strong consistency, distributed writes
Visibility Store (CockroachDB) Queryable execution index: JSONB inverted index replaces btree_gin

Why CockroachDB for Temporal?

Temporal officially supports PostgreSQL, MySQL, SQLite, and Cassandra. CockroachDB is a natural fit for the persistence store because it brings:

  • Distributed, strongly-consistent SQL: the same ACID guarantees PostgreSQL provides, at any scale
  • Active-active multi-region: Temporal history shards can be distributed across regions without manual replication
  • Automatic failover: node failures are transparent to Temporal’s services
  • Horizontal scalability: scale reads and writes without sharding logic in the application
  • PostgreSQL wire protocol: Temporal’s existing postgres12 plugin works directly

Deploying Temporal on CockroachDB

Step 1: Provision CockroachDB databases and user

CREATE DATABASE temporal;
CREATE DATABASE temporal_visibility;
CREATE USER temporal WITH PASSWORD 'temporal';
GRANT ALL ON DATABASE temporal TO temporal;
GRANT ALL ON DATABASE temporal_visibility TO temporal;

Step 2: Initialize the persistence schema

The main schema works with CockroachDB out of the box via Temporal’s SQL tool:

temporal-sql-tool \
  --plugin postgres12 \
  --ep "<crdb-host>:26257" \
  --db temporal \
  --tls \
  --tls-ca-file /certs/ca.crt \
  --tls-cert-file /certs/client.temporal.crt \
  --tls-key-file /certs/client.temporal.key \
  setup-schema -v 0.0

temporal-sql-tool \
  --plugin postgres12 \
  --ep "<crdb-host>:26257" \
  --db temporal \
  --tls ... \
  update-schema -d ./schema/postgresql/v12/temporal/versioned

Step 3: Fix the visibility schema for CockroachDB

This is the critical fix. Temporal’s advanced visibility schema contains CREATE EXTENSION IF NOT EXISTS btree_gin, a PostgreSQL-only extension that enables B-tree operators on GIN indexes. CockroachDB does not support this extension, and the schema migration fails at this line.

The root cause is this migration in schema/postgresql/v12/visibility/versioned/v1.1/manifest.json:

-- PostgreSQL only; FAILS on CockroachDB
CREATE EXTENSION IF NOT EXISTS btree_gin;
CREATE INDEX custom_search_attributes_idx
  ON executions_visibility
  USING gin(search_attributes jsonb_path_ops);

The fix: bypass the standard migration tool for the visibility database and apply a CockroachDB-compatible schema directly. CockroachDB natively supports inverted indexes on JSONB columns without any extension:

-- CockroachDB-compatible visibility schema
CREATE TABLE executions_visibility (
  namespace_id           VARCHAR(64)   NOT NULL,
  run_id                 VARCHAR(64)   NOT NULL,
  start_time             TIMESTAMPTZ   NOT NULL,
  execution_time         TIMESTAMPTZ   NOT NULL,
  workflow_id            VARCHAR(255)  NOT NULL,
  workflow_type_name     VARCHAR(255)  NOT NULL,
  status                 INT4          NOT NULL,
  close_time             TIMESTAMPTZ,
  history_length         BIGINT,
  history_size_bytes     BIGINT,
  execution_duration     BIGINT,
  state_transition_count BIGINT,
  memo                   BYTEA,
  encoding               VARCHAR(64)   NOT NULL,
  task_queue             VARCHAR(255)  NOT NULL DEFAULT '',
  search_attributes      JSONB,
  parent_workflow_id     VARCHAR(255),
  parent_run_id          VARCHAR(255),
  root_workflow_id       VARCHAR(255)  NOT NULL DEFAULT '',
  root_run_id            VARCHAR(255)  NOT NULL DEFAULT '',
  PRIMARY KEY (namespace_id, run_id)
);

-- Standard B-tree indexes; work identically on CockroachDB
CREATE INDEX by_type_start_time
  ON executions_visibility (namespace_id, workflow_type_name, start_time DESC, run_id);
CREATE INDEX by_workflow_id_start_time
  ON executions_visibility (namespace_id, workflow_id, start_time DESC, run_id);
CREATE INDEX by_status_start_time
  ON executions_visibility (namespace_id, status, start_time DESC, run_id);
CREATE INDEX by_close_time
  ON executions_visibility (namespace_id, status, close_time DESC, run_id)
  WHERE close_time IS NOT NULL;

-- CockroachDB native inverted index replaces btree_gin GIN index
CREATE INVERTED INDEX by_search_attributes
  ON executions_visibility (search_attributes);

Initialize the visibility database using this schema file directly:

cockroach sql \
  --url "postgresql://temporal@<crdb-host>:26257/temporal_visibility" \
  --certs-dir=/certs \
  --file ./crdb_visibility_schema.sql

Step 4: Configure Temporal server

persistence:
  defaultStore: crdb-default
  visibilityStore: crdb-visibility
  numHistoryShards: 512
  datastores:
    crdb-default:
      sql:
        pluginName: "postgres12"
        databaseName: "temporal"
        connectAddr: "<crdb-host>:26257"
        connectProtocol: "tcp"
        user: "temporal"
        password: "${TEMPORAL_DB_PASSWORD}"
        maxConns: 20
        maxIdleConns: 20
        maxConnLifetime: "1h"
        tls:
          enabled: true
          caFile: "/certs/ca.crt"
          certFile: "/certs/client.temporal.crt"
          keyFile: "/certs/client.temporal.key"
          serverName: "<crdb-host>"
    crdb-visibility:
      sql:
        pluginName: "postgres12"
        databaseName: "temporal_visibility"
        connectAddr: "<crdb-host>:26257"
        connectProtocol: "tcp"
        user: "temporal"
        password: "${TEMPORAL_DB_PASSWORD}"
        maxConns: 10
        maxIdleConns: 10
        maxConnLifetime: "1h"
        tls:
          enabled: true
          caFile: "/certs/ca.crt"
          certFile: "/certs/client.temporal.crt"
          keyFile: "/certs/client.temporal.key"

Step 5: Write your first durable AI agent workflow

The following agent loop retrieves context, calls an LLM, and writes the result to a database. Each step is an Activity: it executes exactly once even if the process crashes between steps. An LLM call that costs money is never re-issued after it succeeds.

from temporalio import workflow, activity
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio.common import RetryPolicy
from datetime import timedelta

@activity.defn
async def retrieve_context(task: str) -> str:
    """Query a vector store for relevant context."""
    return await vector_store.search(task)

@activity.defn
async def call_llm(context: str) -> str:
    """Call the LLM: billed once, never re-executed on retry."""
    return await llm_client.complete(f"Given this context: {context}, respond.")

@activity.defn
async def request_human_approval(response: str) -> bool:
    """Write pending approval to DB; the agent can wait days here."""
    return await approvals_db.create_pending(response)

@activity.defn
async def write_final_result(result: str) -> None:
    """Persist the approved result: exactly once."""
    await results_db.insert(result)

@workflow.defn
class AICockroachAgentWorkflow:
    @workflow.run
    async def run(self, task: str) -> str:
        # Step 1: retrieve context (retries safely, idempotent)
        context = await workflow.execute_activity(
            retrieve_context, task,
            start_to_close_timeout=timedelta(minutes=2),
        )
        # Step 2: LLM call (exactly once; no double billing)
        response = await workflow.execute_activity(
            call_llm, context,
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )
        # Step 3: human-in-the-loop (agent sleeps until approved, days if needed)
        approved = await workflow.execute_activity(
            request_human_approval, response,
            start_to_close_timeout=timedelta(days=7),
        )
        # Step 4: persist result (idempotent write, exactly once)
        if approved:
            await workflow.execute_activity(
                write_final_result, response,
                start_to_close_timeout=timedelta(seconds=30),
            )
        return response

Key Benefits

Capability CockroachDB Contribution
Serializable isolation No lost updates or phantom reads under concurrent workflow execution
Multi-region replication History shards durable across data-center failures without manual failover
Horizontal scalability Add nodes to absorb more concurrent workflows without re-sharding
Automatic failover Node failures transparent to all four Temporal services
PostgreSQL compatibility Zero application code changes; postgres12 plugin works directly

CockroachDB acts as a drop-in replacement for PostgreSQL, giving Temporal’s stateless services an indestructible, globally distributed foundation, with one schema fix for the visibility store.


See Also