Building a Bitcoin Indexer from Scratch

A Bitcoin indexer is a service that turns raw blockchain data into something your application can actually use: balances, histories, labels, analytics, and whatever domain view your product needs. Bitcoin Core already stores and validates the chain, but it is not designed as your backend database. An indexer sits next to a full node, consumes its view of the chain, and maintains its own query‑friendly model.

In this article I’ll walk through how to design a production‑ready Bitcoin indexer. We will stay at the architectural level: data sources, parsing pipeline, schema design, reorg handling, and performance. No code yet; the goal is that you can draw the whole system on a whiteboard and know what each component is responsible for.

Prerequisites

I’ll assume you:

Understand Bitcoin at the level of blocks, transactions, and UTXOs.
Think of the ledger as a graph of outputs and spends (see the UTXO article).
Are comfortable with services, SQL databases, and basic distributed systems ideas such as backpressure, replay, and idempotency.

You do not need to know every node RPC call or binary format detail yet; we’ll keep those abstract and focus on structure.

1. What a Bitcoin Indexer Actually Does

Bitcoin Core already has validated data, but its storage is optimised for consensus, not your queries. A dedicated indexer:

- Follows the canonical chain as seen by a trusted full node.
- Parses blocks and transactions into its own internal model.
- Maintains a queryable database (for example PostgreSQL) with rich indexes.
- Exposes APIs for:
    - transaction details
    - address / script histories
    - balances and live UTXO views
    - time-series metrics and analytics

The important separation:

[ Bitcoin Core ]
- P2P, consensus, validation
- Decides what is valid and canonical

[ Indexer ]
- Deterministic consumer of node state
- Decides how to store, index, and expose it

You do not re‑implement consensus. If you find yourself re‑checking signatures and scripts, stop and reconsider the design.

2. High‑Level Architecture

A practical architecture separates ingestion, storage, and query serving.

                         +----------------------+
                         |   Bitcoin Core Node  |
                         |  (full validation)   |
                         +----------+-----------+
                                    |
             ZMQ / RPC stream or raw block files
                                    |
                                    v
+---------------------+   +---------------------+
|  Ingestion Worker   |-->|  Parsing &          |
|  (Spring service)   |   |  Normalization      |
+----------+----------+   +----------+----------+
                                    |
                                    v
                         +-----------------------+
                         |  Relational DB        |
                         |  (PostgreSQL, etc.)   |
                         +----------+------------+
                                    |
                                    v
                       +---------------------------+
                       |  API / Query Service      |
                       |  (REST / GraphQL)         |
                       +---------------------------+

You can deploy ingestion and API as separate services or as modules in one application. The key distinction:

Ingestion is sequential, block‑ordered, reorg‑aware.
APIs are concurrent, read‑only, often time‑series heavy.

From the trenches: Every time I tried to “just bolt on some queries” directly to a node, I ended up building an indexer anyway. It’s easier to make that separation explicit from day one, while this introduce additional challenges and it complexify a bit the architecture having an indexer in the architecture allow for better query performance and reduced operational costs and head-ache using external services with their own SLA outages idiosyncratic beheaviour and also is better for security especially in controlled and regulated environment that don’t wont to use external services but have their own data sovereignty.

3. Choosing a Data Source: RPC, ZMQ, or Raw Files

Bitcoin Core exposes the chain in several ways. You mostly care about three:

+-------------------+---------------------------+
| Source            | Characteristics           |
+-------------------+---------------------------+
| RPC               | Simple, synchronous,      |
|                   | JSON-encoded responses    |
|                   | Good for random access    |
+-------------------+---------------------------+
| ZMQ               | Push-based, streaming     |
|                   | Good for "follow the tip" |
+-------------------+---------------------------+
| Raw blk*.dat      | Binary blocks from disk   |
| (+ LevelDB index) | Max throughput for bulk   |
|                   | import, more parsing work |
+-------------------+---------------------------+

A common pattern is to split syncing into two phases:

Bootstrap phase (initial synchronization) – stream historical blocks from raw blk*.dat files on a dedicated node to build your full local view of the chain.
Follow phase (steady‑state synchronization) – stay up to date by following new blocks via ZMQ or lightweight RPC polling; this path is much lighter than the initial catch‑up.

To keep things minimalistic, I’ll assume a two‑stage setup for a simplified implementation:

Start with RPC + ZMQ – use RPC to backfill historical blocks and ZMQ (or light polling) to follow the tip. This is easy to reason about and debug.
Add a bulk importer later – once the model is stable, introduce a separate “bulk importer” that reads raw blk*.dat files to accelerate full re‑indexing or fresh deployments.

In all cases the invariant stays the same: regardless of whether the data comes from RPC, ZMQ, or raw files, you only process blocks that the node currently considers valid and canonical.

4. Core Data Model: Blocks, Transactions, I/O

Most serious indexers converge towards a small set of relational tables. Conceptually:

+---------------------------+
|          blocks           |
+---------------------------+
| id (pk)                   |
| hash                      |
| height                    |
| prev_hash                 |
| timestamp                 |
| size_bytes                |
| weight_units              |
| is_main_chain             |
+---------------------------+

+---------------------------+
|       transactions        |
+---------------------------+
| id (pk)                   |
| txid                      |
| block_id (fk -> blocks)   |
| index_in_block            |
| version                   |
| locktime                  |
+---------------------------+

+----------------------------+
|           inputs           |
+----------------------------+
| id (pk)                    |
| tx_id (fk -> transactions) |
| index_in_tx                |
| prev_txid                  |
| prev_vout_index            |
| sequence                   |
| is_coinbase                |
+----------------------------+

+----------------------------+
|          outputs           |
+----------------------------+
| id (pk)                    |
| tx_id (fk -> transactions) |
| index_in_tx                |
| value_sats                 |
| script_type                |
| script_bytes / asm         |
| address (nullable)         |
| spent_by_input_id (fk)     |
+----------------------------+

Note on simplification: this schema is intentionally simplified to explain the core ideas. In real systems there are more idiomatic ways to represent “main chain vs side chain” and “spent vs unspent” that derive these views from the block/transaction graph rather than treating them as mutable state.

In this simplified model:

is_main_chain is shown as an explicit flag. In a more advanced design you usually derive “main vs side chain” from the block tree and cumulative work (following prev_hash links) instead of flipping a boolean on each block. I keep the flag here because it makes reorg examples and queries easier to read while still allowing you to keep side branches around for analysis without polluting your canonical view.
spent_by_input_id is modelled as a direct foreign key from each output to the input that spends it. In a more idiomatic relational model you’d often derive this relationship from the inputs table (or a dedicated join / materialized view), rather than maintaining a mutable “spent” marker on the output row. Here I keep it as a column purely for clarity: every output either has a single spending input (spent_by_input_id set) or NULL (still unspent), which matches the UTXO mental model we care about.

Later, when we talk about reorgs and UTXO queries, I’ll use this simplified schema; you can map the same concepts onto a more purely graph‑ or lineage‑based model in a production implementation.

From the trenches: This schema is already a simplification. In real systems you usually derive “main chain vs side chain” and “spent vs unspent” from the block/transaction hierarchy, rather than mutating flags on rows. What I’ve regretted in the past are schemas that tried to be even simpler than this (for example, just a couple of denormalised tables with opaque JSON). You end up rebuilding these relationships in ad‑hoc ways anyway. It’s usually cheaper to model them explicitly up front—even if you later move to a more graph‑ or lineage‑oriented representation internally.

5. Parsing and Normalization Pipeline

Once you know how a block should look in your database, the ingestion path becomes a straightforward pipeline.

[ Bitcoin Core ]
       |
       v
[ Raw block (binary or JSON) ]
       |
       v
+---------------------------------+
| Block Parser                    |
| - Decode headers                |
| - Decode transactions           |
| - Extract inputs / outputs      |
+---------------------------------+
       |
       v
+---------------------------------+
| Normalizer                      |
| - Derive addresses (where       |
|   possible)                     |
| - Classify script types         |
| - Compute per-block metrics     |
+---------------------------------+
       |
       v
+---------------------------------+
| Batch Writer                    |
| - Insert blocks, tx, I/O        |
| - Maintain main-chain flags     |
| - Commit as one DB transaction  |
+---------------------------------+

Responsibilities are clean:

Parser deals with Bitcoin’s wire/binary formats and RPC responses.
Normalizer encodes your domain choices (how you classify scripts, what metrics you store).
Batch writer ensures that a block is either fully applied or not applied at all.

You typically write per block (or in small batches of blocks), not row by row. This gives you:

Natural transaction boundaries.
Simple reorg rollback (block is the unit of undo/redo).
Predictable I/O patterns.

6. Reorg Handling and Canonical Chain

Reorgs are where naive indexers fall apart. Handling them is always a two‑step problem:

Detect that a reorg has occurred and identify the fork point.
Apply the right rollback and re‑apply logic to all derived state.

You need an explicit model of chain state, not ad‑hoc flags sprinkled around the schema.

At minimum:

+------------------------+
|       chain_state      |
+------------------------+
| tip_hash               |
| tip_height             |
| work_sum               |
+------------------------+

On each new block:

Insert the block, linked to its prev_hash.
Compute the cumulative work for the chain ending at this block.
Compare with the current main chain:
- If the new chain has less work → keep it as a side branch.
- If the new chain has more work → a reorg may be needed:
  - Find the common ancestor (fork point).
  - Roll back blocks on the old main chain back to that fork.
  - Apply blocks from the winning branch in order and mark them as canonical.

At the database level, rollback/apply means something like:

Rollback:
  - Mark affected blocks as non-canonical.
  - Undo their effects on derived data (balances, UTXO views, aggregates).

Apply:
  - Mark new-branch blocks as main-chain.
  - Re-apply their effects to derived data in height order.

A robust way to make this tractable is to treat derived state as pure functions of block history: given “all blocks up to height H on the canonical chain”, your balances, UTXOs, and aggregates are just f(blocks[0..H]). In practice you don’t recompute from genesis every time, so you introduce:

- Checkpoints: saved “good” states at specific heights.
- Snapshots: materialised views of derived data at those checkpoints.
- Deltas: per-block changes that can be replayed or rolled back.

This keeps the model functional (“state = f(history)”) while still letting you rebuild the canonical view efficiently after a reorg or bug.

Crucially, the data model itself must be designed from first principles to support this: blocks and transactions with clear parent links, derived tables that can be updated or rebuilt from block deltas, and no hidden, irreversible side‑effects.

From the trenches: In production, reorg handling has always been a two‑step game: first detect them correctly, then apply rollbacks safely. A functional design that can replay from checkpoints makes the second step much easier, but it only works if the schema was built for it from day one. Every time I’ve seen reorg/rollback support bolted onto an existing indexer, it led to awkward patches, “special case” tables, and eventually a full data‑model redesign. It’s worth spending the design time up front so rollbacks and reorgs are natural operations, not emergency hacks.

7. Query Patterns and Indexing Strategy

Schema is only half the story; the other half is query patterns. The usual suspects:

- Given a txid, return the full transaction with inputs and outputs.
- Given an address/script, list all related transactions and current UTXOs.
- Given a time range, aggregate volumes, fees, or counts per day/block.
- Given a block height, list all transactions in canonical order.

You can map these directly to indexes:

+-------------------------------+------------------------------------------+
| Query                         | Helpful Index                            |
+-------------------------------+------------------------------------------+
| tx by txid                    | UNIQUE on transactions.txid              |   
| outputs by address            | INDEX on outputs.address                 |
| UTXOs by address              | INDEX on (address, spent_by_input_id)    |
| tx by block height + position | INDEX on (blocks.height, index_in_block) |
+-------------------------------+------------------------------------------+

Because the chain is inherently time‑series data, you often:

Partition tables by time or height.
Precompute continuous aggregates (e.g. per‑day volume).
Keep only recent partitions in “hot” storage for low‑latency APIs; move older ones to cheaper tiers if your product allows it.

From the trenches: In practice you achieve this with a multi‑layer storage stack: keep recent, “hot” data in a fast low‑latency database (for example PostgreSQL) with good reverse indexes and plenty of memory‑backed caching (L1 in‑process, L2 shared cache), and push older, “cold” data into cheaper, slower storage such as S3 or an analytical warehouse. Implementing this kind of tiered caching and storage is not trivial and deserves its own article, but in principle it’s the right approach if you want sub‑second queries on recent activity while still being able to scan and analyse deep historical data when needed.

8. Dealing with Blockchain Growth

The Bitcoin chain is large and growing; by the mid‑2020s you’re firmly in “hundreds of gigabytes” territory on disk. That has two consequences:

Initial sync is a serious job Even with efficient raw‑file import, filling your database from genesis is hours‑to‑days work on typical hardware.
Not all data is equally hot Most user‑facing APIs care much more about recent activity than about some transaction from ten years ago.

A common pattern is a tiered layout:

+-------------------------------+
| HOT: recent blocks / days     |
| - Full indexes                |
| - Low-latency queries         |
+-------------------------------+
              |
              v
+-------------------------------+
| WARM: older partitions        |
| - Fewer indexes               |
| - Still directly queryable    |
+-------------------------------+
              |
              v
+-------------------------------+
| COLD: archival storage        |
| - Compressed partitions       |
| - Object storage / snapshots  |
+-------------------------------+

Analytics workloads may scan everything; user‑facing services often only touch “hot + warm” data plus precomputed summaries. Decide upfront which category your indexer belongs to.

9. Service Design

Everything above can be implemented as a single monolith, but it’s easier to reason about if you think in terms of three logical services:

+-----------------------------+
| BlockFollower               |
+-----------------------------+
| - Talks to Bitcoin Core     |
| - Tracks current height     |
| - Streams new blocks        |
+-----------------------------+

+-----------------------------+
| Indexer                     |
+-----------------------------+
| - Parses blocks             |
| - Normalizes data           |
| - Applies blocks to DB      |
| - Handles reorg signals     |
+-----------------------------+

+-----------------------------+
| API Server                  |
+-----------------------------+
| - Read-only queries         |
| - REST / GraphQL / gRPC     |
| - Horizontal scaling        |
+-----------------------------+

In a production environment, you might:

Run BlockFollower and Indexer in the same process, with clear module boundaries and strong transactional guarantees around block application.
Deploy the API layer separately so it can scale independently, remain strictly read‑only, and never mutate state.

The key properties:

Idempotence: applying the same block twice has the same effect as applying it once (or fails loudly).
Replayability: you can drop and rebuild derived tables from block history without special cases.
Backpressure: ingestion should slow down gracefully if the database or downstream systems are saturated.

10. Testing and Validation

Before you trust an indexer, you want evidence that it:

Agrees with a reference node on core facts.
Behaves predictably under reorgs and restarts.

Think in three layers:

1. Pure parsing / normalization tests
   - Given a sample block (JSON or binary), assert that the internal model
     has the right fields and relationships.

2. End-to-end block application tests
   - Feed a short, known chain into an in-memory DB.
   - Assert the contents of blocks, transactions, inputs, outputs, and
     derived fields like spent_by_input_id.

3. Cross-checks against Bitcoin Core
   - For a range of heights: block hash and transaction counts match.
   - For selected txids: your decoded transaction matches RPC output.
   - For selected addresses: your balance and history match a trusted
     view (your own RPC logic or a well-vetted explorer).

For reorgs, set up explicit scenarios:

- Construct a private chain with a fork:
    A1 -> A2 -> A3
         \
          B2 -> B3 -> B4 (more work)

- Feed A1, A2, A3 to the indexer.
- Then feed B2, B3, B4 and signal the reorg.
- Assert that:
    - A3 and A2 are no longer marked main-chain.
    - B2, B3, B4 are main-chain.
    - Derived data (balances, UTXO views) matches the B-branch.

You don’t need full Bitcoin tooling to test the model; a small synthetic chain is enough to exercise the logic.

11. Observability and Operations

In production an indexer is a long‑running data pipeline. If you cannot see what it is doing, you will eventually be surprised.

At minimum, expose metrics and logs for:

Sync / chain view
-----------------
- current_indexed_height
- node_best_height
- blocks_lagging (node - indexer)
- last_block_hash

Ingestion performance
---------------------
- blocks_processed_total
- txs_processed_total
- ingestion_latency_per_block
- DB write latency / errors

API behaviour
-------------
- request_rate
- latency_percentiles
- error_rate (by endpoint)

Storage
-------
- DB size (per table / partition)
- index size and bloat indicators

Combine that with structured logs around:

Reorg events (fork point, old tip, new tip).
Failed block applications and their reasons.
Any divergence detected against the node.

With that in place, you can answer the two questions that matter in operations:

“Are we in sync with the chain?”
“If not, where and why are we stuck?”

Conclusion and Suggested Next Steps

A Bitcoin indexer is not a magical black box. It is a deterministic consumer of a full node’s view of the world, plus storage and APIs shaped to your use case.

The core principles are simple:

Let Bitcoin Core own consensus and canonical chain selection.
Model blocks, transactions, inputs, outputs, and UTXOs explicitly.
Apply blocks in order, with clear rollback semantics for reorgs.
Optimise storage and indexes for your query patterns, not someone else’s.
Design for replay, resync, and observability from day one.

Once you are comfortable with this architecture, you can move in two directions:

Downwards: into binary parsing, RPC integration, and concrete implementations of the BlockFollower and Indexer modules.
Sideways: into multi‑chain indexers, cross‑chain address analytics, or specialised APIs (for example, for a DEX or launchpad).

This article assumes the UTXO mental model from the previous piece. If you haven’t read that yet, it is the best “prerequisite” for understanding how your indexer’s internal state should behave over time.