Database

pg_textsearch: BM25 Comes to Postgres in February

Industry-Standard Full-Text Search, Native in Your Database

Boyan Balev
Boyan Balev Software Engineer
14 min
pg_textsearch: BM25 Comes to Postgres in February

Full-text search in PostgreSQL just got a major upgrade. pg_textsearch is a new extension from Tiger Data (formerly Timescale) that brings BM25 ranking, the same algorithm powering modern search engines, directly into Postgres.

BM25 Industry-Standard Ranking
29+ Supported Languages
4x Faster Top-K Queries

For teams already running PostgreSQL, this opens up new possibilities: advanced full-text search without leaving the database ecosystem. Let’s explore what pg_textsearch brings to the table.

What is pg_textsearch?

pg_textsearch is a PostgreSQL extension that brings industry-standard BM25 ranking directly into Postgres. It’s built by Tiger Data (formerly Timescale, the team behind TimescaleDB) and designed for developers who want powerful search capabilities through familiar SQL syntax.

The API is straightforward:

SELECT * FROM products
WHERE description <@> 'wireless bluetooth headphones'
ORDER BY description <@> 'wireless bluetooth headphones'
LIMIT 10;

The <@> operator returns negative BM25 scores (more negative means a better match). This design allows PostgreSQL’s ascending index scans to return the most relevant results first.

Key Capabilities

CapabilityWhat It MeansWhy It Matters
BM25 Ranking Industry-standard relevance algorithm Same ranking quality as dedicated search engines
SQL Interface Query with familiar syntax No new query language to learn
ACID Transactions Search index updates are transactional Data consistency guaranteed
Native Joins Join search results with other tables No denormalization required
pgvector Integration Combine keyword + semantic search Hybrid search in one query
Partitioned Tables Works with Postgres partitioning Time-series and large table support

pg_textsearch Features

Ranking BM25 Industry-standard relevance
Languages 29+ Built-in stemming & stop words
Performance Block-Max WAND 4x faster top-k queries
Architecture Memtable Fast writes, async merge
Integration pgvector Hybrid semantic search
Tables Partitioned Time-series friendly

Understanding BM25 Ranking

BM25 (Best Matching 25) is the ranking algorithm that powers most modern search engines. It improves on simpler term-frequency approaches with three key innovations:

  1. Length normalization: A concise, focused document can outrank a lengthy document that happens to mention the term many times
  2. Term saturation: Diminishing returns on repeated terms prevent keyword stuffing from gaming results
  3. Inverse document frequency (IDF): Rare, specific terms contribute more to relevance than common words

Here’s how this plays out in practice:

-- BM25 search with pg_textsearch
SELECT title, body <@> 'postgresql' AS score
FROM articles
ORDER BY body <@> 'postgresql'
LIMIT 10;

-- Example results showing BM25's length normalization:
-- "PostgreSQL Tutorial: Getting Started"            → -4.87  (concise, focused)
-- "10,000 word article mentioning PostgreSQL often" → -1.24  (diluted relevance)

More negative scores indicate higher relevance. BM25 recognizes that a focused tutorial about PostgreSQL is likely more relevant than a lengthy document that merely mentions the term repeatedly.

The BM25 Formula (Simplified)

score = IDF × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × docLen/avgDocLen))
  • IDF: Rare terms score higher (inverse document frequency)
  • tf: Term frequency, but with saturation (diminishing returns)
  • k1: Controls saturation speed (default: 1.2)
  • b: Controls length normalization (default: 0.75)
  • docLen/avgDocLen: Penalizes unusually long documents

Installation & Setup

# Install pgxman if you haven't
curl -sfL https://pgxman.com/install.sh | sh

# Install pg_textsearch
pgxman install pg_textsearch

From Source

git clone https://github.com/timescale/pg_textsearch.git
cd pg_textsearch
cargo pgrx install --release

Enable the Extension

CREATE EXTENSION IF NOT EXISTS pg_textsearch;

Basic Queries with <@> Operator

Let’s build a searchable product catalog:

-- Create table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    description TEXT,
    category TEXT,
    price NUMERIC(10, 2)
);

-- Insert sample data
INSERT INTO products (name, description, category, price) VALUES
('Sony WH-1000XM5', 'Industry-leading noise canceling wireless headphones with exceptional sound quality', 'Electronics', 349.99),
('Apple AirPods Pro', 'Active noise cancellation, spatial audio, MagSafe charging case', 'Electronics', 249.99),
('Bose QuietComfort', 'Wireless bluetooth headphones with world-class noise cancellation', 'Electronics', 329.99),
('Audio-Technica ATH-M50x', 'Professional studio monitor headphones, wired, exceptional clarity', 'Electronics', 149.99),
('JBL Tune 510BT', 'Wireless on-ear headphones, 40 hour battery, lightweight design', 'Electronics', 49.99);

-- Create BM25 index on description
CREATE INDEX idx_products_bm25 ON products
USING bm25 (description)
WITH (language = 'english');

Now search:

SELECT name, price, description <@> 'wireless noise canceling' AS score
FROM products
ORDER BY description <@> 'wireless noise canceling'
LIMIT 5;

Expected output:

         name          |  price  |  score
-----------------------+---------+---------
 Sony WH-1000XM5       | 349.99  |  -5.42
 Bose QuietComfort     | 329.99  |  -4.87
 Apple AirPods Pro     | 249.99  |  -3.21
 JBL Tune 510BT        |  49.99  |  -1.58
(4 rows)

Search across multiple fields by creating a composite index:

-- Create index on multiple columns
CREATE INDEX idx_products_multi ON products
USING bm25 (name, description, category)
WITH (language = 'english');

-- Search across all indexed columns
SELECT name, category,
       (name, description, category) <@> 'professional studio' AS score
FROM products
ORDER BY (name, description, category) <@> 'professional studio'
LIMIT 10;

Index Tuning: k1 and b Parameters

The default BM25 parameters work well for general text. But you can tune them:

ParameterDefaultEffect
k1 1.2 Term saturation speed. Higher = more credit for repeated terms
b 0.75 Length normalization. 0 = ignore length, 1 = full normalization

Tuning for Different Document Types

-- Short documents (tweets, titles): Less length normalization
CREATE INDEX idx_short_docs ON tweets
USING bm25 (content)
WITH (k1 = 1.2, b = 0.3, language = 'english');

-- Long documents (articles, manuals): Standard settings
CREATE INDEX idx_long_docs ON articles
USING bm25 (body)
WITH (k1 = 1.2, b = 0.75, language = 'english');

-- Very long documents (books): More aggressive normalization
CREATE INDEX idx_books ON books
USING bm25 (content)
WITH (k1 = 1.5, b = 0.9, language = 'english');

Language Configuration

pg_textsearch supports 29+ languages with built-in stemming and stop words:

-- German product descriptions
CREATE INDEX idx_products_de ON products_de
USING bm25 (beschreibung)
WITH (language = 'german');

-- French articles
CREATE INDEX idx_articles_fr ON articles_fr
USING bm25 (contenu)
WITH (language = 'french');

-- Japanese (requires separate tokenizer)
CREATE INDEX idx_products_ja ON products_ja
USING bm25 (description)
WITH (language = 'japanese');

Hybrid Search with pgvector

One of pg_textsearch’s most powerful features is its integration with pgvector. This enables hybrid search: combining keyword matching (BM25) with semantic understanding (vector embeddings).

-- Enable both extensions
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS vector;

-- Add embedding column
ALTER TABLE products ADD COLUMN embedding vector(1536);

-- Create both indexes
CREATE INDEX idx_products_bm25 ON products USING bm25 (description);
CREATE INDEX idx_products_vector ON products USING hnsw (embedding vector_cosine_ops);

-- Hybrid search with Reciprocal Rank Fusion (RRF)
WITH bm25_results AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY description <@> 'noise canceling headphones') AS bm25_rank
    FROM products
    ORDER BY description <@> 'noise canceling headphones'
    LIMIT 20
),
vector_results AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $query_embedding) AS vector_rank
    FROM products
    ORDER BY embedding <=> $query_embedding
    LIMIT 20
),
rrf_scores AS (
    SELECT
        COALESCE(b.id, v.id) AS id,
        COALESCE(1.0 / (60 + b.bm25_rank), 0) +
        COALESCE(1.0 / (60 + v.vector_rank), 0) AS rrf_score
    FROM bm25_results b
    FULL OUTER JOIN vector_results v ON b.id = v.id
)
SELECT p.name, p.price, r.rrf_score
FROM rrf_scores r
JOIN products p ON p.id = r.id
ORDER BY r.rrf_score DESC
LIMIT 10;

Why Reciprocal Rank Fusion?

RRF combines rankings without needing to normalize scores across different systems:

RRF_score = Σ 1 / (k + rank_i)

Where k is typically 60. This formula:

  • Rewards documents that rank highly in multiple systems
  • Doesn’t require score normalization
  • Is robust to outliers

Performance Considerations

Block-Max WAND: 4x Faster Top-K

pg_textsearch uses Block-Max WAND (Weak AND) algorithm for top-k queries. Instead of scoring every document, it skips blocks that can’t possibly make the top results.

-- This is FAST even on millions of rows
SELECT name, description <@> 'wireless headphones' AS score
FROM products
ORDER BY description <@> 'wireless headphones'
LIMIT 10;  -- Block-Max WAND kicks in
Dataset SizeFull ScanBlock-Max WAND
10K rows ~50ms ~15ms
100K rows ~400ms ~45ms
1M rows ~3.5s ~120ms
10M rows ~35s ~400ms

Partitioned Table Support

pg_textsearch works with PostgreSQL’s native partitioning:

-- Create partitioned table for time-series logs
CREATE TABLE logs (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    message TEXT,
    level TEXT
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE logs_2026_01 PARTITION OF logs
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE logs_2026_02 PARTITION OF logs
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

-- Create BM25 index (automatically applies to all partitions)
CREATE INDEX idx_logs_bm25 ON logs USING bm25 (message);

-- Search with partition pruning
SELECT * FROM logs
WHERE created_at >= '2026-01-15'
  AND created_at < '2026-01-20'
ORDER BY message <@> 'connection timeout error'
LIMIT 100;

Best Practices

Ideal Use Cases

pg_textsearch works well for:

  • Product catalogs, content management, and documentation search
  • Applications requiring ACID consistency between data and search indexes
  • Teams comfortable with SQL who want to avoid learning specialized query DSLs
  • Hybrid search combining keywords with pgvector embeddings

Consider Alternatives When

  • You’re working with 100M+ documents at scale
  • You need specialized aggregation or analytics features
  • Real-time log ingestion across distributed systems is the priority

The Search Landscape

Different tools serve different needs. Here’s where pg_textsearch fits in the broader ecosystem:

pg_textsearch

Strengths:

  • BM25 ranking in PostgreSQL
  • ACID transactions
  • SQL interface
  • pgvector integration
  • Single database stack

Considerations:

  • PostgreSQL 17+ required
  • Pre-GA (v0.5.0-dev)
  • Optimized for <10M docs

Elasticsearch / OpenSearch

Strengths:

  • Proven at massive scale
  • Rich aggregation framework
  • Kibana/dashboards
  • Log analytics ecosystem

Considerations:

  • Separate cluster to manage
  • Eventually consistent
  • Specialized query DSL

Typesense / Meilisearch

Strengths:

  • Typo tolerance built-in
  • Instant search focus
  • Simple to deploy
  • Great developer UX

Considerations:

  • Separate service
  • Less SQL integration
  • Different feature set

PostgreSQL ts_rank

Strengths:

  • Zero dependencies
  • Already in Postgres
  • Simple to start

Considerations:

  • Basic ranking
  • No BM25 features
  • Limited tuning options

Production Readiness

What to Do Now

  1. Test in staging: Evaluate on a copy of your production data
  2. Benchmark your queries: Measure against your current search solution
  3. Follow the repo: Watch for breaking changes before GA
  4. Don’t deploy to production yet: Wait for the stable release

Roadmap Items to Watch

Stability 1.0 Release Expected Feb 2026
Feature Phrase Search Exact phrase matching
Feature Fuzzy Search Typo tolerance
Performance Parallel Scan Multi-core indexing
Integration pg_analytics Combined OLAP + search
Cloud Timescale Cloud Managed offering

Summary

pg_textsearch

pg_textsearch brings a significant capability to PostgreSQL: industry-standard BM25 ranking without leaving your existing database. With 29+ languages, Block-Max WAND optimization, and native pgvector integration, it’s a compelling option for teams that want powerful full-text search through familiar SQL. The GA release is expected February 2026. Worth evaluating now if you’re planning search infrastructure for a PostgreSQL-based application.


pg_textsearch is developed by Tiger Data (formerly Timescale). For the latest updates, follow the GitHub repository.