Database • January 25, 2026

pg_textsearch: BM25 Comes to Postgres in February

Industry-Standard Full-Text Search, Native in Your Database

Boyan Balev Software Engineer

14 min

Full-text search in PostgreSQL just got a major upgrade. pg_textsearch is a new extension from Tiger Data (formerly Timescale) that brings BM25 ranking, the same algorithm powering modern search engines, directly into Postgres.

BM25 Industry-Standard Ranking

29+ Supported Languages

4x Faster Top-K Queries

For teams already running PostgreSQL, this opens up new possibilities: advanced full-text search without leaving the database ecosystem. Let’s explore what pg_textsearch brings to the table.

What is pg_textsearch?

pg_textsearch is a PostgreSQL extension that brings industry-standard BM25 ranking directly into Postgres. It’s built by Tiger Data (formerly Timescale, the team behind TimescaleDB) and designed for developers who want powerful search capabilities through familiar SQL syntax.

The API is straightforward:

SELECT * FROM products
WHERE description <@> 'wireless bluetooth headphones'
ORDER BY description <@> 'wireless bluetooth headphones'
LIMIT 10;

The <@> operator returns negative BM25 scores (more negative means a better match). This design allows PostgreSQL’s ascending index scans to return the most relevant results first.

Key Capabilities

Capability	What It Means	Why It Matters
BM25 Ranking	Industry-standard relevance algorithm	Same ranking quality as dedicated search engines
SQL Interface	Query with familiar syntax	No new query language to learn
ACID Transactions	Search index updates are transactional	Data consistency guaranteed
Native Joins	Join search results with other tables	No denormalization required
pgvector Integration	Combine keyword + semantic search	Hybrid search in one query
Partitioned Tables	Works with Postgres partitioning	Time-series and large table support

pg_textsearch Features

Ranking BM25 Industry-standard relevance

Languages 29+ Built-in stemming & stop words

Performance Block-Max WAND 4x faster top-k queries

Architecture Memtable Fast writes, async merge

Integration pgvector Hybrid semantic search

Tables Partitioned Time-series friendly

Understanding BM25 Ranking

BM25 (Best Matching 25) is the ranking algorithm that powers most modern search engines. It improves on simpler term-frequency approaches with three key innovations:

Length normalization: A concise, focused document can outrank a lengthy document that happens to mention the term many times
Term saturation: Diminishing returns on repeated terms prevent keyword stuffing from gaming results
Inverse document frequency (IDF): Rare, specific terms contribute more to relevance than common words

Here’s how this plays out in practice:

-- BM25 search with pg_textsearch
SELECT title, body <@> 'postgresql' AS score
FROM articles
ORDER BY body <@> 'postgresql'
LIMIT 10;

-- Example results showing BM25's length normalization:
-- "PostgreSQL Tutorial: Getting Started"            → -4.87  (concise, focused)
-- "10,000 word article mentioning PostgreSQL often" → -1.24  (diluted relevance)

More negative scores indicate higher relevance. BM25 recognizes that a focused tutorial about PostgreSQL is likely more relevant than a lengthy document that merely mentions the term repeatedly.

The BM25 Formula (Simplified)

score = IDF × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × docLen/avgDocLen))

IDF: Rare terms score higher (inverse document frequency)
tf: Term frequency, but with saturation (diminishing returns)
k1: Controls saturation speed (default: 1.2)
b: Controls length normalization (default: 0.75)
docLen/avgDocLen: Penalizes unusually long documents

Installation & Setup

Using pgxman (Recommended)

# Install pgxman if you haven't
curl -sfL https://pgxman.com/install.sh | sh

# Install pg_textsearch
pgxman install pg_textsearch

From Source

git clone https://github.com/timescale/pg_textsearch.git
cd pg_textsearch
cargo pgrx install --release

Enable the Extension

CREATE EXTENSION IF NOT EXISTS pg_textsearch;

Basic Queries with `<@>` Operator

Let’s build a searchable product catalog:

-- Create table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    description TEXT,
    category TEXT,
    price NUMERIC(10, 2)
);

-- Insert sample data
INSERT INTO products (name, description, category, price) VALUES
('Sony WH-1000XM5', 'Industry-leading noise canceling wireless headphones with exceptional sound quality', 'Electronics', 349.99),
('Apple AirPods Pro', 'Active noise cancellation, spatial audio, MagSafe charging case', 'Electronics', 249.99),
('Bose QuietComfort', 'Wireless bluetooth headphones with world-class noise cancellation', 'Electronics', 329.99),
('Audio-Technica ATH-M50x', 'Professional studio monitor headphones, wired, exceptional clarity', 'Electronics', 149.99),
('JBL Tune 510BT', 'Wireless on-ear headphones, 40 hour battery, lightweight design', 'Electronics', 49.99);

-- Create BM25 index on description
CREATE INDEX idx_products_bm25 ON products
USING bm25 (description)
WITH (language = 'english');

Now search:

SELECT name, price, description <@> 'wireless noise canceling' AS score
FROM products
ORDER BY description <@> 'wireless noise canceling'
LIMIT 5;

Expected output:

         name          |  price  |  score
-----------------------+---------+---------
 Sony WH-1000XM5       | 349.99  |  -5.42
 Bose QuietComfort     | 329.99  |  -4.87
 Apple AirPods Pro     | 249.99  |  -3.21
 JBL Tune 510BT        |  49.99  |  -1.58
(4 rows)

Multi-Column Search

Search across multiple fields by creating a composite index:

-- Create index on multiple columns
CREATE INDEX idx_products_multi ON products
USING bm25 (name, description, category)
WITH (language = 'english');

-- Search across all indexed columns
SELECT name, category,
       (name, description, category) <@> 'professional studio' AS score
FROM products
ORDER BY (name, description, category) <@> 'professional studio'
LIMIT 10;

Index Tuning: k1 and b Parameters

The default BM25 parameters work well for general text. But you can tune them:

Parameter	Default	Effect
k1	1.2	Term saturation speed. Higher = more credit for repeated terms
b	0.75	Length normalization. 0 = ignore length, 1 = full normalization

Tuning for Different Document Types

-- Short documents (tweets, titles): Less length normalization
CREATE INDEX idx_short_docs ON tweets
USING bm25 (content)
WITH (k1 = 1.2, b = 0.3, language = 'english');

-- Long documents (articles, manuals): Standard settings
CREATE INDEX idx_long_docs ON articles
USING bm25 (body)
WITH (k1 = 1.2, b = 0.75, language = 'english');

-- Very long documents (books): More aggressive normalization
CREATE INDEX idx_books ON books
USING bm25 (content)
WITH (k1 = 1.5, b = 0.9, language = 'english');

Language Configuration

pg_textsearch supports 29+ languages with built-in stemming and stop words:

-- German product descriptions
CREATE INDEX idx_products_de ON products_de
USING bm25 (beschreibung)
WITH (language = 'german');

-- French articles
CREATE INDEX idx_articles_fr ON articles_fr
USING bm25 (contenu)
WITH (language = 'french');

-- Japanese (requires separate tokenizer)
CREATE INDEX idx_products_ja ON products_ja
USING bm25 (description)
WITH (language = 'japanese');

Hybrid Search with pgvector

One of pg_textsearch’s most powerful features is its integration with pgvector. This enables hybrid search: combining keyword matching (BM25) with semantic understanding (vector embeddings).

-- Enable both extensions
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS vector;

-- Add embedding column
ALTER TABLE products ADD COLUMN embedding vector(1536);

-- Create both indexes
CREATE INDEX idx_products_bm25 ON products USING bm25 (description);
CREATE INDEX idx_products_vector ON products USING hnsw (embedding vector_cosine_ops);

-- Hybrid search with Reciprocal Rank Fusion (RRF)
WITH bm25_results AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY description <@> 'noise canceling headphones') AS bm25_rank
    FROM products
    ORDER BY description <@> 'noise canceling headphones'
    LIMIT 20
),
vector_results AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $query_embedding) AS vector_rank
    FROM products
    ORDER BY embedding <=> $query_embedding
    LIMIT 20
),
rrf_scores AS (
    SELECT
        COALESCE(b.id, v.id) AS id,
        COALESCE(1.0 / (60 + b.bm25_rank), 0) +
        COALESCE(1.0 / (60 + v.vector_rank), 0) AS rrf_score
    FROM bm25_results b
    FULL OUTER JOIN vector_results v ON b.id = v.id
)
SELECT p.name, p.price, r.rrf_score
FROM rrf_scores r
JOIN products p ON p.id = r.id
ORDER BY r.rrf_score DESC
LIMIT 10;

Why Reciprocal Rank Fusion?

RRF combines rankings without needing to normalize scores across different systems:

RRF_score = Σ 1 / (k + rank_i)

Where k is typically 60. This formula:

Rewards documents that rank highly in multiple systems
Doesn’t require score normalization
Is robust to outliers

Performance Considerations

Block-Max WAND: 4x Faster Top-K

pg_textsearch uses Block-Max WAND (Weak AND) algorithm for top-k queries. Instead of scoring every document, it skips blocks that can’t possibly make the top results.

-- This is FAST even on millions of rows
SELECT name, description <@> 'wireless headphones' AS score
FROM products
ORDER BY description <@> 'wireless headphones'
LIMIT 10;  -- Block-Max WAND kicks in

Dataset Size	Full Scan	Block-Max WAND
10K rows	~50ms	~15ms
100K rows	~400ms	~45ms
1M rows	~3.5s	~120ms
10M rows	~35s	~400ms

Partitioned Table Support

pg_textsearch works with PostgreSQL’s native partitioning:

-- Create partitioned table for time-series logs
CREATE TABLE logs (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    message TEXT,
    level TEXT
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE logs_2026_01 PARTITION OF logs
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE logs_2026_02 PARTITION OF logs
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

-- Create BM25 index (automatically applies to all partitions)
CREATE INDEX idx_logs_bm25 ON logs USING bm25 (message);

-- Search with partition pruning
SELECT * FROM logs
WHERE created_at >= '2026-01-15'
  AND created_at < '2026-01-20'
ORDER BY message <@> 'connection timeout error'
LIMIT 100;

Best Practices

Ideal Use Cases

pg_textsearch works well for:

Product catalogs, content management, and documentation search
Applications requiring ACID consistency between data and search indexes
Teams comfortable with SQL who want to avoid learning specialized query DSLs
Hybrid search combining keywords with pgvector embeddings

Consider Alternatives When

You’re working with 100M+ documents at scale
You need specialized aggregation or analytics features
Real-time log ingestion across distributed systems is the priority

The Search Landscape

Different tools serve different needs. Here’s where pg_textsearch fits in the broader ecosystem:

pg_textsearch

Strengths:

BM25 ranking in PostgreSQL
ACID transactions
SQL interface
pgvector integration
Single database stack

Considerations:

PostgreSQL 17+ required
Pre-GA (v0.5.0-dev)
Optimized for <10M docs

Elasticsearch / OpenSearch

Strengths:

Proven at massive scale
Rich aggregation framework
Kibana/dashboards
Log analytics ecosystem

Considerations:

Separate cluster to manage
Eventually consistent
Specialized query DSL

Typesense / Meilisearch

Strengths:

Typo tolerance built-in
Instant search focus
Simple to deploy
Great developer UX

Considerations:

Separate service
Less SQL integration
Different feature set

PostgreSQL ts_rank

Strengths:

Zero dependencies
Already in Postgres
Simple to start

Considerations:

Basic ranking
No BM25 features
Limited tuning options

Production Readiness

What to Do Now

Test in staging: Evaluate on a copy of your production data
Benchmark your queries: Measure against your current search solution
Follow the repo: Watch for breaking changes before GA
Don’t deploy to production yet: Wait for the stable release

Roadmap Items to Watch

Stability 1.0 Release Expected Feb 2026

Feature Phrase Search Exact phrase matching

Feature Fuzzy Search Typo tolerance

Performance Parallel Scan Multi-core indexing

Integration pg_analytics Combined OLAP + search

Cloud Timescale Cloud Managed offering

Summary

pg_textsearch brings a significant capability to PostgreSQL: industry-standard BM25 ranking without leaving your existing database. With 29+ languages, Block-Max WAND optimization, and native pgvector integration, it’s a compelling option for teams that want powerful full-text search through familiar SQL. The GA release is expected February 2026. Worth evaluating now if you’re planning search infrastructure for a PostgreSQL-based application.

pg_textsearch is developed by Tiger Data (formerly Timescale). For the latest updates, follow the GitHub repository.

What is pg_textsearch?

Key Capabilities

pg_textsearch Features

Understanding BM25 Ranking

The BM25 Formula (Simplified)

Installation & Setup

Using pgxman (Recommended)

From Source

Enable the Extension

Basic Queries with <@> Operator

Multi-Column Search

Index Tuning: k1 and b Parameters

Tuning for Different Document Types

Language Configuration

Hybrid Search with pgvector

Why Reciprocal Rank Fusion?

Performance Considerations

Block-Max WAND: 4x Faster Top-K

Partitioned Table Support

Best Practices

Ideal Use Cases

Consider Alternatives When

The Search Landscape

pg_textsearch

Elasticsearch / OpenSearch

Typesense / Meilisearch

PostgreSQL ts_rank

Production Readiness

What to Do Now

Roadmap Items to Watch

Summary

Basic Queries with `<@>` Operator