Full-text search in PostgreSQL just got a major upgrade. pg_textsearch is a new extension from Tiger Data (formerly Timescale) that brings BM25 ranking, the same algorithm powering modern search engines, directly into Postgres.
For teams already running PostgreSQL, this opens up new possibilities: advanced full-text search without leaving the database ecosystem. Let’s explore what pg_textsearch brings to the table.
What is pg_textsearch?
pg_textsearch is a PostgreSQL extension that brings industry-standard BM25 ranking directly into Postgres. It’s built by Tiger Data (formerly Timescale, the team behind TimescaleDB) and designed for developers who want powerful search capabilities through familiar SQL syntax.
The API is straightforward:
SELECT * FROM products
WHERE description <@> 'wireless bluetooth headphones'
ORDER BY description <@> 'wireless bluetooth headphones'
LIMIT 10;
The <@> operator returns negative BM25 scores (more negative means a better match). This design allows PostgreSQL’s ascending index scans to return the most relevant results first.
Key Capabilities
| Capability | What It Means | Why It Matters |
|---|---|---|
| BM25 Ranking | Industry-standard relevance algorithm | Same ranking quality as dedicated search engines |
| SQL Interface | Query with familiar syntax | No new query language to learn |
| ACID Transactions | Search index updates are transactional | Data consistency guaranteed |
| Native Joins | Join search results with other tables | No denormalization required |
| pgvector Integration | Combine keyword + semantic search | Hybrid search in one query |
| Partitioned Tables | Works with Postgres partitioning | Time-series and large table support |
pg_textsearch Features
Understanding BM25 Ranking
BM25 (Best Matching 25) is the ranking algorithm that powers most modern search engines. It improves on simpler term-frequency approaches with three key innovations:
- Length normalization: A concise, focused document can outrank a lengthy document that happens to mention the term many times
- Term saturation: Diminishing returns on repeated terms prevent keyword stuffing from gaming results
- Inverse document frequency (IDF): Rare, specific terms contribute more to relevance than common words
Here’s how this plays out in practice:
-- BM25 search with pg_textsearch
SELECT title, body <@> 'postgresql' AS score
FROM articles
ORDER BY body <@> 'postgresql'
LIMIT 10;
-- Example results showing BM25's length normalization:
-- "PostgreSQL Tutorial: Getting Started" → -4.87 (concise, focused)
-- "10,000 word article mentioning PostgreSQL often" → -1.24 (diluted relevance)
More negative scores indicate higher relevance. BM25 recognizes that a focused tutorial about PostgreSQL is likely more relevant than a lengthy document that merely mentions the term repeatedly.
The BM25 Formula (Simplified)
score = IDF × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × docLen/avgDocLen))
- IDF: Rare terms score higher (inverse document frequency)
- tf: Term frequency, but with saturation (diminishing returns)
- k1: Controls saturation speed (default: 1.2)
- b: Controls length normalization (default: 0.75)
- docLen/avgDocLen: Penalizes unusually long documents
Installation & Setup
Using pgxman (Recommended)
# Install pgxman if you haven't
curl -sfL https://pgxman.com/install.sh | sh
# Install pg_textsearch
pgxman install pg_textsearch
From Source
git clone https://github.com/timescale/pg_textsearch.git
cd pg_textsearch
cargo pgrx install --release
Enable the Extension
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
Basic Queries with <@> Operator
Let’s build a searchable product catalog:
-- Create table
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
category TEXT,
price NUMERIC(10, 2)
);
-- Insert sample data
INSERT INTO products (name, description, category, price) VALUES
('Sony WH-1000XM5', 'Industry-leading noise canceling wireless headphones with exceptional sound quality', 'Electronics', 349.99),
('Apple AirPods Pro', 'Active noise cancellation, spatial audio, MagSafe charging case', 'Electronics', 249.99),
('Bose QuietComfort', 'Wireless bluetooth headphones with world-class noise cancellation', 'Electronics', 329.99),
('Audio-Technica ATH-M50x', 'Professional studio monitor headphones, wired, exceptional clarity', 'Electronics', 149.99),
('JBL Tune 510BT', 'Wireless on-ear headphones, 40 hour battery, lightweight design', 'Electronics', 49.99);
-- Create BM25 index on description
CREATE INDEX idx_products_bm25 ON products
USING bm25 (description)
WITH (language = 'english');
Now search:
SELECT name, price, description <@> 'wireless noise canceling' AS score
FROM products
ORDER BY description <@> 'wireless noise canceling'
LIMIT 5;
Expected output:
name | price | score
-----------------------+---------+---------
Sony WH-1000XM5 | 349.99 | -5.42
Bose QuietComfort | 329.99 | -4.87
Apple AirPods Pro | 249.99 | -3.21
JBL Tune 510BT | 49.99 | -1.58
(4 rows)
Multi-Column Search
Search across multiple fields by creating a composite index:
-- Create index on multiple columns
CREATE INDEX idx_products_multi ON products
USING bm25 (name, description, category)
WITH (language = 'english');
-- Search across all indexed columns
SELECT name, category,
(name, description, category) <@> 'professional studio' AS score
FROM products
ORDER BY (name, description, category) <@> 'professional studio'
LIMIT 10;
Index Tuning: k1 and b Parameters
The default BM25 parameters work well for general text. But you can tune them:
| Parameter | Default | Effect |
|---|---|---|
| k1 | 1.2 | Term saturation speed. Higher = more credit for repeated terms |
| b | 0.75 | Length normalization. 0 = ignore length, 1 = full normalization |
Tuning for Different Document Types
-- Short documents (tweets, titles): Less length normalization
CREATE INDEX idx_short_docs ON tweets
USING bm25 (content)
WITH (k1 = 1.2, b = 0.3, language = 'english');
-- Long documents (articles, manuals): Standard settings
CREATE INDEX idx_long_docs ON articles
USING bm25 (body)
WITH (k1 = 1.2, b = 0.75, language = 'english');
-- Very long documents (books): More aggressive normalization
CREATE INDEX idx_books ON books
USING bm25 (content)
WITH (k1 = 1.5, b = 0.9, language = 'english');
Language Configuration
pg_textsearch supports 29+ languages with built-in stemming and stop words:
-- German product descriptions
CREATE INDEX idx_products_de ON products_de
USING bm25 (beschreibung)
WITH (language = 'german');
-- French articles
CREATE INDEX idx_articles_fr ON articles_fr
USING bm25 (contenu)
WITH (language = 'french');
-- Japanese (requires separate tokenizer)
CREATE INDEX idx_products_ja ON products_ja
USING bm25 (description)
WITH (language = 'japanese');
Hybrid Search with pgvector
One of pg_textsearch’s most powerful features is its integration with pgvector. This enables hybrid search: combining keyword matching (BM25) with semantic understanding (vector embeddings).
-- Enable both extensions
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS vector;
-- Add embedding column
ALTER TABLE products ADD COLUMN embedding vector(1536);
-- Create both indexes
CREATE INDEX idx_products_bm25 ON products USING bm25 (description);
CREATE INDEX idx_products_vector ON products USING hnsw (embedding vector_cosine_ops);
-- Hybrid search with Reciprocal Rank Fusion (RRF)
WITH bm25_results AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY description <@> 'noise canceling headphones') AS bm25_rank
FROM products
ORDER BY description <@> 'noise canceling headphones'
LIMIT 20
),
vector_results AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $query_embedding) AS vector_rank
FROM products
ORDER BY embedding <=> $query_embedding
LIMIT 20
),
rrf_scores AS (
SELECT
COALESCE(b.id, v.id) AS id,
COALESCE(1.0 / (60 + b.bm25_rank), 0) +
COALESCE(1.0 / (60 + v.vector_rank), 0) AS rrf_score
FROM bm25_results b
FULL OUTER JOIN vector_results v ON b.id = v.id
)
SELECT p.name, p.price, r.rrf_score
FROM rrf_scores r
JOIN products p ON p.id = r.id
ORDER BY r.rrf_score DESC
LIMIT 10;
Why Reciprocal Rank Fusion?
RRF combines rankings without needing to normalize scores across different systems:
RRF_score = Σ 1 / (k + rank_i)
Where k is typically 60. This formula:
- Rewards documents that rank highly in multiple systems
- Doesn’t require score normalization
- Is robust to outliers
Performance Considerations
Block-Max WAND: 4x Faster Top-K
pg_textsearch uses Block-Max WAND (Weak AND) algorithm for top-k queries. Instead of scoring every document, it skips blocks that can’t possibly make the top results.
-- This is FAST even on millions of rows
SELECT name, description <@> 'wireless headphones' AS score
FROM products
ORDER BY description <@> 'wireless headphones'
LIMIT 10; -- Block-Max WAND kicks in
| Dataset Size | Full Scan | Block-Max WAND |
|---|---|---|
| 10K rows | ~50ms | ~15ms |
| 100K rows | ~400ms | ~45ms |
| 1M rows | ~3.5s | ~120ms |
| 10M rows | ~35s | ~400ms |
Partitioned Table Support
pg_textsearch works with PostgreSQL’s native partitioning:
-- Create partitioned table for time-series logs
CREATE TABLE logs (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
message TEXT,
level TEXT
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE logs_2026_01 PARTITION OF logs
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE logs_2026_02 PARTITION OF logs
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- Create BM25 index (automatically applies to all partitions)
CREATE INDEX idx_logs_bm25 ON logs USING bm25 (message);
-- Search with partition pruning
SELECT * FROM logs
WHERE created_at >= '2026-01-15'
AND created_at < '2026-01-20'
ORDER BY message <@> 'connection timeout error'
LIMIT 100;
Best Practices
Ideal Use Cases
pg_textsearch works well for:
- Product catalogs, content management, and documentation search
- Applications requiring ACID consistency between data and search indexes
- Teams comfortable with SQL who want to avoid learning specialized query DSLs
- Hybrid search combining keywords with pgvector embeddings
Consider Alternatives When
- You’re working with 100M+ documents at scale
- You need specialized aggregation or analytics features
- Real-time log ingestion across distributed systems is the priority
The Search Landscape
Different tools serve different needs. Here’s where pg_textsearch fits in the broader ecosystem:
pg_textsearch
Strengths:
- BM25 ranking in PostgreSQL
- ACID transactions
- SQL interface
- pgvector integration
- Single database stack
Considerations:
- PostgreSQL 17+ required
- Pre-GA (v0.5.0-dev)
- Optimized for <10M docs
Elasticsearch / OpenSearch
Strengths:
- Proven at massive scale
- Rich aggregation framework
- Kibana/dashboards
- Log analytics ecosystem
Considerations:
- Separate cluster to manage
- Eventually consistent
- Specialized query DSL
Typesense / Meilisearch
Strengths:
- Typo tolerance built-in
- Instant search focus
- Simple to deploy
- Great developer UX
Considerations:
- Separate service
- Less SQL integration
- Different feature set
PostgreSQL ts_rank
Strengths:
- Zero dependencies
- Already in Postgres
- Simple to start
Considerations:
- Basic ranking
- No BM25 features
- Limited tuning options
Production Readiness
What to Do Now
- Test in staging: Evaluate on a copy of your production data
- Benchmark your queries: Measure against your current search solution
- Follow the repo: Watch for breaking changes before GA
- Don’t deploy to production yet: Wait for the stable release
Roadmap Items to Watch
Summary
pg_textsearch brings a significant capability to PostgreSQL: industry-standard BM25 ranking without leaving your existing database. With 29+ languages, Block-Max WAND optimization, and native pgvector integration, it’s a compelling option for teams that want powerful full-text search through familiar SQL. The GA release is expected February 2026. Worth evaluating now if you’re planning search infrastructure for a PostgreSQL-based application.
pg_textsearch is developed by Tiger Data (formerly Timescale). For the latest updates, follow the GitHub repository.