pg_bestmatch

在数据库内生成BM25稀疏向量

Module:

PGEXT

扩展总览

PIGSTY 第三方扩展： pg_bestmatch ：在数据库内生成BM25稀疏向量

基本信息

扩展编号： 2140
扩展名称： pg_bestmatch
标准包名： pg_bestmatch
扩展类目： FTS
开源协议： Apache-2.0
官方网站： https://github.com/tensorchord/pg_bestmatch.rs
编程语言： Rust
其他标签： pgrx
备注信息：

元数据

默认版本： 0.0.1
PG大版本： 17,16,15,14,13
动态加载： 需要显式加载
需要DDL：需要执行 CREATE EXTENSION DDL
可重定位：可以重定位安装至其他模式下
信任程度：未受信任，创建扩展需要超级用户权限
所需模式： bm_catalog
所需扩展：无

软件包

RPM仓库：PIGSTY
RPM包名：pg_bestmatch_$v
RPM版本：0.0.1
RPM依赖：无
DEB仓库：PIGSTY
DEB包名：postgresql-$v-pg-bestmatch
DEB版本：0.0.1
DEB依赖：无

最新版本

系统	架构	PG17	PG16	PG15	PG14	PG13
`el8`	`x86_64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el8`	`aarch64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el9`	`x86_64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`el9`	`aarch64`	`pg_bestmatch_17` PIGSTY 0.0.1	`pg_bestmatch_16` PIGSTY 0.0.1	`pg_bestmatch_15` PIGSTY 0.0.1	`pg_bestmatch_14` PIGSTY 0.0.1	`pg_bestmatch_13` PIGSTY 0.0.1
`d12`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`d12`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u22`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u22`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u24`	`x86_64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1
`u24`	`aarch64`	`postgresql-17-pg-bestmatch` PIGSTY 0.0.1	`postgresql-16-pg-bestmatch` PIGSTY 0.0.1	`postgresql-15-pg-bestmatch` PIGSTY 0.0.1	`postgresql-14-pg-bestmatch` PIGSTY 0.0.1	`postgresql-13-pg-bestmatch` PIGSTY 0.0.1

扩展安装

使用 pig 命令行工具安装 pg_bestmatch 扩展：

pig ext install pg_bestmatch

使用 Pigsty剧本安装 pg_bestmatch 扩展：

./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pg_bestmatch"]}' # -l <集群名>

从 YUM仓库 手工安装 pg_bestmatch RPM 包：

dnf install pg_bestmatch_17;
dnf install pg_bestmatch_16;
dnf install pg_bestmatch_15;
dnf install pg_bestmatch_14;
dnf install pg_bestmatch_13;

从 APT仓库 手工安装 pg_bestmatch DEB 包：

apt install postgresql-17-pg-bestmatch;
apt install postgresql-16-pg-bestmatch;
apt install postgresql-15-pg-bestmatch;
apt install postgresql-14-pg-bestmatch;
apt install postgresql-13-pg-bestmatch;

扩展 pg_bestmatch 需要通过 shared_preload_libraries 进行 动态加载：

shared_preload_libraries = 'pg_bestmatch'; # 修改 PG 集群配置

使用以下 SQL 命令在已经安装此扩展插件的 PG 集群上启用 pg_bestmatch 扩展：

CREATE EXTENSION pg_bestmatch;

使用方法

How does it work?

Create an BM25 statistics based on your document set by bm25_create(table_name, column_name, statistic_name);. It will create a materilized view to record the stats.
Generate document sparse vector by bm25_document_to_svector(statistic_name, passage)
For query, generate query sparse vector bm25_query_to_svector(statistic_name, query)
Calculate the score by dot product between the query sparse vector and the document sparse vector
Currently we use huggingface tokenizer with bert-base-uncased vocabulary set to tokenize words. Might support more configuration on tokenizer in the future.

Install

CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;

Example

Here is an example workflow demonstrating the usage of this extension with the example of Stanford LoCo benchmark.

Load the dataset. Here is a script for you if you want to experience pg_bestmatch with the dataset.

wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet

import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs

def adapter_numpy_float64(numpy_float64):
    return AsIs(numpy_float64)

def adapter_numpy_int64(numpy_int64):
    return AsIs(numpy_int64)

def adapter_numpy_float32(numpy_float32):
    return AsIs(numpy_float32)

def adapter_numpy_int32(numpy_int32):
    return AsIs(numpy_int32)

def adapter_numpy_array(numpy_array):
    return AsIs(tuple(numpy_array))

register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)

db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)

def load_documents():
    df = pd.read_parquet("documents.parquet")
    df.to_sql("documents", engine, if_exists='replace', index=False)

def load_queries():
    df = pd.read_parquet("queries.parquet")
    df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))    
    df.to_sql("queries", engine, if_exists='replace', index=False)

load_documents()
load_queries()

Create BM25 statistics for the documents table.

SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);

Add an embedding column to the documents and queries tables and update the embeddings for documents and queries.

ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users

(Optional) Create a vector index on the sparse vector column.

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users

Perform a vector search to find the most relevant documents for each query.

ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users

UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users

SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;

This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is 0.77. If you reproduce the result, your operations are correct.

Comparison with pg_search

pg_bestmatch.rs only provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector).
pg_search performs BM25 retrieval via the external tantivy engine, which may have limitations when combined with transactions, filters, or JOIN operations. Since pg_bestmatch.rs is entirely native to Postgres, it offers full compatibility with these operations inside postgres.

Reference

tokenize
- Description: Tokenizes an input string into individual tokens.
- Example:
```
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple}
```
bm25_create
- Description: Creates BM25 statistics for a specified table and column.
- Usage:
```
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25');
```
- Parameters:
  - table_name: Name of the table.
  - column_name: Name of the column.
  - stat_name: Name of the BM25 statistics.
  - b: BM25 parameter (default 0.75).
  - k: BM25 parameter (default 1.2).
bm25_refresh
- Description: Updates the BM25 statistics to reflect any changes in the underlying data.
- Usage:
```
SELECT bm25_refresh('documents_passage_bm25');
```
- Parameters:
  - stat_name: Name of the BM25 statistics to update.
bm25_drop
- Description: Deletes the BM25 statistics for a specified table and column.
- Usage:
```
SELECT bm25_drop('documents_passage_bm25');
```
- Parameters:
  - stat_name: Name of the BM25 statistics to delete.
bm25_document_to_svector
- Description: Converts document text into a sparse vector representation.
- Usage:
```
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text');
```
- Parameters:
  - stat_name: Name of the BM25 statistics.
  - document_text: The text of the document.
  - style: Emits pgvecto.rs-style sparse vector or pgvector-style sparse vector.
bm25_query_to_svector
- Description: Converts query text into a sparse vector representation.
- Usage:
```
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.');
```
- Parameters:
  - stat_name: Name of the BM25 statistics.
  - query_text: The text of the query.
  - style: Emits pgvecto.rs-style sparse vector or pgvector-style sparse vector.

反馈

这个页面有帮助吗？

很高兴收到您的反馈！欢迎告诉我们有什么可以改进的地方.

很遗憾听到这个消息，欢迎告诉我们哪里可以继续改进.

最后修改 2025-05-20: update extension and blog (0c5b15b)