pg_bestmatch
在数据库内生成BM25稀疏向量
	Module:
Categories:
扩展总览
PIGSTY 第三方扩展: pg_bestmatch : 在数据库内生成BM25稀疏向量
基本信息
- 扩展编号: 2140
 - 扩展名称: 
pg_bestmatch - 标准包名: 
pg_bestmatch - 扩展类目: 
FTS - 开源协议: Apache-2.0
 - 官方网站: https://github.com/tensorchord/pg_bestmatch.rs
 - 编程语言: Rust
 - 其他标签: 
pgrx - 备注信息:
 
元数据
- 默认版本: 0.0.1
 - PG大版本: 
17,16,15,14,13 - 动态加载: 需要显式加载
 - 需要DDL:  需要执行 
CREATE EXTENSIONDDL - 可重定位: 可以重定位安装至其他模式下
 - 信任程度: 未受信任,创建扩展需要超级用户权限
 - 所需模式: 
bm_catalog - 所需扩展: 无
 
软件包
- RPM仓库:PIGSTY
 - RPM包名:
pg_bestmatch_$v - RPM版本:
0.0.1 - RPM依赖:无
 - DEB仓库:PIGSTY
 - DEB包名:
postgresql-$v-pg-bestmatch - DEB版本:
0.0.1 - DEB依赖:无
 
最新版本
扩展安装
使用 pig 命令行工具安装 pg_bestmatch 扩展:
pig ext install pg_bestmatch
使用 Pigsty剧本 安装 pg_bestmatch 扩展:
./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pg_bestmatch"]}' # -l <集群名>
从 YUM仓库 手工安装 pg_bestmatch RPM 包:
dnf install pg_bestmatch_17;
dnf install pg_bestmatch_16;
dnf install pg_bestmatch_15;
dnf install pg_bestmatch_14;
dnf install pg_bestmatch_13;
从 APT仓库 手工安装 pg_bestmatch DEB 包:
apt install postgresql-17-pg-bestmatch;
apt install postgresql-16-pg-bestmatch;
apt install postgresql-15-pg-bestmatch;
apt install postgresql-14-pg-bestmatch;
apt install postgresql-13-pg-bestmatch;
扩展 pg_bestmatch 需要通过 shared_preload_libraries 进行 动态加载:
shared_preload_libraries = 'pg_bestmatch'; # 修改 PG 集群配置
使用以下 SQL 命令在已经安装此扩展插件的 PG 集群上 启用 pg_bestmatch 扩展:
CREATE EXTENSION pg_bestmatch;
使用方法
- repo: https://github.com/tensorchord/pg_bestmatch.rs
 - benchmark: https://hazyresearch.stanford.edu/blog/2024-05-20-m2-bert-retrieval
 
How does it work?
- Create an BM25 statistics based on your document set by 
bm25_create(table_name, column_name, statistic_name);. It will create a materilized view to record the stats. - Generate document sparse vector by 
bm25_document_to_svector(statistic_name, passage) - For query, generate query sparse vector 
bm25_query_to_svector(statistic_name, query) - Calculate the score by dot product between the query sparse vector and the document sparse vector
 - Currently we use huggingface tokenizer with 
bert-base-uncasedvocabulary set to tokenize words. Might support more configuration on tokenizer in the future. 
Install
CREATE EXTENSION pg_bestmatch;
SET search_path TO public, bm_catalog;
Example
Here is an example workflow demonstrating the usage of this extension with the example of Stanford LoCo benchmark.
- Load the dataset. Here is a script for you if you want to experience 
pg_bestmatchwith the dataset. 
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Documents/parquet/default/test/0.parquet -O documents.parquet
wget https://huggingface.co/api/datasets/hazyresearch/LoCoV1-Queries/parquet/default/test/0.parquet -O queries.parquet
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def adapter_numpy_float64(numpy_float64):
    return AsIs(numpy_float64)
def adapter_numpy_int64(numpy_int64):
    return AsIs(numpy_int64)
def adapter_numpy_float32(numpy_float32):
    return AsIs(numpy_float32)
def adapter_numpy_int32(numpy_int32):
    return AsIs(numpy_int32)
def adapter_numpy_array(numpy_array):
    return AsIs(tuple(numpy_array))
register_adapter(np.float64, adapter_numpy_float64)
register_adapter(np.int64, adapter_numpy_int64)
register_adapter(np.float32, adapter_numpy_float32)
register_adapter(np.int32, adapter_numpy_int32)
register_adapter(np.ndarray, adapter_numpy_array)
db_url = "postgresql://localhost:5432/pg_bestmatch_test"
engine = create_engine(db_url)
def load_documents():
    df = pd.read_parquet("documents.parquet")
    df.to_sql("documents", engine, if_exists='replace', index=False)
def load_queries():
    df = pd.read_parquet("queries.parquet")
    df['answer_pids'] = df['answer_pids'].apply(lambda x: str(x[0]))    
    df.to_sql("queries", engine, if_exists='replace', index=False)
load_documents()
load_queries()
- Create BM25 statistics for the 
documentstable. 
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 0.75, 1.2);
- Add an embedding column to the 
documentsandqueriestables and update the embeddings for documents and queries. 
ALTER TABLE documents ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE documents ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage)::svector; -- for pgvecto.rs users
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', passage, 'pgvector')::sparsevec; -- for pgvector users
- (Optional) Create a vector index on the sparse vector column.
 
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops); -- for pgvecto.rs users
CREATE INDEX ON documents USING ivfflat (embedding sparsevec_ip_ops); -- for pgvector users
- Perform a vector search to find the most relevant documents for each query.
 
ALTER TABLE queries ADD COLUMN embedding svector; -- for pgvecto.rs users
ALTER TABLE queries ADD COLUMN embedding sparsevec; -- for pgvector users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query)::svector; -- for pgvecto.rs users
UPDATE queries SET embedding = bm25_query_to_svector('documents_passage_bm25', query, 'pgvector')::sparsevec; -- for pgvector users
SELECT sum((array[answer_pids] = array(SELECT pid FROM documents WHERE queries.dataset = documents.dataset ORDER BY queries.embedding <#> documents.embedding LIMIT 1))::int) FROM queries;
This workflow showcases how to leverage BM25 text queries and vector search in PostgreSQL using this extension. The Top 1 recall of BM25 on this dataset is 0.77. If you reproduce the result, your operations are correct.
Comparison with pg_search
pg_bestmatch.rsonly provides methods for generating sparse vectors and does not support index-based search (which can be achieved by pgvecto.rs or pgvector).pg_searchperforms BM25 retrieval via the externaltantivyengine, which may have limitations when combined with transactions, filters, or JOIN operations. Sincepg_bestmatch.rsis entirely native to Postgres, it offers full compatibility with these operations inside postgres.
Reference
tokenize- Description: Tokenizes an input string into individual tokens.
 - Example:
SELECT tokenize('i have an apple'); -- result: {i,have,an,apple} 
bm25_create- Description: Creates BM25 statistics for a specified table and column.
 - Usage:
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25'); - Parameters:
table_name: Name of the table.column_name: Name of the column.stat_name: Name of the BM25 statistics.b: BM25 parameter (default 0.75).k: BM25 parameter (default 1.2).
 
bm25_refresh- Description: Updates the BM25 statistics to reflect any changes in the underlying data.
 - Usage:
SELECT bm25_refresh('documents_passage_bm25'); - Parameters:
stat_name: Name of the BM25 statistics to update.
 
bm25_drop- Description: Deletes the BM25 statistics for a specified table and column.
 - Usage:
SELECT bm25_drop('documents_passage_bm25'); - Parameters:
stat_name: Name of the BM25 statistics to delete.
 
bm25_document_to_svector- Description: Converts document text into a sparse vector representation.
 - Usage:
SELECT bm25_document_to_svector('documents_passage_bm25', 'document_text'); - Parameters:
stat_name: Name of the BM25 statistics.document_text: The text of the document.style: Emitspgvecto.rs-style sparse vector orpgvector-style sparse vector.
 
bm25_query_to_svector- Description: Converts query text into a sparse vector representation.
 - Usage:
SELECT bm25_query_to_svector('documents_passage_bm25', 'We begin, as always, with the text.'); - Parameters:
stat_name: Name of the BM25 statistics.query_text: The text of the query.style: Emitspgvecto.rs-style sparse vector orpgvector-style sparse vector.