pgpdf

PDF数据类型，管理函数与全文检索

Module:

PGEXT

扩展总览

PIGSTY 第三方扩展： pgpdf ： PDF数据类型，管理函数与全文检索

基本信息

扩展编号： 3530
扩展名称： pgpdf
标准包名： pgpdf
扩展类目： TYPE
开源协议： GPLv3
官方网站： https://github.com/Florents-Tselai/pgpdf
编程语言： C
其他标签：无
备注信息：无

元数据

默认版本： 0.1.0
PG大版本： 17,16,15,14,13
动态加载： 需要显式加载
需要DDL：需要执行 CREATE EXTENSION DDL
可重定位：无法安装至任意模式下
信任程度：受信任，无需超级用户，带 CREATE 权限的用户可以直接创建
所需模式：无
所需扩展：无

软件包

RPM仓库：PIGSTY
RPM包名：pgpdf_$v*
RPM版本：0.1.0
RPM依赖：无
DEB仓库：PIGSTY
DEB包名：postgresql-$v-pgpdf
DEB版本：0.1.0
DEB依赖：无

最新版本

系统	架构	PG17	PG16	PG15	PG14	PG13
`el8`	`x86_64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el8`	`aarch64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el9`	`x86_64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`el9`	`aarch64`	`pgpdf_17` PIGSTY 0.1.0	`pgpdf_16` PIGSTY 0.1.0	`pgpdf_15` PIGSTY 0.1.0	`pgpdf_14` PIGSTY 0.1.0	`pgpdf_13` PIGSTY 0.1.0
`d12`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`d12`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u22`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u22`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u24`	`x86_64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0
`u24`	`aarch64`	`postgresql-17-pgpdf` PIGSTY 0.1.0	`postgresql-16-pgpdf` PIGSTY 0.1.0	`postgresql-15-pgpdf` PIGSTY 0.1.0	`postgresql-14-pgpdf` PIGSTY 0.1.0	`postgresql-13-pgpdf` PIGSTY 0.1.0

扩展安装

使用 pig 命令行工具安装 pgpdf 扩展：

pig ext install pgpdf

使用 Pigsty剧本安装 pgpdf 扩展：

./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pgpdf"]}' # -l <集群名>

从 YUM仓库 手工安装 pgpdf RPM 包：

dnf install pgpdf_17*;
dnf install pgpdf_16*;
dnf install pgpdf_15*;
dnf install pgpdf_14*;
dnf install pgpdf_13*;

从 APT仓库 手工安装 pgpdf DEB 包：

apt install postgresql-17-pgpdf;
apt install postgresql-16-pgpdf;
apt install postgresql-15-pgpdf;
apt install postgresql-14-pgpdf;
apt install postgresql-13-pgpdf;

扩展 pgpdf 需要通过 shared_preload_libraries 进行 动态加载：

shared_preload_libraries = 'pgpdf'; # 修改 PG 集群配置

使用以下 SQL 命令在已经安装此扩展插件的 PG 集群上启用 pgpdf 扩展：

CREATE EXTENSION pgpdf;

使用方法

The actual PDF parsing is done by poppler.

This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.

Download some PDFs.

wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf

You can create a pdf type, by casting either a text filepath or bytea column.

CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;

                                       pdf                                        
----------------------------------------------------------------------------------
 PostgreSQL Introduction                                                         +
 Digoal.Zhou                                                                     +
 7/20/2011Catalog                                                                +
  PostgreSQL Origin

If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.

SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;

Examples

Create a table with a pdf column:

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

Parsing and validation should happen automatically. The files will be read from the disk only once!

[!NOTE] The filepath should be accessible by the postgres process / user! That’s different than the user running psql. If you don’t understand what this means, as your DBA!

String Functions and Operators

Standard Postgres String Functions and Operators should work as usual:

SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;

SELECT upper('/tmp/pgintro.pdf'::pdf::text);

SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';

Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a pdf file like normal text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');

 ?column? 
----------
 t
(1 row)

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');

 ?column? 
----------
 f
(1 row)

Document similarity with `pg_trgm`

You can use pg_trgm to get the similarity between two documents:

CREATE EXTENSION pg_trgm;

SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);

Metadata

The following functions are available:

pdf_title(pdf) → text
pdf_author(pdf) → text
pdf_num_pages(pdf) → integer

Total number of pages in the document
pdf_page(pdf, integer) → text

Get the i-th page as text
pdf_creator(pdf) → text
pdf_keywords(pdf) → text
pdf_metadata(pdf) → text
pdf_version(pdf) → text
pdf_subject(pdf) → text
pdf_creation(pdf) → timestamp
pdf_modification(pdf) → timestamp

SELECT pdf_title('/tmp/pgintro.pdf');

        pdf_title        
-------------------------
 PostgreSQL Introduction
(1 row)

SELECT pdf_author('/tmp/pgintro.pdf');

 pdf_author 
------------
 周正中
(1 row)

Getting a subset of pages

SELECT pdf_num_pages('/tmp/pgintro.pdf');

 pdf_num_pages 
---------------
            24
(1 row)

SELECT pdf_page('/tmp/pgintro.pdf', 1);

           pdf_page           
------------------------------
 Catalog                     +
  PostgreSQL Origin         +
  Layout                    +
  Features                  +
  Enterprise Class Attribute+
  Case
(1 row)

SELECT pdf_subject('/tmp/pgintro.pdf');

 pdf_subject 
-------------
 
(1 row)

SELECT pdf_creation('/tmp/pgintro.pdf');

       pdf_creation       
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_modification('/tmp/pgintro.pdf');

     pdf_modification     
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_creator('/tmp/pgintro.pdf');

            pdf_creator             
------------------------------------
 Microsoft® Office PowerPoint® 2007
(1 row)

SELECT pdf_metadata('/tmp/pgintro.pdf');

 pdf_metadata 
--------------
 
(1 row)

SELECT pdf_version('/tmp/pgintro.pdf');

 pdf_version 
-------------
 PDF-1.5
(1 row)

Installation

Install poppler dependencies

Linux

sudo apt install -y libpoppler-glib-dev pkg-config

Homebrew/MacOS

brew install poppler pkgconf

cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo

After the installation, in a session:

CREATE EXTENSION pgpdf;

Docker

Get the Docker image with:

docker pull florents/pgpdf:pg17

This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).

Run the image in a container.

docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17

Through another terminal, connect to the running server (container).

PGPASSWORD=pass psql -h localhost -p 5432 -U postgres

[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.

反馈

这个页面有帮助吗？

很高兴收到您的反馈！欢迎告诉我们有什么可以改进的地方.

很遗憾听到这个消息，欢迎告诉我们哪里可以继续改进.

最后修改 2025-03-07: routine update (2106723)

pgpdf

Module:

Categories:

Tags:

扩展总览

基本信息

元数据

软件包

最新版本

扩展安装

使用方法

Examples

String Functions and Operators

Full-Text Search (FTS)

Document similarity with `pg_trgm`

Metadata

Installation

Docker

反馈

pgpdf

扩展总览

基本信息

元数据

软件包

最新版本

扩展安装

使用方法

Examples

String Functions and Operators

Full-Text Search (FTS)

Document similarity with pg_trgm

Metadata

Installation

Docker

反馈

Document similarity with `pg_trgm`