pgpdf
Module:
Categories:
扩展总览
PIGSTY 第三方扩展: pgpdf
: PDF数据类型,管理函数与全文检索
基本信息
- 扩展编号: 3530
- 扩展名称:
pgpdf
- 标准包名:
pgpdf
- 扩展类目:
TYPE
- 开源协议: GPLv3
- 官方网站: https://github.com/Florents-Tselai/pgpdf
- 编程语言: C
- 其他标签: 无
- 备注信息: 无
元数据
- 默认版本: 0.1.0
- PG大版本:
17
,16
,15
,14
,13
- 动态加载: 需要显式加载
- 需要DDL: 需要执行
CREATE EXTENSION
DDL - 可重定位: 无法安装至任意模式下
- 信任程度: 受信任,无需超级用户,带
CREATE
权限的用户可以直接创建 - 所需模式: 无
- 所需扩展: 无
软件包
- RPM仓库:PIGSTY
- RPM包名:
pgpdf_$v*
- RPM版本:
0.1.0
- RPM依赖:无
- DEB仓库:PIGSTY
- DEB包名:
postgresql-$v-pgpdf
- DEB版本:
0.1.0
- DEB依赖:无
最新版本
系统 | 架构 | PG17 | PG16 | PG15 | PG14 | PG13 |
---|---|---|---|---|---|---|
el8 |
x86_64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el8 |
aarch64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el9 |
x86_64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
el9 |
aarch64 |
pgpdf_17 PIGSTY 0.1.0 |
pgpdf_16 PIGSTY 0.1.0 |
pgpdf_15 PIGSTY 0.1.0 |
pgpdf_14 PIGSTY 0.1.0 |
pgpdf_13 PIGSTY 0.1.0 |
d12 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
d12 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u22 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u22 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u24 |
x86_64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
u24 |
aarch64 |
postgresql-17-pgpdf PIGSTY 0.1.0 |
postgresql-16-pgpdf PIGSTY 0.1.0 |
postgresql-15-pgpdf PIGSTY 0.1.0 |
postgresql-14-pgpdf PIGSTY 0.1.0 |
postgresql-13-pgpdf PIGSTY 0.1.0 |
扩展安装
使用 pig
命令行工具安装 pgpdf
扩展:
pig ext install pgpdf
使用 Pigsty剧本 安装 pgpdf 扩展:
./pgsql.yml -t pg_extension -e '{"pg_extensions": ["pgpdf"]}' # -l <集群名>
dnf install pgpdf_17*;
dnf install pgpdf_16*;
dnf install pgpdf_15*;
dnf install pgpdf_14*;
dnf install pgpdf_13*;
apt install postgresql-17-pgpdf;
apt install postgresql-16-pgpdf;
apt install postgresql-15-pgpdf;
apt install postgresql-14-pgpdf;
apt install postgresql-13-pgpdf;
扩展 pgpdf
需要通过 shared_preload_libraries
进行 动态加载:
shared_preload_libraries = 'pgpdf'; # 修改 PG 集群配置
使用以下 SQL 命令在已经安装此扩展插件的 PG 集群上 启用 pgpdf
扩展:
CREATE EXTENSION pgpdf;
使用方法
The actual PDF parsing is done by poppler.
This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.
Download some PDFs.
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf
You can create a pdf
type, by casting either a text
filepath or bytea
column.
CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
pdf
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea
column, you can just cast it to pdf
.
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;
Examples
Create a table with a pdf
column:
CREATE TABLE pdfs(name text primary key, doc pdf);
INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');
Parsing and validation should happen automatically. The files will be read from the disk only once!
[!NOTE] The filepath should be accessible by the
postgres
process / user! That’s different than the user running psql. If you don’t understand what this means, as your DBA!
String Functions and Operators
Standard Postgres String Functions and Operators should work as usual:
SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';
Full-Text Search (FTS)
You can also perform full-text search (FTS), since you can work on a pdf
file like normal text.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
?column?
----------
t
(1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
?column?
----------
f
(1 row)
Document similarity with pg_trgm
You can use pg_trgm to get the similarity between two documents:
CREATE EXTENSION pg_trgm;
SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);
Metadata
The following functions are available:
-
pdf_title(pdf) → text
-
pdf_author(pdf) → text
-
pdf_num_pages(pdf) → integer
Total number of pages in the document
-
pdf_page(pdf, integer) → text
Get the i-th page as text
-
pdf_creator(pdf) → text
-
pdf_keywords(pdf) → text
-
pdf_metadata(pdf) → text
-
pdf_version(pdf) → text
-
pdf_subject(pdf) → text
-
pdf_creation(pdf) → timestamp
-
pdf_modification(pdf) → timestamp
SELECT pdf_title('/tmp/pgintro.pdf');
pdf_title
-------------------------
PostgreSQL Introduction
(1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
pdf_author
------------
周正中
(1 row)
Getting a subset of pages
SELECT pdf_num_pages('/tmp/pgintro.pdf');
pdf_num_pages
---------------
24
(1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
pdf_page
------------------------------
Catalog +
PostgreSQL Origin +
Layout +
Features +
Enterprise Class Attribute+
Case
(1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
pdf_subject
-------------
(1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
pdf_creation
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
pdf_modification
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
pdf_creator
------------------------------------
Microsoft® Office PowerPoint® 2007
(1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
pdf_metadata
--------------
(1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
pdf_version
-------------
PDF-1.5
(1 row)
Installation
Install poppler dependencies
Linux
sudo apt install -y libpoppler-glib-dev pkg-config
Homebrew/MacOS
brew install poppler pkgconf
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
After the installation, in a session:
CREATE EXTENSION pgpdf;
Docker
Get the Docker image with:
docker pull florents/pgpdf:pg17
This adds pgpdf to the Postgres image (replace 17
with your Postgres server version, and run it the same way).
Run the image in a container.
docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17
Through another terminal, connect to the running server (container).
PGPASSWORD=pass psql -h localhost -p 5432 -U postgres
[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.