This is the multi-page printable view of this section. Click here to print.
Database
- Self-Hosting Supabase on PostgreSQL
- Database in K8S: Pros & Cons
- NewSQL: Distributive Nonsens
- Is running postgres in docker a good idea?
Self-Hosting Supabase on PostgreSQL
Supabase is great, own your own Supabase is even better. Here’s a comprehensive tutorial for self-hosting production-grade supabase on local/cloud VM/BMs.
What is Supabase?
Supabase is an open-source Firebase alternative, a Backend as a Service (BaaS).
Supabase wraps PostgreSQL kernel and vector extensions, alone with authentication, realtime subscriptions, edge functions, object storage, and instant REST and GraphQL APIs from your postgres schema. It let you skip most backend work, requiring only database design and frontend skills to ship quickly.
Currently, Supabase may be the most popular open-source project in the PostgreSQL ecosystem, boasting over 74,000 stars on GitHub. And become quite popular among developers, and startups, since they have a generous free plan, just like cloudflare & neon.
Why Self-Hosting?
Supabase’s slogan is: “Build in a weekend, Scale to millions”. It has great cost-effectiveness in small scales (4c8g) indeed. But there is no doubt that when you really grow to millions of users, some may choose to self-hosting their own Supabase —— for functionality, performance, cost, and other reasons.
That’s where Pigsty comes in. Pigsty provides a complete one-click self-hosting solution for Supabase. Self-hosted Supabase can enjoy full PostgreSQL monitoring, IaC, PITR, and high availability capability,
You can run the latest PostgreSQL 17(,16,15,14) kernels, (supabase is using the 15 currently), alone with 340 PostgreSQL extensions out-of-the-box. Run on mainstream Linus OS distros with production grade HA PostgreSQL, MinIO, Prometheus & Grafana Stack for observability, and Nginx for reverse proxy.
TIME: timescaledb
timescaledb_toolkit
timeseries
periods
temporal_tables
emaj
table_version
pg_cron
pg_later
pg_background
GIS: postgis
postgis_topology
postgis_raster
postgis_sfcgal
postgis_tiger_geocoder
address_standardizer
address_standardizer_data_us
pgrouting
pointcloud
pointcloud_postgis
h3
h3_postgis
q3c
ogr_fdw
geoip
pg_polyline
pg_geohash
mobilitydb
earthdistance
RAG: vector
vectorscale
vectorize
pg_similarity
smlar
pg_summarize
pg_tiktoken
pgml
pg4ml
FTS: pg_search
pg_bigm
zhparser
hunspell_cs_cz
hunspell_de_de
hunspell_en_us
hunspell_fr
hunspell_ne_np
hunspell_nl_nl
hunspell_nn_no
hunspell_pt_pt
hunspell_ru_ru
hunspell_ru_ru_aot
fuzzystrmatch
pg_trgm
OLAP: citus
citus_columnar
columnar
pg_analytics
pg_duckdb
pg_mooncake
duckdb_fdw
pg_parquet
pg_fkpart
pg_partman
plproxy
pg_strom
tablefunc
FEAT: age
hll
rum
pg_graphql
pg_jsonschema
jsquery
pg_hint_plan
hypopg
index_advisor
plan_filter
imgsmlr
pg_ivm
pgmq
pgq
pg_cardano
rdkit
bloom
LANG: pg_tle
plv8
pllua
hstore_pllua
plluau
hstore_plluau
plprql
pldbgapi
plpgsql_check
plprofiler
plsh
pljava
plr
pgtap
faker
dbt2
pltcl
pltclu
plperl
bool_plperl
hstore_plperl
jsonb_plperl
plperlu
bool_plperlu
jsonb_plperlu
hstore_plperlu
plpgsql
plpython3u
jsonb_plpython3u
ltree_plpython3u
hstore_plpython3u
TYPE: prefix
semver
unit
md5hash
asn1oid
roaringbitmap
pgfaceting
pg_sphere
country
currency
pgmp
numeral
pg_rational
uint
uint128
ip4r
uri
pgemailaddr
acl
debversion
pg_rrule
timestamp9
chkpass
isn
seg
cube
ltree
hstore
citext
xml2
FUNC: topn
gzip
zstd
http
pg_net
pg_smtp_client
pg_html5_email_address
pgsql_tweaks
pg_extra_time
timeit
count_distinct
extra_window_functions
first_last_agg
tdigest
aggs_for_vecs
aggs_for_arrays
arraymath
quantile
lower_quantile
pg_idkit
pg_uuidv7
permuteseq
pg_hashids
sequential_uuids
pg_math
random
base36
base62
pg_base58
floatvec
financial
pgjwt
pg_hashlib
shacrypt
cryptint
pguecc
pgpcre
icu_ext
pgqr
envvar
pg_protobuf
url_encode
refint
autoinc
insert_username
moddatetime
tsm_system_time
dict_xsyn
tsm_system_rows
tcn
uuid-ossp
btree_gist
btree_gin
intarray
intagg
dict_int
unaccent
ADMIN: pg_repack
pg_squeeze
pg_dirtyread
pgfincore
pgdd
ddlx
prioritize
pg_checksums
pg_readonly
safeupdate
pg_permissions
pgautofailover
pg_catcheck
pre_prepare
pgcozy
pg_orphaned
pg_crash
pg_cheat_funcs
pg_savior
table_log
pg_fio
pgpool_adm
pgpool_recovery
pgpool_regclass
pgagent
vacuumlo
pg_prewarm
oid2name
lo
basic_archive
basebackup_to_shell
old_snapshot
adminpack
amcheck
pg_surgery
STAT: pg_profile
pg_show_plans
pg_stat_kcache
pg_stat_monitor
pg_qualstats
pg_store_plans
pg_track_settings
pg_wait_sampling
system_stats
meta
pgnodemx
pg_proctab
pg_sqlog
bgw_replstatus
pgmeminfo
toastinfo
explain_ui
pg_relusage
pg_top
pagevis
powa
pageinspect
pgrowlocks
sslinfo
pg_buffercache
pg_walinspect
pg_freespacemap
pg_visibility
pgstattuple
auto_explain
pg_stat_statements
SEC: passwordcheck_cracklib
supautils
pgsodium
supabase_vault
pg_session_jwt
anon
pg_tde
pgsmcrypto
pgaudit
pgauditlogtofile
pg_auth_mon
credcheck
pgcryptokey
pg_jobmon
logerrors
login_hook
set_user
pg_snakeoil
pgextwlist
pg_auditor
sslutils
noset
sepgsql
auth_delay
pgcrypto
passwordcheck
FDW: wrappers
multicorn
odbc_fdw
jdbc_fdw
mysql_fdw
oracle_fdw
tds_fdw
db2_fdw
sqlite_fdw
pgbouncer_fdw
mongo_fdw
redis_fdw
redis
kafka_fdw
hdfs_fdw
firebird_fdw
aws_s3
log_fdw
dblink
file_fdw
postgres_fdw
SIM: orafce
pgtt
session_variable
pg_statement_rollback
pg_dbms_metadata
pg_dbms_lock
pg_dbms_job
babelfishpg_common
babelfishpg_tsql
babelfishpg_tds
babelfishpg_money
pgmemcache
ETL: pglogical
pglogical_origin
pglogical_ticker
pgl_ddl_deploy
pg_failover_slots
wal2json
wal2mongo
decoderbufs
decoder_raw
test_decoding
mimeo
repmgr
pg_fact_loader
pg_bulkload
Since most of the supabase maintained extensions are not available in the official PGDG repo, we have compiled all the RPM/DEBs for these extensions and put them in the Pigsty repo: pg_graphql, pg_jsonschema, wrappers, index_advisor, pg_net, vault, pgjwt, supautils, pg_plan_filter,
Everything is under your control, you have the ability and freedom to scale PGSQL, MinIO, and Supabase itself. And take full advantage of the performance and cost advantages of modern hardware like Gen5 NVMe SSD.
All you need is prepare a VM with several commands and wait for 10 minutes….
Get Started
First, download & install pigsty as usual, with the supa
config template:
curl -fsSL https://repo.pigsty.io/get | bash
./bootstrap # install deps (ansible)
./configure -c supa # use supa config template (IMPORTANT: CHANGE PASSWORDS!)
./install.yml # install pigsty, create ha postgres & minio clusters
Please change the
pigsty.yml
config file according to your need before deploying Supabase. (Credentials) For dev/test/demo purposes, we will just skip that, and comes back later.
Then, run the supabase.yml
to launch stateless part of supabase.
./supabase.yml # launch stateless supabase containers with docker compose
You can access the supabase API / Web UI through the 8000/8443
directly.
with configured DNS, or a local /etc/hosts
entry, you can also use the default supa.pigsty
domain name via the 80/443 infra portal.
Credentials for Supabase Studio:
supabase
:pigsty
Architecture
Pigsty’s supabase is based on the Supabase Docker Compose Template, with some slight modifications to fit-in Pigsty’s default ACL model.
The stateful part of this template is replaced by Pigsty’s managed PostgreSQL cluster and MinIO cluster. The container part are stateless, so you can launch / destroy / run multiple supabase containers on the same stateful PGSQL / MINIO cluster simultaneously to scale out.
The built-in supa.yml
config template will create a single-node supabase, with a singleton PostgreSQL and SNSD MinIO server.
You can use Multinode PostgreSQL Clusters and MNMD MinIO Clusters / external S3 service instead in production, we will cover that later.
Config Detail
Here are checklists for self-hosting
- Hardware: necessary VM/BM resources, one node at least, 3-4 are recommended for HA.
- Linux OS: Linux x86_64 server with fresh installed Linux, check compatible distro
- Network: Static IPv4 address which can be used as node identity
- Admin User: nopass ssh & sudo are recommended for admin user
- Conf Template: Use the
supa
config template, if you don’t know how to manually configure pigsty
The built-in supa.yml
config template is shown below.
The supa Config Template
all:
children:
# infra cluster for proxy, monitor, alert, etc..
infra: { hosts: { 10.10.10.10: { infra_seq: 1 } } }
# etcd cluster for ha postgres
etcd: { hosts: { 10.10.10.10: { etcd_seq: 1 } }, vars: { etcd_cluster: etcd } }
# minio cluster, s3 compatible object storage
minio: { hosts: { 10.10.10.10: { minio_seq: 1 } }, vars: { minio_cluster: minio } }
# pg-meta, the underlying postgres database for supabase
pg-meta:
hosts: { 10.10.10.10: { pg_seq: 1, pg_role: primary } }
vars:
pg_cluster: pg-meta
pg_users:
# supabase roles: anon, authenticated, dashboard_user
- { name: anon ,login: false }
- { name: authenticated ,login: false }
- { name: dashboard_user ,login: false ,replication: true ,createdb: true ,createrole: true }
- { name: service_role ,login: false ,bypassrls: true }
# supabase users: please use the same password
- { name: supabase_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: true ,roles: [ dbrole_admin ] ,superuser: true ,replication: true ,createdb: true ,createrole: true ,bypassrls: true }
- { name: authenticator ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin, authenticated ,anon ,service_role ] }
- { name: supabase_auth_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin ] ,createrole: true }
- { name: supabase_storage_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin, authenticated ,anon ,service_role ] ,createrole: true }
- { name: supabase_functions_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin ] ,createrole: true }
- { name: supabase_replication_admin ,password: 'DBUser.Supa' ,replication: true ,roles: [ dbrole_admin ]}
- { name: supabase_read_only_user ,password: 'DBUser.Supa' ,bypassrls: true ,roles: [ dbrole_readonly, pg_read_all_data ] }
pg_databases:
- name: postgres
baseline: supabase.sql
owner: supabase_admin
comment: supabase postgres database
schemas: [ extensions ,auth ,realtime ,storage ,graphql_public ,supabase_functions ,_analytics ,_realtime ]
extensions:
- { name: pgcrypto ,schema: extensions } # 1.3 : cryptographic functions
- { name: pg_net ,schema: extensions } # 0.9.2 : async HTTP
- { name: pgjwt ,schema: extensions } # 0.2.0 : json web token API for postgres
- { name: uuid-ossp ,schema: extensions } # 1.1 : generate universally unique identifiers (UUIDs)
- { name: pgsodium } # 3.1.9 : pgsodium is a modern cryptography library for Postgres.
- { name: supabase_vault } # 0.2.8 : Supabase Vault Extension
- { name: pg_graphql } # 1.5.9 : pg_graphql: GraphQL support
- { name: pg_jsonschema } # 0.3.3 : pg_jsonschema: Validate json schema
- { name: wrappers } # 0.4.3 : wrappers: FDW collections
- { name: http } # 1.6 : http: allows web page retrieval inside the database.
- { name: pg_cron } # 1.6 : pg_cron: Job scheduler for PostgreSQL
- { name: timescaledb } # 2.17 : timescaledb: Enables scalable inserts and complex queries for time-series data
- { name: pg_tle } # 1.2 : pg_tle: Trusted Language Extensions for PostgreSQL
- { name: vector } # 0.8.0 : pgvector: the vector similarity search
# supabase required extensions
pg_libs: 'pg_stat_statements, plpgsql, plpgsql_check, pg_cron, pg_net, timescaledb, auto_explain, pg_tle, plan_filter'
pg_extensions: # extensions to be installed on this cluster
- supabase # essential extensions for supabase
- timescaledb postgis pg_graphql pg_jsonschema wrappers pg_search pg_analytics pg_parquet plv8 duckdb_fdw pg_cron pg_timetable pgqr
- supautils pg_plan_filter passwordcheck plpgsql_check pgaudit pgsodium pg_vault pgjwt pg_ecdsa pg_session_jwt index_advisor
- pgvector pgvectorscale pg_summarize pg_tiktoken pg_tle pg_stat_monitor hypopg pg_hint_plan pg_http pg_net pg_smtp_client pg_idkit
pg_parameters:
cron.database_name: postgres
pgsodium.enable_event_trigger: off
pg_hba_rules: # supabase hba rules, require access from docker network
- { user: all ,db: postgres ,addr: intra ,auth: pwd ,title: 'allow supabase access from intranet' }
- { user: all ,db: postgres ,addr: 172.17.0.0/16 ,auth: pwd ,title: 'allow access from local docker network' }
node_crontab: [ '00 01 * * * postgres /pg/bin/pg-backup full' ] # make a full backup every 1am
# launch supabase stateless part with docker compose: ./supabase.yml
supabase:
hosts:
10.10.10.10: { supa_seq: 1 } # instance id
vars:
supa_cluster: supa # cluster name
docker_enabled: true # enable docker
# use these to pull docker images via proxy and mirror registries
#docker_registry_mirrors: ['https://docker.xxxxx.io']
#proxy_env: # add [OPTIONAL] proxy env to /etc/docker/daemon.json configuration file
# no_proxy: "localhost,127.0.0.1,10.0.0.0/8,192.168.0.0/16,*.pigsty,*.aliyun.com,mirrors.*,*.myqcloud.com,*.tsinghua.edu.cn"
# #all_proxy: http://user:pass@host:port
# these configuration entries will OVERWRITE or APPEND to /opt/supabase/.env file (src template: app/supabase/.env)
# check https://github.com/Vonng/pigsty/blob/main/app/supabase/.env for default values
supa_config:
# IMPORTANT: CHANGE JWT_SECRET AND REGENERATE CREDENTIAL ACCORDING!!!!!!!!!!!
# https://supabase.com/docs/guides/self-hosting/docker#securing-your-services
jwt_secret: your-super-secret-jwt-token-with-at-least-32-characters-long
anon_key: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyAgCiAgICAicm9sZSI6ICJhbm9uIiwKICAgICJpc3MiOiAic3VwYWJhc2UtZGVtbyIsCiAgICAiaWF0IjogMTY0MTc2OTIwMCwKICAgICJleHAiOiAxNzk5NTM1NjAwCn0.dc_X5iR_VP_qT0zsiyj_I_OZ2T9FtRU2BBNWN8Bu4GE
service_role_key: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyAgCiAgICAicm9sZSI6ICJzZXJ2aWNlX3JvbGUiLAogICAgImlzcyI6ICJzdXBhYmFzZS1kZW1vIiwKICAgICJpYXQiOiAxNjQxNzY5MjAwLAogICAgImV4cCI6IDE3OTk1MzU2MDAKfQ.DaYlNEoUrrEn2Ig7tqibS-PHK5vgusbcbo7X36XVt4Q
dashboard_username: supabase
dashboard_password: pigsty
# postgres connection string (use the correct ip and port)
postgres_host: 10.10.10.10
postgres_port: 5436 # access via the 'default' service, which always route to the primary postgres
postgres_db: postgres
postgres_password: DBUser.Supa # password for supabase_admin and multiple supabase users
# expose supabase via domain name
site_url: http://supa.pigsty
api_external_url: http://supa.pigsty
supabase_public_url: http://supa.pigsty
# if using s3/minio as file storage
s3_bucket: supa
s3_endpoint: https://sss.pigsty:9000
s3_access_key: supabase
s3_secret_key: S3User.Supabase
s3_force_path_style: true
s3_protocol: https
s3_region: stub
minio_domain_ip: 10.10.10.10 # sss.pigsty domain name will resolve to this ip statically
# if using SMTP (optional)
#smtp_admin_email: admin@example.com
#smtp_host: supabase-mail
#smtp_port: 2500
#smtp_user: fake_mail_user
#smtp_pass: fake_mail_password
#smtp_sender_name: fake_sender
#enable_anonymous_users: false
vars:
version: v3.1.0 # pigsty version string
admin_ip: 10.10.10.10 # admin node ip address
region: default # upstream mirror region: default|china|europe
node_tune: oltp # node tuning specs: oltp,olap,tiny,crit
pg_conf: oltp.yml # pgsql tuning specs: {oltp,olap,tiny,crit}.yml
infra_portal: # domain names and upstream servers
home : { domain: h.pigsty }
grafana : { domain: g.pigsty ,endpoint: "${admin_ip}:3000" , websocket: true }
prometheus : { domain: p.pigsty ,endpoint: "${admin_ip}:9090" }
alertmanager : { domain: a.pigsty ,endpoint: "${admin_ip}:9093" }
minio : { domain: m.pigsty ,endpoint: "10.10.10.10:9001", https: true, websocket: true }
blackbox : { endpoint: "${admin_ip}:9115" }
loki : { endpoint: "${admin_ip}:3100" } # expose supa studio UI and API via nginx
supa : { domain: supa.pigsty ,endpoint: "10.10.10.10:8000", websocket: true }
#----------------------------------#
# Credential: CHANGE THESE PASSWORDS
#----------------------------------#
#grafana_admin_username: admin
grafana_admin_password: pigsty
#pg_admin_username: dbuser_dba
pg_admin_password: DBUser.DBA
#pg_monitor_username: dbuser_monitor
pg_monitor_password: DBUser.Monitor
#pg_replication_username: replicator
pg_replication_password: DBUser.Replicator
#patroni_username: postgres
patroni_password: Patroni.API
#haproxy_admin_username: admin
haproxy_admin_password: pigsty
# use minio as supabase file storage, single node single driver mode for demonstration purpose
minio_access_key: minioadmin # root access key, `minioadmin` by default
minio_secret_key: minioadmin # root secret key, `minioadmin` by default
minio_buckets: [ { name: pgsql }, { name: supa } ]
minio_users:
- { access_key: dba , secret_key: S3User.DBA, policy: consoleAdmin }
- { access_key: pgbackrest , secret_key: S3User.Backup, policy: readwrite }
- { access_key: supabase , secret_key: S3User.Supabase, policy: readwrite }
minio_endpoint: https://sss.pigsty:9000 # explicit overwrite minio endpoint with haproxy port
node_etc_hosts: ["10.10.10.10 sss.pigsty"] # domain name to access minio from all nodes (required)
# use minio as default backup repo for PostgreSQL
pgbackrest_method: minio # pgbackrest repo method: local,minio,[user-defined...]
pgbackrest_repo: # pgbackrest repo: https://pgbackrest.org/configuration.html#section-repository
local: # default pgbackrest repo with local posix fs
path: /pg/backup # local backup directory, `/pg/backup` by default
retention_full_type: count # retention full backups by count
retention_full: 2 # keep 2, at most 3 full backup when using local fs repo
minio: # optional minio repo for pgbackrest
type: s3 # minio is s3-compatible, so s3 is used
s3_endpoint: sss.pigsty # minio endpoint domain name, `sss.pigsty` by default
s3_region: us-east-1 # minio region, us-east-1 by default, useless for minio
s3_bucket: pgsql # minio bucket name, `pgsql` by default
s3_key: pgbackrest # minio user access key for pgbackrest
s3_key_secret: S3User.Backup # minio user secret key for pgbackrest
s3_uri_style: path # use path style uri for minio rather than host style
path: /pgbackrest # minio backup path, default is `/pgbackrest`
storage_port: 9000 # minio port, 9000 by default
storage_ca_file: /pg/cert/ca.crt # minio ca file path, `/pg/cert/ca.crt` by default
bundle: y # bundle small files into a single file
cipher_type: aes-256-cbc # enable AES encryption for remote backup repo
cipher_pass: pgBackRest # AES encryption password, default is 'pgBackRest'
retention_full_type: time # retention full backup by time on minio repo
retention_full: 14 # keep full backup for last 14 days
# download docker and supabase related extensions
pg_version: 17
repo_modules: node,pgsql,infra,docker
repo_packages: [node-bootstrap, infra-package, infra-addons, node-package1, node-package2, pgsql-utility, docker ]
repo_extra_packages:
- pgsql-main
- supabase # essential extensions for supabase
- timescaledb postgis pg_graphql pg_jsonschema wrappers pg_search pg_analytics pg_parquet plv8 duckdb_fdw pg_cron pg_timetable pgqr
- supautils pg_plan_filter passwordcheck plpgsql_check pgaudit pgsodium pg_vault pgjwt pg_ecdsa pg_session_jwt index_advisor
- pgvector pgvectorscale pg_summarize pg_tiktoken pg_tle pg_stat_monitor hypopg pg_hint_plan pg_http pg_net pg_smtp_client pg_idkit
For advanced topics, we may need to modify the configuration file to fit our needs.
- Security Enhancement
- Domain Name and HTTPS
- Sending Mail with SMTP
- MinIO or External S3
- True High Availability
Security Enhancement
For security reasons, you should change the default passwords in the pigsty.yml
config file.
grafana_admin_password
:pigsty
, Grafana admin passwordpg_admin_password
:DBUser.DBA
, PGSQL superuser passwordpg_monitor_password
:DBUser.Monitor
, PGSQL monitor user passwordpg_replication_password
:DBUser.Replicator
, PGSQL replication user passwordpatroni_password
:Patroni.API
, Patroni HA Agent Passwordhaproxy_admin_password
:pigsty
, Load balancer admin passwordminio_access_key
:minioadmin
, MinIO root usernameminio_secret_key
:minioadmin
, MinIO root password
Supabase will use PostgreSQL & MinIO as its backend, so also change the following passwords for supabase business users:
pg_users
: password for supabase business users in postgresminio_users
:minioadmin
, MinIO business user’s password
The pgbackrest will take backups and WALs to MinIO, so also change the following passwords reference
pgbackrest_repo
: refer to the
PLEASE check the Supabase Self-Hosting: Generate API Keys to generate supabase credentials:
jwt_secret
: a secret key with at least 40 charactersanon_key
: a jwt token generate for anonymous users, based onjwt_secret
service_role_key
: a jwt token generate for elevated service roles, based onjwt_secret
dashboard_username
: supabase studio web portal username,supabase
by defaultdashboard_password
: supabase studio web portal password,pigsty
by default
If you have chanaged the default password for PostgreSQL and MinIO, you have to update the following parameters as well:
postgres_password
, according topg_users
s3_access_key
ands3_secret_key
, according tominio_users
Domain Name and HTTPS
For local or intranet use, you can connect directly to Kong port on http://<IP>:8000
or 8443
for https.
This works but isn’t ideal. Using a domain with HTTPS is strongly recommended when serving Supabase to the public.
Pigsty has a Nginx server installed & configured on the admin node to act as a reverse proxy for all web based service. which is configured via the infra_portal
parameter.
all:
vars: # global vars
#.....
infra_portal: # domain names and upstream servers
home : { domain: h.pigsty }
grafana : { domain: g.pigsty ,endpoint: "${admin_ip}:3000" , websocket: true }
prometheus : { domain: p.pigsty ,endpoint: "${admin_ip}:9090" }
alertmanager : { domain: a.pigsty ,endpoint: "${admin_ip}:9093" }
minio : { domain: m.pigsty ,endpoint: "10.10.10.10:9001", https: true, websocket: true }
blackbox : { endpoint: "${admin_ip}:9115" }
loki : { endpoint: "${admin_ip}:3100" } # expose supa studio UI and API via nginx
supa : { domain: supa.pigsty ,endpoint: "10.10.10.10:8000", websocket: true }
On the client side, you can use the domain supa.pigsty
to access the Supabase Studio management interface.
You can add this domain to your local /etc/hosts
file or use a local DNS server to resolve it to the server’s external IP address.
To use a real domain with HTTPS, you will need to modify the all.vars.infra_portal.supa
with updated domain name (such as supa.pigsty.cc
here).
You can obtain a free HTTPS certificate from Let’s Encrypt, and just put the cert/key files in the specified path.
#supa : { domain: supa.pigsty ,endpoint: "10.10.10.10:8000", websocket: true } # add your HTTPS certs/keys and specify the path
supa : { domain: supa.pigsty.cc ,endpoint: "10.10.10.10:8000", websocket: true ,cert: /etc/cert/suap.pigsty.cc.crt ,key: /etc/cert/supa.pigsty.cc.key }
To reload the new configuration after installation, use the infra.yml
playbook:
./infra.yml -t nginx_config,nginx_launch # reload nginx config
You also have to update the all.children.supabase.vars.supa_config
to tell supabase to use the new domain name:
all:
children: # clusters
supabase: # supabase group
vars: # supabase param
supa_config: # supabase config
# update supabase domain names here
site_url: http://supa.pigsty.cc
api_external_url: http://supa.pigsty.cc
supabase_public_url: http://supa.pigsty.cc
And reload the supabase service to apply the new configuration:
./supabase.yml -t supa_config,supa_launch # reload supabase config
Sending Mail with SMTP
Some Supabase features require email. For production use, I’d recommend using an external SMTP service. Since self-hosted SMTP servers often result in rejected or spam-flagged emails.
To do this, modify the Supabase configuration and add SMTP credentials:
all:
children:
supabase:
vars:
supa_config:
smtp_host: smtpdm.aliyun.com:80
smtp_port: 80
smtp_user: no_reply@mail.your.domain.com
smtp_pass: your_email_user_password
smtp_sender_name: MySupabase
smtp_admin_email: adminxxx@mail.your.domain.com
enable_anonymous_users: false
And don’t forget to reload the supabase service with ./supabase.yml -t supa_config,supa_launch
MinIO or External S3
Pigsty’s self-hosting supabase will use a local SNSD MinIO server, which is used by Supabase itself for object storage, and by PostgreSQL for backups. For production use, you should consider using a HA MNMD MinIO cluster or an external S3 compatible service instead.
We recommend using an external S3 when:
- you just have one single server available, then external s3 gives you a minimal disaster recovery guarantee, with RTO in hours and RPO in MBs.
- you are operating in the cloud, then using S3 directly is recommended rather than wrap expensively EBS with MinIO
The
terraform/spec/aliyun-meta-s3.tf
provides an example of how to provision a single node alone with an S3 bucket.
To use an external S3 compatible service, you’ll have to update two related references in the pigsty.yml
config.
For example, to use Aliyun OSS as the object storage for Supabase, you can modify the all.children.supabase.vars.supa_config
to point to the Aliyun OSS bucket:
all:
children:
supabase:
vars:
supa_config:
s3_bucket: pigsty-oss
s3_endpoint: https://oss-cn-beijing-internal.aliyuncs.com
s3_access_key: xxxxxxxxxxxxxxxx
s3_secret_key: xxxxxxxxxxxxxxxx
s3_force_path_style: false
s3_protocol: https
s3_region: oss-cn-beijing
Reload the supabase service with ./supabase.yml -t supa_config,supa_launch
again.
The next reference is in the PostgreSQL backup repo:
all:
vars:
# use minio as default backup repo for PostgreSQL
pgbackrest_method: minio # pgbackrest repo method: local,minio,[user-defined...]
pgbackrest_repo: # pgbackrest repo: https://pgbackrest.org/configuration.html#section-repository
local: # default pgbackrest repo with local posix fs
path: /pg/backup # local backup directory, `/pg/backup` by default
retention_full_type: count # retention full backups by count
retention_full: 2 # keep 2, at most 3 full backup when using local fs repo
minio: # optional minio repo for pgbackrest
type: s3 # minio is s3-compatible, so s3 is used
# update your credentials here
s3_endpoint: oss-cn-beijing-internal.aliyuncs.com
s3_region: oss-cn-beijing
s3_bucket: pigsty-oss
s3_key: xxxxxxxxxxxxxx
s3_key_secret: xxxxxxxx
s3_uri_style: host
path: /pgbackrest # minio backup path, default is `/pgbackrest`
storage_port: 9000 # minio port, 9000 by default
storage_ca_file: /pg/cert/ca.crt # minio ca file path, `/pg/cert/ca.crt` by default
bundle: y # bundle small files into a single file
cipher_type: aes-256-cbc # enable AES encryption for remote backup repo
cipher_pass: pgBackRest # AES encryption password, default is 'pgBackRest'
retention_full_type: time # retention full backup by time on minio repo
retention_full: 14 # keep full backup for last 14 days
After updating the pgbackrest_repo
, you can reset the pgBackrest backup with ./pgsql.yml -t pgbackrest
.
True High Availability
The default single-node deployment (with external S3) provide a minimal disaster recovery guarantee, with RTO in hours and RPO in MBs.
To achieve RTO < 30s and zero data loss, you need a multi-node high availability cluster with at least 3-nodes.
Which involves high availability for these components:
- ETCD: DCS requires at least three nodes to tolerate one node failure.
- PGSQL: PGSQL synchronous commit mode recommends at least three nodes.
- INFRA: It’s good to have two or three copies of observability stack.
- Supabase itself can also have multiple replicas to achieve high availability.
We recommend you to refer to the trio and safe config to upgrade your cluster to three nodes or more.
In this case, you also need to modify the access points for PostgreSQL and MinIO to use the DNS / L2 VIP / HAProxy HA access points.
all:
children:
supabase:
hosts:
10.10.10.10: { supa_seq: 1 }
10.10.10.11: { supa_seq: 2 }
10.10.10.12: { supa_seq: 3 }
vars:
supa_cluster: supa # cluster name
supa_config:
postgres_host: 10.10.10.2 # use the PG L2 VIP
postgres_port: 5433 # use the 5433 port to access the primary instance through pgbouncer
s3_endpoint: https://sss.pigsty:9002 # If you are using MinIO through the haproxy lb port 9002
minio_domain_ip: 10.10.10.3 # use the L2 VIP binds to all proxy nodes
The 3-Node HA Supabase Config Template
all:
#==============================================================#
# Clusters, Nodes, and Modules
#==============================================================#
children:
# infra cluster for proxy, monitor, alert, etc..
infra:
hosts:
10.10.10.10: { infra_seq: 1 ,nodename: infra-1 }
10.10.10.11: { infra_seq: 2 ,nodename: infra-2, repo_enabled: false, grafana_enabled: false }
10.10.10.12: { infra_seq: 3 ,nodename: infra-3, repo_enabled: false, grafana_enabled: false }
vars:
vip_enabled: true
vip_vrid: 128
vip_address: 10.10.10.3
vip_interface: eth1
haproxy_services:
- name: minio # [REQUIRED] service name, unique
port: 9002 # [REQUIRED] service port, unique
balance: leastconn # [OPTIONAL] load balancer algorithm
options: # [OPTIONAL] minio health check
- option httpchk
- option http-keep-alive
- http-check send meth OPTIONS uri /minio/health/live
- http-check expect status 200
servers:
- { name: minio-1 ,ip: 10.10.10.10 ,port: 9000 ,options: 'check-ssl ca-file /etc/pki/ca.crt check port 9000' }
- { name: minio-2 ,ip: 10.10.10.11 ,port: 9000 ,options: 'check-ssl ca-file /etc/pki/ca.crt check port 9000' }
- { name: minio-3 ,ip: 10.10.10.12 ,port: 9000 ,options: 'check-ssl ca-file /etc/pki/ca.crt check port 9000' }
etcd: # dcs service for postgres/patroni ha consensus
hosts: # 1 node for testing, 3 or 5 for production
10.10.10.10: { etcd_seq: 1 } # etcd_seq required
10.10.10.11: { etcd_seq: 2 } # assign from 1 ~ n
10.10.10.12: { etcd_seq: 3 } # odd number please
vars: # cluster level parameter override roles/etcd
etcd_cluster: etcd # mark etcd cluster name etcd
etcd_safeguard: false # safeguard against purging
etcd_clean: true # purge etcd during init process
# minio cluster 4-node
minio:
hosts:
10.10.10.10: { minio_seq: 1 , nodename: minio-1 }
10.10.10.11: { minio_seq: 2 , nodename: minio-2 }
10.10.10.12: { minio_seq: 3 , nodename: minio-3 }
vars:
minio_cluster: minio
minio_data: '/data{1...4}'
minio_buckets: [ { name: pgsql }, { name: supa } ]
minio_users:
- { access_key: dba , secret_key: S3User.DBA, policy: consoleAdmin }
- { access_key: pgbackrest , secret_key: S3User.Backup, policy: readwrite }
- { access_key: supabase , secret_key: S3User.Supabase, policy: readwrite }
# pg-meta, the underlying postgres database for supabase
pg-meta:
hosts:
10.10.10.10: { pg_seq: 1, pg_role: primary }
10.10.10.11: { pg_seq: 2, pg_role: replica }
10.10.10.12: { pg_seq: 3, pg_role: replica }
vars:
pg_cluster: pg-meta
pg_users:
# supabase roles: anon, authenticated, dashboard_user
- { name: anon ,login: false }
- { name: authenticated ,login: false }
- { name: dashboard_user ,login: false ,replication: true ,createdb: true ,createrole: true }
- { name: service_role ,login: false ,bypassrls: true }
# supabase users: please use the same password
- { name: supabase_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: true ,roles: [ dbrole_admin ] ,superuser: true ,replication: true ,createdb: true ,createrole: true ,bypassrls: true }
- { name: authenticator ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin, authenticated ,anon ,service_role ] }
- { name: supabase_auth_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin ] ,createrole: true }
- { name: supabase_storage_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin, authenticated ,anon ,service_role ] ,createrole: true }
- { name: supabase_functions_admin ,password: 'DBUser.Supa' ,pgbouncer: true ,inherit: false ,roles: [ dbrole_admin ] ,createrole: true }
- { name: supabase_replication_admin ,password: 'DBUser.Supa' ,replication: true ,roles: [ dbrole_admin ]}
- { name: supabase_read_only_user ,password: 'DBUser.Supa' ,bypassrls: true ,roles: [ dbrole_readonly, pg_read_all_data ] }
pg_databases:
- name: postgres
baseline: supabase.sql
owner: supabase_admin
comment: supabase postgres database
schemas: [ extensions ,auth ,realtime ,storage ,graphql_public ,supabase_functions ,_analytics ,_realtime ]
extensions:
- { name: pgcrypto ,schema: extensions } # 1.3 : cryptographic functions
- { name: pg_net ,schema: extensions } # 0.9.2 : async HTTP
- { name: pgjwt ,schema: extensions } # 0.2.0 : json web token API for postgres
- { name: uuid-ossp ,schema: extensions } # 1.1 : generate universally unique identifiers (UUIDs)
- { name: pgsodium } # 3.1.9 : pgsodium is a modern cryptography library for Postgres.
- { name: supabase_vault } # 0.2.8 : Supabase Vault Extension
- { name: pg_graphql } # 1.5.9 : pg_graphql: GraphQL support
- { name: pg_jsonschema } # 0.3.3 : pg_jsonschema: Validate json schema
- { name: wrappers } # 0.4.3 : wrappers: FDW collections
- { name: http } # 1.6 : http: allows web page retrieval inside the database.
- { name: pg_cron } # 1.6 : pg_cron: Job scheduler for PostgreSQL
- { name: timescaledb } # 2.17 : timescaledb: Enables scalable inserts and complex queries for time-series data
- { name: pg_tle } # 1.2 : pg_tle: Trusted Language Extensions for PostgreSQL
- { name: vector } # 0.8.0 : pgvector: the vector similarity search
# supabase required extensions
pg_libs: 'pg_stat_statements, plpgsql, plpgsql_check, pg_cron, pg_net, timescaledb, auto_explain, pg_tle, plan_filter'
pg_extensions: # extensions to be installed on this cluster
- supabase # essential extensions for supabase
- timescaledb postgis pg_graphql pg_jsonschema wrappers pg_search pg_analytics pg_parquet plv8 duckdb_fdw pg_cron pg_timetable pgqr
- supautils pg_plan_filter passwordcheck plpgsql_check pgaudit pgsodium pg_vault pgjwt pg_ecdsa pg_session_jwt index_advisor
- pgvector pgvectorscale pg_summarize pg_tiktoken pg_tle pg_stat_monitor hypopg pg_hint_plan pg_http pg_net pg_smtp_client pg_idkit
pg_parameters:
cron.database_name: postgres
pgsodium.enable_event_trigger: off
pg_hba_rules: # supabase hba rules, require access from docker network
- { user: all ,db: postgres ,addr: intra ,auth: pwd ,title: 'allow supabase access from intranet' }
- { user: all ,db: postgres ,addr: 172.17.0.0/16 ,auth: pwd ,title: 'allow access from local docker network' }
pg_vip_enabled: true
pg_vip_address: 10.10.10.2/24
pg_vip_interface: eth1
node_crontab: [ '00 01 * * * postgres /pg/bin/pg-backup full' ] # make a full backup every 1am
# launch supabase stateless part with docker compose: ./supabase.yml
supabase:
hosts:
10.10.10.10: { supa_seq: 1 } # instance 1
10.10.10.11: { supa_seq: 2 } # instance 2
10.10.10.12: { supa_seq: 3 } # instance 3
vars:
supa_cluster: supa # cluster name
docker_enabled: true # enable docker
# use these to pull docker images via proxy and mirror registries
#docker_registry_mirrors: ['https://docker.xxxxx.io']
#proxy_env: # add [OPTIONAL] proxy env to /etc/docker/daemon.json configuration file
# no_proxy: "localhost,127.0.0.1,10.0.0.0/8,192.168.0.0/16,*.pigsty,*.aliyun.com,mirrors.*,*.myqcloud.com,*.tsinghua.edu.cn"
# #all_proxy: http://user:pass@host:port
# these configuration entries will OVERWRITE or APPEND to /opt/supabase/.env file (src template: app/supabase/.env)
# check https://github.com/Vonng/pigsty/blob/main/app/supabase/.env for default values
supa_config:
# IMPORTANT: CHANGE JWT_SECRET AND REGENERATE CREDENTIAL ACCORDING!!!!!!!!!!!
# https://supabase.com/docs/guides/self-hosting/docker#securing-your-services
jwt_secret: your-super-secret-jwt-token-with-at-least-32-characters-long
anon_key: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyAgCiAgICAicm9sZSI6ICJhbm9uIiwKICAgICJpc3MiOiAic3VwYWJhc2UtZGVtbyIsCiAgICAiaWF0IjogMTY0MTc2OTIwMCwKICAgICJleHAiOiAxNzk5NTM1NjAwCn0.dc_X5iR_VP_qT0zsiyj_I_OZ2T9FtRU2BBNWN8Bu4GE
service_role_key: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyAgCiAgICAicm9sZSI6ICJzZXJ2aWNlX3JvbGUiLAogICAgImlzcyI6ICJzdXBhYmFzZS1kZW1vIiwKICAgICJpYXQiOiAxNjQxNzY5MjAwLAogICAgImV4cCI6IDE3OTk1MzU2MDAKfQ.DaYlNEoUrrEn2Ig7tqibS-PHK5vgusbcbo7X36XVt4Q
dashboard_username: supabase
dashboard_password: pigsty
# postgres connection string (use the correct ip and port)
postgres_host: 10.10.10.3 # use the pg_vip_address rather than single node ip
postgres_port: 5433 # access via the 'default' service, which always route to the primary postgres
postgres_db: postgres
postgres_password: DBUser.Supa # password for supabase_admin and multiple supabase users
# expose supabase via domain name
site_url: http://supa.pigsty
api_external_url: http://supa.pigsty
supabase_public_url: http://supa.pigsty
# if using s3/minio as file storage
s3_bucket: supa
s3_endpoint: https://sss.pigsty:9002
s3_access_key: supabase
s3_secret_key: S3User.Supabase
s3_force_path_style: true
s3_protocol: https
s3_region: stub
minio_domain_ip: 10.10.10.3 # sss.pigsty domain name will resolve to this l2 vip that bind to all nodes
# if using SMTP (optional)
#smtp_admin_email: admin@example.com
#smtp_host: supabase-mail
#smtp_port: 2500
#smtp_user: fake_mail_user
#smtp_pass: fake_mail_password
#smtp_sender_name: fake_sender
#enable_anonymous_users: false
#==============================================================#
# Global Parameters
#==============================================================#
vars:
version: v3.1.0 # pigsty version string
admin_ip: 10.10.10.10 # admin node ip address
region: china # upstream mirror region: default|china|europe
node_tune: oltp # node tuning specs: oltp,olap,tiny,crit
pg_conf: oltp.yml # pgsql tuning specs: {oltp,olap,tiny,crit}.yml
infra_portal: # domain names and upstream servers
home : { domain: h.pigsty }
grafana : { domain: g.pigsty ,endpoint: "${admin_ip}:3000" , websocket: true }
prometheus : { domain: p.pigsty ,endpoint: "${admin_ip}:9090" }
alertmanager : { domain: a.pigsty ,endpoint: "${admin_ip}:9093" }
minio : { domain: m.pigsty ,endpoint: "10.10.10.10:9001", https: true, websocket: true }
blackbox : { endpoint: "${admin_ip}:9115" }
loki : { endpoint: "${admin_ip}:3100" } # expose supa studio UI and API via nginx
supa : { domain: supa.pigsty ,endpoint: "10.10.10.10:8000", websocket: true }
#----------------------------------#
# Credential: CHANGE THESE PASSWORDS
#----------------------------------#
#grafana_admin_username: admin
grafana_admin_password: pigsty
#pg_admin_username: dbuser_dba
pg_admin_password: DBUser.DBA
#pg_monitor_username: dbuser_monitor
pg_monitor_password: DBUser.Monitor
#pg_replication_username: replicator
pg_replication_password: DBUser.Replicator
#patroni_username: postgres
patroni_password: Patroni.API
#haproxy_admin_username: admin
haproxy_admin_password: pigsty
# use minio as supabase file storage, single node single driver mode for demonstration purpose
minio_access_key: minioadmin # root access key, `minioadmin` by default
minio_secret_key: minioadmin # root secret key, `minioadmin` by default
minio_buckets: [ { name: pgsql }, { name: supa } ]
minio_users:
- { access_key: dba , secret_key: S3User.DBA, policy: consoleAdmin }
- { access_key: pgbackrest , secret_key: S3User.Backup, policy: readwrite }
- { access_key: supabase , secret_key: S3User.Supabase, policy: readwrite }
minio_endpoint: https://sss.pigsty:9000 # explicit overwrite minio endpoint with haproxy port
node_etc_hosts: ["10.10.10.3 sss.pigsty"] # domain name to access minio from all nodes (required)
# use minio as default backup repo for PostgreSQL
pgbackrest_method: minio # pgbackrest repo method: local,minio,[user-defined...]
pgbackrest_repo: # pgbackrest repo: https://pgbackrest.org/configuration.html#section-repository
local: # default pgbackrest repo with local posix fs
path: /pg/backup # local backup directory, `/pg/backup` by default
retention_full_type: count # retention full backups by count
retention_full: 2 # keep 2, at most 3 full backup when using local fs repo
minio: # optional minio repo for pgbackrest
type: s3 # minio is s3-compatible, so s3 is used
s3_endpoint: sss.pigsty # minio endpoint domain name, `sss.pigsty` by default
s3_region: us-east-1 # minio region, us-east-1 by default, useless for minio
s3_bucket: pgsql # minio bucket name, `pgsql` by default
s3_key: pgbackrest # minio user access key for pgbackrest
s3_key_secret: S3User.Backup # minio user secret key for pgbackrest
s3_uri_style: path # use path style uri for minio rather than host style
path: /pgbackrest # minio backup path, default is `/pgbackrest`
storage_port: 9002 # minio port, 9000 by default
storage_ca_file: /pg/cert/ca.crt # minio ca file path, `/pg/cert/ca.crt` by default
bundle: y # bundle small files into a single file
cipher_type: aes-256-cbc # enable AES encryption for remote backup repo
cipher_pass: pgBackRest # AES encryption password, default is 'pgBackRest'
retention_full_type: time # retention full backup by time on minio repo
retention_full: 14 # keep full backup for last 14 days
# download docker and supabase related extensions
pg_version: 17
repo_modules: node,pgsql,infra,docker
repo_packages: [node-bootstrap, infra-package, infra-addons, node-package1, node-package2, pgsql-utility, docker ]
repo_extra_packages:
- pgsql-main
- supabase # essential extensions for supabase
- timescaledb postgis pg_graphql pg_jsonschema wrappers pg_search pg_analytics pg_parquet plv8 duckdb_fdw pg_cron pg_timetable pgqr
- supautils pg_plan_filter passwordcheck plpgsql_check pgaudit pgsodium pg_vault pgjwt pg_ecdsa pg_session_jwt index_advisor
- pgvector pgvectorscale pg_summarize pg_tiktoken pg_tle pg_stat_monitor hypopg pg_hint_plan pg_http pg_net pg_smtp_client pg_idkit
Database in K8S: Pros & Cons
Whether databases should be housed in Kubernetes/Docker remains highly controversial. While Kubernetes (k8s) excels in managing stateless applications, it has fundamental drawbacks with stateful services, especially databases like PostgreSQL and MySQL.
In the previous article, “Databases in Docker: Good or Bad,” we discussed the pros and cons of containerizing databases. Today, let’s delve into the trade-offs in orchestrating databases in K8S and explore why it’s not a wise decision.
Summary
Kubernetes (k8s) is an exceptional container orchestration tool aimed at helping developers better manage a vast array of complex stateless applications. Despite its offerings like StatefulSet, PV, PVC, and LocalhostPV for supporting stateful services (i.e., databases), these features are still insufficient for running production-level databases that demand higher reliability.
Databases are more like “pets” than “cattle” and require careful nurturing. Treating databases as “cattle” in K8S essentially turns external disk/file system/storage services into new “database pets.” Running databases on EBS/network storage presents significant disadvantages in reliability and performance. However, using high-performance local NVMe disks will make the database bound to nodes and non-schedulable, negating the primary purpose of putting them in K8S.
Placing databases in K8S results in a “lose-lose” situation - K8S loses its simplicity in statelessness, lacking the flexibility to quickly relocate, schedule, destroy, and rebuild like purely stateless use. On the other hand, databases suffer several crucial attributes: reliability, security, performance, and complexity costs, in exchange for limited “elasticity” and utilization - something virtual machines can also achieve. For users outside public cloud vendors, the disadvantages far outweigh the benefits.
The “cloud-native frenzy,” exemplified by K8S, has become a distorted phenomenon: adopting k8s for the sake of k8s. Engineers add extra complexity to increase their irreplaceability, while managers fear being left behind by the industry and getting caught up in deployment races. Using tanks for tasks that could be done with bicycles, to gain experience or prove oneself, without considering if the problem needs such “dragon-slaying” techniques - this kind of architectural juggling will eventually lead to adverse outcomes.
Until the reliability and performance of the network storage surpass local storage, placing databases in K8S is an unwise choice. There are other ways to seal the complexity of database management, such as RDS and open-source RDS solutions like Pigsty, which are based on bare Metal or bare OS. Users should make wise decisions based on their situations and needs, carefully weighing the pros and cons.
The Status Quo
K8S excels in orchestrating stateless application services but was initially limited to stateful services. Despite not being the intended purpose of K8S and Docker, the community’s zeal for expansion has been unstoppable. Evangelists depict K8S as the next-generation cloud operating system, asserting that databases will inevitably become regular applications within Kubernetes. Various abstractions have emerged to support stateful services: StatefulSet, PV, PVC, and LocalhostPV.
Countless cloud-native enthusiasts have attempted to migrate existing databases into K8S, resulting in a proliferation of CRDs and Operators for databases. Taking PostgreSQL as an example, there are already more than ten different K8S deployment solutions available: PGO, StackGres, CloudNativePG, PostgresOperator, PerconaOperator, CYBERTEC-pg-operator, TemboOperator, Kubegres, KubeDB, KubeBlocks, and so on. The CNCF landscape rapidly expands, turning into a playground of complexity.
However, complexity is a cost. With “cost reduction” becoming mainstream, voices of reflection have begun to emerge. Could-Exit Pioneers like DHH, who deeply utilized K8S in public clouds, abandoned it due to its excessive complexity during the transition to self-hosted open-source solutions, relying only on Docker and a Ruby tool named Kamal as alternatives. Many began to question whether stateful services like databases suit Kubernetes.
K8S itself, in its effort to support stateful applications, has become increasingly complex, straying from its original intention as a container orchestration platform. Tim Hockin, a co-founder of Kubernetes, also voiced his rare concerns at this year’s KubeCon in “K8s is Cannibalizing Itself!”: “Kubernetes has become too complex; it needs to learn restraint, or it will stop innovating and lose its base.”
Lose-Lose Situation
In the cloud-native realm, the analogy of “pets” versus “cattle” is often used for illustrating stateful services. “Pets,” like databases, need careful and individual care, while “cattle” represent disposable, stateless applications (Disposability).
Cloud Native Applications 12 Factors: Disposability
One of the leading architectural goals of K8S is to treat what can be treated as cattle as cattle. The attempt to “separate storage from computation” in databases follows this strategy: splitting stateful database services into state storage outside K8S and pure computation inside K8S. The state is stored on the EBS/cloud disk/distributed storage service, allowing the “stateless” database part to be freely created, destroyed, and scheduled in K8S.
Unfortunately, databases, especially OLTP databases, heavily depend on disk hardware, and network storage’s reliability and performance still lag behind local disks by orders of magnitude. Thus, K8S offers the LocalhostPV option, allowing containers to use data volumes directly lies on the host operating system, utilizing high-performance/high-reliability local NVMe disk storage.
However, this presents a dilemma: should one use subpar cloud disks and tolerate poor database reliability/performance for K8S’s scheduling and orchestration capabilities? Or use high-performance local disks tied to host nodes, virtually losing all flexible scheduling abilities? The former is like stuffing an anchor into K8S’s small boat, slowing overall speed and agility; the latter is like anchoring and pinning the ship to a specific point.
Running a stateless K8S cluster is simple and reliable, as is running a stateful database on a physical machine’s bare operating system. Mixing the two, however, results in a lose-lose situation: K8S loses its stateless flexibility and casual scheduling abilities, while the database sacrifices core attributes like reliability, security, efficiency, and simplicity in exchange for elasticity, resource utilization, and Day1 delivery speed that are not fundamentally important to databases.
A vivid example of the former is the performance optimization of PostgreSQL@K8S, which KubeBlocks contributed. K8S experts employed various advanced methods to solve performance issues that did not exist on bare metal/bare OS at all. A fresh case of the latter is Didi’s K8S architecture juggling disaster; if it weren’t for putting the stateful MySQL in K8S, would rebuilding a stateless K8S cluster and redeploying applications take 12 hours to recover?
Pros and Cons
For serious technology decisions, the most crucial aspect is weighing the pros and cons. Here, in the order of “quality, security, performance, cost,” let’s discuss the technical trade-offs of placing databases in K8S versus classic bare metal/VM deployments. I don’t want to write a comprehensive paper that covers everything. Instead, I’ll throw some specific questions for consideration and discussion.
Quality
K8S, compared to physical deployments, introduces additional failure points and architectural complexity, increasing the blast radius and significantly prolonging the average recovery time of failures. In “Is it a Good Idea to Put Databases into Docker?”, we provided an argument about reliability, which can also apply to Kubernetes — K8S and Docker introduce additional and unnecessary dependencies and failure points to databases, lacking community failure knowledge accumulation and reliability track record (MTTR/MTBF).
In the cloud vendor classification system, K8S belongs to PaaS, while RDS belongs to a more fundamental layer, IaaS. Database services have higher reliability requirements than K8S; for instance, many companies’ cloud management platforms rely on an additional CMDB database. Where should this database be placed? You shouldn’t let K8S manage things it depends on, nor should you add unnecessary extra dependencies. The Alibaba Cloud global epic failure and Didi’s K8S architecture juggling disaster have taught us this lesson. Moreover, maintaining a separate database system inside K8S when there’s already one outside is even more unjustifiable.
Security
The database in a multi-tenant environment introduces additional attack surfaces, bringing higher risks and more complex audit compliance challenges. Does K8S make your database more secure? Maybe the complexity of K8S architecture juggling will deter script kiddies unfamiliar with K8S, but for real attackers, more components and dependencies often mean a broader attack surface.
In “BrokenSesame Alibaba Cloud PostgreSQL Vulnerability Technical Details”, security personnel escaped to the K8S host node using their own PostgreSQL container and accessed the K8S API and other tenants’ containers and data. This is clearly a K8S-specific issue — the risk is real, such attacks have occurred, and even Alibaba Cloud, a local cloud industry leader, has been compromised.
《The Attacker Perspective - Insights From Hacking Alibaba Cloud》
Performance
As stated in “Is it a Good Idea to Put Databases into Docker?”, whether it’s additional network overhead, Ingress bottlenecks, or underperforming cloud disks, all negatively impact database performance. For example, as revealed in “PostgreSQL@K8s Performance Optimization” — you need a considerable level of technical prowess to make database performance in K8S barely match that on bare metal.
Latency is measured in ms, not µs; I almost thought my eyes were deceiving me.
Another misconception about efficiency is resource utilization. Unlike offline analytical businesses, critical online OLTP databases should not aim to increase resource utilization but rather deliberately lower it to enhance system reliability and user experience. If there are many fragmented businesses, resource utilization can be improved through PDB/shared database clusters. K8S’s advocated elasticity efficiency is not unique to it — KVM/EC2 can also effectively address this issue.
In terms of cost, K8S and various Operators provide a decent abstraction, encapsulating some of the complexity of database management, which is attractive for teams without DBAs. However, the complexity reduced by using it to manage databases pales in comparison to the complexity introduced by using K8S itself. For instance, random IP address drifts and automatic Pod restarts may not be a big issue for stateless applications, but for databases, they are intolerable — many companies have had to attempt to modify kubelet to avoid this behavior, thereby introducing more complexity and maintenance costs.
As stated in “From Reducing Costs and Smiles to Reducing Costs and Efficiency” “Reducing Complexity Costs” section: Intellectual power is hard to accumulate spatially: when a database encounters problems, it needs database experts to solve them; when Kubernetes has problems, it needs K8S experts to look into them; however, when you put a database into Kubernetes, complexities combine, the state space explodes, but the intellectual bandwidth of individual database experts and K8S experts is hard to stack — you need a dual expert to solve the problem, and such experts are undoubtedly much rarer and more expensive than pure database experts. Such architectural juggling is enough to cause major setbacks for most teams, including top public clouds/big companies, in the event of a failure.
The Cloud-Native Frenzy
An interesting question arises: if K8S is unsuitable for stateful databases, why are so many companies, including big players, rushing to do this? The reasons are not technical.
Google open-sourced its K8S battleship, modeled after its internal Borg spaceship, and managers, fearing being left behind, rushed to adopt it, thinking using K8S would put them on par with Google. Ironically, Google doesn’t use K8S; it was more likely to disrupt AWS and mislead the industry. However, most companies don’t have the manpower like Google to operate such a battleship. More importantly, their problems might need a simple vessel. Running MySQL + PHP, PostgreSQL + Go/Python on bare metal has already taken many companies to IPO.
Under modern hardware conditions, the complexity of most applications throughout their lifecycle doesn’t justify using K8S. Yet, the “cloud-native” frenzy, epitomized by K8S, has become a distorted phenomenon: adopting k8s just for the sake of k8s. Some engineers are looking for “advanced” and “cool” technologies used by big companies to fulfill their personal goals like job hopping or promotions or to increase their job security by adding complexity, not considering if these “dragon-slaying” techniques are necessary for solving their problems.
The cloud-native landscape is filled with fancy projects. Every new development team wants to introduce something new: Helm today, Kubevela tomorrow. They talk big about bright futures and peak efficiency, but in reality, they create a mountain of architectural complexities and a playground for “YAML Boys” - tinkering with the latest tech, inventing concepts, earning experience and reputation at the expense of users who bear the complexity and maintenance costs.
CNCF Landscape
The cloud-native movement’s philosophy is compelling - democratizing the elastic scheduling capabilities of public clouds for every user. K8S indeed excels in stateless applications. However, excessive enthusiasm has led K8S astray from its original intent and direction - simply doing well in orchestrating stateless applications, burdened by the ill-conceived support for stateful applications.
Making Wise Decisions
Years ago, when I first encountered K8S, I too was fervent —— It was at TanTan. We had over twenty thousand cores and hundreds of database clusters, and I was eager to try putting databases in Kubernetes and testing all the available Operators. However, after two to three years of extensive research and architectural design, I calmed down and abandoned this madness. Instead, I architected our database service based on bare metal/operating systems. For us, the benefits K8S brought to databases were negligible compared to the problems and hassles it introduced.
Should databases be put into K8S? It depends: for public cloud vendors who thrive on overselling resources, elasticity and utilization are crucial, which are directly linked to revenue and profit, While reliability and performance take a back seat - after all, an availability below three nines means compensating 25% monthly credit. But for most user, including ourselves, these trade-offs hold different: One-time Day1 Setup, elasticity, and resource utilization aren’t their primary concerns; reliability, performance, Day2 Operation costs, these core database attributes are what matter most.
We open-sourced our database service architecture — an out-of-the-box PostgreSQL distribution and a local-first RDS alternative: Pigsty. We didn’t choose the so-called “build once, run anywhere” approach of K8S and Docker. Instead, we adapted to different OS distros & major versions, and used Ansible to achieve a K8S CRD IaC-like API to seal management complexity. This was arduous, but it was the right thing to do - the world does not need another clumsy attempt at putting PostgreSQL into K8S. Still, it does need a production database service architecture that maximizes hardware performance and reliability.
Pigsty vs StackGres
Perhaps one day, when the reliability and performance of distributed network storage surpass local storage and mainstream databases have some native support for storage-computation separation, things might change again — K8S might become suitable for databases. But for now, I believe putting serious production OLTP databases into K8S is immature and inappropriate. I hope readers will make wise choices on this matter.
Reference
Database in Docker: Is that a good idea?
《What can we learn from DiDi’s Epic k8s Failure》
NewSQL: Distributive Nonsens
As hardware technology advances, the capacity and performance of standalone databases have reached unprecedented heights. In this transformative era, distributed (TP) databases appear utterly powerless, much like the “data middle platform,” donning the emperor’s new clothes in a state of self-deception.
- TL; DR
- The Pull of the Internet
- The Trade-Offs of Distributive
- The Impact of New Hardware
- The Predicament of False Needs
- The Struggles in Confusion
- References
TL; DR
The core trade-off of distributed databases is: “quality for quantity,” sacrificing functionality, performance, complexity, and reliability for greater data capacity and throughput. However, “what divides must eventually converge,” and hardware innovations have propelled centralized databases to new heights in capacity and throughput, rendering distributed (TP) databases obsolete.
Hardware, exemplified by NVMe SSDs, follows Moore’s Law, evolving at an exponential pace. Over a decade, performance has increased by tens of times, and prices have dropped significantly, improving the cost-performance ratio by three orders of magnitude. A single card can now hold 32TB+, with 4K random read/write IOPS reaching 1600K/600K, latency at 70µs/10µs, and a cost of less than 200 ¥/TB·year. Running a centralized database on a single machine can achieve one to two million point write/point query QPS.
Scenarios truly requiring distributed databases are few and far between, with typical mid-sized internet companies/banks handling request volumes ranging from tens to hundreds of thousands of QPS, and non-repetitive TP data at the hundred TB level. In the real world, over 99% of scenarios do not need distributed databases, and the remaining 1% can likely be addressed through classic engineering solutions like horizontal/vertical partitioning.
Top-tier internet companies might have a few genuine use cases, yet these companies have no intention to pay. The market simply cannot sustain so many distributed database cores, and the few products that do survive don’t necessarily rely on distribution as their selling point. HATP and the integration of distributed and standalone databases represent the struggles of confused distributed TP database vendors seeking transformation, but they are still far from achieving product-market fit.
The Pull of the Internet
“Distributed database” is not a term with a strict definition. In a narrow sense, it highly overlaps with NewSQL databases such as CockroachDB, YugabyteDB, TiDB, OceanBase, and TDSQL; broadly speaking, classic databases like Oracle, PostgreSQL, MySQL, SQL Server, PolarDB, and Aurora, which span multiple physical nodes and use master-slave replication or shared storage, can also be considered distributed databases. In the context of this article, a distributed database refers to the former, specifically focusing on transactional processing (OLTP) distributed relational databases.
The rise of distributed databases stemmed from the rapid development of internet applications and the explosive growth of data volumes. In that era, traditional relational databases often encountered performance bottlenecks and scalability issues when dealing with massive data and high concurrency. Even using Oracle with Exadata struggled in the face of voluminous CRUD operations, not to mention the prohibitively expensive annual hardware and software costs.
Internet companies embarked on a different path, building their infrastructure with free, open-source databases like MySQL. Veteran developers/DBAs might still recall the MySQL best practice: keep single-table records below 21 million to avoid rapid performance degradation. Correspondingly, database sharding became a widely recognized practice among large companies.
The basic idea here was “three cobblers with their wits combined equal Zhuge Liang,” using a bunch of inexpensive x86 servers + numerous sharded open-source database instances to create a massive CRUD simple data store. Thus, distributed databases often originated from internet company scenarios, evolving along the manual sharding → sharding middleware → distributed database path.
As an industry solution, distributed databases have successfully met the needs of internet companies. However, before abstracting and solidifying it into a product for external output, several questions need to be clarified:
Do the trade-offs from ten years ago still hold up today?
Are the scenarios of internet companies applicable to other industries?
Could distribute OLTP databases be a false necessity?
The Trade-Offs of Distributive
“Distributed,” along with buzzwords like “HTAP,” “compute-storage separation,” “Serverless,” and “lakehouse,” holds no inherent meaning for enterprise users. Practical clients focus on tangible attributes and capabilities: functionality, performance, security, reliability, return on investment, and cost-effectiveness. What truly matters is the trade-off: compared to classic centralized databases, what do distributed databases sacrifice, and what do they gain in return?
数据库需求层次金字塔[1]
The core trade-off of distributed databases can be summarized as “quality for quantity”: sacrificing functionality, performance, complexity, and reliability to gain greater data capacity and request throughput.
NewSQL often markets itself on the concept of “distribution,” solving scalability issues through “distribution.” Architecturally, it typically features multiple peer data nodes and a coordinator, employing distributed consensus protocols like Paxos/Raft for replication, allowing for horizontal scaling by adding data nodes.
Firstly, due to their inherent limitations, distributed databases sacrifice many features, offering only basic and limited CRUD query support. Secondly, because distributed databases require multiple network RPCs to complete requests, their performance typically suffers a 70% or more degradation compared to centralized databases. Furthermore, distributed databases, consisting of DN/CN and TSO components among others, introduce significant complexity in operations and management. Lastly, in terms of high availability and disaster recovery, distributed databases do not offer a qualitative improvement over the classic centralized master-slave setup; instead, they introduce numerous additional failure points due to their complex components.
SYSBENCH吞吐对比[2]
In the past, the trade-offs of distributed databases were justified: the internet required larger data storage capacities and higher access throughputs—a must-solve problem, and these drawbacks were surmountable. But today, hardware advancements have rendered the “quantity” question obsolete, thus erasing the raison d’être of distributed databases along with the very problem they sought to solve.
Times have changed, My lord!
The Impact of New Hardware
Moore’s Law posits that every 18 to 24 months, processor performance doubles while costs halve. This principle largely applies to storage as well. From 2013 to 2023, spanning 5 to 6 cycles, we should see performance and cost differences of dozens of times compared to a decade ago. Is this the case?
Let’s examine the performance metrics of a typical SSD from 2013 and compare them with those of a typical PCI-e Gen4 NVMe SSD from 2022. It’s evident that the SSD’s 4K random read/write IOPS have jumped from 60K/40K to 1600K/600K, with prices plummeting from 2220$/TB to 40$/TB. Performance has improved by 15 to 26 times, while prices have dropped 56-fold[3,4,5], certainly validating the rule of thumb at a magnitude level.
HDD/SSD Performance in 2013
NVMe Gen4 SSD in 2022
A decade ago, mechanical hard drives dominated the market. A 1TB hard drive cost about seven or eight hundred yuan, and a 64GB SSD was even more expensive. Ten years later, a mainstream 3.2TB enterprise-grade NVMe SSD costs just three thousand yuan. Considering a five-year warranty, the monthly cost per TB is only 16 yuan, with an annual cost under 200 yuan. For reference, cloud providers’ reputedly cost-effective S3 object storage costs 1800¥/TB·year.
Price per unit of SSD/HDD from 2013 to 2030 with predictions
The typical fourth-generation local NVMe disk can reach a maximum capacity of 32TB to 64TB, offering 70µs/10µs 4K random read/write latencies, and 1600K/600K read/write IOPS, with the fifth generation boasting an astonishing bandwidth of several GB/s per card.
Equipping a classic Dell 64C / 512G server with such a card, factoring in five years of IDC depreciation, the total cost is under one hundred thousand yuan. Such a server running PostgreSQL sysbench can nearly reach one million QPS for single-point writes and two million QPS for point queries without issue.
What does this mean? For a typical mid-sized internet company/bank, the demand for database requests is usually in the tens of thousands to hundreds of thousands of QPS, with non-repeated TP data volumes fluctuating around hundreds of TBs. Considering hardware storage compression cards can achieve several times compression ratio, such scenarios might now be manageable by a centralized database on a single machine and card under modern hardware conditions[6].
Previously, users might have had to invest millions in high-end storage solutions like exadata, then spend a fortune on Oracle commercial database licenses and original factory services. Now, achieving similar outcomes starts with just a few thousand yuan on an enterprise-grade SSD card; open-source Oracle alternatives like PostgreSQL, capable of smoothly running the largest single tables of 32TB, no longer suffer from the limitations that once forced MySQL into partitioning. High-performance database services, once luxury items restricted to intelligence/banking sectors, have become affordable for all industries[7].
Cost-effectiveness is the primary product strength. The cost-effectiveness of high-performance, large-capacity storage has improved by three orders of magnitude over a decade, making the once-highlighted value of distributed databases appear weak in the face of such remarkable hardware evolution.
The Predicament of False Needs
Nowadays, sacrificing functionality, performance, complexity for scalability is most likely to be a fake-demands in most scenarios.
With the support of modern hardware, over 99% of real-world scenarios do not exceed the capabilities of a centralized, single-machine database. The remaining scenarios can likely be addressed through classical engineering methods like horizontal or vertical splitting. This holds true even for internet companies: even among the global top firms, scenarios where a transactional (TP) single table exceeds several tens of TBs are still rare.
Google Spanner, the forefather of NewSQL, was designed to solve the problem of massive data scalability, but how many enterprises actually handle data volumes comparable to Google’s? In terms of data volume, the lifetime TP data volume for the vast majority of enterprises will not exceed the bottleneck of a centralized database, which continues to grow exponentially with Moore’s Law. Regarding request throughput, many enterprises have enough database performance headroom to implement all their business logic in stored procedures and run it smoothly within the database.
“Premature optimization is the root of all evil,” designing for unneeded scale is a waste of effort. If volume is no longer an issue, then sacrificing other attributes for unneeded volume becomes meaningless.
“Premature optimization is the root of all evil”
In many subfields of databases, distributed technology is not a pseudo-requirement: if you need a highly reliable, disaster-resilient, simple, low-frequency KV storage for metadata, then a distributed etcd is a suitable choice; if you require a globally distributed table for arbitrary reads and writes across different locations and are willing to endure significant performance degradation, then YugabyteDB might be a good choice. For ensuring transparency and preventing tampering and denial, blockchain is fundamentally a leaderless distributed ledger database;
For large-scale data analytics (OLAP), distributed technology is indispensable (though this is usually referred to as data warehousing, MPP); however, in the transaction processing (OLTP) domain, distributed technology is largely unnecessary: OLTP databases are like working memory, characterized by being small, fast, and feature-rich. Even in very large business systems, the active working set at any one moment is not particularly large. A basic rule of thumb for OLTP system design is: If your problem can be solved within a single machine, don’t bother with distributed databases.
OLTP databases have a history spanning several decades, with existing cores developing to a mature stage. Standards in the TP domain are gradually converging towards three Wire Protocols: PostgreSQL, MySQL, and Oracle. If the discussion is about tinkering with database auto-sharding and adding global transactions as a form of “distribution,” it’s definitely a dead end. If a “distributed” database manages to break through, it’s likely not because of the “pseudo-requirement” of “distribution,” but rather due to new features, open-source ecosystems, compatibility, ease of use, domestic innovation, and self-reliance.
The Struggles in Confusion
The greatest challenge for distributed databases stems from the market structure: Internet companies, the most likely candidates to utilize distributed TP databases, are paradoxically the least likely to pay for them. Internet companies can serve as high-quality users or even contributors, offering case studies, feedback, and PR, but they inherently resist the notion of financially supporting software, clashing with their meme instincts. Even leading distributed database vendors face the challenge of being applauded but not financially supported.
In a recent casual conversation with an engineer at a distributed database company, it was revealed that during a POC with a client, a query that Oracle completed in 10 seconds, their distributed database could only match with an order of magnitude difference, even when utilizing various resources and Dirty Hacks. Even openGauss, which forked from PostgreSQL 9.2 a decade ago, can outperform many distributed databases in certain scenarios, not to mention the advancements seen in PostgreSQL 15 and Oracle 23c ten years later. This gap is so significant that even the original manufacturers are left puzzled about the future direction of distributed databases.
Thus, some distributed databases have started pivoting towards self-rescue, with HTAP being a prime example: while transaction processing in a distributed setting is suboptimal, analytics can benefit greatly. So, why not combine the two? A single system capable of handling both transactions and analytics! However, engineers in the real world understand that AP systems and TP systems each have their own patterns, and forcibly merging two diametrically opposed systems will only result in both tasks failing to succeed. Whether it’s classic ETL/CDC pushing and pulling to specialized solutions like ClickHouse/Greenplum/Doris, or logical replication to a dedicated in-memory columnar store, any of these approaches is more reliable than using a chimera HTAP database.
Another idea is monolithic-distributed integration: if you can’t beat them, join them by adding a monolithic mode to avoid the high costs of network RPCs, ensuring that in 99% of scenarios where distributed capabilities are unnecessary, they aren’t completely outperformed by centralized databases — even if distributed isn’t needed, it’s essential to stay in the game and prevent others from taking the lead! But the fundamental issue here is the same as with HTAP: forcing heterogeneous data systems together is pointless. If there was value in doing so, why hasn’t anyone created a monolithic binary that integrates all heterogeneous databases into a do-it-all behemoth — the Database Jack-of-all-trades? Because it violates the KISS principle: Keep It Simple, Stupid!
The plight of distributed databases is similar to that of Middle Data Platforms: originating from internal scenarios at major internet companies and solving domain-specific problems. Once riding the wave of the internet industry, the discussion of databases was dominated by distributed technologies, enjoying a moment of pride. However, due to excessive hype and promises of unrealistic capabilities, they failed to meet user expectations, ending in disappointment and becoming akin to the emperor’s new clothes.
There are still many areas within the TP database field worthy of focus: Leveraging new hardware, actively embracing changes in underlying architectures like CXL, RDMA, NVMe; or providing simple and intuitive declarative interfaces to make database usage and management more convenient; offering more intelligent automatic monitoring and control systems to minimize operational tasks; developing compatibility plugins like Babelfish for MySQL/Oracle, aiming for a unified relational database WireProtocol. Even investing in better support services would be more meaningful than chasing the false need for “distributed” features.
Time changes, and a wise man adapts. It is hoped that distributed database vendors will find their Product-Market Fit and focus on what users truly need.
References
[1] 数据库需求层次金字塔 : https://mp.weixin.qq.com/s/1xR92Z67kvvj2_NpUMie1Q
[2] PostgreSQL到底有多强? : https://mp.weixin.qq.com/s/651zXDKGwFy8i0Owrmm-Xg
[3] SSD Performence in 2013 : https://www.snia.org/sites/default/files/SNIASSSI.SSDPerformance-APrimer2013.pdf
[4] 2022 Micron NVMe SSD Spec: https://media-www.micron.com/-/media/client/global/documents/products/product-flyer/9400_nvme_ssd_product_brief.pdf
[5] 2013-2030 SSD Pricing : https://blocksandfiles.com/2021/01/25/wikibon-ssds-vs-hard-drives-wrights-law/
[6] Single Instance with 100TB: https://mp.weixin.qq.com/s/JSQPzep09rDYbM-x5ptsZA
[7] EBS: Scam: https://mp.weixin.qq.com/s/UxjiUBTpb1pRUfGtR9V3ag
[8] 中台:一场彻头彻尾的自欺欺人: https://mp.weixin.qq.com/s/VgTU7NcOwmrX-nbrBBeH_w
Is running postgres in docker a good idea?
For stateless app services, containers are an almost perfect devops solution. However, for stateful services like databases, it’s not so straightforward. Whether production databases should be containerized remains controversial.
From a developer’s perspective, I’m a big fan of Docker & Kubernetes and believe that they might be the future standard for software deployment and operations. But as a database administrator, I think hosting production databases in Docker/K8S is still a bad idea.
What problems does Docker solve?
Docker is described with terms like lightweight, standardized, portable, cost-effective, efficient, automated, integrated, and high-performance in operations. These claims are valid, as Docker indeed simplifies both development and operations. This explains why many companies are eager to containerize their software and services. However, this enthusiasm sometimes goes to the extreme of containerizing everything, including production databases.
Containers were originally designed for stateless apps, where temporary data produced by the app is logically part of the container. A service is created with a container and destroyed after use. These apps are stateless, with the state typically stored outside in a database, reflecting the classic architecture and philosophy of containerization.
But when it comes to containerizing the production database itself, the scenario changes: databases are stateful. To maintain their state without losing it when the container stops, database containers need to “punch a hole” to the underlying OS, which is named data volumes.
Such containers are no longer ephemeral entities that can be freely created, destroyed, moved, or transferred; they become bound to the underlying environment. Thus, the many advantages of using containers for traditional apps are not applicable to database containers.
Reliability
Getting software up & running is one thing; ensuring its reliability is another. Databases, central to information systems, are often critical, with failure leading to catastrophic consequences. This reflects common experience: while office software crashes can be tolerated and resolved with restarts, document loss or corruption is unresolvable and disastrous. Database failure without replica & backups can be terminal, particularly for internet/finance companies.
Reliability is the paramount attribute for databases. It’s the system’s ability to function correctly during adversity (hardware/software faults, human error), i.e. fault tolerance and resilience. Unlike liveness attribute such as performance, reliability, a safety attribute, proves itself over time or falsify by failures, often overlooked until disaster strikes.
Docker’s description notably omits “reliability” —— the crucial attribute for database.
Reliability Proof
As mentioned, reliability lacks a definitive measure. Confidence in a system’s reliability builds over time through consistent, correct operation (MTTF). Deploying databases on bare metal has been a long-standing practice, proven reliable over decades. Docker, despite revolutionizing DevOps, has a mere ten-year track record, which is insufficient for establishing reliability, especially for mission-critical production databases. In essence, there haven’t been enough “guinea pigs” to clear the minefield.
Community Knowledge
Improving reliability hinges on learning from failures. Failures are invaluable, turning unknowns into knowns and forming the bedrock of operational knowledge. Community experience with failures is predominantly based on bare-metal deployments, with a plethora of issues well-trodden over decades. Encountering a problem often means finding a well-documented solution, thanks to previous experiences. However, add “Docker” to the mix, and the pool of useful information shrinks significantly. This implies a lower success rate in data recovery and longer times to resolve complex issues when they arise.
A subtle reality is that, without compelling reasons, businesses and individuals are generally reluctant to share experiences with failures. Failures can tarnish a company’s reputation, potentially exposing sensitive data or reflecting poorly on the organization and team. Moreover, insights from failures are often the result of costly lessons and financial losses, representing core value for operations personnel, thus public documentation on failures is scarce.
Extra Failure Point
Running databases in Docker doesn’t reduce the chances of hardware failures, software bugs, or human errors. Hardware issues persist with or without Docker. Software defects, mainly application bugs, aren’t lessened by containerization, and the same goes for human errors. In fact, Docker introduces extra components, complexity, and failure points, decreasing overall system reliability.
Consider this simple scenario: if the Docker daemon crashes, the database process dies. Such incidents, albeit rare, are non-existent on bare-metal.
Moreover, the failure points from an additional component like Docker aren’t limited to Docker itself. Issues could arise from interactions between Docker and the database, the OS, orchestration systems, VMs, networks, or disks. For evidence, see the issue tracker for the official PostgreSQL Docker image: https://github.com/docker-library/postgres/issues?q=.
Intellectual power doesn’t easily stack — a team’s intellect relies on the few seasoned members and their communication overhead. Database issues require database experts; container issues, container experts. However, when databases are deployed on kubernetes & dockers, merging the expertise of database and K8S specialists is challenging — you need a dual-expert to resolve issues, and such individuals are rarer than specialists in one domain.
Moreover, one man’s meat is another man’s poison. Certain Docker features might turn into bugs under specific conditions.
Unnecessary Isolation
Docker provides process-level isolation, which generally benefits applications by reducing interaction-related issues, thereby enhancing system reliability. However, this isolation isn’t always advantageous for databases.
A subtle real-world case involved starting two PostgreSQL server on the same data directory, either on the host or one in the host and another inside a container. On bare metal, the second instance would fail to start as PostgreSQL recognizes the existing instance and refuses to launch; however, Docker’s isolation allows the second instance to start obliviously, potentially toast the data files if proper fencing mechanisms (like host port or PID file exclusivity) aren’t in place.
Do databases need isolation? Absolutely, but not this kind. Databases often demand dedicated physical machines for performance reasons, with only the database process and essential tools running. Even in containers, they’re typically bound exclusively to physical/virtual machines. Thus, the type of isolation Docker provides is somewhat irrelevant for such deployments, though it is a handy feature for cloud providers to efficiently oversell in a multi-tenant environment.
Maintainability
Docker simplify the day one setup, but bring much more troubles on day two operation.
The bulk of software expenses isn’t in initial development but in ongoing maintenance, which includes fixing vulnerabilities, ensuring operational continuity, handling outages, upgrading versions, repaying technical debt, and adding new features. Maintainability is crucial for the quality of life in operations work. Docker shines in this aspect with its infrastructure-as-code approach, effectively turning operational knowledge into reusable code, accumulating it in a streamlined manner rather than scattered across various installation/setup documents. Docker excels here, especially for stateless applications with frequently changing logic. Docker and Kubernetes facilitate deployment, scaling, publishing, and rolling upgrades, allowing Devs to perform Ops tasks, and Ops to handle DBA duties (somewhat convincingly).
Day 1 Setup
Perhaps Docker’s greatest strength is the standardization of environment configuration. A standardized environment aids in delivering changes, discussing issues, and reproducing bugs. Using binary images (essentially materialized Dockerfile installation scripts) is quicker and easier to manage than running installation scripts. Not having to rebuild complex, dependency-heavy extensions each time is a notable advantage.
Unfortunately, databases don’t behave like typical business applications with frequent updates, and creating new instances or delivering environments is a rare operation. Additionally, DBAs often accumulate various installation and configuration scripts, making environment setup almost as fast as using Docker. Thus, Docker’s advantage in environment configuration isn’t as pronounced, falling into the “nice to have” category. Of course, in the absence of a dedicated DBA, using Docker images might still be preferable as they encapsulate some operational experience.
Typically, it’s not unusual for databases to run continuously for months or years after initialization. The primary aspect of database management isn’t creating new instances or delivering environments, but the day-to-day operations — Day2 Operation. Unfortunately, Docker doesn’t offer much benefit in this area and can introduce additional complications.
Day2 Operation
Docker can significantly streamline the maintenance of stateless apps, enabling easy create/destroy, version upgrades, and scaling. However, does this extend to databases?
Unlike app containers, database containers can’t be freely destroyed or created. Docker doesn’t enhance the operational experience for databases; tools like Ansible are more beneficial. Often, operations require executing scripts inside containers via docker exec
, adding unnecessary complexity.
CLI tools often struggle with Docker integration. For instance, docker exec
mixes stderr
and stdout
, breaking pipeline-dependent commands. In bare-metal deployments, certain ETL tasks for PostgreSQL can be easily done with a single Bash line.
psql <src-url> -c 'COPY tbl TO STDOUT' | psql <dst-url> -c 'COPY tdb FROM STDIN'
Yet, without proper client binaries on the host, one must awkwardly use Docker’s binaries like:
docker exec -it srcpg gosu postgres bash -c "psql -c \"COPY tbl TO STDOUT\" 2>/dev/null" |\
docker exec -i dstpg gosu postgres psql -c 'COPY tbl FROM STDIN;'
complicating simple commands like physical backups, which require layers of command wrapping:
docker exec -i postgres_pg_1 gosu postgres bash -c 'pg_basebackup -Xf -Ft -c fast -D - 2>/dev/null' | tar -xC /tmp/backup/basebackup
docker
,gosu
,bash
,pg_basebackup
Client-side applications (psql
, pg_basebackup
, pg_dump
) can bypass these issues with version-matched client tools on the host, but server-side solutions lack such workarounds. Upgrading containerized database software shouldn’t necessitate host server binary upgrades.
Docker advocates for easy software versioning; updating a minor database version is straightforward by tweaking the Dockerfile and restarting the container. However, major version upgrades requiring state modification are more complex in Docker, often leading to convoluted processes like those in https://github.com/tianon/docker-postgres-upgrade.
If database containers can’t be scheduled, scaled, or maintained as easily as AppServers, why use them in production? While stateless apps benefit from Docker and Kubernetes’ scaling ease, stateful applications like databases don’t enjoy such flexibility. Replicating a large production database is time-consuming and manual, questioning the efficiency of using docker run
for such operations.
Docker’s awkwardness in hosting production databases stems from the stateful nature of databases, requiring additional setup steps. Setting up a new PostgreSQL replica, for instance, involves a local data directory clone and starting the postmaster
process. Container lifecycle tied to a single process complicates database scaling and replication, leading to inelegant and complex solutions. This process isolation in containers, or “abstraction leakage,” fails to neatly cover the multiprocess, multitasking nature of databases, introducing unnecessary complexity and affecting maintainability.
In conclusion, while Docker can improve system maintainability in some aspects, like simplifying new instance creation, the introduced complexities often undermine these benefits.
Tooling
Databases require tools for maintenance, including a variety of operational scripts, deployment, backup, archiving, failover, version upgrades, plugin installation, connection pooling, performance analysis, monitoring, tuning, inspection, and repair. Most of these tools are designed for bare-metal deployments. Like databases, these tools need thorough and careful testing. Getting something to run versus ensuring its stable, long-term, and correct operation are distinct levels of reliability.
A simple example is plugin and package management. PostgreSQL offers many useful plugins, such as PostGIS. On bare metal, installing this plugin is as easy as executing yum install
followed by create extension postgis
. However, in Docker, following best practices requires making changes at the image level to persist the extension beyond container restarts. This necessitates modifying the Dockerfile, rebuilding the image, pushing it to the server, and restarting the database container, undeniably a more cumbersome process.
Package management is a core aspect of OS distributions. Docker complicates this, as many PostgreSQL binaries are distributed not as RPM/DEB packages but as Docker images with pre-installed extensions. This raises a significant issue: how to consolidate multiple disparate images if one needs to use two, three, or over a hundred extensions from the PostgreSQL ecosystem? Compared to reliable OS package management, building Docker images invariably requires more time and effort to function properly.
Take monitoring as another example. In traditional bare-metal deployment, machine metrics are crucial for database monitoring. Monitoring in containers differs subtly from that on bare metal, and oversight can lead to pitfalls. For instance, the sum of various CPU mode durations always equals 100% on bare metal, but this assumption doesn’t necessarily hold in containers. Moreover, monitoring tools relying on the /proc
filesystem may yield metrics in containers that differ significantly from those on bare metal. While such issues are solvable (e.g., mounting the Proc filesystem inside the container), complex and ugly workarounds are generally unwelcome compared to straightforward solutions.
Similar issues arise with some failure detection tools and common system commands. Theoretically, these could be executed directly on the host, but can we guarantee that the results in the container will be identical to those on bare metal? More frustrating is the emergency troubleshooting process, where necessary tools might be missing in the container, and with no external network access, the Dockerfile→Image→Restart path can be exasperating.
Treating Docker like a VM, many tools may still function, but this defeats much of Docker’s purpose, reducing it to just another package manager. Some argue that Docker enhances system reliability through standardized deployment, given the more controlled environment. While this is true, I believe that if the personnel managing the database understand how to configure the database environment, there’s no fundamental difference between scripting environment initialization in a Shell script or in a Dockerfile.
Scalability
Performance is another point that people concerned a lot. From the performance perspective, the basic principle of database deployment is: The close to hardware, The better it is. Additional isolation & abstraction layer is bad for database performance. More isolation means more overhead, even if it is just an additional memcpy
in the kernel .
For performance-seeking scenarios, some databases choose to bypass the operating system’s page management mechanism to operate the disk directly, while some databases may even use FPGA or GPU to speed up query processing. Docker as a lightweight container, performance suffers not much, and the impact to performance-insensitive scenarios may not be significant, but the extra abstract layer will definitely make performance worse than make it better.
Summary
Container and orchestration technologies are valuable for operations, bridging the gap between software and services by aiming to codify and modularize operational expertise and capabilities. Container technology is poised to become the future of package management, while orchestration evolves into a “data center distributed cluster operating system,” forming the underlying infrastructure runtime for all software. As more challenges are addressed, confidently running both stateful and stateless applications in containers will become feasible. However, for databases, this remains an ideal rather than a practical option, especially in production.
It’s crucial to reiterate that the above discussion applies specifically to production databases. For development and testing, despite the existence of Vagrant-based virtual machine sandboxes, I advocate for Docker use—many developers are unfamiliar with configuring local test database environments, and Docker provides a clearer, simpler solution. For stateless production applications or those with non-critical derivative state data (like Redis caches), Docker is a good choice. But for core relational databases in production, where data integrity is paramount, one should carefully consider the risks and benefits: What’s the value of using Docker here? Can it handle potential issues? Are you prepared to assume the responsibility if things go wrong?
Every technological decision involves balancing pros and cons, like the core trade-off here of sacrificing reliability for maintainability with Docker. Some scenarios may warrant this, such as cloud providers optimizing for containerization to oversell resources, where container isolation, high resource utilization, and management convenience align well. Here, the benefits might outweigh the drawbacks. However, in many cases, reliability is the top priority, and compromising it for maintainability is not advisable. Moreover, it’s debatable whether using Docker significantly eases database management; sacrificing long-term operational maintainability for short-term deployment ease is unwise.
In conclusion, containerizing production databases is likely not a prudent choice.