Applet: 数据分析应用
使用Pigsty工具箱进行数据分析与可视化
Applet的结构
Applet,是一种自包含的,运行于Pigsty基础设施中的数据小应用。
一个Pigsty应用通常包括以下内容中的至少一样或全部:
- 图形界面(Grafana Dashboard定义) 放置于
ui
目录
- 数据定义(PostgreSQL DDL File),放置于
sql
目录
- 数据文件(各类资源,需要下载的文件),放置于
data
目录
- 逻辑脚本(执行各类逻辑),放置于
bin
目录
Pigsty默认提供了几个样例应用:
pglog
, 分析PostgreSQL CSV日志样本。
covid
, 可视化WHO COVID-19数据,查阅各国疫情数据。
pglog
, NOAA ISD,可以查询全球30000个地表气象站从1901年来的气象观测记录。
应用的结构
一个Pigsty应用会在应用根目录提供一个安装脚本:install
或相关快捷方式。您需要使用管理用户在元节点执行安装。安装脚本会检测当前的环境(获取 METADB_URL
, PIGSTY_HOME
,GRAFANA_ENDPOINT
等信息以执行安装)
通常,带有APP
标签的面板会被列入Pigsty Grafana首页导航中App下拉菜单中,带有APP
和Overview
标签的面板则会列入首页面板导航中。
您可以从 https://github.com/Vonng/pigsty/releases/download/v1.5.1/app.tgz 下载带有基础数据的应用进行安装。
COVID
COVID是一个可视化WHO COVID-19数据,查阅各国疫情数据的应用样例。
公开演示:http://demo.pigsty.cc/d/covid-overview
安装方式
cd covid
make all # 完整安装(会从WHO下载最新数据)
make all2 # 完整安装(会直接使用本地下载好的数据)
更精细的控制:
make ui # 将covid dashboards安装至grafana
make sql # 将covid 数据库表定义创建至metadb中
make download # 下载WHO最新数据
make load # 加载下载好的WHO数据
make reload # download + load
如果已经下载了数据(例如,通过下载app.tgz获得应用程序),运行make all2
代替,以跳过下载。
1 - PGLOG:PG自带日志分析应用
Pigsty自带的,用于分析PostgreSQL CSV日志样本的一个样例Applet
PGLOG是Pigsty自带的一个样例应用,固定使用MetaDB中pglog.sample
表作为数据来源。您只需要将日志灌入该表,然后访问相关Dashboard即可。
Pigsty提供了一些趁手的命令,用于拉取csv日志,并灌入样本表中。在元节点上,默认提供下列快捷命令:
catlog [node=localhost] [date=today] # 打印CSV日志到标准输出
pglog # 从标准输入灌入CSVLOG
pglog12 # 灌入PG12格式的CSVLOG
pglog13 # 灌入PG13格式的CSVLOG
pglog14 # 灌入PG14格式的CSVLOG (=pglog)
catlog | pglog # 分析当前节点当日的日志
catlog node-1 '2021-07-15' | pglog # 分析node-1在2021-07-15的csvlog
接下来,您可以访问以下的连接,查看样例日志分析界面。


catlog
命令从特定节点拉取特定日期的CSV数据库日志,写入stdout
默认情况下,catlog
会拉取当前节点当日的日志,您可以通过参数指定节点与日期。
组合使用pglog
与catlog
,即可快速拉取数据库CSV日志进行分析。
catlog | pglog # 分析当前节点当日的日志
catlog node-1 '2021-07-15' | pglog # 分析node-1在2021-07-15的csvlog
2 - COVID:WHO新冠疫情数据大盘
Pigsty自带的,用于展示世界卫生组织官方疫情数据的一个样例Applet
在线Demo地址:http://demo.pigsty.cc/d/covid-overview



安装
在管理节点上进入应用目录,执行make
以完成安装。
make # 如果本地数据可用
make all # 完整安装,从WHO官网下载数据
makd reload # 重新下载并加载最新数据
其他一些子任务:
make reload # download latest data and pour it again
make ui # install grafana dashboards
make sql # install database schemas
make download # download latest data
make load # load downloaded data into database
make reload # download latest data and pour it into database
监控面板
3 - NOAA ISD气象站历史数据查询
以ISD数据集为例,展现如何将数据导入数据库中
在线Demo地址: http://demo.pigsty.cc/d/isd-overview
如果您拥有数据库后不知道干点什么,不妨参考作者的另一个开源项目:Vonng/isd
您可以直接复用监控系统Grafana,以交互式的方式查阅近30000个地面气象站过去120年间的亚小时级气象数据。
一个功能完成的数据应用,可以查询全球30000个地表气象站从1901年来的气象观测记录。
项目地址:https://github.com/Vonng/isd
安装方式
cd isd
make all # 完整安装(会从Github与NOAA下载最新数据)
make all2 # 完整安装(会直接使用本地下载好的数据)
更精细的控制:
make ui # 将covid dashboards安装至grafana
make sql # 将covid 数据库表定义创建至metadb中
make download # 下载NOAA最新数据,ISD Parser,字典表
make baseline # 使用下载好的数据初始化最基本的全局大盘功能
make reload # 从NOAA下载最新的每日摘要并解析加载
ISD —— Intergrated Surface Data
这里包含了下载、解析、处理、可视化NOAA ISD数据集所需的所有工具。
能让您查阅近30000个地面气象站过去120年间的亚小时级气象数据。并充分体验PostgreSQL带来的强大的数据分析与处理能力!
SYNOPSIS
Download, Parse, Visualize Intergrated Suface Dataset.
Including 30000 meteorology station, sub-hourly observation records, from 1900-2020.

Quick Started
-
Clone repo
git clone https://github.com/Vonng/isd && cd isd
-
Prepare a postgres database
Connect via something like isd
or postgres://user:pass@host/dbname
)
# skip this if you already have a viable database
PGURL=postgres
psql ${PGURL} -c 'CREATE DATABASE isd;'
# database connection string, something like `isd` or `postgres://user:pass@host/dbname`
PGURL='isd'
psql ${PGURL} -AXtwc 'CREATE EXTENSION postgis;'
# create tables, partitions, functions
psql ${PGURL} -AXtwf 'sql/schema.sql'
-
Download data
- ISD Station: Station metadata, id, name, location, country, etc…
- ISD History: Station observation records: observation count per month
- ISD Hourly: Yearly archived station (sub-)hourly observation records
- ISD Daily: Yearly archvied station daily aggregated summary
git clone https://github.com/Vonng/isd && cd isd
bin/get-isd-station.sh # download isd station from noaa (proxy makes it faster)
bin/get-isd-history.sh # download isd history observation from noaa
bin/get-isd-hourly.sh <year> # download isd hourly data (yearly tarball 1901-2020)
bin/get-isd-daily.sh <year> # download isd daily data (yearly tarball 1929-2020)
-
Build Parser
There are two ISD dataset parsers written in Golang : isdh
for isd hourly dataset and isdd
for isd daily dataset.
make isdh
and make isdd
will build it and copy to bin. These parsers are required for loading data into database.
You can download pre-compiled binary to bin/ dir to skip this phase.
-
Load data
Metadata includes world_fences
, china_fences
, isd_elements
, isd_mwcode
, isd_station
, isd_history
. These are gzipped csv file lies in data/meta/
. world_fences
, china_fences
, isd_elements
, isd_mwcode
are constant dict table. But isd_station
and isd_history
are frequently updated. You’ll have to download it from noaa before loading it.
# load metadata: fences, dicts, station, history,...
bin/load-meta.sh
# load a year's daily data to database
bin/load-isd-daily <year>
# load a year's hourly data to database
bin/laod-isd-hourly <year>
Note that the original isd_daily
dataset has some un-cleansed data, refer caveat for detail.
Data
Dataset
Hourly Data: Oringinal tarball size 105GB, Table size 1TB (+600GB Indexes).
Daily Data: Oringinal tarball size 3.2GB, table size 24 GB
It is recommended to have 2TB storage for a full installation, and at least 40GB for daily data only installation.
Schema
Data schema definition
Station
CREATE TABLE public.isd_station
(
station VARCHAR(12) PRIMARY KEY,
usaf VARCHAR(6) GENERATED ALWAYS AS (substring(station, 1, 6)) STORED,
wban VARCHAR(5) GENERATED ALWAYS AS (substring(station, 7, 5)) STORED,
name VARCHAR(32),
country VARCHAR(2),
province VARCHAR(2),
icao VARCHAR(4),
location GEOMETRY(POINT),
longitude NUMERIC GENERATED ALWAYS AS (Round(ST_X(location)::NUMERIC, 6)) STORED,
latitude NUMERIC GENERATED ALWAYS AS (Round(ST_Y(location)::NUMERIC, 6)) STORED,
elevation NUMERIC,
period daterange,
begin_date DATE GENERATED ALWAYS AS (lower(period)) STORED,
end_date DATE GENERATED ALWAYS AS (upper(period)) STORED
);
Hourly Data
CREATE TABLE public.isd_hourly
(
station VARCHAR(11) NOT NULL,
ts TIMESTAMP NOT NULL,
temp NUMERIC(3, 1),
dewp NUMERIC(3, 1),
slp NUMERIC(5, 1),
stp NUMERIC(5, 1),
vis NUMERIC(6),
wd_angle NUMERIC(3),
wd_speed NUMERIC(4, 1),
wd_gust NUMERIC(4, 1),
wd_code VARCHAR(1),
cld_height NUMERIC(5),
cld_code VARCHAR(2),
sndp NUMERIC(5, 1),
prcp NUMERIC(5, 1),
prcp_hour NUMERIC(2),
prcp_code VARCHAR(1),
mw_code VARCHAR(2),
aw_code VARCHAR(2),
pw_code VARCHAR(1),
pw_hour NUMERIC(2),
data JSONB
) PARTITION BY RANGE (ts);
Daily Data
CREATE TABLE public.isd_daily
(
station VARCHAR(12) NOT NULL,
ts DATE NOT NULL,
temp_mean NUMERIC(3, 1),
temp_min NUMERIC(3, 1),
temp_max NUMERIC(3, 1),
dewp_mean NUMERIC(3, 1),
slp_mean NUMERIC(5, 1),
stp_mean NUMERIC(5, 1),
vis_mean NUMERIC(6),
wdsp_mean NUMERIC(4, 1),
wdsp_max NUMERIC(4, 1),
gust NUMERIC(4, 1),
prcp_mean NUMERIC(5, 1),
prcp NUMERIC(5, 1),
sndp NuMERIC(5, 1),
is_foggy BOOLEAN,
is_rainy BOOLEAN,
is_snowy BOOLEAN,
is_hail BOOLEAN,
is_thunder BOOLEAN,
is_tornado BOOLEAN,
temp_count SMALLINT,
dewp_count SMALLINT,
slp_count SMALLINT,
stp_count SMALLINT,
wdsp_count SMALLINT,
visib_count SMALLINT,
temp_min_f BOOLEAN,
temp_max_f BOOLEAN,
prcp_flag CHAR,
PRIMARY KEY (ts, station)
) PARTITION BY RANGE (ts);
Update
ISD Daily and ISD hourly dataset will rolling update each day. Run following scripts to load latest data into database.
# download, clean, reload latest hourly dataset
bin/get-isd-daily.sh
bin/load-isd-daily.sh
# download, clean, reload latest daily dataset
bin/get-isd-daily.sh
bin/load-isd-daily.sh
# recalculate latest partition of monthly and yearly
bin/refresh-latest.sh
Parser
There are two parser: isdd
and isdh
, which takes noaa original yearly tarball as input, generate CSV as output (which could be directly consume by PostgreSQL Copy command).
NAME
isdh -- Intergrated Surface Dataset Hourly Parser
SYNOPSIS
isdh [-i <input|stdin>] [-o <output|st>] -p -d -c -v
DESCRIPTION
The isdh program takes isd hourly (yearly tarball file) as input.
And generate csv format as output
OPTIONS
-i <input> input file, stdin by default
-o <output> output file, stdout by default
-p <profpath> pprof file path (disable by default)
-v verbose progress report
-d de-duplicate rows (raw, ts-first, hour-first)
-c add comma separated extra columns
UI
ISD Station

ISD Monthly
