Pivotal Greenplum Database (GPDB)
In this document, we will take a look at gpsd execution and internals.
gpsd is a GPDB utility used to collect object metadata and statistics from a source database and import them into a target database. This essentially produces a replica of the source databases system catalog and statistics which are used in query planning.
gpsd is included in the GPDB distribution under the $GPHOME/bin directory.
Generally, the source database is a customers system and the target database is an internal lab system.
The target database can be used to reproduce issues when the optimizer/planner:
- Produces the wrong plan
- Crashes while producing a plan
- Encounters optimizer/planner related issues
Only metadata and statistics are collected. Data stored in the tables is ignored as this would consume too much space and customers generally do not want to share data as it can be confidential.
How To Execute?
The following command should be executed on the source system. This is where the statistics will be extracted from:
gpsd <dbname> > <gpsd_dump_file> Example:
gpsd customer_prod1 > customer_prod1_20150808.sql
The SQL file should be gzipped and transferred to the target database.
To load the data into the target database:
psql <dbname> -f <gpsd_dump_file>
psql repro_db -f customer_prod1_20150808.sql
Now the target database contains the same schema and statistics and should produce the same query plan or encounter the same crash as the source database when the specific query is executed.
gpsd also supports skipping the schema and only collecting statistics:
gpsd -s <dbname> > <gpsd_dump_file> Example:
gpsd -s customer_prod1 > customer_prod1_20150808_updated.sql
This can be useful when you have already collected the schema from the source database and you only need to get updated statistics.
What Happens Internally?
The following steps are performed internally when gpsd is executed:
1. Dump global objects
if orca: pg_dumpall -h -p -U -l -g --no-gp-syntax else: pg_dumpall -h -p -U -l -g --gp-syntax
2. Dump schema
pg_dump -h -p -U -s -x --gp-syntax -O
3. Dump tuple counts from pg_class
SELECT pgc.relname, pgn.nspname, pgc.relpages, pgc.reltuples FROM pg_class pgc, pg_namespace pgn WHERE pgc.relnamespace = pgn.oid AND pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')
4. Dump the statistics from pg_statistic
SELECT pgc.relname, pgn.nspname, pga.attname, pgt.typname, pgs.* FROM pg_class pgc, pg_statistic pgs, pg_namespace pgn, pg_attribute pga, pg_type pgt WHERE pgc.relnamespace = pgn.oid and pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema') AND pgc.oid = pgs.starelid AND pga.attrelid = pgc.oid AND pga.attnum = pgs.staattnum AND pga.atttypid = pgt.oid ORDER BY pgc.relname, pgs.staattnum
All the output from the above 4 steps outputted to the screen which can be redirected into an SQL file.