Pivotal Knowledge Base


How to do Root Cause Analysis for the Segments Marked Down


  • Pivotal Greenplum Database (GPDB) 4.3.x
  • Operating System (OS)- Red Hat Enterprise Linux (RHEL) 6.x


This article helps to understand the facts to review and understand the cause of segments going down. There can be many reasons for segments going down, so we need to understand the basic principle and take actions accordingly.


In order to RCA a "Segment Down" event, the following information needs to be reviewed:

-- gp_segment_configuration table

select * from gp_segment_configuration gsc join pg_filespace_entry pfe on gsc.dbid=pfe.fsedbid where content= <contentID for segment which went down> ;

-- gp_configuration_history table

select * from gp_configuration_history order by "desc" desc;

-- DB log files from the date and time segments went down

Master logs ($MASTER_DATA_DIRECTORY/pg_log/)
Primary logs (segment_data_directory/pg_log/)
Mirror logs (segment_data_directory/pg_log/)

-- Relevant configuration parameters (gpconfig -s)

gpconfig -s gp_fts_probe_interval 
gpconfig -s gp_fts_probe_threadcount
gpconfig -s gp_fts_probe_timeout
gpconfig -s gp_segment_connect_timeout

-- The time (exact or approximate) when the segment went down

-- Identification of the segment - either DBID or combination of ContentID and Role (Primary/Mirror)

Note- Corresponding primary or corresponding mirror segment for a specific segment can be found in the "gp_segment_configuration" table - primary/mirror pair will have same "content_id".

How to do the Root Cause Analysis

From the gp_segment_configuration and master logs identify few things:

Which segments went down?
What time they all went down?
Is it only primaries?
Is it only mirrors?
Is it only one server or rack?
Is it a mix of primaries or mirrors?

In most cases, it is the mirror segments that are marked down. The general cause of mirror segments going down is the inability for the primary and mirror to keep timely communication and the primary segment unable to receive confirmation within the time limit controlled by gp_segment_connect_timeout. 

Case 1:

This is an indication of mirrors going down due to high workload or networking overload. In this case, the "gp_segment_connect_timeout" can be increased to allow for longer response time from the mirror. This is not a permanent fix and if the workload keeps increasing, another failure can happen later.

2013-05-08 04:10:50.730638 EDT,,,p28480,th111540096,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=1500' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,

Case 2:

Mirrors can also go down due to missing files. In this case search for log entry referring to 'transition' and missing files in the segment log files.

Case 3:

Primaries for segments are marked down. There can be multiple reasons. Start with the review of the primary segment log files and search through the timeframe when the segment was marked down. Look for the word "transition", the log messages around this will be more helpful to understand the cause of segments going down. The reason could be one of the following:

Out of memory (OS or VMEM)
Network issues

Case 4:

The postmaster process on the primary segment will verify periodically if the I/O on the segment data directory works properly (file can be written and read). It does that by writing a file under the data directory ("fts_probe_file.bak"). If there is a problem with the I/O (stuck controller), the segment will not be able to respond to FTS process on the master and FTS will promote the mirror to primary and transition the primary to mirror. Symptoms of these issues are problems where segments are transitioned and segment servers seem "stuck" while nobody is able to connect to them.

Long-Term Trend Analysis

Often we need to analyze past behavior of segment failures to identify any long-term trends such as possible hardware issues. There is a PSQL script attached to this articles (segment_failures.sql) which can be used for this purpose.

This script will analyze the last three months of segment failures and produce 3 reports:

  1. Any primary segments that have failed more than once within the reporting window.
  2.  Any mirrors that have failed more than once within the reporting window
  3. Any server with segment failures, the date and time (to an hour granularity) and the number of segment failures (mirror and primary) within that hour window.

The output will look similar to the following:

[gpadmin@mdw ~]$ psql -p 54320 -f f.sql
Timing is on.
Primary segments with more than 1 failure
 hostname | content | number_failures
(0 rows)

Time: 15.570 ms
Mirror segments with more than 1 failure
 hostname | content | number_failures
 sdw1     |       2 |               2
 sdw1     |       3 |               2
(2 rows)

Time: 4.946 ms
Hosts and time with failures
 hostname |      failure time      | number_failures
 sdw1     | 2016-01-14 10:00:00-08 |               4
 sdw1     | 2016-02-29 07:00:00-08 |               2
 sdw1     |                        |               6
          |                        |               6
(4 rows)

Time: 2.476 ms

Based on the above, certain segments can be flagged for investigation for potential failures.  The third report can also be used to roll up the total number of segment failures per node for the reporting period.

If you need to change the reporting period then simply alter the first line of the script:

\set report_interval ('3 month')::INTERVAL

Additional Information

  1. Details about the segments down are located in the segment DB logs.
  2. Use gp_configuration_history to understand if there any patterns.
  3. Use gpstate -e to see the quick state of segments.
  4. Catalog issues can also cause segments to go down.
  5. Always understand why segment went down before you can suggest the recovery.
  6. gprecoverseg full deletes all the data and files in the segment data directory.
  7. Running incremental recovery after full will not work, due to above reasons.

Check the document mentioned here for more information on segment failure analysis.


Powered by Zendesk