When installing or uninstalling large clusters icm might report errors to the console indicating some hosts have failed, but puppet sync reports finished.
[gpadmin@pccadmin ~]$ icm_client deploy -c ClusterConfig_1.1/ Please enter the root password for the cluster nodes: PCC creates a gpadmin user on the newly added cluster nodes (if any). Please enter a non-empty password to be used for the gpadmin user: Verifying input Starting install [====================================================================================================] 100% Results: hdw3.hadoop.local... [Success] hdw2.hadoop.local... [Failed] hdw1.hadoop.local... [Success] hdm1.hadoop.local... [Failed] Details at /var/log/gphd/gphdmgr/ Cluster ID: 18
During deployment ICM will attempt to fetch the cluster status every 10 seconds and expects the installation to start. If the cluster nodes did not start the puppet sync after 10 seconds * 10 tries = 100 seconds then ICM will mark the node as failed.
This might happen if the hardware in the admin node is not equipped to handle the deployment of a large amount of cluster nodes.
You do not need to uninstall the cluster to fix this. First Modify the gphdmgr.properties gphdmgr.batch.size or gphdmgr.statusfetch.interval.secs so the admin node can effectively deploy all the nodes. After modification restart commander with "service commander restart"
Changing gphdmgr.batch.size from default of 50 to say 25 in a 144 node cluster install will result in longer deployment times. gphdmgr will install batches of 25 nodes and wait "gpdhmgr.batch.interval.secs=60" before deploying the next batch of 25 nodes.
Changing gphdmgr.statusfetcher.interval.secs from default of 10 to 20 will increase the time gphdmgr waits for the cluster node to start the installation. With a value of 20 ICM will wait 20 * 10 = 200 second. The benefit in only changing this value will be ICM can still deploy in 50 node batches resulting in a faster deployment.
Params from file /etc/gphd/gphdmgr/conf/gphdmgr.properties
secs=7200gphdmgr.sequential.batching= truegphdmgr.statusfetch.interval. secs=10---
Once the configuration changes are in place you simply need to restart commander services and then execute an ICM reconfigure to clear the failed install flag.
1. service commander restart
2. icm_client reconfigure -l <cluster name>