In a secured Hadoop cluster, you can come across situations in which Hadoop daemons (namenode / datanode etc) may fail to start due to Kerberos authentication issues.
Daemons logs can help you identify the problem further.
"Unable to obtain password from user"
Snippet from the namenode logs are below:
2014-03-27 17:57:57,904 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join java.io.IOException: Login failure for hdfs/dev6ha@SATURN.LOCAL from keytab /etc/security/phd/keytab/hdfs.service.keytab at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:836) .. Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:789)
2014-03-27 18:15:33,186 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2014-03-27 18:15:33,188 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at hdm1.saturn.local/10.246.67.243
The below pointers can help you start the investigation, we have used the above log snippet values for example.
Identify what is the principal name?
In hdfs-site.xml Kerberos principal name is specified which is used to authenticate against Kerberos. Verify, if the parameter dfs.namenode.kerberos.*.principal / dfs.datanode.kerberos.*.principal has correct principal name, depending upon the node role. If using _HOST variable in hdfs-site.xml, ensure that hostname -f is returning the fully qualified hostname and it matches with the principal name as indicated in the log files. If you have configuration issues (for example DNS returns an IP but in /etc/hosts there is a different IP specified for the same hosts); it will not replace _HOST with the correct name and you may see such errors.
Note: In this example, DNS returned IP as 10.246.67.243, but /etc/hosts was pointing to 10.246.67.218, and _HOST was getting replaced by the nameservice name (dev6ha) instead of actual hostname because this was a NameNode High Availability configuration.
Identify what is the keytab file used?
If the keytab file defined in hdfs-site.xml is not present you will see this error. So, please verify the path and the keytab filename.
Verify if you can kinit using the principal name and keytab?
[root@phd11-nn keytab] kinit -ket /etc/security/phd/keytab/hdfs.service.keytab hdfs/dev6ha@SATURN.LOCAL
If kinit is failing then there might be a problem with the hostname IP mapping in your keytab file that are inconsistent with DNS or /etc/hosts, and you can still get the same error.
How to verify contents of keytab file:
klist -ket /etc/security/phd/keytab/hdfs.server.keytab
How to regenerate keytab file:
[root@KDC server] kadmin.local ktadd -norandkey -k /etc/security/keytab/hdfs-hostid.service.keytab hdfs/host_fqdn@REALM HTTP/host_fqdn@REALM
Identify how the hostname or the IP is determined?
DNS or using /etc/hosts you can check /etc/nsswitch.conf to identify which one is looked up first. There will be an entry like below indicating /etc/hosts file is used before looking up at DNS or vice-versa.
hosts: files dns
Note: We will keep updating this document as we find more reasons for the same issue.