Apache
hadoop
Security
ActiveDirectory
Kerberos

1. Hadoop Security Concern

Hadoop is one of Data Lake Solution https://en.wikipedia.org/wiki/Data_lake. Store all the data your organization has and analyze/manage it in a unified way. This is a very rough concept, however, it does not miss the point.

Wikipedia: http://hadoop.apache.org/
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

All the data/processes are distributed but connected among hundreds of servers. If mistakenly one of the servers is illegally intruded, it would result in a critical security issue affecting the whole system. Therefore, you need to consider how to protect your Hadoop system by increasing its security.

According to a Hortonworks' document, there are 5 pillars of security in Hadoop.
http://hortonworks.com/wp-content/uploads/2015/07/Security_White_Paper.pdf
image.png

Also, in the same document, you can find what offers Hortonworks can provide to achieve the 5 security pillars.

image.png

In secured Hadoop Cluster, the security work flow is like this.
https://community.hortonworks.com/articles/102957/hadoop-security-concepts.html
image.png

These are Hortonwork's document, thus, Cloudera may not have similar service offerings like above. However, what is important here is to know that, whatever Hadoop distribution you use, there are some security aspects you need to consider.

In addition, this post covers only Kerberos Authentication. As you noticed, Kerberos provides very limited part of security in Hadoop Cluster. (Highlighted by red flames in the figures above). It is not a silver bullet for Hadoop Security. After kerberizing your Hadoop Cluster, still you need to design how to achieve strong security further.
Anyway, I think Kerberos Authentication is a very good start point for integrating security system to your Hadoop Cluster (at least for me).

2. How Kerberos works on Hadoop

In this part, I would like to show how user authentication works differently between kerberized/non-kerberized Hadoop clusters.

Note that, in this section, 2 hadoop clusters based on different Hadoop distributions with different configuration are used: HDP without Kerberos Authentication and CDH with Kerberos Authentication using Isilon as HDFS. As a result, the commands and outputs in the following example are a little bit different depending on the clusters. However, the difference does not affect this kerberos authentication test result.

Without Kerberos Authentication

You can't submit YARN job as root user.

[root@hdp-master ~]# hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user/root/QuasiMonteCarlo_1517577291268_969459034/in":hdfs:hdfs:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:325)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:246)
(***omitted***)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/user/root/QuasiMonteCarlo_1517577291268_969459034/in":hdfs:hdfs:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:325)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:246)
(***omitted***)

However, just like below, you can submit YARN job as hdfs user very easily. No authentication is necessary. This security is so vulnerable.

[root@hdp-master ~]# HADOOP_USER_NAME=hdfs   hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
18/02/02 08:21:39 INFO client.RMProxy: Connecting to ResourceManager at hdp-cluster-1.testdom.local/192.168.0.182:8050
18/02/02 08:21:39 INFO client.AHSProxy: Connecting to Application History server at hdp-cluster-1.testdom.local/192.168.0.182:10200
18/02/02 08:21:39 INFO input.FileInputFormat: Total input paths to process : 10
18/02/02 08:21:39 INFO mapreduce.JobSubmitter: number of splits:10
18/02/02 08:21:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517298489855_0004
18/02/02 08:21:40 INFO impl.YarnClientImpl: Submitted application application_1517298489855_0004
18/02/02 08:21:40 INFO mapreduce.Job: The url to track the job: http://hdp-cluster-1.testdom.local:8088/proxy/application_1517298489855_0004/
18/02/02 08:21:40 INFO mapreduce.Job: Running job: job_1517298489855_0004
18/02/02 08:21:45 INFO mapreduce.Job: Job job_1517298489855_0004 running in uber mode : false
18/02/02 08:21:45 INFO mapreduce.Job:  map 0% reduce 0%
18/02/02 08:21:59 INFO mapreduce.Job:  map 20% reduce 0%
18/02/02 08:22:00 INFO mapreduce.Job:  map 100% reduce 0%
18/02/02 08:22:04 INFO mapreduce.Job:  map 100% reduce 100%
18/02/02 08:22:05 INFO mapreduce.Job: Job job_1517298489855_0004 completed successfully
18/02/02 08:22:05 INFO mapreduce.Job: Counters: 49
(***omitted***)
Job Finished in 26.048 seconds
Estimated value of Pi is 3.20000000000000000000

Whoever you are, just by pretending as if you are hdfs user, you can access to hadoop system. That's why Kerberos Authentication is necessary to be enabled in Hadoop Cluster.

Following shows how authentication works when Kerberos is enabled.

With Kerberos Authentication

In this environment, without authentication via Kerberos, it is not allowed to submit YARN job even if you pretend to be hdfs user.
Hadoop job fails saying No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt). In the latter part of this post, Kerberos TGT is explained.

[root@kcdh-master-1 ~]#  HADOOP_USER_NAME=hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
18/02/02 08:34:18 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
18/02/02 08:34:18 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
18/02/02 08:34:18 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "kcdh-master-1.testdom.local/192.168.0.172"; destination host is: "cloudera-isi8101.t`estdom.local":8020;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1508)
        at org.apache.hadoop.ipc.Client.call(Client.java:1441)
(***omitted***)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
        at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:718)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
(***omitted***)

You can authenticate yourself with kinit command like below.

[root@kcdh-master-1 ~]# kinit kerberos-user-0109@TESTDOM.LOCAL
Password for kerberos-user@TESTDOM.LOCAL:

Then, you can submit YARN job. Also, note that you don't need to be hdfs user to submit Hadoop job.

[root@kcdh-master-1 ~]# hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
18/02/02 08:46:52 INFO hdfs.DFSClient: Created token for kerberos-user-0109: HDFS_DELEGATION_TOKEN owner=kerberos-user-0109@TESTDOM.LOCAL, renewer=yarn/kcdh-master-1.testdom.local@TESTDOM.LOCAL, realUser=, issueDate=1517579267897, maxDate=1518184067897, sequenceNumber=0, masterKeyId=0 on cloudera-isi8101.testdom.local:8020
18/02/02 08:46:52 INFO security.TokenCache: Got dt for hdfs://cloudera-isi8101.testdom.local:8020; Kind: HDFS_DELEGATION_TOKEN, Service: cloudera-isi8101.testdom.local:8020, Ident: (token for kerberos-user-0109: HDFS_DELEGATION_TOKEN owner=kerberos-user-0109@TESTDOM.LOCAL, renewer=yarn/kcdh-master-1.testdom.local@TESTDOM.LOCAL, realUser=, issueDate=1517579267897, maxDate=1518184067897, sequenceNumber=0, masterKeyId=0)
18/02/02 08:46:52 WARN ipc.Client: Failed to connect to server: kcdh-master-1.testdom.local/192.168.0.172:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
        at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1441)
(***omitted***)
18/02/02 08:46:52 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm39
18/02/02 08:46:52 INFO input.FileInputFormat: Total input paths to process : 10
18/02/02 08:46:52 INFO mapreduce.JobSubmitter: number of splits:10
18/02/02 08:46:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516176355578_0010
18/02/02 08:46:52 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: cloudera-isi8101.testdom.local:8020, Ident: (token for kerberos-user-0109: HDFS_DELEGATION_TOKEN owner=kerberos-user-0109@TESTDOM.LOCAL, renewer=yarn/kcdh-master-1.testdom.local@TESTDOM.LOCAL, realUser=, issueDate=1517579267897, maxDate=1518184067897, sequenceNumber=0, masterKeyId=0)
18/02/02 08:46:53 INFO impl.YarnClientImpl: Submitted application application_1516176355578_0010
18/02/02 08:46:53 INFO mapreduce.Job: The url to track the job: http://kcdh-master-2.testdom.local:8088/proxy/application_1516176355578_0010/
18/02/02 08:46:53 INFO mapreduce.Job: Running job: job_1516176355578_0010
18/02/02 08:50:18 INFO mapreduce.Job: Job job_1516176355578_0010 running in uber mode : false
18/02/02 08:50:18 INFO mapreduce.Job:  map 0% reduce 0%
18/02/02 08:50:24 INFO mapreduce.Job:  map 30% reduce 0%
18/02/02 08:50:25 INFO mapreduce.Job:  map 40% reduce 0%
18/02/02 08:50:26 INFO mapreduce.Job:  map 70% reduce 0%
18/02/02 08:50:28 INFO mapreduce.Job:  map 100% reduce 0%
18/02/02 08:50:33 INFO mapreduce.Job:  map 100% reduce 100%
18/02/02 08:50:34 INFO mapreduce.Job: Job job_1516176355578_0010 completed successfully
18/02/02 08:50:34 INFO mapreduce.Job: Counters: 49
(***omitted***)
Job Finished in 221.891 seconds
Estimated value of Pi is 3.20000000000000000000

Only after you authenticate yourself with kinit, you can submit Hadoop job.
In this way, Kerberos provides authentication phase to Hadoop Cluster and increase its security. But don't forget that there are 5 pillars of security in Hadoop, and only 1 part Kerberos Authentication covers. Once Kerberos authenticate a user, the user can use whatever services and access whatever data on the cluster. If you need authorization flow to control who uses what services/data, you may consider deploying Apache Ranger https://ranger.apache.org/.

Note

You might notice that there is an error says "java.net.ConnectException: Connection refused" in the middle of the output above. It is not because of Kerberos Authentication, but just because YARN Resource Manager HA is enabled in the cluster.
According to this link https://community.hortonworks.com/articles/74768/yarn-client-always-connects-to-rm1-when-rm-is-in-h.html, YARN always tries to connect to Primary RM, whether it is in Active or Standby, then fails over to Secondary RM if YARN RM HA is enabled. This is how YARN is designed.

3. Kerberos Concept and Components

Before moving on to Kerberos Authentication's internal process, let’s review Kerberos basic concept and components.

Kerberos is a computer network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.
https://en.wikipedia.org/wiki/Kerberos_(protocol)

Hadoop Cluster can be integrated with MIT Kerberos or with Microsoft Active Directory. Please keep it in mind while reading this part, Active Directory is used to deploy Kerberos in my environment. So there would be some slight differences in MIT Kerberos.

  • User Principal Name (UPN): UPN is a registered user name in your Active Directory Domain. For kerberizing Hadoop, you need to have a dedicated Organization Unit, where all principles used by Hadoop Cluster resides, and a registered user who has delegated control to create/delete/manager user account in the OU. UPNs are used when authenticating with kinit command. In my case, the user kerberos-user-0109@TESTDOM.LOCAL is a UPN. Note that it is recommended to have separated users who have delegated control to manage principles in AD and who are allowed to use Hadoop services on the cluster. https://www.cloudera.com/documentation/enterprise/5-11-x/topics/cm_sg_s3_cm_principal.html

  • Service Principal Name (SPN): SPN is an identifier for a service in the cluster. For example, yarn/kcdh-master-1.testdom.local@TESTDOM.LOCAL in the output log above is a SPN for YARN. A SPN is used to request a Service Ticket. As a result, the SPN must be matched to the SPNs stored in Active Directory. After Kerberos is enabled in Hadoop Cluster, you can find that there are dozens of new registered principles inside the OU. Their names and passwords are randomly set. But, each principle corresponds to each Hadoop services working on each host. http://blog.cloudera.com/blog/2014/07/new-in-cloudera-manager-5-1-direct-active-directory-integration-for-kerberos-authentication/
    image.png

  • Realm: A Kerberos realm is a set of managed nodes that share the same Kerberos database. When using Active Directory, Realm is the domain administered by Kerberos. In the example above, Realm is TESTDOM.LOCAL.

  • Ticket Granting Ticket/Service Ticket: ST is a ticket used to authenticate an access to a specific service on a host. In the ticket, SPN to access is included. Before requesting ST, however, a client has to have TGT which is used to request ST. To request TGT, client execute kinit command for the authentication.

  • Authentication Server/Ticket Granting Server: AS is a server which provides TGT to a client. TGS is a server which provides ST to a client if its has a valid TGT. When using Active Directory, both of the servers resides in the same host.

4. How Kerberos Authentication Works Internally

The following figure is an example of Kerberos work flow. To explain how Kerberos Authentication works internally, I would like to follow the steps in the figure. Note that these processes do not occur every time when you access Hadoop service. Once Client get a ticket from Kerberos Server, it is valid for a while. If Client's tickets are expired, new tickets are generated by the following procedures.

image.png
https://www.safaribooksonline.com/library/view/hadoop-security/9781491900970/ch04.html

Also, to understand its design with more detail, I referred Kerberos section in this book.

Step 1: Client Send AS_REQ to Authentication Server

First of all, Client send AS_REQ to AS to authenticate itself. These Step 1 ~ 3 correspond to the kinit part in the above example.
The contents of AS_REQ is like below. Note that "Client Name" is UPN of Client and "Service name" is SPN of Ticket Granting Server.

image.png

Step 2: Authentication Server send TGT back to Client

After AS receives AS_REQ, it verifies the contents such as if Client's UPN is registed in Active Directory. Once the request is validated, AS returns Ticket Granting Ticket to Client encrypted by the password of Client's UPN. TGT is encapsulated in AS_REP.
The following figure shows the contents of AS_REP.
image.png

Step 3: Decode TGT

Then, you are prompted to enter password for Client's UPN. If you correctly enter the password, now Client has got a TGT to request Service Ticket. TGT is stored in cache.

At this point, Client also retrieve "Session Key" in AS_REP. By using this Session Key, Client need to enter no password (until the session is expires) any more in the later part of Kerberos processes.

Step 4: Client Send TGS_REQ to Ticket Granting Server

Now, Client is ready to request Service Ticket. STEP 4 ~ 7 correspond to the hadoop jar hadoop-examples.jar part in the above example. It means, this process starts when you interact with Hadoop services such as YARN, Spark and Hive. When Client access to a service, TGS_REQ is sent to Ticket Granting Server.

The following figure shows the contents of TGS_REQ. As you can see, inside the packet, there is TGT to validate the request come from an authenticated user. Also, TGT is used to generate an access token for the service. In addition, it is important to know that "Service Principal" is the service to which Client is requesting to connect. In the above example, it is yarn/kcdh-master-1.testdom.local@TESTDOM.LOCAL.
image.png

Step 5: Ticket Granting Server Send Service Ticket Back to Client

When TGS receives a TGS_REQ, it confirms if there is an SPN registered in Active Directory which matches to "Service Principal" in the TGS_REQ. If a matched SPN is found, TGS issues a Service Ticket for allowing the Client to use the service. ST is sent to Client being included in TGS_REP shown in the figure below.
As you can see, in "Service Ticket", there are many information included. However, it is encrypted with a hash of the service's password.
image.png

Step 6: Submit a Job with a Valid Service Ticket

After receiving an ST, Client stores the ticket in cache. Finally, Client is ready to access a Hadoop service using valid tickets and submit a job.

Step 7: Service Response

As Client is validated by Kerberos Authentication, the job submitted by Client is executed and an output returns.

Note

You might notice that there is the following message in the log above when trying to access YARN service.
18/02/02 08:46:52 INFO hdfs.DFSClient: Created token for kerberos-user-0109: HDFS_DELEGATION_TOKEN
According to this document, using Hadoop services such as YARN and MapReduce, it is necessary to access to Name Node several times during a single job execution. In that case, instead of processing Kerberos Authentication every single time to access to the nodes, Name Node issues "Delegation Tokens" which can be used from the second time.