Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR supports Kerberos for authentication; you can enable Kerberos on Amazon EMR and put the cluster in a private subnet to maximize security.
To access the cluster, the best practice is to use a Network Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups. By default, the NLB prevents Kerberos ticket authentication to any Amazon EMR service.
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as SparkContext management, all via a simple REST interface or an RPC client library.
In this post, we discuss how to provide Kerberos ticket access to Livy for external systems like Airflow and Notebooks using an NLB. You can apply this process to other Amazon EMR services beyond Livy, such as Trino and Hive.
The following are the high-level steps required:
- Create an EMR cluster with Kerberos security configuration.
- Create an NLB with required listeners and target groups.
- Update the Kerberos Key Distribution Center (KDC) to create a new service principal and keytab changes.
- Update the Livy configuration file.
- Verify Livy is accessible via the NLB.
- Run the Python Livy test case.
The advanced configuration presented in this post assumes familiarity with Amazon EMR, Kerberos, Livy, Python and bash.
Create an EMR cluster
Create the Kerberos security configuration using the AWS Command Line Interface (AWS CLI) as follows (this creates the KDC on the EMR primary node):
It’s a security best practice to keep passwords in AWS Secrets Manager. You can use a bash function like the following as the argument to the
--kerberos-attributes option so no passwords are stored in the launch script or command line. The function outputs the required JSON for the
--kerberos-attributes option after retrieving the password from Secrets Manager.
Create the cluster using the AWS CLI as follows:
Create an NLB
Create an internet-facing NLB with TCP listeners in your VPC and subnet. An Internet-facing load balancer routes requests from clients to targets over the internet. Conversely, an Internal NLB routes requests to targets using private IP addresses. For instructions, refer to Create a Network Load Balancer.
The following screenshot shows the listener details.
Create target groups and register the EMR primary instance (
Livy3) and KDC instance (
KDC3). For this post, these instances are the same; use the respective instances if KDC is running on a different instance.
The KDC and EMR security groups must allow the NLB’s private IP address to access ports 88 and 8998, respectively. You can find the NLB’s private IP address by searching the elastic network interfaces for the NLB’s name. For access control instructions, refer to this article on the knowledge center. Now that the security groups allow access, the NLB health check should pass, but Livy isn’t usable via the NLB until you make further changes (detailed in the following sections). The NLB is actually being used as a proxy to access Livy rather than doing any load balancing.
Update the Kerberos KDC
The KDC used by the Livy service must contain a new HTTP Service Principal Name (SPN) using the public NLB host name.
- You can create the new principle from the EMR primary host using the full NLB public host name:
Replace the fully qualified domain name (FQDN) and Kerberos realm as needed. Ensure the NLB hostname is all lowercase.
After the new SPN exists, you create two keytabs containing that SPN. The first keytab is for the Livy service. The second keytab, which must use the same KVNO number as the first keytab, is for the Livy client.
- Create Livy service keytab as follows:
Note the key version number (KVNO) for the HTTP principal in the output of the preceding
klist command. The KVNO numbers for the HTTP principal must match the KVNO numbers in the user
keytab. Copy the
livy2.keytab file to the EMR cluster Livy host if it’s not already there.
- Create a user or client keytab as follows:
-norandkey option used when adding the SPN. That preserves the KVNO created in the preceding
- Copy the
user1.keytabto the client machine running the Python test case as
Replace the FQDN, realm, and keytab path as needed.
Update the Livy configuration file
The primary change on the EMR cluster primary node running the Livy service is to the
/etc/livy/conf/livy.conf file. You change the authentication principal that Livy uses, as well as the associated Kerberos keytab created earlier.
- Make the following changes to the
livy.conffile with sudo:
Don’t change the
- Restart and verify the Livy service:
- Verify the Livy port is listening:
You can automate these steps (modifying the KDC and Livy config file) by adding a step to the EMR cluster. For more information, refer to Tutorial: Configure a cluster-dedicated KDC.
Verify Livy is accessible via the NLB
You can now use
user1.keytab to authenticate against the Livy REST endpoint. Copy the
user1.keytab you created earlier to the host and user login, which run the Livy test case. The host running the test case must be configured to connect to the modified KDC.
- Create a Linux user (
user1) on client host and EMR cluster.
If the client host has a terminal available that the user can run commands in, you can use the following commands to verify network connectivity to Livy before running the actual Livy Python test case.
- Verify the NLB host and port are reachable (no data will be returned by the nc command):
- Create a TLS connection, which returns the server’s TLS certificate and TCP packets:
If the openssl command doesn’t return a TLS server certificate, the rest of the verification doesn’t succeed. You may have a proxy or firewall interfering with the connection. Investigate your network environment, resolve the issue, and repeat the openssl command to ensure connectivity.
- Verify the Livy REST endpoint using curl. This verifies Livy REST but not Spark.
Run the Python Livy test case
The Livy test case is a simple Python3 script named
livy-verify.py. You can run this script from a client machine to run Spark commands via Livy using the NLB. The script is as follows:
The test case requires the new SPN to be in the user’s Kerberos ticket cache. To get the service principal into the Kerberos cache, use the
kinit command with the
Note the SPN and the User Principal Name (UPN) are both used in the
The Kerberos cache should look like the following code, as revealed by the
Note the HTTP service principal in the
klist ticket cache output.
After the SPN is in the cache as verified by
klist, you can run the following command to verify that Livy accepts the Kerberos ticket and runs the simple PySpark script. It generates a simple array,
[0,1,2], as the output. The preceding Python script has been copied to the
/var/tmp/user1/ folder in this example.
It can take a minute or so to generate the result. Any authentication errors will happen in seconds. If the test in the new environment generates the preceding array, the Livy Kerberos configuration has been verified.
Any other client program that needs to have Livy access must be a Kerberos client of the KDC that generated the keytabs. It must also have a client keytab (such as
user1.keytab or equivalent) and the service principal key in its Kerberos ticket cache.
In some environments, a simple
kinit as follows may be sufficient:
If you have an existing EMR cluster running Livy and using Kerberos (even in a private subnet), you can add an NLB to connect to the Livy service and still authenticate with Kerberos. For simplicity, we used a cluster-dedicated KDC in this post, but you can use any KDC architecture option supported by Amazon EMR. This post documented all the KDC and Livy changes to make it work; the script and procedure have been run successfully in multiple environments. You can modify the Python script as needed and try running the verification script in your environment.
For more details about the systems and processes described in this post, refer to the following:
About the Authors
John Benninghoff is a AWS Professional Services Sr. Data Architect, focused on Data Lake architecture and implementation.
Bharat Gamini is a Data Architect focused on Big Data & Analytics at Amazon Web Services. He helps customers architect and build highly scalable, robust and secure cloud-based analytical solutions on AWS. Besides family time, he likes watching movies and sports.