Amazon EMR
This page outlines how to use Amazon EMR via Livy as your Spark execution engine in Prophecy.
These instructions work for both Amazon EMR and Amazon EMR Serverless.
Create Amazon EMR cluster with Apache Livy
In your Amazon EMR service, create a cluster. When doing so:
- Under Application bundle select Custom.
- When selecting applications, make sure Livy and Spark are included in the install.
Configure network settings
To make sure that EMR can communicate with Prophecy, you need to configure specific network settings. Specifically, you need to modify the security groups of your EMR cluster.
- Modify the Primary Node security group to allow incoming connections to port
8998
from the Prophecy IP. You can do so by adding an inbound rule to the Master security group that permits incoming traffic on port8998
from the Prophecy IP address. - Modify the Core Node security group to allow outgoing connections to the Prophecy public IP
3.133.35.237
over HTTPS. Do this by adding an outbound rule to the Core security group that allows outgoing traffic over HTTPS protocol to the Prophecy public IP.
Create a Fabric
To connect EMR and Prophecy, you must create a Fabric. You can either create an EMR Fabric (suggested), or a Livy Fabric.
- EMR
- EMR Serverless
- Livy
To create an EMR Fabric:
- Open Prophecy and click Create Entity from the left navigation menu. Then, click on the Fabric tile.
- Name your Fabric and click Continue.
- Keep the Provider Type as Spark, and choose EMR as the Provider.
- Enter your AWS credentials under Access Key and Secret Key. Then, enter the region that your EMR cluster is running in.
- Click on Fetch environments.
- Under Spark Environment, select the EMR cluster that you would like to connect to.
- Enter the S3 path that points to the location where you would like your logs to persist.
- Add the Job size to your environment by clicking on Add Job Size. Configure your Job size and click on Add.
- Select File System under Scala Resolution mode and input
s3://prophecy-public-bucket/prophecy-libs
- Select File System under Python Resolution mode and input
s3://prophecy-public-bucket/python-prophecy-libs
Click Complete to save your new EMR Fabric.
To create an EMR Fabric for your serverless cluster:
- Open Prophecy and click Create Entity from the left navigation menu. Then, click on the Fabric tile.
- Name your Fabric and click Continue.
- Keep the Provider Type as Spark, and choose EMR as the Provider.
- Enter your AWS credentials under Access Key and Secret Key. Then, enter the region that your EMR cluster is running in.
- Click on Fetch environments.
- Under Spark Environment, select the EMR serverless cluster that you would like to connect to. From the list of active clusters, you can identify your EMR Serverless cluster by finding
emr-serverless-services
within the Livy URL. This will automatically populate some of the following fields. - For the Authentication type, select
AWS Sig V4
. This is the only option that will work for EMR Serverless. - Select your Runtime Role. The role must have enough permissions to use the selected Serverless App.
- Enter the S3 path that points to the location where you would like your logs and Pipeline artifacts to persist.
- Add the Job size to your environment by clicking on Add Job Size. Configure your Job size and click on Add.
- Note that the Spark and Scala versions are fixed.
- Select File System under Scala Resolution mode and input
s3://prophecy-public-bucket/prophecy-libs
- Select File System under Python Resolution mode and input
s3://prophecy-public-bucket/python-prophecy-libs
Click Complete to save your new EMR Fabric.
Note: Livy does not work with Python version 3.8 and above.
To create a Livy Fabric:
- Open Prophecy and click Create Entity from the left navigation menu. Then, click on the Fabric tile.
- Name your Fabric and click Continue.
- Keep the Provider Type as Spark, and choose Livy as the Provider.
- Paste your Livy URL. You can find your Livy URL under the Applications tab on your EMR cluster page.
- Choose your Authentication type.
- Click Test Connection.
Once your connection is tested, you'll have to configure a few more settings:
- Fill out the Spark and Scala version based on the values found in your EMR console. Refer to Spark documentation to find the Scala version in relation to your Spark version.
- For the Prophecy Library configurations, choose the File System Resolution mode and provide the S3 folder names for dependencies.
Click Complete to save your new Livy Fabric.
At this point, you can test your Fabric. Open a project, connect to a cluster, and try to run a pipeline!