AWS EMR Tutorial with hands-on session part 2

“functions” in the AWS Management Console search bar and select “Step Functions” from the dropdown.

  1. Create a State Machine:
    • Click on the “Create state machine” button.
    • For the “Name” field, enter a name for your state machine, like “EMROrchestration.”
    • In the “Definition” section, select “Edit code inline.”
    • Copy and paste the JSON definition for your state machine. This JSON definition will include the steps you want to orchestrate, such as starting a Spark step, a Hive step, and a Pig step on your EMR cluster. Make sure you have the correct paths to your scripts on S3 and any other parameters required for your EMR steps.
  2. Create Execution Role:
    • If you haven’t created an IAM role for Step Functions before, you can create one by clicking on “Create an IAM role” link. This role should have the necessary permissions to perform EMR and S3 actions.
  3. Create State Machine:
    • Click on the “Create state machine” button to create your state machine. Review the settings, and click “Create state machine” when ready.
  4. Start a State Machine Execution:
    • Once your state machine is created, you can start an execution by clicking the “Start execution” button.
    • In the “Input” section, you can provide any input data required for your state machine. This input data will be used as parameters in your EMR steps if needed.
  5. Monitor Execution:
    • You can monitor the progress of your state machine execution on the Step Functions console. It will show you which step is currently being executed, completed, or if there are any failures.
  6. View Execution Logs:
    • If you encounter any issues or want to view detailed logs for each step, you can navigate to the AWS EMR console and select your cluster. From there, you can view the logs for each step within your EMR cluster.

By using AWS Step Functions, you can easily automate and orchestrate complex workflows involving EMR clusters and other AWS services. It simplifies the process of managing and monitoring these workflows, making it more efficient and less error-prone.

Please note that creating JSON definitions for state machines and configuring IAM roles may require a good understanding of AWS services and permissions. Be sure to follow AWS best practices and security guidelines when setting up these components.

we’ll explore EMR (Elastic MapReduce) Auto Scaling, a feature that allows you to dynamically adjust the number of instances in your EMR cluster based on specific conditions and rules. This can help optimize performance and costs, ensuring you have the right amount of compute resources when you need them and scaling down when you don’t.

Setting Up EMR Auto Scaling

  1. Adjust Concurrency Level:
    • To get started, navigate to your EMR cluster in the AWS Management Console.
    • Click on the “Steps” tab.
    • Locate the “Concurrency” level, which is currently set to 1.
    • Change it to 5 to allow running up to five steps concurrently.
    • Make sure to save your changes.
  2. Create a Custom Scaling Policy:
    • Go back to the cluster summary page.
    • Click on the “Cluster Scaling Policy.”
    • Edit the existing scaling policy or create a new one as needed.
    • Set the minimum and maximum number of instances according to your requirements (e.g., 2 minimum and 5 maximum).
  3. Define Scaling Rules:
    • Create scaling rules that determine when instances should be added or removed based on conditions.
    • For example, you can create a rule to “Add Node” when the number of running applications (apps running) is greater than or equal to 2 for one 5-minute period. Set a cooldown period for the rule (e.g., 60 seconds).
    • Create a corresponding rule to “Remove Node” when the number of running applications (apps running) is less than 2 for one 5-minute period with a similar cooldown period.
  4. Attach the Scaling Policy:
    • Wait for the scaling policy to transition from “Pending” to “Attached.” This may take a moment.
  5. Create an EMR State Machine:
    • Open the Step Functions console.
    • Create a new state machine.
    • Define your workflow using the EMR state machine code, which specifies the steps and actions you want to perform.
    • You can find the sample workflow code in the provided JSON file.
  6. Configure State Machine Inputs:
    • For this step, you will need to specify inputs for the state machine execution.
    • Copy the EMR state machine arguments file from your GitHub repository.
    • Update the cluster ID and S3 bucket names in the arguments file to match your cluster and resources.
  7. Start the State Machine Execution:
    • Initiate the state machine execution with the updated arguments.
    • The state machine will start running the specified EMR steps, causing applications to run on the cluster.
  8. Monitor Auto Scaling:
    • Go to the EMR cluster monitoring section.
    • Monitor the number of running applications (apps count) and other relevant metrics.
    • As the workload increases, you should observe the cluster automatically scaling out by adding instances to handle the load.
  9. Observe Scaling Down:
    • Once the workload decreases, the cluster will scale down by removing instances, as specified in your scaling policies.
    • Monitor the cluster’s behavior during this scaling-down phase.


Congratulations! You’ve successfully set up EMR Auto Scaling in your EMR cluster. This feature allows you to dynamically adjust cluster resources based on workload demands, optimizing both performance and cost-efficiency. EMR Auto Scaling is a valuable tool for managing your big data processing tasks in a flexible and efficient manner.

Thank you for following along with this tutorial series on Amazon EMR. We hope you found it informative and useful. If you have any questions or need further assistance, please feel free to reach out or explore other AWS services to expand your cloud computing skills.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top