AWS EMR Tutorial with hands-on session

Welcome to our AWS Certified Solutions Architect Professional tutorial series, and in this tutorial, we’re going to take a deep dive into AWS EMR (Elastic MapReduce). This comprehensive course will cover everything you need to know to master AWS EMR, from configuring and spinning up EMR clusters to running Spark ETL jobs. We’ll also explore notebooks, delve into Hive and Pig, explore step functions for orchestration, and touch on auto-scaling.

Before we dive into the content, I want to let you know that all the resources you’ll need for this course are available in the description below. This includes links to GitHub for the code and Buy Me a Coffee for the slides, and guess what? It’s all completely free!

Now, let’s get started with the course. But before we jump into the technical details, we have some housekeeping to take care of. So, let’s head over to the computer and begin the setup.

Setup: Alright, folks, we’re all set to start the setup process. I’ve logged into the AWS console, and the first thing we need to do is create a new Virtual Private Cloud (VPC). Let’s go ahead and do that:

  1. Type “VPC” in the search bar at the top, and select “VPC” under “Isolated Cloud Resources.”
  2. Click on “Launch VPC Wizard.”
  3. We’ll leave most settings as default, but let’s give our VPC a name. I’ll call it “emr-tutorial.”
  4. For the number of public subnets, you can keep it as the default (2). We only need one, but that’s fine.
  5. Set the number of private subnets to 0 since we won’t need any for this tutorial.
  6. Make sure there’s a NAT Gateway with the S3 option selected.
  7. Optionally, you can change the name to something more meaningful.
  8. Click “Create VPC.”

The VPC setup can be a bit confusing with the new wizard, but we’ve got our VPC ready now. Next, we need to set up a Cloud9 development environment. Cloud9 provides free IDEs (Integrated Development Environments) by AWS, and I’ll be using a free EC2 instance type, so it won’t cost us anything.

  1. Type “Cloud9” in the search bar at the top and select “Cloud9.”
  2. Click “Create environment.”
  3. Give your environment a name; I’ll call it “emr-tutorial” as well.
  4. Choose the default settings, making sure it’s set to automatically stop after 30 minutes to avoid costs.
  5. Under “Network settings,” select the VPC you just created (in my case, “emr-tutorial VPC”).
  6. Keep the subnet as “1b” if that’s what you selected for your Cloud9 instance.
  7. Click “Next step.”
  8. Choose “Create a new EC2 instance for environment (direct access).”
  9. Make sure the instance type is “t2.micro” to stay within the free tier.
  10. Leave other settings as default and click “Next step.”
  11. Review your settings and click “Create environment.”

The Cloud9 environment will take a moment to set up. Once it’s ready, you can access it to continue with the setup.

Now, let’s move on to creating an SSH key pair for secure access:

  1. Go to EC2 by typing “EC2” in the search bar and selecting “EC2.”
  2. In the left menu, navigate to “Key Pairs” under “Network & Security.”
  3. Click “Create Key Pair.”
  4. Give your key pair a name, like “emr-tutorial.”
  5. Choose the “RSA” key pair type and download the private key file (e.g., “emr-tutorial.pem”).
  6. Keep this file secure, as it will be used for SSH access to your instances.

With the SSH key pair in hand, we’ll upload it to your Cloud9 environment:

  1. Return to your Cloud9 environment.
  2. In the Cloud9 environment, click “File” and then select “Upload Local Files.”
  3. Select the private key file you just downloaded (e.g., “emr-tutorial.pem”).
  4. The key file will appear in your Cloud9 environment.

Now, let’s make sure no one else can access the key file by running the following command:

bashCopy codechmod 400 emr-tutorial.pem

This command restricts access to the key file. Now you’re all set up and secure for the AWS EMR tutorial. In the next section, we’ll dive into the core concepts of AWS EMR. Let’s jump back to the console and get started.


Before we start configuring and running EMR clusters, let’s get familiar with some essential AWS EMR terminology:

  • EMR (Elastic MapReduce): EMR is a managed clustered platform that simplifies running big data frameworks. It stands for Elastic MapReduce.
  • Master Node: The master node manages the cluster by coordinating the distribution of data and tasks among other nodes.
  • Core Nodes: Core nodes are responsible for running tasks and storing data in the Hadoop Distributed File System (HDFS) on the cluster. At least one master node and one core node are required for a minimum EMR cluster setup.
  • Task Nodes: Task nodes run tasks on data but do not store data themselves. They are optional in an EMR cluster.
  • Data Processing Frameworks: EMR supports various data processing frameworks, including Spark, Hive, Pig, and more. These frameworks help process and analyze data on the cluster.
  • YARN (Yet Another Resource Negotiator): YARN is a cluster resource management tool that manages resources and job scheduling in EMR clusters.
  • Storage: EMR uses HDFS to store data, but many users prefer using the EMR File System (EMRFS) to store data on Amazon S3 for cost-effective and scalable storage.
  • Local File System: Each node in the EMR cluster has a local file system, but it’s typically used for the operating system and software, not for data storage.

With these foundational terms in mind, let’s proceed to set up our first EMR cluster. Back to the console!

Setting up EMR Cluster:

In this section, we’ll walk through the steps to set up an EMR cluster for our tutorial. Follow along with these steps:

  1. Type “EMR” in the search bar at the top and select “EMR” to go to the EMR service.
  2. Click “Create cluster.”
  3. Choose “Go to advanced options.”
  4. Under “Software configuration,” select the following frameworks:
    • Hadoop
    • JupyterHub
    • Hive
    • Jupyter Enterprise Gateway
    • Hue
    • Spark
    • Pig
    • Livy
  5. Click “Next.”
  6. For “Network,” choose the VPC you created earlier.
  7. For “Subnet,” select a subnet in the same availability zone as your Cloud9 instance (e.g., “us-east-1b”).
  8. Set the number of master nodes to 1

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top