Hands-On Guide to Processing Data with Amazon EMR

Amazon Web Services (AWS) offers a plethora of powerful tools and services for processing and analyzing large datasets. One such service is Amazon Elastic MapReduce (EMR), a managed cluster platform that simplifies the execution of big data frameworks, including Apache Hadoop and Apache Spark, on AWS infrastructure. In this hands-on guide, we will explore the architecture of Amazon EMR and walk through the step-by-step process of creating an EMR cluster using the AWS Management Console. Additionally, we’ll run a sample Hive application to process and analyze data stored in Amazon S3.

Understanding Amazon EMR Architecture

Before we dive into the practical aspects, let’s briefly review the key components of Amazon EMR:

Types of Storage

  1. Hadoop Distributed File System (HDFS): A scalable, distributed file system designed for Hadoop.
  2. EMR File System: Uses either HDFS or Amazon S3 as the file system in your cluster.
  3. Local File System: Refers to locally connected disks.

Processing Frameworks

  • Cluster Resource Management: Manages cluster resources and schedules data processing jobs using YARN as the default.
  • Data Processing Frameworks: Different frameworks such as Hadoop MapReduce, Tez, and Spark are available for various processing needs.
  • Applications and Programs: Supports applications like Hive, Pig, and Spark Streaming for data processing and analysis.

Prerequisites

Before you get started, ensure you have completed the following prerequisites:

  1. Sign Up for AWS: Create an AWS account if you don’t have one already.
  2. Create an S3 Bucket: Set up an S3 bucket to store output data.
  3. Create an EC2 Key Pair: Generate an EC2 key pair for secure access to your EMR cluster.

Launching an EMR Cluster

Now, let’s walk through the process of creating an Amazon EMR cluster step by step:

  1. Sign in to the AWS Management Console.
  2. Select your desired region (e.g., Ireland).
  3. Navigate to the EMR service and click on “Create cluster.”
  4. Provide a meaningful name for your cluster, keep logging enabled, and choose “Cluster” as the launch mode. Use the latest release version.
  5. Under Applications, select “Core Hadoop” (which includes Hive).
  6. Hardware configuration can be left as default.
  7. Choose the EC2 key pair you created earlier under EC2 key pair.
  8. Permissions can be left as default with the default EMR role and EC2 instance profile.
  9. Click on “Create cluster,” and your cluster will begin provisioning. Click on Hardware tab to view underlying EC2 instance A popup open to show you the SSH command Click on

Running a Hive Script

While your EMR cluster is spinning up, let’s prepare the sample data and script. In this example, we’ll calculate the number of requests per operating system over a specified timeframe using HiveQL, a SQL-like scripting language for data analysis. The sample data and script are stored in an S3 bucket.

Submitting the Hive Script as a Step

  1. Select “Steps” from the EMR console.
  2. Click on “Add step.”
  3. Choose the step type as “Hive program.”
  4. Provide the script’s S3 location, input S3 location, and output S3 location. Ensure the output S3 location points to the bucket you created earlier.
  5. Click on “Add.”
  6. Wait for the step to complete. You can monitor its progress in the EMR console.

Running the Hive Script via SSH

To run the Hive script directly on the master node, follow these steps:

  1. Open port 22 for SSH access in the security group of the master node.
  2. Copy the SSH command from the EMR console.
  3. Paste the command into your terminal and press Enter to SSH into the master node.
  4. Navigate to the “Steps” section, select your Hive program, and copy the complete command.
  5. Paste the command in the terminal, changing the output folder name if needed.
  6. Press Enter to run the script.
  7. Wait for the Hive job to complete.

Viewing the Results

To view the results, navigate to the Amazon S3 console:

  1. Search for “Amazon S3” in the AWS Management Console.
  2. Select your S3 bucket.
  3. Access the output folder.
  4. Download and open the output file to view the results of your Hive job.

Conclusion

Congratulations! You’ve successfully created an Amazon EMR cluster and executed a sample Hive job to process and analyze data. Remember to clean up your resources by removing your S3 bucket and terminating your EMR cluster to avoid additional charges.

For more information on analyzing big data with Amazon EMR, refer to the official AWS documentation.

Thank you for joining us on our AWS Certified Solutions Architect Professional tutorial series

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top