Description:
In this hands-on lab session, your student group will take on the role of data engineers working for a company that provides big data services to global enterprise clients. Your task is to manage the migration of a massive 60 TB dataset from an on-premises Oracle data warehouse to Amazon Redshift. This migration project presents unique challenges due to limited internet bandwidth and the need to maintain data consistency during daily and monthly updates.
Lab Scenario:
- Your client has accumulated 60 TB of raw data in their on-premises Oracle data warehouse.
- The goal is to migrate this data to Amazon Redshift, a powerful data warehousing solution in the AWS Cloud.
- The Oracle data warehouse receives minor updates daily and major updates at the end of each month.
- The migration process must be completed within approximately 30 days before the next major update on the Redshift database.
- You have a strict constraint: The company can only allocate a maximum of 50 Mbps of internet bandwidth for this activity to prevent disruptions to ongoing business operations.
Lab Objectives:
Your team’s mission is to successfully manage this complex data migration project within the given constraints. Specifically, you will:
- Plan and design a migration strategy that considers the limited bandwidth and the need to maintain data integrity during updates.
- Choose appropriate tools and methods for extracting data from the Oracle database and loading it into Amazon Redshift.
- Create a timeline and schedule that ensures the migration is completed within 30 days.
- Implement data replication strategies to address daily and monthly updates without data loss.
- Monitor the migration process, analyze bottlenecks, and make adjustments to optimize the use of limited bandwidth.
- Ensure data consistency and accuracy throughout the migration.
Key Learning Outcomes:
- Gain practical experience in data migration from on-premises databases to cloud-based solutions.
- Learn to work under constraints, such as limited bandwidth, and devise efficient strategies.
- Develop project planning and management skills for large-scale data migration.
- Apply real-time monitoring and optimization techniques for data transfer.
This hands-on lab presents a real-world challenge encountered by data engineers in the field. It requires you to apply your knowledge of data migration, cloud services, and problem-solving skills to successfully execute the migration project while adhering to strict time and bandwidth constraints.
- Create an AWS Snowball Import Job: Use the
create-job
command to create a Snowball import job.
aws snowball create-job --job-type IMPORT --resources S3Resources={bucketArn=arn:aws:s3:::your-s3-bucket},Ec2AmiResources={your-ami-id} --description "Your Job Description" --addressId your-address-id --shippingOption "2DAY" --notification "SNSTopicArn=arn:aws:sns:your-region:your-account:your-sns-topic"
--job-type
: SpecifyIMPORT
for an import job.--resources
: Define the source S3 bucket and optionally an EC2 AMI for the job.--description
: Provide a description for the job.--addressId
: Specify the Snowball shipping address ID.--shippingOption
: Choose a shipping option, e.g.,"2DAY"
.--notification
: Set up an SNS topic to receive job status updates.
- Data Extraction and Transformation with AWS SCT: AWS Schema Conversion Tool (SCT) primarily involves a GUI application. The AWS CLI doesn’t directly handle SCT tasks. You’ll need to use the GUI tool for schema conversion and mapping data types.
- Install and Register Extraction Agent: Installation and registration of the extraction agent are typically done through the SCT GUI tool. The CLI doesn’t directly manage this process.
- Data Extraction and Loading to Snowball Edge: Use standard CLI commands (e.g.,
aws s3 cp
) to extract and load data into the Snowball Edge device. Ensure the device is mounted correctly and accessible as an S3 endpoint. - Transfer Snowball Edge Data to S3 Bucket: This step is handled by AWS Snowball personnel once you’ve shipped the Snowball Edge device to AWS. No CLI action is needed for this.
- Migrate Data to Amazon Redshift: This step can be achieved using AWS Database Migration Service (DMS). Here’s a basic command to create a DMS replication instance:
aws dms create-replication-instance --replication-instance-identifier your-instance-id --replication-instance-class db.m4.large --allocated-storage 100 --vpc-security-group-ids your-sg-id --availability-zone your-availability-zone
--replication-instance-identifier
: Provide a unique identifier for the replication instance.--replication-instance-class
: Choose the instance class.--allocated-storage
: Specify storage.--vpc-security-group-ids
: Define the VPC security groups.--availability-zone
: Set the availability zone. Creating a migration task and the actual data migration are typically done using the AWS DMS console or API.
- Configure Local Task and AWS DMS Task: The configuration and management of AWS DMS tasks are best handled through the AWS DMS console or API. The AWS CLI can be used for basic tasks but not the entire process.
- Monitor and Verify Data Migration: Monitoring and verification can be done via the AWS DMS console, AWS Snowball job status, and other relevant AWS service consoles.
- Post-Migration Cleanup: Cleanup steps depend on the resources and infrastructure you want to decommission. This may include using commands like
aws s3 rm
to remove data from the Snowball Edge device, terminating AWS resources, and deregistering the Snowball Edge device.
Please note that this process involves a combination of CLI and GUI tools, with some tasks best performed through the AWS Management Console or AWS SDKs/APIs. The specific commands you’ll need to run can vary based on your environment and requirements. Always refer to AWS documentation and adjust the commands as needed.