What is Amazon EMR? How to create Amazon EMR clusters
Describe Amazon EMR.
Amazon EMR, previously Amazon Elastic MapReduce, allows Apache Hadoop and Apache Spark easy to run on AWS for processing and analysing enormous amounts of data. These frameworks and open-source apps process data for corporate intelligence and analytics. Amazon EMR transforms and transfers massive volumes of data between Amazon DynamoDB and Amazon S3..
Amazon EMR cluster setup and operation
A detailed overview of Amazon EMR clusters, including how to submit work, how data is handled, and the cluster's processing phases.
Learning nodes and clusters
Main component of Amazon EMR is cluster. Amazon EC2 clusters are groups of instances. Every cluster instance is a node. Each cluster node type has a role. Amazon EMR puts software components on each node type to assign it a function in a distributed application like Apache Hadoop.
Types of Amazon EMR nodes:
The primary node runs software to coordinate work and data allocation across processing nodes, administering the cluster. The primary node monitors cluster health and tasks. Every cluster has a primary node that can form a single-node cluster.
The core node contains the software needed to run operations and store data in your cluster's Hadoop Distributed File System. Core nodes are present in multi-node clusters.
Task nodes: Software-equipped nodes that execute tasks without storing data in HDFS. Task nodes are optional.
Submitted work to cluster
When running an Amazon EMR cluster, you may specify tasks in several ways.
Provide clear instructions for cluster construction phases. This is frequently done to clusters that process a particular amount of data and then shut down.
Submit steps, including jobs, using the Amazon EMR UI, API, or CLI after constructing a long-running cluster. Check out Submit work to an Amazon EMR cluster.
Establish a cluster, connect to the primary node and other nodes via SSH, then complete tasks and send interactive or scripted queries using the installed apps' interfaces. Learn more from the Amazon EMR Release Guide.
Data processing
When you launch your cluster, you choose data processing frameworks and apps. You can process data in your Amazon EMR cluster by performing steps in the cluster or sending jobs or queries to installed apps.
Jobs posted directly to applications
Your Amazon EMR cluster's software lets you submit jobs and communicate with it. This is usually done by connecting securely to the primary node and utilising the tools and interfaces for your cluster's software.
Executing data processing procedures
Amazon EMR clusters can receive ordered steps. Each stage contains data modification instructions for the cluster's software.
The following procedure has four steps:
Submit a dataset for processing.
Process first-stage output with Pig.
Hive can process a second input dataset.
Make an output dataset.
Amazon EMR usually processes data from your chosen file system, such as HDFS or Amazon S3. This data progresses via processing. The output data is written to an Amazon S3 bucket in the last stage.
Steps are performed in this order:
Start processing is requested.
All actions are pending.
It becomes RUNNING when the sequence starts. The remaining steps are PENDING.
After the first stage, it becomes COMPLETED.
Once the sequence continues, its status becomes RUNNING. Its condition is COMPLETED when done.
This cycle continues until all stages are completed and processing is complete.
The following diagram shows processing steps and state changes.
Failure while processing marks a step as FAILED. Choose a follow-up for each stage. If a previous step fails, the remaining steps are set to CANCELLED and do not execute. Other alternatives include stopping the cluster immediately or disregarding the failure and continuing.
The figure shows the default state change and step sequence when a processing step fails.
Understanding cluster lifespan
Successful Amazon EMR clusters work like this:
Amazon EMR creates EC2 instances in the cluster for each instance based on your requirements. See Amazon EMR cluster hardware and networking configuration for more. Amazon EMR always utilises the default AMI or your custom Amazon Linux AMI. For more, see Using a custom AMI to increase Amazon EMR cluster configuration flexibility. The cluster state is just beginning.
You can configure bootstrap activities for each Amazon EMR instance. Custom apps can be installed and customised using bootstrap activities. Read Create bootstrap actions for Amazon EMR cluster software installation. Currently, the cluster is BOOTSTRAPPING.
Amazon EMR may install native apps like Hive, Hadoop, Spark, and others when you establish the cluster. After startup and native application installation, the cluster is RUNNING. After connecting to cluster instances, the cluster will execute the sequential steps you selected when you established it. Submit further actions after prior steps are complete. Check out Submit work to an Amazon EMR cluster.
A successful step puts the cluster in WAITING.
Following the last phase, an auto-terminating cluster enters TERMINATING before terminating. Waiting requires manually shutting down the cluster. After a manual shutdown, the cluster enters TERMINATING before TERMINATED.
Amazon EMR terminates the cluster and all instances if a cluster lifecycle failure occurs without termination protection. If a cluster fails, its data is destroyed and its status changed to TERMINATED_WITH_ERRORS. If configured, you can restore data, deactivate termination protection, and end the cluster. Find out how termination protection can prevent unintended shutdown of Amazon EMR clusters.
This image shows the cluster lifespan and how each stage corresponds to a cluster state.












