What is EMR?
Amazon Elastic Map Reduce (EMR) helps to process and analyze large amount of data in a cluster managed platform. It uses the open-source big data framework such as Apache Hive and Apache Spark to process your data on AWS. EMR provides ability to analyze petabytes scale of data on cloud in a cluster based architecture.
Amazon EMR uses the Hadoop framework that runs on Amazon Elastic Compute (EC2) and Simple Storage Service. The EMR cluster contains all processing steps that executes on EC2. It minimizes the on-premise requirement by providing the flexibility to scale your cluster infrastructure automatically when needed. The open-source applications use in EMR are Apache Spark, Apache Hive, Apache HBase and Presto.
Overview of EMR
An overview of EMR includes following concepts.
Cluster and Nodes – An EMR cluster is a collection of EC2 instances which runs in parallel. Each Instance in EMR cluster is considered as a Node. Amazon EMR lets you install Software’s on each node which runs with its with own compute power. There are different node types in EMR and these are as follows
- Master Node – Each Cluster consists of a master node that monitors the cluster health and tracks the task status. Amazon lets you to create a single node cluster with a master node.
- Core Node – Amzon EMR core node helps to run the task on it and stored the data on top of HDFS.
- Task Node – Task node does not store any data. It helps to run the task on it only.
Specify the Task – Amazon EMR lets you several options to specify the task to an EMR cluster.
- AWS allows to create a function with specific task at the time of cluster launch.
- AWS lets you to submit task using EMR console, AWS CLI and EMR API.
- Submit task using SSH to master and core nodes
Processing Data – AWS lets you to securely connect to master node to submit jobs directly for data processing. It also allows to run steps to process data in the cluster. The step sequence and change of steps stage are as follows
What are different cluster States? – The different cluster states are as follows
Architecture of AWS EMR
Amazon EMR consists of different layers such as
- Cluster Resource Management
- Data Processing Frameworks
- Applications and Programs
Amazon EMR cluster uses different file systems to process your data on cloud. The different types of storage options are as following
Hadoop Distributed File System (HDFS) – The HDFS is a scalable filesystem for Hadoop which helps to distribute multiple copies of data across different instances to ensure no data loss in case of any instance failure.
EMR File System (EMRFS) – Amazon EMR directly access the data store in s3 using EMRFS. Amazon s3 helps to store input data as well as output data in EMR Hadoop architecture.
Local File System – EMR stores data on EC2 instance store file system which is called local file system. These file system persists only during the life-cycle of EC2 instance.
Cluster Resource Management
As the name suggests, it takes care of managing the EMR cluster resources and job scheduling using YARN (Yet Another Resource Negotiator) from Apache Hadoop 2.0. Amazon EMR continuously administrates the YARN on each node with an agent to keep the cluster healthy.
Data Processing Framework
This is an essential layer layer to process and analyze your data.EMR provides different data framework to interact with the data in application layer. The main processing framework used in EMR are as follows
Hadoop MapReduce – Its a open source frameworks that helps writing applications that process large amount of data in parallel on a cluster. The MapReduce job splits the input data sets into multiple chunks which is processed by map tasks in parallel way. Hive is a common framework use by MapReduce.
Apache Spark – Apache Spark uses big data framework to process large amount of data. It uses directed acyclic graphs for execution plan and in-memory caching for datasets. EMRFS could be used directly to access data resides in s3 for a spark job run in EMR.
Applications and Programs
Amazon EMR supports many applications such as Pig, Hive and Spark streaming library to create processing workloads, leveraging machine learning algorithm and building data warehouse.
Amazon EMR supports various libraries and languages to interact with the applications that runs in EMR cluster.
Advantages of EMR
Some of benefits of using EMR are as follows
- Security – AWS EMR provides better securing by setting up the cluster infrastructure in a VPC. It also allows you to manage Inbound and Outbound security groups of your EC2 core instances. It also allows you to apply S3 server-side encryption, KMS encryption to secure your data in s3.
- Flexible- Amazon EMR allows you full control to launch a cluster with custom AMI and install packages on the EC2instance using root privilege.
- Minimize Data Loss – AWS EMR distributes multiple copies of your data across different EC2 nodes of your EMR cluster to minimize data loss during any instance failure.
- Large Data Processing – It helps to process petabyte scale of data.
- Elastic – AWS allows you to set up auto scaling to increase or decrease the number of EC2 based on utilization. The infrastructure can be scaled automatically when needed.
- Cost-Effective – AWS allows to you to set up your infrastructure to process large data sets at low cost.
- Monitoring – AWS EMR continuous administrates YARN on each node to keep healthy of your cluster.
AWS EMR costs for both EMR and EC2 price for your cluster. The price may vary based on regions. The EC2 price are applicable for On-Demand, Reserved and Sport EC2 instances. Amazon EMR,EC2 and EBS (additional if any) are billed per-second, with a one-minute minimum.
Reference – AWS EMR Pricing