Discover Top Posts Tagged with #amazon emr

Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 2

Introduction

In Part 1 of this two-part post, we created and configured the AWS resources required to demonstrate the use of Apache Zeppelin on Amazon Elastic MapReduce (EMR). Further, we configured Zeppelin integrations with AWS Glue Data Catalog, Amazon Relational Database Service (RDS) for PostgreSQL, and Amazon Simple Cloud Storage Service(S3) Data Lake. We also covered how to obtain the…

View On WordPress

#Amazon EMR #Apache Zeppelin #AWS Glue #Big Data #Data Catalog #Data Science #Elastic MapReduce #EMR #RDS #Zeppelin

Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 1 - Setup

Getting Started with Apache Zeppelin on Amazon EMR, using AWS Glue, RDS, and S3: Part 1 – Setup

Introduction

There is little question big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last 3–5 years. Behind the hype cycles and marketing buzz, these technologies are having a significant influence on all aspects of our modern lives. Due to their popularity, commercial…

View On WordPress

#Amazon EMR #Apache Zeppelin #AWS Glue #Big Data #Data Analytics #Data Catalog #Data Science #Elastic MapReduce #Notebook #Zeppelin

MapReduce Custom Input Formats - Reading Paragraphs as Input Records

If you are working on Hadoop MapReduce or Using AWS EMR then there might be an usecase where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then how will you achieve it in MapReduce?.

In order to do this, we will need to customize the default behavior of TextInputFormat i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.

This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it's self-explanatory:

public class ParagraphRecordReader implements RecordReader<LongWritable, Text> { private LineRecordReader lineRecord; private LongWritable lineKey; private Text lineValue; public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException { lineRecord = new LineRecordReader(conf, split); lineKey = lineRecord.createKey(); lineValue = lineRecord.createValue(); } @Override public void close() throws IOException { lineRecord.close(); } @Override public LongWritable createKey() { return new LongWritable(); } @Override public Text createValue() { return new Text(""); } @Override public float getProgress() throws IOException { return lineRecord.getPos(); } @Override public synchronized boolean next(LongWritable key, Text value) throws IOException { boolean appended, isNextLineAvailable; boolean retval; byte space[] = {' '}; value.clear(); isNextLineAvailable = false; do { appended = false; retval = lineRecord.next(lineKey, lineValue); if (retval) { if (lineValue.toString().length() > 0) { byte[] rawline = lineValue.getBytes(); int rawlinelen = lineValue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } isNextLineAvailable = true; } } while (appended); return isNextLineAvailable; } @Override public long getPos() throws IOException { return lineRecord.getPos(); } }

With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior. Our new class, ParagrapghInputFormat will have below implementation:

public class ParagrapghInputFormat extends TextInputFormat { @Override public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException { reporter.setStatus(split.toString()); return new ParagraphRecordReader(conf, (FileSplit)split); } }

Another change is to ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:

conf.setInputFormat(ParagraphInputFormat.class);

With above changes, we have a required implementation to support reading paragraphs as input records into MapReduce programs. To have a clear perspective of what we achieved above, let's assume that input file is as follows with paragraphs:

This is a good article sharing an useful perspective on customizing default behavior. You could use highlight blocks for showing code.

And a simple mapper code would look like:

@Override public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { System.out.println(key+" : "+value); }

This mapper output in console as follows:

0 : This is a good article sharing an useful perspective on customizing default behavior 0 : You could use highlight blocks for showing code.

Hope this helps you in extending default input format behaviors in MapReduce. If you like what we do then connect us with at cloud [at] minjar [dot] com, we are always hiring smart people :)

Author: This blog post is contributed by Amarkant, TechLead - BigData team at Minjar.

#hadoop #Amazon EMR #mapreduce

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

I did a talk last week at Barcamp Bangalore on "Big Data and Hadoop in Cloud - Leveraging Amazon EMR". The focus was to help audience understand Big Data and how to leverage frameworks like Hadoop to build context and derive insights. As big data is becoming a common use case and we need distributed systems that can store and take advantage of parallel processing to analyze growing data sets.

I spoke about Hadoop, Map Reduce in general and how to run Hadoop Map Reduce jobs using Amazon EMR service. Also shared some insights from managing hyper scale production Hadoop clusters and tuning for performance in general – Think 68400 GB RAM, 26000 CPUs and 1700000 GB Disks

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

View more PowerPoint from Vijay Rayapati

Drop me a note if you have any specific comments. Would love to hear your feedback!

#JustMigrate #Amazon EMR #cloudcomputing #Hadoop #MapReduce #Performance Tuning

Amazon EMR Cost Optimization : Using Spot Instances Without Risk

All of us look for saving cost on machines running on the cloud and one such option provided by Amazon are the SPOT machines. But is it practical to use these for our EMR jobs?

You can bid on Spot instances on EMR for your Hadoop jobs but in this case there is always a risk of losing the machines and therefore the job failing. This is not entirely correct since EMR allows us to launch a job with a few spot nodes(task) and a few core nodes.

The EC2 instances used to run an Elastic MapReduce job flow fall in to one of three categories or instance groups:

Master- The Master instance group contains a single EC2 instance. This instance schedules Hadoop tasks on the Core and Task nodes.

Core - The Core instance group contains one or more EC2 instances. These instances use HDFS to store the data for the job flow. They also run mapper and reducer tasks as specified in the job flow. This group can be expanded in order to accelerate a running job flow.

Task - The Task instance group contains zero or more EC2 instances and runs mapper and reduce tasks. Since they don’t store any data, this group can expand or contract during the course of a job flow.

You can choose to use either On-Demand or Spot Instances for each of your job flows. This is valid for all of the above types. However, from the definition above if you lose a master or core machine then your job is bound to fail. Theoretically, you can have something like:

elastic-mapreduce –create –alive –plain-output … –instance-group master –instance-type m1.small –instance-count 1 –bid-price 0.098 \ –instance-group core –instance-type m1.small –instance-count 10 –bid-price 0.028 \ –instance-group task –instance-type m1.small –instance-count 30 –bid-price 0.018

But realistically, as you know, if you request spot instances, keep in mind that if the current spot price exceeds your max bid, either instances will not be provisioned or will be removed from the current job flow. Thus, if at any time the bid price goes higher and you lose any of your CORE or MASTER node then the job will fail. Both CORE and TASKS nodes run TaskTrackers but only CORE nodes run DataNodes so you would need at least one CORE node.

To hedge the complete lose of a jobflow, multiple instance groups can be created where the `CORE` group is a smaller complement of traditional on-demand systems and the `TASK` group is the group of spot instances. In this configuration, the `TASK` group will only benefit the mapper phases of a job flow as work from the `TASK` group is “hand back up” to the `CORE` group for reduction.

So say if you have to run a job which would ideally need 40 slave machines, then you can have say 10 machines(CORE group) as the traditional instance while other 30 as spot instances(TASK group). The syntax for creating the multiple instance groups is below:

elastic-mapreduce –create –alive –plain-output … –instance-group master –instance-type m1.small –instance-count 1 \ –instance-group core –instance-type m1.small –instance-count 10 \ –instance-group task –instance-type m1.small –instance-count 30 –bid-price 0.018.

(Source: http://www.understandbigdata.com)

This will help you to save cost by running SPOT instances as your nodes and at the same time make sure that job does not fail. However, keep in mind that it is possible, depending upon your price and the time taken to complete the job, the SPOT instances may come and go so might in the worst case end up incurring the same cost and taking longer time to complete the job. It will all depend on your bid price so choose the price wisely :)

Written by Varun Singhal, Big Data Architect

#Amazon EMR #Hadoop Optimization #EMR Spot Instances

Fun w/ Conditional & Aggregation Functions

One problem that came up pretty early one is how to run aggregation functions conditionally. The actual use case came up when I was trying to count the number of certain events that a particular user triggered when visiting our website.

The data can be pictured as a table of events (login, sign in, etc) that had a user id associated with them. The idea is to produce counts and averages (aggregations) of the events. One can think of it as turning the original table over sideways so that what were the rows are now the columns (sort of...)

So, imagine the data as like:

TABLE 1

user id, event

01, login

01, sign_in

01, event_02

02, sign_in

02, event_03

...

And so on. The goal was to make it look like

TABLE 2

user id, login, sign_in, sign_out, event_01, event_02 ... etc

01,1,1,0,0,1, ...

In this example we simply want to count, in TABLE 1, the number of events associated with the event type column in TABLE 2.

How to do this? A quick look HIVE UDF docs shows we've got aggregation functions as well as condition functions. The relevant ones are:

count(expr) - Returns the number of rows for which the supplied expression is non-NULL.

and

CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END

Combining them would give us something like

SELECT

user_id,

count(event),

count(CASE event WHEN "login" THEN 1 ELSE NULL END),

count(CASE value WHEN "sign_in" THEN 1 ELSE NULL END),

count(CASE value WHEN "sign_out" THEN 1 ELSE NULL END),

count(CASE value WHEN "event_02" THEN 1 ELSE NULL END).

... etc

FROM

TABLE 1

GROUP BY

user_id

#Amazon EMR