We have some big news to share: Mortar is joining Datadog, the leading SaaS-based monitoring service for cloud applications. We will be putting our team and our technology to work inside Datadog to expand their analytics capabilities and generate new insights for their users. Datadog processes hundreds of billions of data points every day from servers and applications, and we are thrilled to have the opportunity to apply what we have built at Mortar on that scale.
This is a big change for us, and although we are working hard to limit the impacts on our customers, it will be a change for them, too. As we transition into Datadog, we will be winding down the public Mortar service over the next few months. If you’re currently a Mortar user, you will have continued access to your account until May 15, giving you time to transition to new tools and to transfer or back up your Mortar code.
We have created guides on how to transfer your code, whether it is stored in Web projects or in GitHub-backed Mortar projects. And we’ve created a guide to running Mortar code outside the Mortar service using Luigi, the workflow engine developed and open-sourced by Spotify. Ensuring code portability for our customers is one of the main reasons why we built the Mortar platform using the best available open technologies and languages.
We're incredibly grateful to all of you who have been our customers, who have helped spread the word about Mortar, or who have offered invaluable advice as we built this business over the past 3.5 years. We hope you have enjoyed using Mortar as much as we enjoyed building it. We could not have done it without your support, feedback, and encouragement.
We also hope you’ll keep an eye on what we’re doing in our new digs at Datadog. It ought to be a very exciting 2015.
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
Technical
Hortonworks has published a new round of benchmarks for Apache Hive 0.14, which is the first version with a cost-based optimizer (CBO). On the TPC-DS benchmark, they see an average speedup of 3x across all queries. In addition to those numbers, they’ve also analyzed the effect of the CBO on query plans (e.g. if join order was modified or there was a predicate push down).
http://hortonworks.com/blog/cost-based-optimizer-makes-apache-hive-0-14-more-than-2-5x-faster/
The Parquet file format has gained a lot of traction since being announced as a joint project between Cloudera, Twitter, and others. Parquet is a column-oriented format, which means that data is stored differently than one is used to. This post gives an overview of the building-blocks of a Parquet file: row groups and column chunks. From there, the post gives three guidelines for working with Parquet files.
http://ingest.tips/2015/01/31/parquet-row-group-size/
“The morning paper” is a blog which summaries a new CS paper every weekday morning. This week, the blog highlighted five papers from the 2015 Conference on Innovative Data Systems Research (CIDR). Selections include Liquid, LinkedIn’s system for unifying nearline and offline big data systems (built using Kafka, Samza, and Hadoop), and Impala, Cloudera’s open-source SQL system for Hadoop.
http://blog.acolyer.org/2015/02/01/introducing-cidr15-week-on-the-morning-paper/
A lot of companies store analytics data as JSON, since it’s an easy to use and ubiquitous format. Spark SQL has embraced this, and offers built-in support for JSON. This post looks at programmatically loading a JSON file into Spark SQL (which can infer the schema by scanning the dataset), how JSON data types map to SQL, and more.
http://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
The Hortonworks blog has a post describing new features in HDP 2.2 and YARN to support long-running applications. Areas of focus include fault-tolerence in the face of ApplicationMaster failure, security (since delegate tokens expire after 24 hours), log handling, service registry/discovery, and resource-isolation/scheduling. In addition, Apache Slider (incubating) is used to reduce the amount of effort required to deploy an existing distributed application in YARN. Apache HBase, Accumulor, and Storm are all supported via Slider on YARN in HDP 2.2.
http://hortonworks.com/blog/support-long-running-services-hadoop-yarn-clusters/
MapR has a blog post describing the differences between MapReduce v1 and MapReduce on YARN. The post walks through the various steps in submitting a job in both frameworks, describes the fair and capacity schedulers, compares the two frameworks, and more.
https://www.mapr.com/blog/how-job-execution-framework-mapreduce-v1-v2
This post describes how to use the haversine formula to calculate the great-circle distance between two points on Earth using Impala/Hive. The formula makes use of trigonometric and algebraic functions, which can be embedded in a SQL query.
http://blog.godatadriven.com/impala-haversine.html
KOYA is a project to support running Apache Kafka on Apache Hadoop YARN. Since the project was announced a few months ago, the team has decided to use Apache Slider (incubating) to develop the YARN application. The Hortonworks blog has more details on this decision and plans for the project (which is targeting an initial release in Q2 2015).
http://hortonworks.com/blog/koya-apache-slider/
The AWS big data blog has a post on using the Accumulo bootstrap action to install Apache Accumulo on a Amazon EMR cluster. The post has a walkthrough that starts a cluster, creates an Accumulo table, inserts and tags data in the table, and illustrates cell-based access controls at query time.
http://blogs.aws.amazon.com/bigdata/post/Tx15973X6QHUM43/Running-Apache-Accumulo-on-Amazon-EMR
This ingest.tips blog has a post with a quick tip for using Sqoop to dump data from a JDBC database to the local file system. This makes use of a local MapReduce job and the local filesystem implementation.
http://ingest.tips/2015/02/06/use-sqoop-transfer-csv-data-local-filesystem-relational-database/
HiveServer2 began using ZooKeeper for locking in order to support concurrency. This post describes the implementation, how it’s being improved to scale to more clients in upcoming releases of Hive, and several failure scenarios that the implementation addresses.
https://www.mapr.com/blog/how-refine-hive-zookeeper-lock-manager-implementation
The MapR blog has a post detailing counters in Hadoop MapReduce. It describes the four types of counters: file system, job, framework, and custom. For each, it describes some of the key counters and how they can be interpreted to debug or improve a MapReduce job.
https://www.mapr.com/blog/managing-monitoring-and-testing-mapreduce-jobs-how-work-counters
Parsely, makers of analytics software for publishers, have written about “Mage,” the system that powers their analytics engine. Mage is built on Apache Kafka and Apache Storm and implements the lambda architecture (Apache Spark is used for batch processing). The post describes how data flows through the system, the scale of their system, and more.
http://blog.parsely.com/post/1633/mage/
While Hue is predominantly a web-based interface for Hadoop, it also includes an API and a command-line interface. This post gives an introduction to the command-line tools, which can be used to update passwords, run tests, and shutdown Hive queries.
http://gethue.com/hue-api-execute-some-builtin-commands/
News
O’Reilly Radar has published a new book called “Women in Data” which profiles 15 industry leaders. The interviews share personal stories and also explore a number of topics related to gender-diversity in the big data industry. The eBook is free (behind an email-wall).
http://www.oreilly.com/data/free/women-in-data.csp
Datanami has two posts on Hadoop and high performance computing (HPC, which is often used in science applications). The first looks at some of the shortcomings of Hadoop that keep it from really taking off in HPC. These are things like the network layer (using TCP/REST/RPC), the immaturity of schedulers, and HDFS’ semantics and performance. The second post looks at some of the integrations that are driving Hadoop and HPC to converge, including Infiniband, GPU technology, and the cloud.
http://www.datanami.com/2015/01/26/rethinking-hadoop-for-hpc/
http://www.datanami.com/2015/01/27/three-ways-big-data-hpc-converging/
Cloudera has acquired Xplain.io, makers of tools for doing a meta-analysis of analytics database queries. Their tools can analyze and profile database queries, which are then used to generate optimized schemas for systems like Impala.
http://blog.cloudera.com/blog/2015/02/got-sql-xplain-io-joins-cloudera/
DataStax, makers of enterprise software for Cassandra, have acquired Aurelius, who is behind the TitanDB open-sorce project. TitanDB is a graph database, which supports multiple storage backends including Cassandra and HBase and has Hadoop integration for analytics of graph data.
http://www.datanami.com/2015/02/03/datastax-dips-graph-waters-pulls-titan/
There hasn’t been a whole lot of momentum behind Tachyon, the in-memory file system from the AMPlab group at UC Berkeley, from commercial vendors (aside from Pivotal). But BlueData, who makes a platform for Big Data tools like Hadoop, Impala, Hive, and Spark, is investing in supporting Tachyon. Datanami has more about BlueData, the use-cases for Tachyon, and how they plan to integrate it.
http://www.datanami.com/2015/02/03/tachyon-support-coming-big-data-hypervisor/
Informatica and Hortonworks announced the availability of end-to-end visual data lineage of all operations performed through Informatica. Informatica announced a similar integration with Cloudera last year.
http://hortonworks.com/blog/data-governance-transparency-lineage-informatica-hortonworks/
GigaOm has coverage of some news on the Hadoop distribution-front. In short, it sounds like Pivotal will be scaling back its Hadoop development and/or announcing a more formal partnership with Hortonworks or IBM. The news comes after some notable departures at Pivotal and a round of layoffs. An announcement is expected from Pivotal on February 17th.
https://gigaom.com/2015/02/06/exclusive-pivotal-ceo-says-open-source-hadoop-tech-is-coming/
Releases
Last week, MapR announced a number of updates to their distribution. Hadoop, Hue, Flume, Hive, HBase, Impala, and Storm were all updated to new versions.
https://www.mapr.com/blog/apache-open-source-project-updates-january
Apache Hive 1.0.0 was released this week. Previously known as version 0.14.1, the community decided to rebrand it as a 1.0.0 release to reflect the maturity of the project. The Hortonworks blog has a detailed look at history of Hive (the initial release was almost 6 years ago!) while Cloudera has a look at the future of the project.
http://mail-archives.apache.org/mod_mbox/hive-user/201502.mbox/%3CCALOyQxomDp8GW74-M6N11twjL5cL9a__tKDnuiHTBfVuZjhSjQ%40mail.gmail.com%3E
http://hortonworks.com/blog/announcing-hive-1-0-stable-moment-time/
http://blog.cloudera.com/blog/2015/02/apache-hive-1-0-0-has-been-released/
Cloudera released two new versions of their distribution, CDH, this week. CDH 5.2.3 includes a number of fixes, including important fixes for Avro, HDFS, HBase, Hive, and Impala. CDH 5.3.1 includes fixes to Impala, Hive, and YARN (including fixes for HA).
http://community.cloudera.com/t5/Release-Announcements/Announcing-CDH-5-2-3-and-CDH-5-3-1/m-p/24315#U24315
Version 0.8.2.0 of Apache Kafka was released this week. The new version contains a number of new features and improvements, including a new Java producer API, kafka-based offset management, delete topic support, improved configurability of consistency/availability, support for scala 2.11, and lz4 compression.
http://mail-archives.us.apache.org/mod_mbox/www-announce/201502.mbox/%[email protected]%3E
Yahoo! is a big user of Kafka—they have one cluster that does over 20Gbps at peak. To manage Kafka, they’ve built a web-based tool called Kafka Manager. The tool supports managing of multiple clusters, replica election, replica re-assignment, topic creation, and more. It’s built with Scala and the Play framework. The code is on github.
http://yahooeng.tumblr.com/post/109994930921/kafka-yahoo
Hivemall, the machine learning library for Hive, released version 0.3.0 this week. The new version includes an implementation of matrix factorization.
http://mail-archives.apache.org/mod_mbox/hive-user/201502.mbox/%3C54D4A60F.4080207%40gmail.com%3E
Events
Curated by Mortar Data
UNITED STATES
California
Spark After Dark, by Chris Fregly of Databricks (Santa Monica) - Tuesday, February 10
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/219134760/
Couchbase as Operational and Light Analytics to Hadoop (San Diego) - Wednesday, February 11
http://www.meetup.com/sdbigdata/events/219925266/
Do NoSql Like SQL: Introduction to Apache Drill (Woodland Hills) - Wednesday, February 11
http://www.meetup.com/Westlake-Village-Data-Science-Meetup/events/220195982/
Building Real-world Machine Learning Apps with PredictionIO and Spark MLlib (San Francisco) - Thursday, February 12
http://www.meetup.com/sfmachinelearning/events/219672825/
Deeplearning4j on Spark and Data Science on the JVM with nd4j (San Francisco) - Thursday, February 12
http://www.meetup.com/spark-users/events/220117925/
Oregon
Moneyballing: How to Use Data to Win at Fantasy Football (Portland) - Tuesday, February 10
http://www.meetup.com/Hadoop-Portland/events/219792392/
Washington
Better Together: Dato and Spark (Bellevue) - Tuesday, February 10
http://www.meetup.com/Seattle-Spark-Meetup/events/208711882/
Utah
MapR at Big Data Utah (Salt Lake City) - Wednesday, February 11
http://www.meetup.com/BigDataUtah/events/218840772/
Texas
HBase/NoSQL Design Patterns (Houston) - Wednesday, February 11
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/219821997/
Process & Visualize Data with Hadoop/Hive & Tableau (Addison) - Wednesday, February 11
http://www.meetup.com/Divergence-Data-Science-Meetup/events/220018646/
Getting Started on Hadoop: A Hands-on Experience (Arlington) - Thursday, February 12
http://www.meetup.com/DFW-Analytics-Big-Data-and-Beyond/events/219723997/
Illinois
Transitioning Compute Models: Hadoop MapReduce to Spark (Chicago) - Thursday, February 12
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/
Tennessee
Netflix, Pig, and Hadoop: Are You Surus? (Chattanooga) - Thursday, February 12
http://www.meetup.com/CHadoop/events/219719434/
North Carolina
Hive on Spark (Durham) - Tuesday, February 10
http://www.meetup.com/TriHUG/events/219981640/
Maryland
From 0 to Streaming: Using Cassandra with Spark (Baltimore) - Wednesday, February 11
http://www.meetup.com/Data-Science-MD/events/219997608/
FRANCE
Discover the Mesosphere Datacenter Operating System (Paris) - Monday, February 9
http://www.meetup.com/Paris-Mesos-Users-Group/events/220101127/
BELGIUM
Lightning Fast Big Data Analytics with Apache Spark (Edegem) - Wednesday, February 11
http://www.meetup.com/BeScala/events/219958703/
NORWAY
Meet Hortonworks (Oslo) - Tuesday, February 10
http://www.meetup.com/oslohug/events/220198750/
ROMANIA
Apache Hive Workshop (Cluj-Napoca) - Thursday, February 12
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/220256225/
ISRAEL
Big Data and Product Management at eBay (Tel Aviv-Yafo) - Tuesday, February 10
http://www.meetup.com/PMM-IL/events/219420804/
INDIA
A Use Case in Hadoop Executed in Apache Spark: Let Us See If It Is 100x Faster (Hyderabad) - Saturday, February 14
http://www.meetup.com/hyderabad-scalability/events/219097168/
Session on MapReduce with Python and Amazon EMR (Pune) - Saturday, February 14
http://www.meetup.com/Pune-Big-Data-Analytics-Meetup/events/219751224/
AUSTRALIA
Apache Spark 101 (Melbourne) - Monday, February 9
http://www.meetup.com/Melbourne-Apache-Spark-Meetup/events/219492503/
John Mallory, EMC CTO for Analytics, plus Hadoop 101 with MongoDB Integration (Sydney) - Thursday, February 12
http://www.meetup.com/Sydney-Big-Data-and-Analytics/events/219840233/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
Technical
Mortar (disclosure: Mortar sponsors the event-section of this newsletter) CEO K Young has a post on data lakes, data pipelines, and data directories. Although data lakes are a hot topic right now, K argues that it's better to invest in data pipelines, and he discusses how Luigi is a good solution for building a pipeline.
https://www.linkedin.com/pulse/hows-data-lake-k-young
The Cloudera blog has a post about a new integration for CDH 5.3 between Sentry (the role-based access control layer for Hive) and HDFS ACLs. The post looks at how the integration allows Sqoop and Sentry to co-exist for the first time.
http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-apache-sentry-integration-with-hdfs/
Hortonworks has the third post in a series on predicting airline delays with Hadoop. This post looks at using Scalding and R (previous posts covered Spark and Pig). Like the previous posts, there's an IPython notebook that walks through all the individual steps.
http://hortonworks.com/blog/data-science-hadoop-predicting-airline-delays-part-3/
The Hortonworks blog has a post summarizing some recent improvements to YARN that are part of HDP 2.2. Topics include: support for long running applications (Apache Slider), new types of resource management (CPU in addition to RAM slots, node labeling), and improvements to operational support (including rolling upgrades).
http://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-substantial-step-forward-enterprise-hadoop/
"The Morning Paper" is a blog that recaps various computer science papers. This week, it looked at the ZooKeeper paper from 2010. It’s a good overview that serves as supplemental reading material or a refresher if it’s been a while since you read it.
http://blog.acolyer.org/2015/01/27/zookeeper-wait-free-coordination-for-internet-scale-systems/
Spark 1.2 introduced a streaming implementation of k-means with the ability to dynamically detect (and remove) clusters over time. The key to this feature is forgetfulness, which is implemented as a half-life parameter to decay old data. The Databricks blog has a post with more details on the algorithm, including several visualizations of it in action.
http://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Cloudera had a post describing the enterprise security features that are part of CDH 5. Topics include Apache Sentry, integration with Active Directory and Kerberos, centralized audit logging, and encryption (plus key management). Not all of these features are available in the free version of CDH, but Cloudera claims many of the features aren't available in another distribution, either.
http://vision.cloudera.com/production-ready-hadoop-an-overview-of-security-in-cloudera-5/
The Confluent blog has a post from Martin Kleppmann, the author of the upcoming book “Designing Data-Intensive Applications.” The post is a edited transcript of a recent talk on stream processing. It covers a large number of topics, including streaming aggregation, relation to database systems, and several tools. The post is a great overview of important concepts in stream processing.
http://blog.confluent.io/2015/01/29/making-sense-of-stream-processing/
The Mortar blog has the transcript and video of a recent talk at the NYC Pig User Group. The talk describes the types of problems that Pig is really good for, its shortcomings, and the strengths and weaknesses of the user-facing APIs.
http://blog.mortardata.com/post/109495522361/pig-jonathan-coveney-talk
LinkedIn has written about their usage of Kafka and plans for the future. The post provides an insight into what they’re using Kafka for (including monitoring, messaging, analytics) and tools they’ve built around it (a REST API, schema registry, auditing service). Future plans include support for security, improved reliability/availability, and cost efficiency.
http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future
The AWS Big Data Blog has a tutorial describing how to setup a Elastic MapReduce cluster with Elasticsearch and Kibana.
http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR
News
The Apache Software Foundation has announced that Samza has graduated from the incubator and is now a top-level project. Samza, the distributed stream processing framework, uses YARN for fault tolerance and integrates with Kafka.
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces71
MapR announced an initiative this week to provide free on-demand Hadoop training for developers, analysts, and administrators. Currently available courses are “Hadoop Essentials,” “Hadoop Operations: Cluster Administration,” and “Developing Hadoop Applications. Future courses will cover HBase, Drill, and Hive.
https://www.mapr.com/company/press-releases/mapr-unveils-free-demand-training-program-50m-kind-contribution-hadoop
Typesafe and Databricks announced results of a recent survey of Scala and Spark developers. Among the highlights—13% of respondents already have Spark in production and 20% plan to do a production deploy in the coming year. Readwrite has more coverage of the survey, and a follow-up interview with Typesafe’s architect for Big Data Products and Services, Dean Wampler.
http://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html
http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wampler
TechRepublic has an interview with Ion Stoica, the co-founder of Databricks, about Spark. The post emphasizes Sparks’ versatility—it supports batch, streaming, SQL, and machine-learning. There are a few other interesting tidbits, including mention of Spark support for R in the future and the importance of libraries for Spark.
http://www.techrepublic.com/article/spark-promises-to-up-end-hadoop-but-in-a-good-way/
Hortonworks announced the Data Governance Initiative to develop software to meet enterprise requirements for data governance. Along with Hortonworks, Aetna, Merck, Target, and SAS will be working on the initiative, which will include further integrating Apache Falcon and Apache Ranger.
http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/
The Splice Machine RDBMS, which is built atop of HDFS and HBase, is now certified for Hortonworks HDP.
http://hortonworks.com/blog/splice-machine-now-hdp-certified/
Releases
SequenceIQ has announced a new beta release of Cloudbreak, the cloud-agnostic Hadoop-as-a-Service framework. The new version includes user accounts, a usage explorer, support for heterogenous clusters, support for OpenStack, and more.
http://blog.sequenceiq.com/blog/2015/01/28/cloudbreak-last-beta/
HFactory is a platform for building HBase-based applications using Scala. This week, version 1.2 was released with a few enhancements and new features.
http://hfactory.io/blog/hfactory-1-2-release-notes/
VoltDB announced version 5.0, which includes expanded Hadoop ecosystem support. Specifically, VoltDB is now integrated with HDFS, MapReduce, and Kafka. It also supports exporting data as Avro.
http://www.prnewswire.com/news-releases/voltdb-announces-version-50-expands-hadoop-integration-extends-leading-fast-data-application-development-platform-300026817.html
Cloudera announced bug fix releases of Cloudera Manager (5.2.2 and 5.3.1) and Cloudera Navigator (2.1.2 and 2.2.1).
http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Manager-5-2-2-and-Cloudera-Manager-5-3-1/m-p/24101#U24101
Cloudera also announced a new release of the Impala ODBC and JDBC drivers. The new versions support HiveServer2 from CDH 4.1+ and Impala 1.0+.
http://community.cloudera.com/t5/Release-Announcements/Announcing-Impala-ODBC-v2-5-23-and-JDBC-v2-5-16-Drivers/m-p/24211#U24211
Events
Curated by Mortar Data
UNITED STATES
California
Interactive Session on Sparkling Water = Spark + H2O (Mountain View) - Tuesday, February 3
http://www.meetup.com/Silicon-Valley-Big-Data-Science/events/219076654/
Bayesian Networks with R and Hadoop (Palo Alto) - Wednesday, February 4
http://www.meetup.com/Hadoop-Talks/events/219644755/
Nonstop HBase: Making HBase Safe and Bulletproof by Ryan Rawson of WANDisco (Los Angeles) - Thursday, February 5
http://www.meetup.com/Los-Angeles-HBase-User-group/events/219045881/
Building Real-World Machine Learning Applications with PredictionIO and Spark ML (Mountain View) - Friday, February 6
http://www.meetup.com/Silicon-Valley-Machine-Learning/events/219609337/
Washington
Hadoop-Based Big Data Analytics with Datameer (Bellevue) - Thursday, February 5
http://www.meetup.com/CloudTalk/events/219236686/
Arizona
Oozie or Easy: Managing Hadoop Workflows the Easy Way (Tempe) - Wednesday, February 4
http://www.meetup.com/Phoenix-Hadoop-User-Group/events/219168066/
Colorado
Hands-on Spark Workshop for Beginners (Boulder) - Saturday, February 7
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/220044380/
Texas
Sean Busbey on Apache Accumulo (Austin) - Wednesday, February 4
http://www.meetup.com/austin-data-geeks/events/190580872/
Oklahoma
Machine Learning and Data Ingestion with Apache Storm, Kafka (Oklahoma City) - Thursday, February 5
http://www.meetup.com/Big-Data-in-Oklahoma-City/events/219965196/
Minnesota
Performance Tuning Cassandra at Target (Minneapolis) - Monday, February 2
http://www.meetup.com/Minneapolis-St-Paul-Cassandra-Meetup/events/218885496/
Tennessee
Intro to Hadoop Components and Distributions (Brentwood) - Monday, February 2
http://www.meetup.com/Data-Science-Nashville/events/219721972/
Maryland
Introduction to Big Data Techniques for Cybersecurity (Rockville) - Monday, February 2
http://www.meetup.com/Capital-Area-Cyber-Security/events/219333009/
Introduction to Apache Accumulo: Architecture and Use Cases (Jessup) - Tuesday, February 3
http://www.meetup.com/Accumulo-Users-DC/events/220167811/
Massachusetts
Get Started with Hadoop Experts: Big Data for Social Good Challenge (Cambridge) - Tuesday, February 3
http://www.meetup.com/Big-Data-Developers-in-Boston/events/219861979/
CANADA
Greenplum Deep Dive (Toronto) - Tuesday, February 3
http://www.meetup.com/Toronto-Pivotal-User-Group/events/219718869/
MEXICO
Primera Reunión de Apache Spark (Mexico City) - Friday, February 6
http://www.meetup.com/Mexico-City-Apache-Spark-Meetup/events/219029479/
IRELAND
First Galway Data Meetup, with Michael Hausenblas of MapR (Galway) - Tuesday, February 3
http://www.meetup.com/Galway-Data-Meetup/events/219176364/
FRANCE
Spark Meetup at Viadeo (Paris) - Wednesday, February 4
http://www.meetup.com/Paris-Spark-Meetup/events/220141774/
Batch on Hadoop with Cascading (Lyon) - Thursday, February 5
http://www.meetup.com/Lyon-Hadoop-Meetups/events/219273421/
GERMANY
Hadoop and Data Warehouse–Friends, Enemies or Profiteers? What about Real-Time? (Cologne) - Wednesday, February 4
http://www.meetup.com/NoSQL-Usergroup-Cologne/events/219874149/
AUSTRIA
Cassandra: How It Works and What It's Good For! (Vienna) - Wednesday, February 4
http://www.meetup.com/Vienna-Cassandra-Users/events/220027008/
ISRAEL
Lessons I Learned Building a Big Data Startup (Tel Aviv) - Monday, February 2
http://www.meetup.com/BI-on-the-Bar/events/218719652/
Tez vs Spark (Tel Aviv) - Sunday, February 8
http://www.meetup.com/HadoopIsrael/events/219382194/
CROATIA
Introduction to Spark (Zagreb) - Tuesday, February 3
http://www.meetup.com/Apache-Spark-Zagreb-Meetup/events/219811877/
AUSTRALIA
Big Data Integration Research (Canberra) - Tuesday, February 3
http://www.meetup.com/Data-Science-Canberra/events/220176788/
Machine Learning at the New York Times, with Daeil Kim [Video + Slides]
John Matson
At our most recent NYC Data Science meetup, we were joined by Daeil Kim of the New York Times. Daeil is a data scientist at the Times and is finishing up a PhD in machine learning at Brown.
For a look at how data science is done at a venerable but forward-looking institution, check out the video and slides below from Daeil's talk.
Why Pig? Meetup Talk by Jonathan Coveney [Video, Slides, Transcript]
John Matson
There are countless big data technologies out there, and it's often hard to choose the right tool for the job.
Pig is concise, easy to use, and "battle-tested," Jonathan notes, but some of the syntax can be confusing, and developing new features for the language is a challenge. In the end, he suggests several improvements that would help futureproof Pig and "better leverage the strengths it has." Check out the video and slides below for more.
(Because of an A/V problem during recording, the sound quality of the video suffers at times. So we've added closed captioning, which you can toggle on and off with the "CC" button in the player.)
Transcript: Jonathan Coveney at the NYC Pig User Group Meetup
October 15, 2014
I started working with Pig, at this point, four or five years ago.
So it was already fairly mature at that point. But a ton of the richer features that are in Pig right now hadn’t been added yet.
I worked a lot all across Pig, from the front end to the parser, to code generation, to plan optimization, a lot of things. Each of these companies -- less so Spotify, but ComScore and Twitter definitely -- had a lot of Pig in production. Especially Twitter.
The point of this talk, as he said -- which, you know, they just released a bunch of sweet new stuff, so you all should check that out. This is including Luigi, which came out of some work at Spotify. It’s a pretty reasonable scheduling alternative. So, in my time I have worked with Pig; I have worked with Scalding, which is a DSL which I’ll talk about some. I’ll compare a little bit, but I think my perspective is more, what is going to make Pig useful in five years.
Right now, I think Pig is very, very useful, but I also think that there’s some features that are going to be less competitive, let’s say, than potential alternatives. I think it’s great that the Spark and Pig talk is coming up next because I think if the Pig community can really nail Tez support and Spark support, that’s really going to help futureproof people’s pipelines because Spark is kind of huge. But I also think that there’s other things that can be done to make Pig a really robust and elegant choice.
If you guys have questions at any time, we’re a small intimate group. So please interrupt me. I don’t care. Tweet at me @jco. It’s not a perk anymore but it used to be that you could get a sweet username if you worked at Twitter.
Pig at Twitter -- hundreds of users, this guy Nathan is one of them, thousands of scripts, tens of thousands of daily jobs, and many hundreds if not thousands of internal user-defined functions, many of which were written by me. We use it for GEO, we use it for ads, we use it for ad hoc research, we use it for business intelligence reporting. We use it for all sorts of things, but we’re not going to use it anymore. So, why, right?
These guys, should we just cut them out, it’s done, go learn SQL? No. That’s not why. Why are we going to take hundreds of users, thousands of scripts, tens of thousands of daily jobs, and hundreds of internal UDFs, and we’re not going to throw away that work, but we are going to prioritize Scalding, which is a Scala-based DSL.
What I’m going to talk about is how we think about that decision, and the truth is that’s a decision very specific to Twitter that I think really only applies to a company kind of like Twitter.
This kind of weird chart is my view on the learning curve of a language like Pig compared to an alternative like Scalding. I’m not even going to talk about raw MapReduce. I think everyone in this room has seen the slide: where here’s wordcount in MapReduce, it’s gigantic, it fills up the whole wall, and Pig is three lines.
We know that Pig is concise and easy, but where does it shine, where does it not shine? To me, this area is pretty awesome, right? This is why people love Python, this is why people love Ruby, this is why people love Pig because, yeah, it’s a different syntax or whatever, but once you can get around the rougher edges, you’re going. You can build an ETL pipeline that’s going to last you, like with Twitter, for years. We put Pig into production like five, six years ago, and those scripts are still chugging.
Barring some upgrades or whatever, totally fine. And it’s great because you can have your users type Pig and bam, they do describe on relations, they can see the schema, they can do some illustrate to get a sample of the data. They can start building up the pipeline much as in a Python or a Ruby, they can build up the pipeline that they want. So, whether it’s for something that they’re going to run on an ad hoc basis or something they’ll run regularly, for kind of simple stuff, it’s really nice. And the truth is, for a lot of companies, for a lot of people, that area is as far as it goes. This is the moneymaker, and this is kind of true, I think, for the Spark world as well. A lot of the early Spark adopters are ones that are making a lot of money off of very specific things like better Netflix recommendations -- that’s how they make their money, and they’re doing it over here.
But then we start going up the learning curve. And what happens? You start getting weird errors. 6099—what the hell is that? Why didn’t they give me logs? OK, this abstraction is a little leaky, that’s fine. It’s MapReduce, we’re all getting paid a lot, we can figure it out.
The syntax to me is a little inconsistent. It’s like, if I’m inside of a GROUP, then a FOREACH, the syntax changes, the DISTINCT is different. My memory’s getting blown out, why’s that happening?
You start having to understand the way Pig is planning your job, which is fine. Note that this portion of the blue line is still under the red line, which is Scalding. It just means that you’re starting to ramp up, and then the part where Pig gets really gnarly for people -- and this is why they did that work to make native Python UDFs work -- is kind of the UDF part of it. People love the scripting language, and then all of a sudden they have to start writing Java? The syntax itself for doing basic analysis is fine enough, but kind of making sure that this schema is handled properly is pretty nasty. Many-to-1 UDFs, that’s very technical, understanding how the algebraic accumulators stuff works. The polling, the testing, the scheduling. This is why they just released Luigi; up until now, everyone had crappy home-rolled scheduling. People say, Why doesn’t Twitter open-source theirs? Because it’s not very good. It’s one of those pieces of the stack that gets into production, and then you never want to rewrite it because you have more important things to do. And then four years later it’s like, wow, we have a really crappy scheduler.
That’s where Twitter is. That’s what we’re worried about. We have a ton of different groups, a ton of different engineers of different levels of sophistication that need to write robust, maintainable pipelines usually that need to be maintained by other people.
So then for Scalding, which you guys may or may not know -- you should take a look, it’s interesting, even if you don’t use it. It’s in Scala. It kind of leverages functional programming idioms to make it feel like you’re manipulating a list, traversing a list, but actually you’re building a data flow to work on Hadoop.
It’s like, “Oh god, I gotta compile a job? That’s why I write Python! I hate compilers!” Then you start learning it, and you’re like oh, OK, I don’t have to change languages to write UDFs. You just write it inline, and it works. I want to use JUnit to test my stuff. I can do that. I want to use ScalaTest or whatever you want because it’s all on the JVM. That’s nice, right?
And then we have a ton of stuff that’s like, oh I want to use a profiler. It’s really easy to understand what’s going on. I want to use Scala macros to do crazy, mind-bending things. I can do that! It’s just all Scala. It’s a language, right? Instead of paying someone like me to be like, oh yeah, you happen to be running into some Pig bug that we haven’t have time to fix.
The Scala community is pulling it forward. This is a classic debate between standalone DSLs and embedded DSLs.
This isn’t a resounding win for Scala. There are many things that Pig can do that can better leverage the strengths that it has to make it really good.
What does Pig do well? It lets you get started quickly. It’s great for exploring datasets. It describes your flow really nicely. If you’ve ever tried to make anything like... I think “production SQL” is an oxymoron. It’s awful. It’s impossible to test, and once you get past 10 lines, it’s like, what the hell is going on? The nice thing about SQL is you’re just describing your output. When your description gets muddled, then your understanding of it is, too. Pig lets you break it up into pieces, describe the flow, get it out.
And we have people in the community like this guy working on getting Spark support, doing Tez support. So it has a community. The thing that shocked me when I was working on Pig, I went to Hadoop Summit, and Sears gave a keynote about how they used Pig to save a bunch of money. And I’ve never seen them on the user list, but this is a critical piece of their infrastructure. I thought that was pretty cool. And then I thought, I wonder how much Sears would pay me to go work there? And the answer wasn’t a lot.
For most companies, these benefits outweigh the negatives. I’ll talk about some things that I don’t love or that could be improved, but this is huge. In the same way that Twitter kind of started on Ruby on Rails and they still have a little of that but we’ve moved to a JVM/Scala/Java service-oriented architecture. Very similar. Twitter would not have existed without Ruby on Rails, but Twitter today wouldn’t exist without JVM, you know, modulo everything.
These are the things I’ve been ruminating on a lot. I think a lot of it comes down to software engineering “in the large,” which sounds like some horrible 90s object-oriented thing where you read a book by someone called Uncle Bob -- and by the way, how do you become software Uncle? How does everyone just decide that guy is just Uncle Bob. So, testing is really -- to be fair, a lot of people just don’t test. Python testing? Who does that? No one. That’s kind of a non-argument for some people.
IDE support I think is a big one. They have done a lot of good work on surfacing errors, and connecting that to the line of code and that sort of thing. But we could do a lot better with the information that Pig has. There just hasn’t great support for making that really world-class.
Consistency. This one drives me crazy. There’s just a lot of rough edges in the way that Pig semantics work. You group on something, and it kind of names it the relation that it’s grouped on, which is very confusing to people. The first column is called “group.” The second one is the name of the relation that you grouped. Why is that? It’s just the case because they made it in 2006; the world was different. We were still in Iraq. We’re still in Iraq. But, you know, it was a different world.
The type system is inconsistent as well, which some people may not care. “We don’t need types! Python!” But to me working in big data platform stuff, it’s really nice to have a really clear understanding of the data that’s flowing through your system. I don’t want to be just eating bytes and spitting out bytes, it’s great, we’re processing data. I don’t know if it’s right data but we’re processing data. I want to know that the data that I’m processing is good.
And the fact that Pig does have types is super useful, not just to make sure that you’re processing what you think you’re processing, but also for doing a lot of important optimizations, which are quite hard to do in Scalding, for example.
I say this with love. A lot of the things that I’ve contributed to Pig have made it an aging codebase. I spent many months working on a pretty nifty feature that got ripped out as soon as they started working on Tez. Whatever. It’s cool.
Honestly, if you can tell me how Accumulate works, I’ll buy you a bottle of Scotch. I’ll send it to you. I don’t care where you live. This is even true of the internet people -- tweet at me. It’s gross. It’s not good.
It’s not finger-pointing. The guys who have worked on this are really really smart and really really good.
As with anything, to make sure that piece of software continues to stay relevant, I think it’s important to continually shave off part of it, right? And that’s a classic Uncle Bob software thing.
Something I kind of talked about earlier was the fact that for a lot of people these trade-offs don’t work, they don’t matter. So, what are the different types of people? What’s going on? Obviously, the world is a diverse place and crazy things are going on with data, but roughly I’m going to categorically divide along three points.
One is the data guy. It’s common in startups, and it’s also common in big companies where just somehow you become the person who runs the SQL script. They love your results and they don’t ask you how you got them, and you kind of babysitting some important data. There isn’t a lot of infrastructure around it; it’s just that data is valuable, you’re the guy who provides the data. There you go.
Beyond that, you have a team whose job is to write and maintain pipelines for others to use. I think for many companies they never leave the second stage. A lot of people are on the analytics team, and what the team does is own all the production analytics jobs. If someone needs a piece of data, they don’t sit down and write Pig themselves, they ask, “Hey, Susan on the analytics team, can you please give me some data?” And she’s like, “Sure, yeah, I’ll do that if you give me a pizza.” They get her a pizza and she gives them some data.
Honestly prioritization often works that way. When I first joined Twitter, we were in that stage. Nathan and I were coworkers, we were on the same team. He was a data scientist, I was an engineer, living in harmony. If I liked people, I would write UDFs for them. I like most people, so I wrote a lot of UDFs. Over time, as a company grows -- usually what ends up happening is it usually grows out in the following way. Some guy or gal, VP, whatever, says “You’re never prioritizing my stuff. A bunch of analytics people don’t care about my sales stuff. My sales stuff is superinteresting.” So they go out and hire someone. They’re like, “Nathan, do you love sales stuff? Here’s a six-figure salary.” And Nathan’s like, “I love sales stuff now.” And then they become the analytics guy in that part of the organization. Instead of having an analytics team that might guide the way that data is used in a company, now you have analysts all over the company. You might have analysts embedded in sales, analysts working on growth, “growth hackers” is a common thing right now. You have people doing BI. All sorts of roles, all writing Pig, writing SQL, getting data, doing it regularly.
Now instead of “I’m the data gatekeeper, buy me a pizza”, now I’m the best practices person. I might be the person that says, “hey, maybe you should use Mortar. Maybe that would be a lot easier than AWS or something like that.”
A lot of companies aren’t there. Twitter is there. And that’s where Pig, I think, tends to be a little bit weaker. I don’t know if any of you have a 2000-line Pig script? But it’s awful. It’s just terrible. And it’s pretty hard to undo. And the undoing of it can often be hostile to, like, Spark.
The point is composition is very hard in Pig as it is today. Not impossible, tough. Whereas if you’re in a Java or Scala DSL, you have object orientation, you have functional programming. You have languages designed to solve -- the thing they solve is abstraction and composition.
This is my perspective. Big data teams, analytics, data scientists, data engineers, straight across the board of an organization -- most companies aren’t like that. But these companies are driving a lot of the investments.
The question you might be asking is, like, Jonathan worked at Twitter, he worked on Pig -- why didn’t they just decide to fix all this? Honestly, that’s more of a prioritization thing. I think we could have back in the day, we chose not to, or they chose not to. I kind of wish they had, but Scalding is cool, too. If we don’t think about this stuff, it’s going to sneak up on us. If we didn’t have people working on Pig on Spark it would be like, oh, crap, no one’s using Pig anymore because they’re all using Spark.
Having worked in Scala land a long time, there is something essential to Pig that’s useful. Scalding is really nice, but I do think that all those benefits I talked about earlier are legitimate benefits, and there are things about Pig that are just -- like, it would be really nice if you could write a small script, you know it’s going to be super fast and testable, and then you can kind of ride that learning curve.
Because when I’m writing Scalding stuff, I know it really well at this point, but I’ve got to set up my build tool, my SBT or my maven, where’s that tool...
There’s a lot more work, whereas it’s like Pig, bam, go!
A smattering of issues. Adding types is extremely difficult, which is terrible because types are really, really important, in my opinion, for accurate modeling of data. I would know: I think I’ve added three different types. It’s awful, it’s poorly abstracted. It’s very difficult to do, which is a shame, because it would be great to have pluggable types, for example. If someone’s working in geo, they can say, Hey, we built out this really nice geo thing. And a lot of the geo stuff is GPL, so we couldn’t include that here. So it would be really great if that was a standalone module, you could go to github, clone it, now you’ve got all the geo stuff. That’s not the case. It’s also very bug-ridden, because it’s not some central type dispatcher. It’s kind of baked into all different parts of the code.
Pig could also be a lot faster. One of the great benefits of forcing to people to think in columns and types is that you have a lot of really rich information. Pig already leverages that in a pretty good way right now. It does push-down filtering and column trimming and things like that, but it could use code generation to be more memory- and CPU-efficient. Something Spark does that’s really interesting is it’s using Parquet now for intermediate representation. Parquet is a column store to come out based on a collaboration of Cloudera and Twitter, Julian at Twitter, and a number of other people now. Either way, they’re using that internally and Pig could do the same.
And the UDF boilerplate, again, they’ve worked on it, but it’s bad. It’s worse than bad. It’s not just bad in the sense that you have to use Java, but right now you have to make a dot Java file for every single function from the Java standard library that you want. There is code in Pig to get around that I used code generation to auto-generate, but they’re like, Oh, Jonathan, go work on other things. So that never made it as mainline as I would like.
The point is, you could be a lot smarter about leveraging things automatically, and that kind of doubles back to types. Richer type system, better inference of functions, and then all of a sudden you import Guava and you get a really rich set of tools to just automatically do things.
Pig, internally, is very stateful and difficult to reason with, which is fine. It’s battle-tested, so you know it will do what you want to do, but what it means is it’s kind of at that point where adding new features, a lot of the time is spent not adding the feature but kind of wrangling these hidden contracts in the data. This field is null, and when I call this other method, as a side effect this field will be populated. And I’m a Haskell fan—our programs shouldn’t be doing anything! And instead they’re populating nulls. Testing is clunky, there’s PigUnit, it’s all right, composition is poor, macros -- it’s hard, because the abstraction is a bit leaky, it’s really hard to just trust a black box in this world. How is it implementing that? Maybe it’s going to OOM you because the number of values per key was different. That’s not great.
Development is tricky.
A constrained DSL with a type system, that is like, holy crap, the IDE could be amazing! You could be king of Hadoop, getting information about your columns, populating it with samples. This is work that Mortar has worked on, but, you know, it could be really, really great.
Rewrite it all in Haskell. That is my solution. Or Scala or Java. Not all of it, but you know, I think it needs cleanup. Current development is unsustainable, in my opinion. Like Tez support is coming, but it’s taken six people with really deep knowledges of Pig a really long time to do. And turn on Spark support, he’ll tell you how that’s going. Maybe he’s like, it’s so so easy! But it’s not. I remember. It was a Twitter hacking project, we worked with the Spark guys to begin the Spork stuff. And it was like, Oh, PO ForEach is a nightmare.
Adding new features should be modular, which is, in my opinion, what I think will really make Pig last into the future. He should be able to focus on, OK, I want Pig on Spark, I will work on that. And instead it’s, now I have to worry about this code, and oh, this lifecycle is bad, and why are they using reflection here, and why is this OOMing? What’s going on?
And once again, for most people it doesn’t matter. But for the people driving the language forward, it matters a lot.
tl;dr: Super useful tool, very mature, a lot of successful deployments. I dare anyone circa 2006 to do a better job. Pig came out, it’s super robust and reliable, but it’s long in the tooth. I’m a Scala fanboy, I want to write Haskell. I don’t want Java. I don’t want object-oriented. Are any of you doing J2EE? Pig’s entering J2EE territory.
When it comes to big company issues, these are kind of good problems to have. You’re like, damn, the 30 analysts in the sales org are complaining about Pig. That’s great, you’ve got 30 analysts dedicated the sales org. Let’s deal with it.
But it makes it an incumbent ripe for disruption. And maybe that’s fine. But I’m someone who believes that we can build something beautiful to tackle this problem.
I think Pig has taken a bold step, but there are things that we could do to kind of push it along.
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
It was a busy week of announcements and releases—Google and Hortonworks announced a new integration for HDP on Google Compute Engine, Google and Cloudera announced a joint project to bring a Spark backend to the Google Dataflow SDK, Netflix announced a new open-source project, and Apache Flink 0.8.0 was released. In addition, there are articles on machine learning from PayPal and Databricks as well as several other high quality posts on Kafka, HBase, and more.
Technical
PayPal has a post on how they’re training Restricted Boltzmann Machines to build Deep Belief Networks. While many folks are using GPUs to speed up these types of computations, PayPal was looking for a way to make use of existing Hadoop infrastructure. They’ve implemented an adaptation of IterativeReduce running on YARN (Hadoop 2.4.1). The post has a thorough overview of how they use the YARN APIs to build their system, and they show good results from an evaluation of the implementation.
https://www.paypal-engineering.com/2015/01/12/deep-learning-on-hadoop-2-0-2/
Apache Flink is a large-scale, in-memory processing and data streaming framework that’s compatible with Hadoop. This presentation gives an overview of the API ( which resembles Spark and Scalding and includes streaming and graph APIs), compatibility with the Hadoop ecosystem (Mappers and Reducers can be used unmodified), the included visualization tool, the runtime (and how it compares to Spark), and the project roadmap (improvements to fault tolerance and streaming fault tolerance, backend for Hive, and more more).
http://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap
Making the leap from running a Hadoop job in-memory to on a cluster (especially a pseudo distributed one) can be frustrating as you battle configuration, setup, etc. This post suggests using the Kiji Bento Box, which will build a local Hadoop cluster with Hadoop and setup all the proper environment variables to interact with that cluster.
http://www.andyamick.com/blog/easier-get-started-with-scalding
The ingest tips blog has some guidance for using Kafka to ship large messages. The post has several suggestions for avoiding large messages, as well as advice for how to configure Kafka to handle large messages if the other suggestions aren’t feasible.
http://ingest.tips/2015/01/21/handling-large-messages-kafka/
The Databricks blog has a post about the implementation of Random Forests and Gradient-Boosted Trees in Spark 1.2’s MLlib. The post gives a high-level overview of how decision trees work and how MLlib distributes the computation. It then provides some code snippets to provide an introduction to the API and shows several scalability results based on evaluating a dataset in AWS EC2.
http://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
MapR has a new Whiteboard Walkthrough (both a video and a transcript) about HBase key design. OpenTSDB’s schema is used as an example, and the presentation discusses things like sequential vs. random keys and the importance of knowing the data access patterns.
https://www.mapr.com/blog/hbase-key-design-opentsdb
The Hue blog has a post on making Hue highly-available by running multiple instances of the Hue application behind a load-balancer. The tutorial walks through the requirements (a HA database backend and nginx/haproxy installed), and describes how to enable the load-balancer (which runs via supervisord).
http://gethue.com/automatic-high-availability-with-hue-and-cloudera-manager/
Sematext offers a monitoring solution for Spark as part of their Performance Monitoring (SPM) product. This post by a customer describes how to integrate the monitoring with Spark and gives an example of a production issue they solved with the help of SPM. Given that Spark is still relatively young, it’s good to see more solutions for monitoring and debugging helping folks become more productive.
http://blog.sematext.com/2015/01/21/spark-performance-monitoring-use-case/
Hadoop’s cost advantage is based on using commodity hardware, including commodity hard drives. Not all hard drives are the same, though, and failure can be expensive and time consuming. Cloud-backup provider Backblaze has posted a new analysis of hard drive failure rates based on their experience with many different kinds of disks. They look at drives from HGST, Seagate, Toshiba, and Western Digital.
https://www.backblaze.com/blog/best-hard-drive/
The Hortonworks blog has a post about the past, present, and future of HBase High Availability. The majority of the post looks at the recently added Timeline-Consistent Region Replicas, which provide a read-only version of the data in case of a region failure. Combined with best practices, this feature allows for 99.99% availability, although clients must decide if they need strict consistency (i.e. can only query the primary region) or if they can accept stale data. Looking ahead, the HBase team is working on write-availability during failures and cross-datacenter consistency.
http://hortonworks.com/blog/apache-hbase-high-availability-next-level/
A lot of presentations and posts on event-processing are either very high-level and theoretical or about the low-level details of a particular technology. This talk centers around a few technologies (kafka and avro), but it strikes a good balance of theory and practice. There are several important details and ideas related to building an event-based processing system.
http://www.slideshare.net/esammer/from-source-to-solution-building-a-system-for-eventoriented-data
News
The Register has a look at Hadoop-as-a-Service vendor Qubole. The post describes Qubole’s platform (and its differentiators), gives some stats about how it’s being used (processing around 86PB/month), and describes a bit about its customers/demand (planning to be on Azure marketplace soon, many folks are using log data stored on S3).
http://www.theregister.co.uk/2015/01/19/qubole_rains_down_analytic_insights_from_a_hadoop_cloud/
In May, Hortonworks acquired Hadoop security company XA Secure and shortly thereafter Cloudera acquired Gazzang. Gigaom Research has a post looking at what Hortonworks and Cloudera have done with their acquisitions—which parts are free, open-source, or remain proprietary. It also looks at the Apache projects related to Hadoop security—Sentry and Ranger, which have some overlapping goals.
http://research.gigaom.com/2015/01/hadoop-security-wars/
RCRWireless News has an interview with Xplenty CTO Saggi Neumann discussing several predictions for Hadoop in 2015. Among the predictions: Spark will take off, there will be increased competition among Hadoop vendors, Hadoop will transition to the cloud, and companies trying to deploy Hadoop will continue to see a shortage of qualified candidates.
http://www.rcrwireless.com/20150120/big-data-analytics/hadoop-competition-heating-up-in-2015-tag20
This post has advice for companies trying to build a team for a Hadoop deployment: which roles and “tiers” of employees to hire. A lot of folks approach Hadoop without a full understanding of all the roles that need to be filled, and this can lead to under-estimating the amount of resources (and work) needed to be successful.
https://www.linkedin.com/pulse/finding-right-people-your-hadoop-initiative-igor-izotov
Spark Summit East is in New York on March 18th and 19th. The agenda, which covers three tracks (Developer, Applications, and Data Science), has been posted.
http://databricks.com/blog/2015/01/20/spark-summit-east-2015-agenda-is-now-available.html
“Advanced Analytics with Apache Spark” is an upcoming book by several members of the Cloudera Data Science team. The book is currently in early release from O’Reilly Media. The Cloudera blog has an interview with the authors about the goals of the book, the intended audience, and more.
http://blog.cloudera.com/blog/2015/01/advanced-analytics-with-apache-spark-the-book/
Google and Hortonworks announced this week that Hortonworks HDP 2.2 is now available on the Google Cloud Platform. The integration uses Google’s bdutil to build a cluster that’s provisioned via Apache Ambari.
http://googlecloudplatform.blogspot.com/2015/01/hortonworks-hdp-22-on-google-cloud.html
Pachyderm is a new startup from the Y Combinator Winter 2015 class which is building an alternative implementation of Hadoop. Their distributed file system and MapReduce framework is open-source, and makes heavy use of HTTP. The company and software are still in early stages but are worth keeping an eye on.
http://techcrunch.com/2015/01/23/pachyderm/
Revolution Analytics, makers of the RHadoop packages, announced that they’re being acquired by Microsoft. The announcement recognizes Microsoft’s recent embrace of open-source, which includes Linux on Azure and support for Hadoop via Azure HDInsight.
http://blog.revolutionanalytics.com/2015/01/revolution-acquired.html
Releases
Pivotal has released version 1.4 of GemFire XD, its distributed in-memory database. The new version includes support for a JSON data type, persistence of data in Hadoop, and more.
http://blog.pivotal.io/big-data-pivotal/products/gemfire-xd-1-4-released-and-available
Google’s Cloud Dataflow is a system for distributed processing that combines batch and stream processing. While Google offers an implementation for their backend stack, there SDK is open-source and is amenable to additional backend implementations. Google and Cloudera have collaborated on a new backend for the Dataflow SDK which executes via Spark. The project is newly incubating in Cloudera Labs.
http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/
Netflix has started a new open-source project called Surus, which will provide a number of UDFs for Pig and Hive. The first of these UDFs (they plan to add more over the coming year) is for scoring predictive models in Pig using the Predictive Modeling Markup Language. The post has an example of building a model in R and evaluating the model on billions of rows using Pig.
http://techblog.netflix.com/2015/01/introducing-surus-and-scorepmml.html
JobTracker.app provides a menu bar helper for Mac OS X for viewing jobs in the JobTracker/Resource Manager (including notifications of started/completed/failed jobs). Version 1.3 was released this week with support for CDH5/YARN.
http://footle.org/JobTracker/
Apache Flink released version 0.8.0 this week. The new release includes a new Scala API, adds new windowing semantics to Flink Streaming, and includes many performance and usability improvements.
http://flink.apache.org/news/2015/01/21/release-0.8.html
Events (curated by Mortar Data)
UNITED STATES
California
Operating in a Multi–Execution Engine Hadoop Environment (Santa Monica) - Tuesday, January 27
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/216593262/
Apache Flink: Fast and Reliable Large-Scale Data Processing (Palo Alto) - Wednesday, January 28
http://www.meetup.com/Apache-Tez-User-Group/events/219302692/
Next-Generation Access Control for Hadoop, HBase and other NoSQL Databases (Fremont) - Thursday, January 29
http://www.meetup.com/Big-Data-Security-and-Data-Governance-Meetup/events/219285891/
Spark + Cassandra (Santa Clara) - Thursday, January 29
http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/219649049/
Colorado
Advanced Data Storage Technologies: AVRO and Parquet (Broomfield) - Wednesday, January 28
http://www.meetup.com/Boulder-Denver-Big-Data/events/219803770/
Oklahoma
January MySQL Meetup: Big Data (Oklahoma City) - Wednesday, January 28
http://www.meetup.com/Oklahoma-City-MySQL-Meetup/events/219314486/
Minnesota
Analytics with MapR and Hadoop (Saint Louis Park) - Thursday, January 29
http://www.meetup.com/Twin-Cities-Enterprise-NoSQL/events/219156040/
Florida
Design Patterns for Storm and Kafka (Saint Petersburg) - Wednesday, January 28
http://www.meetup.com/Tampa-Hadoop-Meetup-Group/events/219798631/
North Carolina
Creating a Next-Generation Big Data Architecture (Charlotte) - Wednesday, January 28
http://www.meetup.com/CharlotteHUG/events/216124872/
District of Columbia
"Sparkling Visualizations": Data Viz Solutions + Spark (Washington) - Wednesday, January 28
http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/219501996/
Maryland
Introduction to Big Data Techniques for Cybersecurity (Rockville) - Tuesday, January 27
http://www.meetup.com/Capital-Area-Cyber-Security/events/219333009/
ISRAEL
Datameer 5.0: Hadoop Like Never Before (Tel Aviv-Yafo) - Monday, January 26
http://www.meetup.com/Israel-MidLinks-BigData-Meetup/events/219140109/
DENMARK
Hadoop: What Is It... Why Does It Matter? (Aarhus) - Tuesday, January 27
http://www.meetup.com/Big-Data-Denmark/events/219374564/
UNITED ARAB EMIRATES
Real-Time Insights from Big Data (Dubai) - Tuesday, January 27
http://www.meetup.com/Hadoop-User-Group-UAE/events/219512233/
POLAND
An Introduction to Scala and Hadoop (Zielona Góra) - Wednesday, January 28
http://www.meetup.com/Zielona-Gora-JUG/events/219374455/
GERMANY
Flink Community Updates & New Features: SQL-Style Queries, Akka, Hadoop Compatibility (Berlin) - Wednesday, January 28
http://www.meetup.com/Apache-Flink-Meetup/events/219639984/
FRANCE
Haven, Flink, Hadoop Use Case (Paris) - Thursday, January 29
http://www.meetup.com/Hadoop-User-Group-France/events/219778022/
SPAIN
Cassandra & Java + Cassandra & Spark for the Internet of Things (Barcelona) - Thursday, January 29
http://www.meetup.com/Barcelona-Cassandra-Users/events/220004623/
INDIA
Recommendation 2.0 (Bangalore) - Saturday, January 31
http://www.meetup.com/Big-Data-Beyond-Hadoop/events/218923722/
As of today, Apache Spark has been integrated into the Mortar platform and is available for use by customers on request.
We always strive to deliver the best data technologies in a platform that is rock-solid, scalable, and easy to use. Our customers have already done incredible things with our existing technologies—Pig, Luigi, Python, R, Java—and we are excited to see what they can achieve with the addition of Spark.
About Spark
Spark has often been called "the successor to MapReduce": a faster alternative that will someday supplant the venerable MapReduce data-processing framework. Indeed, Spark is an exciting, powerful technology that seems poised for widespread adoption. Its ability to cache datasets in memory for repeated access enables Spark to achieve speeds up to 100 times faster than traditional Hadoop MapReduce for certain tasks.
Spark on Mortar
With Mortar’s new Spark integration, users can now spin up a cluster with any number of nodes and run a Spark script with a single command. The Mortar application collects and displays logs and any output from the Spark job, while also providing access to Spark’s user interface for a deeper look while the job is running.
Spark is a relatively new technology, having graduated from the Apache Incubator in 2014, and as such it is not as fully developed as the technologies around MapReduce. So for the time being, Spark on Mortar is in beta, available by request only. If you're a Mortar customer interested in using Spark, please contact us for more information.
If you’d like to know more about Spark, check out our new tutorial, which is a great introduction to Spark and its machine learning library MLlib.
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
We have a lot of great technical posts this week covering security in Hadoop, recent improvements to Spark, HDFS’ new heterogenous disk support, and the relatively new Apache incubator project NiFi. In addition, Apache Flink graduated from the incubator, Etsy open-sourced a new tool for profiling Hadoop jobs, and much more.
Technical
An important feature for an enterprise storage system is data protection. This post looks at Hadoop’s data protection features including replication, checksums, HDFS snapshots, and distcp for cross-cluster replication. It also looks at some enterprise software from EMC that can integrate via Hadoop’s S3 API support, and the tools that come from major vendors. There’s also an overview of what protection there are for different types of events (data corruption, accidental deletion, etc).
http://www.beebotech.com.au/2015/01/data-protection-for-hadoop-environments/
Spotify has written another post about their personalization pipeline, which is powered by Kafka, Hadoop, and Cassandra. While the post from last week focussed on the recommendation-building layer, this post looks at the database layer, which is powered by Cassandra. The article describes things like schema design, cross-datacenter replication, and bulk transfer. If you’re using or thinking of using Cassandra in production, there are several highly-relevant details.
https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/
Apache NiFi (incubating) has support for a number of different data transfer and file formats. This post looks at its support for HTTP and Kafka—it walks through building a pipeline to download a zip-compressed file over HTTP, extract messages from files within the zip, and send those messages to Kafka.
https://blogs.apache.org/nifi/entry/integrating_apache_nifi_with_apache
Another post on Apache NiFi discusses using the system for ETL and gives a tour of the data provenance features. The post argues for combining data from multiple sources in order to simplify common operations against data, and it describes how you can track the source of data in such a setup using the builtin data provenance tools.
https://blogs.apache.org/nifi/entry/basic_dataflow_design
The eBay tech blog has a post on HDFS’ (relatively new) support for tiered storage. The post describes the feature, which allows a cluster to define different types of storage, and various storage policies that describe how to map replicas to tiers. eBay uses this feature to move much of their data to machines with a small amount of compute power but a large disk density (220TB each).
http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/
The Hortonworks blog also has a post on HDFS’ heterogeneous storage support. The post describes archival storage (similar to the eBay setup) as well as SSD storage and (available in technical preview) in-memory storage.
http://hortonworks.com/blog/heterogeneous-storage-policies-hdp-2-2/
Spark has been criticized for not being as scalable as MapReduce on large clusters and datasets. Some folks from Cloudera and Intel have been working on improving the scalability and performance of one of the bottlenecks—the shuffle phase of Spark jobs. This post looks at those changes in detail and provides some performance comparisons of the new shuffle implementation.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
The Databricks blog has a post on journaling for fault-tolerance in Spark streaming. The post motivates the feature (which is necessary when consuming data from systems like Kafka and Flume), describes the implementation (which uses a write-ahead log on a distributed file system like HDFS), and describes how to use and configure journaling for Spark streaming.
http://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
The AWS blog has a post in which BloomReach describes several techniques that they use to manage costs of Elastic MapReduce. Tips include using spot instances, shared clusters for smaller jobs, tags for tracking cost, and a simple algorithm for computing the best instance type given fluctuations in spot pricing.
http://blogs.aws.amazon.com/bigdata/post/Tx3L1N4PH3MPPIF/Strategies-for-Reducing-Your-Amazon-EMR-Costs
The Cloudera blog has a post about setting up a Hadoop cluster. It covers key areas like networking, DNS, OS configuration, storage, and hardware specs. There are a few CDH-specific recommendations, but overall this is a great resources for anyone building a Hadoop cluster.
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/
This is a quick post describing how to interface with Avro data from Spark. The post shows how to convert cvs data to Avro and then write a small Spark program in Scala for reading the data.
http://matthieulieber.blogspot.com/2015/01/how-to-load-some-avro-data-into-spark.html
This post describes some of the tools that are popular when analyzing scientific data with Hadoop. A lot of popular tools use non-JVM languages (R, Python), which can make it difficult to interface with Hadoop. This post looks at the current state of support (and what can be improved) across three key areas - storing open data, web-based notebooks (popularized by IPython), and distributed data frames.
http://www.tom-e-white.com/2015/01/hadoop-for-science.html
News
Apache Flink, a distributed data processing framework, graduated from the Apache incubator this week. Originally called stratosphere, the project has some similarities to Apache Spark (e.g. it aims to be faster at MapReduce for iterative algorithms).
http://www.datanami.com/2015/01/12/apache-flink-takes-route-distributed-data-processing/
Qubole’s weekly roundup of Hadoop happenings has coverage of ten industry articles. These include Hadoop for enterprise, big data for banking, and Hadoop at AutoDesk. There are also several industry prediction/retrospective articles.
http://www.qubole.com/re-framing-hadoop/
Databricks and O’Reilly have announced a new online Spark Certified Developer exam. The exam takes about 90 minutes and costs $300.
http://databricks.com/blog/2015/01/16/spark-certified-developer-exams-available-online.html
Twitter has created an online course at Udacity for Apache Storm. It’s a free course at an intermediate level, and is expected to take about 2 weeks.
https://www.udacity.com/course/ud381
Fortune has an interview with MapR CEO John Schroeder, where he talks about the company’s milestones, IPO plans, and priorities for 2015.
http://fortune.com/2015/01/14/mapr-eyes-late-2015-ipo/
ComputerWorld UK has coverage of an event in London during which Expedia announced plans to expand their Hadoop cluster and spoke about their experiences with Apache Falcon. The article describes a few ways in which they’re using Falcon at Expedia.
http://www.computerworlduk.com/news/it-business/3593977/expedia-to-double-its-apache-hadoop-cluster-investment-this-year/
Releases
Wibidata, makers of the open-source Kiji Framework, have announced a new feature called Experiments by Wibi. The system allows retailers to setup experiments and send traffic to the experiment and control group to draw statistically significant decisions.
http://venturebeat.com/2015/01/12/this-new-wibidata-software-could-help-retailers-sell-far-more-stuff-online/
`rabbitmq-flume-plugin` is Flume plugin providing a source and sink for RabbitMQ. Version 1.0.3 was released this week.
https://github.com/aweber/rabbitmq-flume-plugin/releases/tag/1.0.3
Etsy has open-sourced their tool for profiling Hadoop clusters. The system, `statsd-jvm-profiler`, is a java agent which outputs data using the StatsD protocol. They include scripts for visualizing the data using flame graphs, and have an overview of the implementation and some potential pitfalls.
https://codeascraft.com/2015/01/14/introducing-statsd-jvm-profiler-a-jvm-profiler-for-hadoop/
Cloudian has announced a new version of its HyperStore software, which is fully compliant with the Amazon S3 API and offers integration with Apache Hadoop.
http://siliconangle.com/blog/2015/01/16/cloudian-integrates-apache-hadoop-to-turn-big-data-into-smart-data/
Events (curated by Mortar Data)
UNITED STATES
California
Apache Ambari for Hadoop Clusters, with Newton Alex and Alexander Denissov (Palo Alto) - Tuesday, January 20
http://www.meetup.com/Pivotal-Open-Source-Hub/events/219339139/
January SF Hadoop Users Meetup (San Francisco) - Wednesday, January 21
http://www.meetup.com/hadoopsf/events/219049168/
Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, January 21
http://www.meetup.com/hadoop/events/167785202/
Ingesting HDFS Data into Solr Using Spark (Palo Alto) - Wednesday, January 21
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/219786571/
Washington
Kubernetes and Mesos Power 90 (Seattle) - Wednesday, January 21
http://www.meetup.com/Seattle-Mesos-Meetup/events/219684171/
Colorado
Surfing Data with Apache Drill and SQL-on-Hadoop Technologies (Boulder) - Thursday, January 22
http://www.meetup.com/CU-Leeds-Business-Analytics/events/219688951/
Minnesota
Target's Journey with Hadoop (Minneapolis) - Wednesday, January 21
http://www.meetup.com/Skyway-Software-Symposium/events/219393925/
The Best of the Bunch for SQL-on-Hadoop (Saint Paul) - Thursday, January 22
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/219661224/
Missouri
Revolution Analytics (Saint Louis) - Tuesday, January 20
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/219101118/
Tennessee
Intro to Spark (Nashville) - Wednesday, January 21
http://www.meetup.com/Nashville-Hadoop-Meetup/events/218911783/
Alabama
Brandon Newell from MapR Presents on Apache Drill (Huntsville) - Tuesday, January 20
http://www.meetup.com/Huntsville-Big-Data-Meetup/events/219152975/
Georgia
Intro to Hadoop & Analytics (Atlanta) - Wednesday, January 21
http://www.meetup.com/Atlanta-Society-for-Business-Intelligence/events/219394022/
Pennsylvania
Hadoop & Hive/Pig (Harrisburg) - Tuesday, January 20
http://www.meetup.com/Central-PA-Hadoop-and-Big-Data-User-Group/events/217628962/
New Jersey
Streaming and CEP: DataTorrent vs Storm vs Spark (Hamilton Township) - Tuesday, January 20
http://www.meetup.com/nj-hadoop/events/219357759/
New York
Apache Mesos for Apache Kafka and Apache Accumulo (New York) - Wednesday, January 21
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/219073804/
Self-Service Data Exploration and Nested Data Analytics on Hadoop (New York) - Friday, January 23
http://www.meetup.com/New-York-Apache-Drill-Meetup/events/219730994/
Spark on/with Hadoop (Amherst) - Thursday, January 22
http://www.meetup.com/buffalolab/events/219349543/
Maryland
Introduction to Big Data Techniques for Cybersecurity (Rockville) - Wednesday, January 21
http://www.meetup.com/Capital-Area-Cyber-Security/events/219333009/
Massachusetts
Get the Most Out of Spark on YARN (Boston) - Tuesday, January 20
http://www.meetup.com/bostonhadoop/events/219194392/
CANADA
Deploying and Using Spark on AWS (Vancouver) - Monday, January 19
http://www.meetup.com/Vancouver-Spark/events/197407432/
UNITED KINGDOM
ElasticSearch and Apache Spark (Bath) - Thursday, January 22
http://www.meetup.com/South-West-ElasticSearch-Community/events/219311668/
SPAIN
Conoce Spark Streaming (Madrid) - Tuesday, January 20
http://www.meetup.com/Madrid-Apache-Spark-meetup/events/219689767/
Fifth Barcelona Spark Meeting (Barcelona) - Thursday, January 22
http://www.meetup.com/Spark-Barcelona/events/186862692/
INDIA
Big Data and Real-time Analytics with Spark (Bangalore) - Friday, January 23
http://www.meetup.com/Spark_big_data_analytics/events/208197842/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
The first full week of 2015 has been fruitful for Hadoop-related content (especially technical). There are a number of great posts, including several from the Data Day Texas conference as well as coverage of Drill, Spark, Storm, and more.
Technical
Apache Lens is a relatively new incubator project (originally from InMobi) for providing a unified interface for analytics on data stored in different systems (HDFS, HBase, RDBMS, S3, etc). This post describes the evolution of the data warehouse at inMobi which led to Lens (formerly Grill) and gives an overview of the Lens architecture.
http://www.slideshare.net/amarsri/apache-lens-at-hadoop-meetup
Apache Drill aims to be an interactive query engine for a wide range of data sources. This post looks at all the different data types and ways that you can query data using Drill. For instance, data can be stored in json files, Hive, HBase, and more. Drill exposes interfaces for querying this data using BI tools, an interactive prompt, and via a HTTP REST API.
https://www.mapr.com/blog/how-use-sql-hadoop-drill-rest-json-nosql-and-hbase-simple-rest-client
The Databricks blog has a new post on the ML pipeline API, which was introduced in Spark 1.2 (and is considered experimental). The API aims to help automate tasks that are often done manually as part of building production machine learning pipelines. For example, the API supports feature transformations: appending new columns to an existing dataset.
http://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
The MapR blog has a post comparing and contrasting two resource managers for distributed systems: Apache Mesos and Apach Hadoop YARN. The post is available both in video and transcript form. One of the conclusions is that the two systems can be complimentary, to which it refers users to the Myriad framework for scaling YARN clusters on Mesos.
https://www.mapr.com/blog/apache-mesos-vs-hadoop-yarn-–-whiteboard-walkthrough
Spotify has written a highly technical and detailed article about their Apache Kafka and Storm setup, which is used for recommendations, ad targeting, and more. The post describes their software testing strategy, metrics, alerting, hardware (they process 3 billion events per day across 6 nodes), performance tuning, and more.
https://www.hakkalabs.co/articles/spotify-scales-apache-storm
This post describes the new `DockerContainerExecutor` for YARN that was introduced in Apache Hadoop 2.6. This feature allows Docker containers to run as YARN containers, which means one could package all system-level dependencies of a YARN job into a docker container for deployment. Not only does the post describe how to use this feature, but it takes it one step further by describing how to use the DockerContainerExecutor when Hadoop itself is running inside of docker containers.
http://blog.sequenceiq.com/blog/2015/01/07/yarn-containers-docker/
Cloudera has declared HDFS’ transparent encryption as production-ready as part of the CDH 5.3 release. This post discusses the design and features of HDFS encryption, provides some basic examples for using it, and talks about about performance impact.
http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-transparent-encryption-in-hdfs/
This post gives an overview of Apache Samza at LinkedIn, where the project was originally built. The post describes the architecture of Samza, including its state storage system and fault tolerance properties. There’s also a case study describing how LinkedIn uses Samza to do call graph assembly for service monitoring.
http://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/
Hadoop Streaming is a system for processing data using MapReduce for non-JVM languages. This post describes how to use node.js for MapReduce on Amazon’s Elastic MapReduce. The tutorial details how to bootstrap an EMR cluster with node.js installed, write a simple MapReduce job, and deploy a job using the EMR command-line tools.
http://blogs.aws.amazon.com/bigdata/post/TxVX5RCSD785H6/Node-js-Streaming-MapReduce-with-Amazon-EMR
Spark SQL supports reading data from Hive tables like several SQL-on-Hadoop systems. In addition to that, it can read data stored in Parquet, JSON, and CSV even if the data isn’t part of a Hive table. This features is not something found in most systems—although Apache Drill can do it, too. This post gives a quick intro to the Spark SQL Data Sources API and how to use it via Spark SQL.
http://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
TPC Express Benchmark HS is a new TPC benchmark for big data systems. This week, Cisco published some benchmark results of the standard for a 16 node cluster running MapR’s distribution. These are the first results for the new benchmark, so we’ll have to wait for some additional vendors to publish results to see how they stack up.
http://blogs.cisco.com/datacenter/tpcx-hs-on-ucs
With the rise of new DSLs and computing frameworks, most folks aren't writing raw-MapReduce jobs using the Java API anymore. This post serves as a good reminder of all the progress that’s been made over the past few years. Specifically, it looks at Scalding and Spark, which offer a rich API for writing big data processing jobs.
http://www.slideshare.net/deanwampler/why-scala-is-taking-over-the-big-data-world
Apache Flume is a popular solution for delivering data to HDFS from an application server tier. This post provides a detailed overview of Flume’s macro architecture, the Flume agent architecture, and how to use Flume for the common use-case of log aggregation.
http://www.slideshare.net/aprabhakar/apache-flume-datadaytexas
This presentation describes the rise of Python for data science, some of the big data tools that exist today for Python, and provides some suggestions for improving Python’s big data support in the future.
http://www.slideshare.net/wesm/pydata-the-next-generation
The Kite SDK provides an API for describing, storing, and accessing data in Hadoop. This presentation details what exactly that means—describing the key abstractions, providing examples, and describing the architecture and tools.
http://www.slideshare.net/JoeyEcheverria/building-data-pipelines-with-kite
As a rather new system, the tools for debugging a Spark workflow are still fairly immature. In addition, the simple interface that Spark’s APIs provide hides a lot of complexity which it can be necessary to understand in order to debug a problem. This presentation looks at several common Spark job failures, and it explains the underlying system mechanics to help understand the root cause.
http://www.slideshare.net/SandyRyza/spark-job-failures-talk
News
The O’Reilly Data Show Podcast recently interviewed UC Berkeley Professor and Databricks CEO Ion Stoica about the origins of Mesos, Spark, and Tachyon. This post has a select transcript of the interview, in which you hear about early work on Mesos and Spark. Ion gives a lot of credit to the success of the projects to the students who worked on them.
http://radar.oreilly.com/2014/12/apache-sparks-journey-from-academia-to-industry.html
The call for speakers for HBaseCon 2015 is open until February 6th. Early bird registration is also open, and the conference takes place on May 7th in San Francisco.
http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2015/01/07/cloudera-announces-hbasecon-2015.html
Releases
Apache Falcon, the data processing and lineage system, recently released version 0.6.0. The Hortonworks blog has an overview of new features of the release, including authorization with ACLs for entities, enhancements to lineage metadata, and archiving to a cloud system such as S3 or Azure.
http://hortonworks.com/blog/announcing-apache-falcon-0-6-0//
The folks at SequenceIQ have published a new Docker image supporting Spark 1.2.0. The post has a quick walkthrough of building the docker image, running a container from the image, and launching a Spark job.
http://blog.sequenceiq.com/blog/2015/01/09/spark-1-2-0-docker/
Pinpoint is a new open-source Application Performance Management tool based on the Google Dapper architecture. Pinpoint uses HBase for data storage.
https://github.com/naver/pinpoint/blob/master/README.md
ASAP is a new stream processing framework. Unlikely other stream processing systems, ASAP focusses on ad hoc querying. ASAP uses Apache Kafka for inter-connecting pipelines.
https://github.com/ottogroup/ASAP
RecordBreaker is a open-source tool from Cloudera that’s been around for a while, but I’ve just learned about. It’s purpose is to extract structured avro records from text-formatted files.
http://cloudera.github.io/RecordBreaker/
Events (curated by Mortar Data)
UNITED STATES
California
What's Coming for Spark in 2015 (San Francisco) - Tuesday, January 13
http://www.meetup.com/spark-users/events/219626218/
Machine Learning for Real-time Bidding on Spark (Santa Monica) - Wednesday, January 14
http://www.meetup.com/AdTechLA/events/218665682/
HBase Meetup @AppDynamics (San Francisco) - Thursday, January 15
http://www.meetup.com/hbaseusergroup/events/218744798/
HBase+Phoenix Developer Meetup (San Francisco) - Thursday, January 15
http://www.meetup.com/hbaseusergroup/events/219648544/
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale (Los Angeles) - Thursday, January 15
http://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/210424402/
Washington
Using Spark to Increase Efficiency in Mobile Marketing at Tune (Seattle) - Wednesday, January 14
http://www.meetup.com/Seattle-Spark-Meetup/events/203176562/
Arizona
Application of Hadoop On-Demand (Tempe) - Wednesday, January 14
http://www.meetup.com/Arizona-SQL-Server-User-Group/events/219264215/
Utah
Hadoop Lunch at Adobe (Lehi) - Thursday, January 15
http://www.meetup.com/BigDataUtah/events/219332301/
Ohio
Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, January 12
http://www.meetup.com/Cleveland-Hadoop/events/208992442/
Virginia
Network Design Considerations and Challenges for Hadoop Big Data Environments (Reston) - Wednesday, January 14
http://www.meetup.com/Arista-Networks-MidAtlantic-DC-MD-NoVA-area-Meetup/events/218798169/
Discuss Migrating Oracle Databases and Apps to Splice Machine Hadoop RDBMS (Chantilly) - Thursday, January 15
http://www.meetup.com/DC-SMUG/events/219374759/
CANADA
HBase Intro and Hands On… Session 1 (Vancouver) - Thursday, January 15
http://www.meetup.com/Big-Data-Developers-in-Vancouver/events/219719368/
MEXICO
Primera Reunión de Apache Spark (Mexico City) - Thursday, January 15
http://www.meetup.com/Mexico-City-Apache-Spark-Meetup/events/219029479/
UNITED KINGDOM
Hadoop at Hotels.com and Expedia.com (London) - Tuesday, January 13
http://www.meetup.com/hadoop-users-group-uk/events/219547920/
SPAIN
Conoce Spark Streaming (Madrid) - Thursday, January 15
http://www.meetup.com/Madrid-Apache-Spark-meetup/events/219689767/
INDIA
Big Data and Real-time Analytics with Spark (Bangalore) - Friday, January 16
http://www.meetup.com/Spark_big_data_analytics/events/208197842
Building Data Pipelines using Luigi, with Erik Bernhardsson of Spotify [Meetup Video, Slides, Transcript]
John Matson
At our most recent NYC Data Science Meetup, we heard a great talk by Erik Bernhardsson from Spotify. Erik gave a comprehensive overview of Luigi, the workflow engine that he, Elias Freider, and other Spotify engineers built and open-sourced. Spotify uses Luigi to run thousands of Hadoop jobs a day, and to automate all kinds of complex data pipelines involving multiple systems. It's also used by many other cutting-edge companies such as Asana, Buffer, and Stripe.
We at Mortar were especially keen to hear what Erik had to say, both because we are devoted Luigi users and because Luigi is one of the core technologies in our platform. With Luigi on Mortar, our customers can build multi-stage data pipelines with complex dependencies, execute them in the cloud, and schedule them to run as often as needed.
Check out the video and slides from Erik’s talk, embedded below, to see what Luigi is all about and why it’s so useful.
Luigi presentation NYC Data Science from Erik Bernhardsson
Transcript
Luigi - NYC Data Science Meetup 12/16/2014
0:00 - Introduction by John Matson from Mortar
What is Luigi?
2:30 Luigi is a workflow engine. What is a workflow engine? When I started writing Luigi, I didn’t even know what a workflow engine was; I don’t think the term existed, but Spotify runs a ton of Hadoop jobs, and we need to chain those jobs into complex dependency graphs. But even if you’re not running a thousand Hadoop jobs a day, it’s pretty nice to think about how you can take tasks and stitch them together in dependency graphs, and that’s what Luigi does for you. It doesn’t just do Hadoop jobs, although that’s what I started off with.
3:12 We use it to train machine learning models, to send reports, to upload things to servers, to adjust things in Cassandra, all kinds of fun stuff. And I will describe now how it works.
What do we use it for?
3:29 There’s a bunch of stuff we use it for. I’ve been focusing mostly on music recommendations, because I’m the engineering manager of our team in New York. I do a lot of machine learning, but in the past I’ve been involved also in Spotify’s data infrastructure. We use Luigi for pretty much anything at Spotify, which is a lot of stuff. We run a lot of Hadoop jobs. Again, it doesn’t have to be used with Hadoop, but it definitely helps with Hadoop jobs.
Some history
4:12 So, I thought it would be instructive to go back really far, a really long time ago to 2008 which is when I started writing my Master’s thesis, and I had this -- I worked at Spotify; I did a Master’s thesis at Spotify on music recommendations, and I started building hacky things. It got bigger and bigger, and I had, after a while -- this dependency graph, actually, came from my Master’s thesis. It was a complex graph of a lot of tasks, and it all got super messy.
5:05 This was 2008, and I started seeing this thing over and over again, these complicated task dependencies, and so I started thinking about what to do about it.
Toy example: classify skipped tracks
5:21 Here’s a little toy example that I put together which reminds me of a lot of bad code that I’ve been writing. I think this is where it starts. You have a bunch of scripts running on a command line. There’s a lot of issues about this. In this little toy example, we’re going to take a bunch of log data, we’re going to subsample some interesting features and train a little machine learning model.
Reproducibility matters
6:00 And the big issue is it’s very hard to reproduce these things because you ran it on the command line, so there’s this implicit dependency graph that nobody knows about because it’s not documented anywhere. And maybe you realize there was a bug and you go back and fix the first thing and then you have to re-run everything. So, that’s messed up and let’s try to fix that.
Let’s make a big workflow
6:27 Let’s build a little Python wrapper to chain these things. A little bit better. Now we have this reproducible chain of tasks. We can just go in and run it, and we’ll run these things step by step.
Reality: crashes happen
6:49 It’s still pretty horrible, actually, for many reasons. One big thing that sucks about this that happens all the time is you’ll run this thing and the first thing finishes, and then the second thing crashes and you realize you had a bug. So, you go back and you fix that bug and then you have to re-run everything, and it’s twice to run the first step, but you already ran it, so why would you want to re-run it again? So, you want this thing to resume from intermediate data.
Ability to resume matters
7:29 That’s really important about building these data flows is you want this thing to be able to resume.
7:55 There are a couple more subtle issues with that, like one thing I’ve struggled with a lot is this last point of atomic file operations, which doesn’t sound like a big deal but I’ve had a lot of issues with something like where something runs halfway and then crashes and there’s this garbage intermediate stuff, and the last thing you want is to do send a royalty report to a label saying this month you’re only going to get $5 because something crashed but we didn't realize this and so we’ve continued to run the next step. So, there are all these tricky issues. You really want a good framework.
So let’s make it possible to resume
8:36 Let’s continue this road of hacking in all these features. Now we have some sort of resumability, we’ve checkpointed things to disk and we try to -- we re-run it if they don’t exist. We have this cleanup thing because this Hadoop job might fail, so delete it from HDFS. It’s starting to solve the issues we’re seeing. It’s also starting to look really horrible. We have hardcoded stuff all over the code.
9:20 This is also something else. I started seeing this over and over again in my scripts, and I’m still guilty of this when I don’t use Luigi, which happens.
But still annoying parts
9:30 This is where there’s hardcoded crap. We have a lot of hardcoded stuff here. We also have these filenames and everything, so let’s think about how we can try to solve this and generalize things.
Generalization matters
9:49 So let’s say we ran this thing and it worked out, and now we actually want to try and change some parameter. We don’t want to hardcode all these things; we want it to be easy to re-run things. One aspect of that is we really want it to be easy to run things in the command line, so if you realize, “Oh, I ran this with the wrong parameter,” we want you to be able to go back and run with a new parameter, so let’s try to fix it.
… Now we’re getting something.
10:26 This is starting to be ridiculous, right? We’re adding some command line in here, so this is the way it looks -- you can run it in the command line and specify parameters. It’s starting to look -- we’re addressing all these issues, but all in all, it’s just so much stuff you have to implement just to get the plumbing working when that’s the least thing you want to do.
… But it’s hardly readable.
11:22 Like, you wanted to train this machine learning model but you ended up building the pipeline. That’s not fun, and it also means you’re introducing a bunch of errors because you screwed up somewhere here.
Boilerplate matters!
11:47 I think I also have this theory with boilerplate where I think that there’s this psychological thing where if it’s hard to write things you're not going to do it because it’s not fun and easy, by default. So that’s the other thing; turn that laziness into an asset. As a programmer, you’re always like, let’s make this fun. Let’s try to think about how we can make all this plumbing fun. I also have this other theory that if something’s boring it’s because you don’t use enough graph algorithms, so I’m going to get back to that later. We’re going to introduce graph theory into this.
A lot of real-world data pipelines are a lot more complex
12:19 Here’s this pipeline again. The toy example I showed you -- it’s super simple, right? Insanely simple. Most real-world productionized pipelines look like this times five. It’s complex, they have a lot of dependencies, those dependencies depend on dates or date intervals or functions. You need some sort of -- you need some way to build up to these complex dependency graphs.
So I started thinking
13:12 How many of you guys use Make? I think it’s kind of cool. Make is this toolset for compiling code. Make is pretty cool, but it’s also pretty horrible.
What is Make and why is it pretty cool?
13:38 It has this sort of nice philosophy I really like, which is you build up these rules, and these rules are basically -- here’s how you compile a C++ source code into an object file. And these rules are reusable templates, so you have these rules and you can reuse them, and you can build up this super-complicated dependency graph, and there’s this horrible syntax to build up these rules. And actually, Make files are so big sometimes that you need another tool to generate the Make files. But there’s this interesting idea about Make that I really like, and I started thinking about how can we actually use that bit? How can we apply that to data plumbing?
We want something that works for a wide range of systems
14:35 The tricky thing about data plumbing -- like, Make, it’s only for compiling C code. I want something that’s super general. I want to run a lot of Hadoop jobs, but there’s also a lot of other stuff I want to do.
15:17 People use Make for all kinds of stuff. You can build workflows in it, but like… Make is for compiling stuff, so it’s never going to be perfect for all domains. If you optimize one domain, it’s not going to be perfect for another domain.
15:29 There’s this quote, I don’t know who the author is, “80% of data science is data munging.” My hope is that this 80% is going to go down to 70% if you use Luigi.
Data processing needs to interact with lots of systems
15:41 Here’s a lot of stuff we do at Spotify. There’s Hadoop jobs. Those Hadoop jobs are Python, Scala, Java, whatever. Dumping databases, training machine learning models, ingesting things into Cassana, copying things around, formatting XML, whatnot.
My first attempt: Builder
16:05 So at some point in early 2009, I think it was, I built this horrible framework called Builder that led to building up a dependency graph. XML was a mistake. Don’t ever use XML for anything, and here’s why. Well, there’s many reasons XML sucks but in my case, the biggest issue I had was that it’s hard to use a markup language to encode logic in that. You have to use logic to find dependencies.
Don’t ever invent your own DSL
17:08 There’s another quote I like, “It’s better to write domain specific code in a general purpose language than the other way around,” and I think that applies to XML. Don’t use XML when you really want to have -- when like, Python is a great language for specifying logic. There’s an alternative to Luigi called Oozie that relies on XML. If you google “Oozie simple example,” you’ll find this enormous XML block that never ends. It’s horrible.
2009: builder2
17:52 I decided that Python is a great language to build up graphs, and I made up a very unimaginative name, builder2. It’s completely in Python so you can build dependency graphs. It actually checked off all the boxes that I wanted it to, almost. It had really cool visualization, sent error emails. We still had some code that runs in production, although by coincidence I think 80% of that had been decommissioned today after 5 years of no one daring to touch it.
18:34 It was kind of robust, but it was kind of horrible, though. I never showed the source code because I was always embarrassed about it.
Graphs!
18:43 Here’s some cool graphs. builder2 would render these graphs and send them to you in an email, which was pretty cool. These graphs are complex. A lot of the things you want to do is building up to these fairly complex graphs.
What were the good bits? What went wrong?
19:14 So, builder2 did a lot of good stuff, but the code looked horrible. I think if you want to do things right, you have to do them at least three times. After builder2, I decided to build Luigi. We sat down and decided, let’s actually try to build something right from scratch. The third time, it worked out pretty well. We started doing that in December of 2011.
Luigi is your friendly plumber
20:20 I think of it as a way of outsourcing all your plumbing to this Luigi guy. It’s very easy to build complex dependency graphs in Python.
Luigi Task
20:48 Here’s how it works. The fundamental unit of computation in Luigi is the task. The task is sort of similar to a rule in Make files. Most tasks depend on some other task. In this case, the name of some other task is “Some Other Task.” They implement some code using a RUN method. They also say where they are going to write output, and that can be in local disk, that can be on HDFS, it can be on S3, it can be on a database. The interesting thing is you have these parameters, and these params are ways of passing on templates. This actually creates a constructor in passing any parameters. These things are defined on a class scope, so there’s metaprogramming to create a constructor.
Easy command line integration
22:19 The cool thing is you get command line integration completely for free with it. We did nothing, and it generated a command line interface just by looking through this code, pulling out the parameters, and generating a command line interface. This is cool because I think the command line is a data scientist’s best friend.
Let’s go back to the example
22:47 What do we want to do? We have a bunch of log entries, we’re going to sample things, we’re going to train a classifier using scikit-learn, and we’re going to look at the output. Very simple thing but illustrates all the principles of Luigi.
Code in Luigi
23:15 Here’s the scaffolding. We’re going to build this dependency graph -- I left out the implementation as an exercise for the reader. Here’s how we build up dependency graphs. In this case, it’s simple. This requires method returns a list of other tasks. It uses some date algebra to convert a date interval into a list of dates. You can run this on the command line.
Extract the features
24:06 Cool thing is since it comes with Hadoop integration using Hadoop streaming, you can write -- if you want to write just a simple Python MapReduce job, it’s very simple. Instead of implementing a run method, you just implement a mapper and a reducer. It’s the same principle.
Run on the command line
24:36 You can run it on the command line.
Step 2: Train a machine learning model
24:42 I’m not going to go into details. We’re going to pick a little model and output it.
Let’s run everything on the command line from scratch
24:54 I didn't show you the source code of inspect model because it -- you can just go in and run this on the command line, and it’s going to schedule everything backwards. I figured out it has to run all these three things in sequence, and it’s going to output some stuff. So, that’s cool. Now we’re going to make this insanely more complicated.
Let’s make it more complicated - cross validation
25:23 Just because I wanted to demonstrate how easy it is to build up these complex dependency graphs -- we’re going to take one classifier and apply it to another data set to do cross validation. If we had to have this approach where we built up -- with the hard parts and we had to do it ourselves, it would have been a mess. But in Luigi, you can do these things really easily because you can just do another task that says, I need another classifier for other features.
Cross validation implementation
25:59 Here’s the code. We actually put this in a different module, and we can import the tasks from the other module, and then we just say, oh we need these three tasks, and we’re going to use one classifier to predict, and then we’re going to see how well the predictions match what we observed. Then we get a nice command line interface with this. And this is cool because it’s starting to look like something more interesting now, and you can see how stitching together these tasks -- you can build up pretty complex dependency graphs. So, happy plumbing.
The nice things about Luigi
26:52 So, what are the benefits of Luigi? Minimal boilerplate and easy command line integration.
Everything is a directed acyclic graph
27:08 You build up directly to this directed acyclic graph, which I like to think of any computation in terms of these DAGs. I personally think that’s the right way to think of any computation. Any pipeline is a dependency graph of tasks.
Luigi’s visualizer
27:29 It comes with a visualizer, which is pretty cool, so you don’t have to write any code to get to this thing. It comes with a little server that runs. You can run it on a separate host if you want to, and this server keeps track of anything that’s queued up, that’s done, that’s failed. You can click on the failed things and see the tracebacks. It just gets really useful if you’re scheduling thousands of jobs every day that you can go in and see why they failed and what’s going on. You can also click on these jobs and see a dependency graph of what’s going on.
Dive into any task
28:16 This is pretty useful in order to see what’s going on, and we didn’t have to write any code for this. This comes automatically because it goes in and looks at all your task definitions, and when you schedule something it’s just going to upload the dependency graph to the server, and the server lets you render it through the web interface.
Run with multiple workers
28:52 Another pretty cool thing, you can actually parallelize things as long as you’re running it on a single host by just using the workers argument. It will try to use multiple processes on the same machine for tasks with no dependencies on each other. It doesn’t do this for separate machines, which is a feature that would be really cool that I’ve been trying to implement. You can run it in parallel on one machine, which could be useful if you’re training a machine learning model.
Error Notifications
29:30 It also has a built-in notification system so you can hook it up to email, can set it up to have a lot of automated things. Whenever things crash, one of these features, you get to actually trace back from it.
Process synchronization
30:02 The server that runs -- you can run it on the same host or you can run it on a separate host. We have it on one big, central server at Spotify. It also makes sure you’re not running the same tasks on multiple hosts or you’re not running the same task at the same time. It blocks things and makes them wait until something is done, which is useful in some cases. If you have one thing running on one machine and then you have dependencies on multiple machines that need that thing, it will synchronize it.
Luigi is a way of coordinating lots of different tasks
30:47 So, just my little disclaimer. Luigi just gives you the plumbing; you still have to implement the business logic. In our case, we use a lot of Python MapReduce, but recently switching more to Scala and Java. It’s not like you can just take some just thing and add Luigi and it will magically scale it to Hadoop scale. Still have to figure out how to implement things, but it is a glue that you can stitch together all these cool stuff.
Do general purpose stuff
31:32 Comes with Hadoop support. There’s actually a lot of people in biotech that don’t use Hadoop with Luigi -- well, a lot of people that don’t use Hadoop at all with Luigi -- but a lot of people in biotech, they have a lot of crazy schedulers that they like for some reason.
Built-in support for HDFS & Hadoop
31:53 At Spotify, we originally started with having everything on Python, but at our scale -- I think we have 900 Hadoop nodes -- it’s really slow to do Python in MapReduce, so we’re starting to see Luigi more as a glue, and Foursquare did the same thing. Foursquare had all their MapReduce jobs in Scala, but they use Luigi to stitch it together. I think our team is a good example. We use Luigi to stitch together models. We do a lot of injection into databases, we dump a lot of stuff from databases, and this is a nice glue.
The one time we accidentally deleted 50TB of data
32:47 Here’s another testament of why I think this is kind of cool. We accidentally deleted all of the intermediary output that we need. We had to do absolutely nothing. Luigi went in and was like, “oh shit, all this data is missing. I’ve got to re-create the whole dependency graph and schedule everything,” and it scheduled like 10,000 tasks and it ran for three days and re-created everything because we didn’t delete any original data. So, Luigi could re-create the dependency graph all the way back to original data and schedule everything and run it. We didn't have to write a single line of code for that.
Some things are still not perfect
33:28 I’m going to mention some drawbacks. I have to, and also because I think it’s fun to think about what I want to do next with Luigi.
The missing parts
33:43 One thing I had mentioned was I want to build this thing where you can schedule a thing on your laptop, and you have this big machine farm and it would magically send your task somewhere else. It doesn’t do that, so you have to schedule things where you run them, which is not a massive issue, but it would be nice to fix that.
34:16 Visualization isn’t awesome. There’s also no built-in scheduling with Luigi, which is also like not a super big deal because Luigi deals with the dependency graphs so well that when you need to schedule stuff, it’s really just a matter of adding one line, but it would be nice to avoid that. My dream would be able to do like data push, and then it would just automatically schedule it, and it would run it somewhere else. I really want to build this scalability cloud thing.
Luigi in Scala?
35:16 I really want to link Luigi in Scala.
Luigi implements some core beliefs
35:26 I really want to end up with some core beliefs with Luigi, because I think a lot of design philosophy is reflected in how it works. What did we optimize for? There’s many things you can optimize for. We focus on removing all boilerplate, because I think once you do that -- if you make it easy, take away all the boring plumbing, then writing Python is fun.
36:08 Number two, I never wanted to make this specific to Hadoop, although it’s really useful with Hadoop. You can run it on practically anything.
36:20 The other thing, it’s super important when you’re working at Spotify from a point of -- like, you’ve built your whole machine learning pipeline, but if you can’t go from test to production, that sucks. You want to take your machine learning pipeline and say, okay, now I’m running this everyday, and the result of it I’m going to upload to Cassandra. And that’s something that I think Luigi does well.
36:49 Last slide: Here’s a bunch of companies that use Luigi. Foursquare was the first company outside of Spotify to use it. And that’s it!
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
The first issue of 2015 starts off the year relatively quiet, although there are a number good technical and industry news articles. Technical articles cover the Kite SDK, Kafka, HBase, Hive, and Sqoop; news articles include a number of year-end synopses. Based on these, it looks like 2015 will be an interesting year for the Hadoop ecosystem.
Technical
This post describes how to use the Kite SDK’s command line tools to ingest data into HDFS. Using the `kite-dataset` command, the author sets up the schema for datasets, creates tables in Hive, and loads the datasets. It also describes how to use the tools to build parquet files from csv.
http://captechconsulting.com/blog/ben-harden/ingesting-data-quickly-using-kite-the-command-line
Kafka has a built-in tool called MirrorMaker for replicating data between Kafka clusters. This post looks at an alternative implementation built with golang and supporting similar functionality. The tool, Go Kafka Mirror Maker, also supports a few additional features like adding a topic prefix to avoid collisions (and doesn’t run on the JVM, so it has less overhead).
http://allthingshadoop.com/2014/12/29/multi-datacenter-replication-with-apache-kafka/
This post provides a thorough overview of HBase Coprocessors, which (as the article describes) are analogous to triggers and stored procedures, MapReduce, and aspect oriented programming. The post describes the main interfaces and used to implement two categories of coprocessors, complete with a walkthrough of a coprocessor implementation. It also details the steps needed to deploy a co-processor.
http://www.3pillarglobal.com/insights/hbase-coprocessors
Recent releases of Apache Hive have a number of new optimizations, which must be enabled (possibly by reprocessing data). This post describes several of those features (Tez, cost-based optimization, vectorization) and provides instructions for enabling them.
https://www.altiscale.com/hadoop-blog/optimizing-apachehive-by-tuning-configuration-parameters/
This two-part blog series looks at several SQL-on-Hadoop engines. The first post looks at various storage backends (HBase, HDFS, etc) and file formats, while the second describes several major systems (Hive, Impala, Presto, Drill, and Spark SQL). There’s also a discussion of if these systems are appropriate for for OLTP and OLAP.
https://phdata.io/the-truth-about-sql-on-hadoop-part-1/
https://phdata.io/the-truth-about-sql-on-hadoop-part-2/
Security has been a hot topic for the Hadoop ecosystem recently, and most systems are adopting or improving their enterprise security features. This post describes a new security feature in Sqoop2: support for Kerberos. It walks through the steps necessary to enable security as part of the Sqoop2 server.
http://ingest.tips/2015/01/01/kerberos-support-in-sqoop2/
News
This post presents a bullish view on the future of big data. The author argues that big data provides a mechanism to measure and analyze systems that never existed before. Referencing success stories from the energy and agriculture sectors, the post describes some of the new capabilities that big data offers.
http://www.odbms.org/2014/12/underhyped-big-data-scientific-method/
This year-end retrospective piece looks at some of the advances in machine learning tools for big data in 2014 (Spark 1.0, H20, GraphLab) and makes some predictions for the industry in 2015.
https://www.linkedin.com/pulse/new-year-practical-data-kyle-napierkowski
Based on data from the recent Hortonworks IPO and reports from Wikibon and Forrester, this post speculates on the future of the Hadoop industry. The article points out some bearish observations, such as the sparsity of successful companies based on open-source and the fact that Google has moved away from MapReduce.
http://whatsthebigdata.com/2014/12/31/the-day-the-hadoop-bubble-started-to-quiver/
We’ve seen quite a few end-of-year posts, and Qubole’s list of Hadoop Happenings this week includes several articles that touch on this theme.
http://www.qubole.com/hadoop-lives-hype/
Releases
MemSQL has open-sourced a new tool for loading data from S3 and HDFS to MemSQL and MySQL. The system is somewhat similar to Sqoop, but provides additional features like deduplication and failure handling.
http://blog.memsql.com/memsql-loader/
Events (curated by Mortar Data)
UNITED STATES
California
Docker: New Approaches to Software Development (San Ramon) - Tuesday, January 6
http://www.meetup.com/Docker-Containers-Cloud-Foundry-PaaS-East-Bay-Meetup/events/219129851/
A Real-Time Streaming Implementation of Markov Chain–Based Fraud Detection (Newport Beach) - Thursday, January 8
http://www.meetup.com/Orange-County-Java-Users-Group-OCJUG/events/219321555/
Hadoop Framework and Tools, Spark Intro, plus Data Science with R and Python (Fremont) - Thursday, January 8
http://www.meetup.com/East-Bay-Big-Data-and-Data-Science-Meetup/events/219416430/
How We Used Storm Wrong, Then Right, at AdRoll (Emeryville) - Thursday, January 8
http://www.meetup.com/Bay-Area-Storm-Users/events/218816484/
Nevada
Hadoop Past, Present and Future, plus NoSQL Data Modeling and Couchbase Mobile (Las Vegas) - Wednesday, January 7
http://www.meetup.com/lvbigdata/events/218825012/
New York
BDW Meetup: Spark SQL (New York) - Wednesday, January 7
http://www.meetup.com/Big-Data-Warehousing/events/219272125/
BELGIUM
RBelgium #9, Including “RHadoop, Using R in Hadoop” (Brussels) - Friday, January 9
http://www.meetup.com/RBelgium/events/219295488/
CHINA
Spark Meetup (Shanghai) - Saturday, January 10
http://www.meetup.com/Shanghai-Apache-Spark-Meetup/events/219419191/
INDIA
Streaming and Apache Spark (Bangalore) - Saturday, January 10
http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/events/219490940/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
For the last issue of Hadoop Weekly in 2014, we have a short but sweet edition. The theme for this week is the future of Hadoop—from the first technical post about Apache NiFi to posts on younger security projects (Apache Ranger and Apache Sentry) to several posts about the Hadoop industry in 2015.
Technical
The Apache Drill blog has a post on upcoming features that the project will be focussing on for 2015. These include improved JSON support, improved access control, an integration with Apache Spark, and operational enhancements.
http://drill.apache.org/blog/2014/12/16/whats-coming-in-2015/
Camus is the open-source project from LinkedIn for loading data from Kafka to HDFS. This post gives an introduction to Camus, a walkthrough to setting it up, and details how to customize Camus by writing a custom Decoder and RecordWriter.
http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline
This post gives a technical introduction to a new Apache incubator project, NiFi. NiFi is a system for integrating data sources using a web-interface to build data flows. The intro shows how to build a local dropbox folder that uploads data to HDFS whenever a file is added. The post also describes how to integrate NiFi with the KiteSDK.
http://ingest.tips/2014/12/22/getting-started-with-apache-nifi/
Apache Ranger (incubating) is a security project for Hadoop (based on code from XA Secure, which Hortonworks acquired earlier this year). This post looks at the audit features of Ranger, which are integrated for storage in a RDBMS, in HDFS, or via log4j. The post details the various configuration settings of each setup.
http://hortonworks.com/blog/apache-ranger-audit-framework/
Apache Sentry (incubating) is a security project originally backed by Cloudera for fine-grained authorization in Hive, Impala, and search. This post describes a new integration between Sentry and HDFS, and how that integration can be used to import data via Sqoop.
http://ingest.tips/2014/12/25/sqoop-import-in-a-world-governed-by-sentry-2/
News
Spark Packages is a new community index of packages for Spark. The initial set of packages includes integrations with Avro, Kafka, Pig, and more.
http://databricks.com/blog/2014/12/22/announcing-spark-packages.html
With 2015 starting later this week, Silicon Angle ventures a few predictions for Hadoop in the new year. These include the rise of “fast Big Data,” the need for real-time ingestion, support for streaming analytics in Hadoop-as-a-Service platforms, and the ubiquity of YARN.
http://siliconangle.com/blog/2014/12/22/2015-technology-predictions-datatorrent-on-big-data/
ZDNet has an interview with MapR CEO John Schroeder on what’s in store for MapR and the industry in 2015. Predictions include an emphasis on real-time over batch, data agility, fading hype of Hadoop, and vendor consolidation.
http://www.zdnet.com/article/mapr-ceo-talks-hadoop-ipo-possibilities-for-2015/
The Gartner blog has a short post that points out that while enterprises are ready to adopt Hadoop, the amount of expertise with the system is still lagging. And with new systems being added to the Hadoop ecosystem very frequently, this problem doesn’t seem to be going away any time soon.
http://blogs.gartner.com/nick-heudecker/hadoops-achilles-heel-in-2015/
Releases
The Cloudera Labs Kafka integration has been updated to support Kafka 0.8.2-beta. That release includes a number of useful features and improvements, which are summarized in the announcement.
https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/7-QaOzhJqlE/McO0hug8w_wJ
Cloudera Manager 5.3 was released this week. It includes a number of fixes and improvements, including stronger support for encryption (folder-level HDFS encryption), a new implementation of the S3-native file system, Kafka-Flume integration, and significant improvements to HBase. The release also includes the latest version of Apache Spark, Hue, and Impala.
http://blog.cloudera.com/blog/2014/12/cloudera-enterprise-5-3-is-released/
Cloudbreak, the system for auto-scaling Hadoop clusters in the cloud, has released a new version based on Apache Ambari 1.7.0 and Hortonworks HDP 2.2. The post has an overview of the new features and explains some upcoming improvements planned for future versions.
http://blog.sequenceiq.com/blog/2014/12/23/cloudbreak-on-hdp-2-dot-2/
Apache Drill, the SQL engine for Hadoop and NoSQL, released version 0.7 this week. Improvements in this release include: no longer depending on UDP multicast (Drill now works in EC2), automatic partition pruning, Hive 0.13 compatibility, and improved performance on JSON data.
http://drill.apache.org/blog/2014/12/23/drill-0.7-released/
Events (curated by Mortar Data)
UNITED STATES
California
SFML Office Hours (San Francisco) - Monday, December 29
http://www.meetup.com/sfmachinelearning/events/219423133/
Colorado
Big Data for Business (Centennial) - Thursday, January 1
http://www.meetup.com/Big-Data-for-Business/events/218775347/
New York
Data Workshop #7 (New York) - Sunday, January 4
http://www.meetup.com/NYC-Data-Wranglers/events/219352351/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
This week’s newsletter marks the milestone of issue 100, which goes out to over 4,000 recipients. The ecosystem has changed a lot in the last 100 issues, which you can see from this week’s coverage of Storm, Spark, Slider, and Kafka. Thanks to everyone that’s contributed great content to this newsletter in the past—please keep it coming!
Technical
This post describes how to send metrics from Storm to StatsD & Graphite. The tutorial describes how to integrate the necessary libraries into a build, use the Storm metrics API, and how to access the graphs once data has made it to Graphite.
http://etl.svbtle.com/visualizing-metrics-in-storm-using-statsdgraphite
The AWS big data blog has a post on building a recommendation system using Mortar (disclosure: Mortar syndicates Hadoop Weekly and curates the events section). The tutorial walks through using Mortar’s system locally as well as in a distributed setting on Amazon EMR.
http://blogs.aws.amazon.com/bigdata/post/TxEUOWQRG7D9IZ/Building-and-Running-a-Recommendation-Engine-at-Any-Scale
The Hortonworks blog has a recap of a recent webinar discussing new HDFS features such as heterogeneous storage, encryption, and security enhancements. They’ve posted a Q&A from the webinar, which has a lot of good information about how these features work (or will work) in practice.
http://hortonworks.com/blog/discover-hdp-2-2-discover-hdp-2-2-data-storage-innovations-hadoop-distributed-filesystem-hdfs/
The Hue blog has a post on setting up Hue for HDP 2.2. It describes the custom configuration and gotchas related to the integration.
http://gethue.com/how-to-deploy-hue-on-hdp/
Hive-on-Spark has recently made a lot of progress, and there’s a new way to try out the integration. The team from MapR, Intel, IBM, and Cloudera have created a pre-configured Amazon AMI for trying it out.
http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/
The High Scalability blog has a post on how Bloomberg is using HBase to serve time-series data. The post describes how they’re replacing a legacy system with HBase, which provides 1000x faster writes and 3x faster reads. With that said, they’re eagerly anticipating the upcoming timeline-consistent standby region servers will help improve read latency in the face of failure in order to meet SLAs.
http://highscalability.com/blog/2014/12/17/the-big-problem-is-medium-data.html
Cloudera Enterprise is available on the Microsoft Azure Marketplace for deployment in the cloud. This tutorial describes how to launch a CDH cluster in Azure, which brings up cluster with Cloudera Manager for configuration.
http://azure.microsoft.com/blog/2014/12/17/how-to-deploy-the-cloudera-evaluation-cluster-in-azure/
This presentation provides an overview of using Spark together with Cassandra. The first part gives an introduction to Spark, the second looks at Spark and Cassandra (including how CQL integrates), and the last part looks at integrating Spark streaming with Cassandra.
http://www.slideshare.net/RussellSpitzer/zero-to-streaming-spark-and-cassandra
News
MapR and SAS announced an alliance this week. The companies will collaborate on Hadoop integration, customer support, and more.
http://www.cbronline.com/news/tech/software/analytics/mapr-and-sas-in-big-data-pact-4472975
The Pivotal blog has a post that helps shed some more light on the term ‘data lake.’ They walk through 10 ways that people use a data lake, which uses Hadoop for storage and integrates a MPP database and open-source tools.
http://blog.pivotal.io/big-data-pivotal/features/10-amazing-things-to-do-with-a-hadoop-based-data-lake
HBaseCon 2015 will take place on May 7th in San Francisco. The Call for Papers is now open and early bird registration is through Feb 1st (for $375).
http://blog.cloudera.com/blog/2014/12/hbasecon-2015-call-for-papers-and-early-bird-registration/
Hortonworks has created new certifications called HDP Operations Ready, HDP Security Ready, and HDP Governance Ready. As part of the announcement, Teradata, Cisco, VMware, and Syncsort have all been certified as HDP Ops Ready.
http://hortonworks.com/blog/accelerating-adoption-enterprise-hadoop/
The MapR blog has a post about the growing community support for Spark. It highlights how many vendors are all collaborating on Spark, and that there has already been significant adoption in community projects. Examples include Apache Crunch, the Kite SDK, Pig on Spark, and preview of Hive on Spark.
https://www.mapr.com/blog/look-back-spark-open-standard
Computer Business Review has 10 Hadoop predictions for 2015. They cover both the business-side of Hadoop (growth rate, consolidation) as well as technical considerations like Spark vs. Hadoop and SQL for Hadoop.
http://www.cbronline.com/news/tech/software/analytics/10-hadoop-predictions-for-2015-4471886
Releases
Apache Storm 0.9.3 was recently released. Highlights of the release include an improved Kafka integration (for writing data to Kafka), new support for writing data to HDFS, HBase integration, reduced dependency conflicts (via shading), and node.js support.
https://storm.apache.org/2014/11/25/storm093-released.html
Scaldual is a new framework for working with Lingual using Scalding. Lingual is a SQL interface extension to cascading, and the project provides APIs for translating between Lingual and Scalding.
https://github.com/richwhitjr/scaldual
`spark-kafka` is a new integration between Spark and Kafka. While Spark has built-in support for Spark streaming on Kafka, this one uses Kafka as a source for RDDs.
https://github.com/tresata/spark-kafka
Apache Slider is a framework for deploying distributed applications in YARN—with built-in support for Apache HBase and Apache Accumulo. Version 0.60.0 was recently with support for Kerberos clusters, support for Hadoop on Windows, improved failure handling, integration with Apache Ambari, and more. The Hortonworks blog has more info the release and upcoming plans for Slider.
http://hortonworks.com/blog/announcing-apache-slider-0-60-0/
Cloudera has announced a new Cloudera Labs project SparkOnHBase. The integration, for Scala and Java, supports creating RDD/DStream from a Scan and performing delete/put operations based on the contents of an RDD/DStream. The introductory post provides code examples and describes optimizations built into SparkOnHBase.
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
Minotaur is a new open-source framework for managing Apache Kafka, Apache Mesos, and CDH clusters in AWS using Cloud Formation. An introductory post describes how the system works and provides examples of usage.
http://allthingshadoop.com/2014/12/17/open-source-cloud-formation/
Apache Spark 1.2.0 was released this week. The release notes have a good overview of the major changes and improvements. Among the new features are Scala 2.11 support, a new Python streaming API, a write ahead log for streaming high-availability, and several improvements to MLLib, Spark SQL, and GraphX. Also, to improve performance, bulk transfers have been switched to use netty and a new shuffle mechanism.
http://spark.apache.org/releases/spark-release-1-2-0.html
Events (curated by Mortar Data)
UNITED STATES
California
ICT Applications, Cancer Genomics and Data Sciences (Fremont) - Sunday, December 28
http://www.meetup.com/Big-Data-Science/events/208479772/
FRANCE
Big Data & eCommerce (Paris) - Tuesday, December 23
http://www.meetup.com/Paris-Digital/events/218753173/
CHINA
Spark Meetup (Hangzhou) - Sunday, December 28
http://www.meetup.com/Hangzhou-Apache-Spark-Meetup/events/219249582/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
It was quite a busy week for funding news with the Hortonworks IPO and VC funding for Altiscale and Qubole. There were also quite a few new releases (Oozie, Phoenix, Samza) as well as several posts describing recent releases in more detail. And given that we’re halfway into December, the first of 2014-retrospective and 2015-predictions for Hadoop are in this week’s issue.
Technical
Kafka is rapidly becoming a key component of the Hadoop ecosystem and integrating with more-and-more systems. As more use-cases arise, folks are looking to use Kafka in new ways. This post looks at how to model a few different types of data processing jobs with Kafka to ensure exactly-once delivery. It's not a comprehensive solution, but it introduces several key concepts to help think about avoiding duplication.
http://ben.kirw.in/2014/11/28/kafka-patterns/
This post provides several resources if you're looking to get started with Kafka. In addition to the official docs, it suggests several blog posts.
http://ingest.tips/2014/12/07/getting-started-with-kafka-resources/
This article about Kafka (last one, I promise!) has a recap of the O'Reilly Data Show Podcast. The conversation with Jay Kreps covers the origins of Kafka at LinkedIn and Kafka's surprising popularity.
http://radar.oreilly.com/2014/12/building-apache-kafka-from-scratch.html
The Hortonworks blog has the second post in a series of articles on data science with Hadoop. The first post looked at predicting airline delays using Pig and Python, and this time they do the same exercise with Spark and Spark’s matching learning library, MLlib. There’s an iPython notebook walking through the code, which does some data preparation and munging before evaluating Logistic Regression, Support Vector Machines, and Decision Trees.
http://hortonworks.com/blog/data-science-hadoop-spark-scala-part-2/
The Cloudera blog has a post describing several new Apache HBase features related to multi-tenancy that have shipped as part of CDH 5.2. These include throttling via quotas and execution queues (e.g. separate queues for read/write and de-prioritizing long-running scans). There’s also a discussion of future work to further improve multi-tenancy.
http://blog.cloudera.com/blog/2014/12/new-in-cdh-5-2-improvements-for-running-multiple-workloads-on-a-single-hbase-cluster/
Apache Knox Gateway is a perimeter security system for Hadoop. The latest release, 0.5.0 (from early November) includes support for HDFS HA, installation via Ambari, authorization via Apache Ranger, and YARN REST API access. The Hortonworks blog has more details on these features and future plans for Knox.
http://hortonworks.com/blog/announcing-apache-knox-gateway-0-5-0/
The Hortonworks blog also has a post about the recently released Apache Ranger 0.4.0. Ranger is a system for enterprise security in Hadoop, which is based on the technology that Hortonworks acquired as part of XA Secure. Highlights of the new release include support for Apache Storm and Knox, REST APIs for the policy manager, and storage of audit events in HDFS.
http://hortonworks.com/blog/announcing-apache-ranger-0-4-0/
This post looks at the HDFS protocol by way of monitoring system calls made when fetching a file. By running the python snakebite library under strace, one can see the contents of protobuf RPCs to the namenode and datanode to locate and fetch the data.
http://jvns.ca/blog/2014/12/10/spying-on-hadoop-with-strace/
Although Spark is written in Scala, a lot of folks are using it via Java (which is a far more widely used language). This post looks at the benefits in syntax and improvements in usability Java users will see as part of Java 8’s lambdas. There’s an analysis of implementing collaborative filtering in Spark with Java 7 and Java 8 (the latter is about half the lines of code). You can find the actual code on the companion github repo.
http://www.datanami.com/2014/12/11/apache-spark-java-8-big-data-team-2015/
Sqoop 1.99.4 was recently released with a number of new features. The ingest.tips blog has a tour of the architectural changes, the new intermediate data format, and command line tools. It also covers improvements to batch execution of shell commands, the rest API, and filesystem URI configuration.
http://ingest.tips/2014/12/11/sqoop-1-99-4-release/
The Cloudera Impala team has written a guide with a number of tips and best practices for building an Impala cluster. The tips are shared as a presentation, and they cover things like schema design, hardware recommendations, and multi-tenancy best practices.
http://blog.cloudera.com/blog/2014/12/the-impala-cookbook/
Hortonworks HDP 2.2 shipped earlier this month with features form the Apache Hive 0.14 release. This post on the Hortonworks blog covers a number of common questions about the ACID transaction support added in 0.14. Specifically, it discusses the expected number of concurrent transactions, transaction throughput, and overhead. It also mentions example use cases and some in-progress features that’ll be added soon.
http://hortonworks.com/blog/apache-hive-acid-transactions-hdp-2-2/
The Amazon Web Services blog has a post describing how to load data from Amazon Kinesis into HBase. The post describes how to use the AWS Java SDK to start an Amazon EMR HBase cluster, create a HBase table that’ll hold the data from Kinesis, implement a Kinesis connector pipeline (full source code on github), and run the Kinesis connector to load data.
http://blogs.aws.amazon.com/bigdata/post/Tx3CFT0COINZ4N8/Getting-HBase-Running-on-Amazon-EMR-and-Connecting-it-to-Amazon-Kinesis
News
As 2014 comes to an end, I expect to see lots of Hadoop year-in-review posts. This one looks at five big data industry trends from 2014. They are: vendors integrating SQL with Hadoop, maturing platforms (security and other enterprise features), new educational options including MOOCs, lots of new cloud options, and Spark (which seems to be integrating with just about every platform).
http://www.informationweek.com/big-data/software-platforms/top-5-big-data-trends-of-2014/a/d-id/1317939
The AWS Activate blog has a post on Hadoop startup Mortar Data, which is built on the AWS cloud (disclosure: Mortar syndicates Hadoop Weekly and curates the events section). The post talks about their platform, which is built with Amazon EMR, Apache Pig, and the Luigi workflow engine.
https://medium.com/aws-activate-startup-blog/modern-data-integration-with-mortar-and-redshift-fed7aff67519
Folks from Cloudera, MapR, and Twitter have written about the state of the community of the Apache Parquet (incubation) project. Parquet is a columnar storage format for Hadoop which is integrated with a number of different systems. They credit the community and the ubiquity of Parquet (thanks in part to the community for building many integrations) with making Parquet a successful project.
https://www.mapr.com/blog/bringing-community-together-parquet
Hadoop as a Service (HaaS) vendor Altiscale has announced $30 million in Series B funding (bringing total funding to $42 million). Unlike many other HaaS vendors, Altiscale has their own hardware running in datacenters on the east and west coasts of the US. WSJ has more details on the recent funding, the company, and how the company plans to use the new money.
http://blogs.wsj.com/venturecapital/2014/12/09/altiscale-raises-30-million-to-make-hadoop-easier-to-use/
Another HaaS vendor, Qubole, also raised a series B round this week. Qubole, who was founded by two creators of Apache Hive, raised $13 million (total funding is now $20 million). Qubole's offering includes an optimized version of Hive for the cloud, supports the Presto SQL-on-Hadoop engine, and more. GigaOm has more details on the round and the company.
https://gigaom.com/2014/12/10/apache-hive-creators-raise-13m-for-their-hadoop-service-qubole/
The Qubole blog, in a post unrelated to their funding, has curated a list of several industry articles covering predictions and expectations for Hadoop in 2015.
http://www.qubole.com/looking-2015/
It’s always interesting to hear how Hadoop is being used outside of the industries that made it popular (such as ads and search). This post describes the problems that were solved by introducing Hadoop at a hospital to analyze historic patient monitoring data.
http://www.cio.com.au/article/562370/tackling-big-data-challenges-hadoop/
The DMBS2 blog has three posts related to Hadoop. The first covers the future of Hadoop (especially highlighting some architectural issues with HDFS which the Tachyon distributed file system could solve) as “the new OS.” The second offers analysis of some news out of MapR this week on their customer growth and numbers. The last article includes notes covering a wide range of topics from Wibidata to Scaling Data to Hortonworks.
http://www.dbms2.com/2014/12/07/hadoops-next-refactoring/
http://www.dbms2.com/2014/12/10/a-few-numbers-from-mapr/
http://www.dbms2.com/2014/12/12/notes-and-links-december-12-2014/
Hortonworks became a publicly traded company on Friday on NASDAQ under the ticker HDP. The IPO raised $100 million, and the stock jumped as high as $24.35 on the first day of trading (from an initial price of $16). CNBC has more coverage of the IPO.
http://www.cnbc.com/id/102261971
Releases
Apache Oozie 4.1.0 was released this week. Highlights include a new Sharelib service, improvements to Oozie HA with secure clusters, support for sqoop via the oozie command, and many bug fixes.
http://mail-archives.apache.org/mod_mbox/oozie-user/201412.mbox/%3C358205767.4248377.1418077750359.JavaMail.yahoo@jws100124.mail.ne1.yahoo.com%3E
Apache Samza (incubating) version 0.8.0 was released this week. Samza is a stream processing framework built on Apache Kafka and Apache YARN. Highlights of the release include: a new YARN AM UI, support for RocksDB for state management, and major performance improvements.
https://blogs.apache.org/samza/entry/announcing_the_release_of_apache1
Apache Phoenix, the relational db layer for Apache HBase, released version 4.2.2 (for HBase 0.98.x) and 3.2.2 (for HBase 0.94.x) this week. The new releases include >100 bug fixes, statistics collection, correlated subqueries, and more.
https://blogs.apache.org/phoenix/entry/announcing_phoenix_4_2_2
Version 0.17.1 of the Kite SDK was released. This version has bug fixes and enhancements for kite data and morphlines.
http://kitesdk.org/docs/0.17.1/release_notes.html
MapR has added formal support for Apache Storm to their distribution. MapR is supporting version 0.9.3 of Storm, which is the most current release.
https://www.mapr.com/blog/apache-storm-now-part-mapr-distribution-including-hadoop
SequenceIQ has released a new version of Cloudbreak, their cloud-agnostic, docker-based Hadoop as a Service solution. This post describes one of the major new features of the release—autoscaling of Hadoop clusters via Periscope. Cloud break current supports AWS, Google Cloud, Azure, and OpenStack (in private beta). See the post for more details on Persicope’s Alarms, scaling policies, and scaling configurations.
http://blog.sequenceiq.com/blog/2014/12/12/cloudbreak-got-periscope/
Events (curated by Mortar Data)
UNITED STATES
California
Elastic Scaling in Spark 1.2 and Beyond (San Francisco) - Monday, December 15
http://www.meetup.com/spark-users/events/219015324/
From 0 to Streaming: Spark and Cassandra (San Francisco) - Wednesday, December 17
http://www.meetup.com/CassandraSF/events/199432252/
Interactive Session on Sparkling Water = Spark + H2O (Mountain View) - Wednesday, December 17
http://www.meetup.com/Silicon-Valley-Big-Data-Science/events/219067296/
Meet Apache Spark: A Faster and More Flexible Compute Engine (Santa Barbara) - Thursday, December 18
http://www.meetup.com/Scala-SB/events/218812151/
Oregon
Making Big Data Analytics Simple for Everyone: Datameer (Portland) - Thursday, December 18
http://www.meetup.com/Hadoop-Portland/events/219119919/
Washington State
Getting Started with Pig and Hive on Hadoop in Azure HDInsight (Redmond) - Wednesday, December 17
http://www.meetup.com/data-science-dojo/events/218748428/
Texas
Spark and Cassandra (Austin) - Wednesday, December 17
http://www.meetup.com/Austin-ACM-SIGKDD/events/219175468/
Missouri
Introducing the R Language / About Hue (Saint Louis) - Tuesday, December 16
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/218819986/
Kentucky
An Introduction to Hadoop for Big Data Analysis (Louisville) - Thursday, December 18
http://www.meetup.com/Louisville-BI-Big-Data-Analytics-Meetup/events/218916789/
North Carolina
Hortonworks: How YARN Enables Multiple Data Processing Engines (Charlotte) - Wednesday, December 17
http://www.meetup.com/CharlotteHUG/events/167353552/
New York
What's New in Neo4j 2.2 (New York) - Monday, December 15
http://www.meetup.com/nycneo4j/events/218988862/
Data Driven NYC: Yann LeCun & Mike Olson (New York) - Tuesday, December 16
http://www.meetup.com/NYC-Data-Business-Meetup/events/218791841/
Building Data Pipelines Using Luigi, with Erik Bernhardsson of Spotify (New York) - Tuesday, December 16
http://www.meetup.com/NYC-Data-Science/events/218604422/
BDW Meetup: Spark SQL (New York) - Wednesday, December 17
http://www.meetup.com/Big-Data-Warehousing/events/218915626/
Scaling Apache Storm Pipeline and Building Real-Time Market Data Platform (New York) - Wednesday, December 17
http://www.meetup.com/New-York-City-Storm-User-Group/events/218922691/
Etsy on Migrating to Kafka in Three Short Years (New York) - Thursday, December 18
http://www.meetup.com/NYC-Data-Engineering/events/219175595/
New Hampshire
Big Data Hadoop MapReduce Hands-On with Google Cloud (Nashua) - Thursday, December 18
http://www.meetup.com/FREE-Big-Data-Hands-On-Workshops/events/219195573/
CANADA
Holiday Guest Speakers: Mark Grover & Reza Zadeh (Toronto) - Thursday, December 18
http://www.meetup.com/TorontoHUG/events/216507712/
SWEDEN
SQL on Hadoop: SQL+NoSQL in One Place (Stockholm) - Tuesday, December 16
http://www.meetup.com/stockholm-hug/events/218928743/
ISRAEL
NoSQL in Real-time Architectures (Herzliya) - Tuesday, December 16
http://www.meetup.com/Big-Data-Israel/events/219111468/
INDIA
BigData and Analytics: Why to Learn Hadoop (Hyderabad) - Wednesday, December 17
BigData and Analytics: Why to Learn Hadoop (Hyderabad) - Wednesday, December 17
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
There’s a lot to cover from the past two weeks, including several new releases: Apache Hadoop 2.6.0, Apache Ambari 1.7.0, and updates from Cloudera, MapR, and Hortonworks. Technical posts from LinkedIn, Pinterest, and Spotify all give a glimpse into the data infrastructure of those companies. There’s also coverage of the Hortonworks IPO and Apache Drill, which graduated from the Apache incubator this week.
Technical
LinkedIn has published a technical overview of Gobblin, their data ingestion framework. LinkedIn uses Gobblin to ingest data from Kafka, Databus (database change logs), and external partners in a unified fashion. It’s responsible for things like compaction and privacy compliance. The post promises that in the future Gobblin will be open-sourced.
http://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease
The Cloudera blog has a post about BigBench, which is a specification-based benchmark for big data systems. Parts of BigBench are derived from TPC-DS, including the data model and ~1/3 of the queries. The post also details the upcoming BigBench 2.0 and includes some experimental results from running BigBench against 3 different server configurations.
http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/
The Hortonworks blog has a deeper look at features of the recently-released Hive 0.14. There are a number of notable new features, including ACID transactions, a cost-based optimizer, and temporary tables.
http://hortonworks.com/blog/announcing-apache-hive-0-14/
Apache Pig 0.14.0 was recently released with some exciting new features. This post looks at several those features and improvements: Pig on Tez, ORCStorage, Predicate pushdown for Load Functions, auto-shipping of UDF dependencies, and refactoring of jar artifacts.
http://hortonworks.com/blog/announcing-apache-pig-0-14-0/
The Amazon Web Services blog has a post covering how to build a data ingestion pipeline using Data Pipeline and Elastic MapReduce. The tutorial assumes that web servers publish data to an S3 bucket, after which it describes how to run Pig jobs to clean and filter the data and to generate reports using Hive.
http://blogs.aws.amazon.com/bigdata/post/Tx1PU7JM7I34L81/ETL-Processing-Using-AWS-Data-Pipeline-and-Amazon-Elastic-MapReduce
This post covers how to use the Oozie workflow system and the Sqoop data transfer framework to import data from MySQL to Hive. The post has a 3-part walkthrough and a troubleshooting section that describes several common errors.
http://ingest.tips/2014/11/27/how-to-oozie-sqoop-hive/
The Cloudera Impala team has written a paper describing Impala. It looks at the use-cases Impala aims to solve, gives an overview of the system architecture and main components, includes benchmarks with different file formats/compression, describes Impalas integration with YARN, and provides a comparison to other SQL-on-Hadoop systems as well as a commercial analytics db engine.
http://pandis.net/resources/cidr15impala.pdf
This in-depth tutorial looks at using Microsoft Azure’s HDInsight to run a Storm and HBase cluster to perform sensor analysis. The system consumes data from Event Hub and uses ASP.NET SignalR and D3.js to build a dashboard.
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-sensor-data-analysis/
Apache Kafka 0.8.2 is currently in beta release with a final release planned for late this year. This post looks at several new features and improvements of the release, which include a new producer API, topic deletion, offset management in Kafka rather than Zookeeper, automated leader rebalancing, controlled shutdown, stronger durability guarantees, and connection quotas.
http://blog.confluent.io/2014/12/02/whats-coming-in-apache-kafka-0-8-2/
Spotify has written a post about how they’ve been migrating many of their MapReduce jobs from Hadoop Streaming with Python to Apache Crunch. Apache Crunch is based on Google’s FlumeJava, which has a number of nice properties (type-safety, Avro-support, simple testing, etc.). The post looks at how Spotify uses Crunch and introduces a new-open source project called `crunch-lib`, which contains a number of high-level operations.
https://labs.spotify.com/2014/11/27/crunch/
One of the goals of YARN is to be a general-purpose compute framework for many applications. There have been a few examples so far, but they’re still mostly compute frameworks (e.g. Spark, Storm). This post looks at a new project from LucidWorks for running SolrCloud on YARN. It describes the YARN application lifecycle, how to launch a Solr cluster, and how to shut one down. It’s a great introduction to building a long-running application on YARN.
http://lucidworks.com/blog/solr-yarn/
Among the many SQL-on-Hadoop frameworks is IBM’s Big SQL 3.0. This post compares Big SQL to Cloudera Impala 1.4 and Apache Hive 0.13 in two areas - how much effort is required in porting the SQL queries to each engine and a performance comparison. With the caveat that every vendor will try to show their system in the best light, the post claims that Big SQL is 3.6x faster than Impala (5.4x Hive) for single user at 10TB scale and 2.1x faster than Impala (8.5x Hive) with 4 concurrent streams.
https://developer.ibm.com/hadoop/blog/2014/12/02/big-sql-3-0-hadoop-ds-benchmark-performance-isnt-everything/
Amazon Web Services has posted some benchmarks for Hadoop workloads using Elastic MapReduce (it looks at 6 node clusters, which is good start but not a comprehensive overview). They compare the current-generation instances to previous-generation instances as well as running against data stored in S3 and on an instance-backed HDFS. Current-generation instances are faster and cheaper, which makes the more cost-effective in most cases. The new instance types have a fraction of the instance storage, though.
http://blogs.aws.amazon.com/bigdata/post/Tx3RD6EISZGHQ1C/The-Impact-of-Using-Latest-Generation-Instances-for-Your-Amazon-EMR-Job
As Hadoop deployments in the cloud become more economical and performant, we can expect to see more folks deploying Hadoop in the cloud in some shape or form. The Hortonworks blog has two posts about a hybrid cloud—the first details common use cases such as backup (to a blobstore like S3 or Azure Blob storage), development, and burst/overflow. The second describes how you can use the Microsoft Azure cloud towards these ends.
http://hortonworks.com/blog/hybrid-deployment-options-hadoop-hdp/
http://hortonworks.com/blog/deploying-hadoop-hybrid-cloud-microsoft/
Pinterest describes their analytics system which is built on MySQL and HBase with the Flask web framework. For scalability, the system uses HBase with co-processors and secondary indexes. The post gives a high-level overview of how the HBase storage/compute works as well as how they build rolling-window data sets with Cascading jobs.
http://engineering.pinterest.com/post/104418761649/building-pinalytics-pinterests-data-analytics
News
ZDNet has coverage of Hortonworks’ IPO. The company will trade on Nasdaq under “HDP,” and 6 million shares of stock are expected to be offered at between $12 and $14. The article has some more details on the IPO and the company based on the S-1 filing.
http://www.zdnet.com/hortonworks-prices-ipo-up-to-14-per-share-will-debut-on-nasdaq-7000036291/
InfoWorld has an analysis of the Hortonworks IPO and the Hadoop industry at a whole. The reporting is based on the data in Hortonworks’ S-1 filing and previous statements about the company’s finances. The article discusses the role that open-source software has on the revenue and on the success of any Hadoop vendor.
http://www.infoworld.com/article/2854359/hadoop/why-hortonworks-cant-sustain-a-billion-dollar-unicorn-valuation.html
Apache Drill, the schema-free SQL system which can query data on lots of different systems, has graduated from the Apache incubator. The Apache blog has more details on the project, and GigaOm research has more background on the SQL-on-Hadoop ecosystem and how Drill fits into it.
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces66
http://research.gigaom.com/2014/12/apache-drill-drills-up-to-top-level-project/
If you’re looking for some more information on Apache Drill, the MapR blog describes several use-cases that customers are targeting for Drill. These include self-service data (e.g. integration with Tableau or MicroStrategy), data agility (plugging into many data sources), interactive query response time, and ubiquity (integration with several systems such as Spark and MongoDB).
https://www.mapr.com/blog/4-key-reasons-why-customers-are-so-excited-about-apache-drill
The edX massive online course organization is offering two courses for Apache Spark. “Introduction to Big Data with Apache Spark” will be taught by UC Berkeley professor Anthony Joseph and will start on February 23rd. “Scalable Machine Learning” will be taught by UCLA assistant professor Ameet Talwalkar and will start on April 14th.
http://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html
Releases
Apache Hadoop 2.6.0 was released this week. The new release includes a number of improvements, bug fixes, and new features. These include a (beta) key management server, improved support for heterogeneous storage tiers, (beta) transparent encryption at rest, support for long-running services in YARN, support for rolling upgrades, (beta) support for running applications in docker containers, and much more. The Hortonwork’s blog has more details on several of the new features as well as a look forward to the Hadoop 2.7 release.
http://mail-archives.apache.org/mod_mbox/hadoop-general/201412.mbox/%3C561D9F9D-22B1-4766-BD56-FFB6F3DFA24B%40hortonworks.com%3E
http://hortonworks.com/blog/announcing-apache-hadoop-2-6-0/
On the heels of Apache Hadoop 2.6.0 release, Hortonworks has announced HDP2.2 based on Hadoop 2.6.0 and with new versions of all core components. This release also adds several new systems which were previously in technical preview—Apache Spark, Apache Slider, Apache Kafka, and Apache Ranger. Other major improvements features of the release are support for Rolling Upgrades, automated cloud backup to Azure and S3, and phase 1 of stinger.next (Hive 0.14.0).
http://hortonworks.com/blog/available-now-hdp-2-2/
If you’re looking to try out Hadoop 2.6.0, there’s a new docker image for the release. This post from SequenceIQ shows how to run the container and run an example Hadoop job.
http://blog.sequenceiq.com/blog/2014/12/02/hadoop-2-6-0-docker/
MapR has announced upgrades to several systems included in its distribution. The new versions include Impala 1.4.1, Spark 1.1.0, Pig 0.13, Hive 0.13, and Sqoop 1.4.5.
https://www.mapr.com/blog/latest-open-source-updates-include-impala-spark-pig-and-hive
DataStax has announced DataStax Enterprise 4.6, which provides an enterprise-ready Apache Cassandra and management software. The new version includes Apache Spark streaming analytics and security integration.
http://www.datastax.com/2014/12/datastax-announces-dse-4-6-the-leading-database-platform
Apache Ambari 1.7.0 was released this week. It’s a pretty epic release resolving over 1,600 tickets. The Hortonworks blog has details on improvements and new features, which include improvements across operations, extensibility, and the core platform.
http://hortonworks.com/blog/announcing-apache-ambari-1-7-0/
Cloudera announced several patch-level releases this week. Cloudera Enterprise 5.2.1 includes fixes to Oozie, YARN, Impala, Cloudera Manager, and Cloudera Navigator. Several previous releases (Cloudera Enterprise 5.1.4/5.0.5 and some CDH4 releases) were all patched to fix the POODLE vulnerability in SSL. Finally, new ODBC and JDBC drivers for CDH 5.2 were released with support for Hive 0.13, Impala 2.0, and more.
http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-2-1-CDH-5-2-1-Cloudera-Manager/m-p/22239#U22239
http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-1-4-and-5-0-5-Cloudera/m-p/22337#U22337
http://community.cloudera.com/t5/Release-Announcements/Announcing-New-ODBC-and-JDBC-Drivers/m-p/22384#U22384
The presto-marathon-docker project provides tools for running Presto (the open-source SQL-on-Hadoop/S3/etc from Facebook) inside of docker and using Mesos to build an on-demand cluster.
https://github.com/sheepkiller/presto-marathon-docker
Events (curated by Mortar Data)
UNITED STATES
California
Databricks Spark Meetup (Playa Vista) - Thursday, December 11
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
December SF Hadoop Users Meetup (San Francisco) - Wednesday, December 10
http://www.meetup.com/hadoopsf/events/219004016/
Texas
Apache Drill Intro: Data Exploration and Analytics on Hadoop (Houston) - Wednesday, December 10
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/218626903/
Minnesota
Let's Talk about Hadoop! (Minnetonka) - Wednesday, December 10
http://www.meetup.com/Minnetonka-Big-Data-Analytics-Meetup/events/218644407/
Tennessee
Join Us @Chadoopers (Chattanooga) - Thursday, December 11
http://www.meetup.com/CHadoop/events/218681314/
New York
Go Lightning Talks: Apache Kafka; Solving Problems with Bosun (New York) - Tuesday, December 9
http://www.meetup.com/nycgolang/events/218699394/
Massachusetts
Show & Tell: Winter Is Coming Edition (Boston) - Wednesday, December 10
http://www.meetup.com/bostonhadoop/events/218874917/
Spark + Cassandra (Waltham) - Wednesday, December 10
http://www.meetup.com/The-Boston-Cassandra-Users/events/218834781/
CANADA
What's the Scoop on Hadoop? (Ottawa) - Wednesday, December 10
http://www.meetup.com/Big-Data-Developers-in-Ottawa/events/218942004/
UNITED KINGDOM
Building Data Pipelines (Bristol) - Tuesday, December 9
http://www.meetup.com/Bristol-Java-Meetup/events/211167202/
AUSTRALIA
Mobile Beacons and #IoT with SQL on Hadoop Featuring RSA Analytics (Melbourne) - Tuesday, December 9
http://www.meetup.com/BigDataAnalyticsMelbourne/events/218788958/
POLAND
Intro to Apache Spark: Paweł Szulc (Warsaw) - Tuesday, December 9
http://www.meetup.com/Warszawa-Java-User-Group-Warszawa-JUG/events/219054039/
Lightning Talks (Warsaw) - Wednesday, December 10
http://www.meetup.com/warsaw-hug/events/218579675/
SWEDEN
A Machine Learning Pipeline for Event Detection on Spark (Goteborg) - Wednesday, December 10
http://www.meetup.com/Goteborg-Machine-Learning-Meetup/events/218799452/
ROMANIA
How to Think in MapReduce (Cluj-Napoca) - Wednesday, December 10
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/218894090/
NETHERLANDS
Apache Flink and Generating Query Suggestions on Hadoop (Hoofddorp) - Thursday, December 11
http://www.meetup.com/Netherlands-Hadoop-User-Group/events/218635152/
ARGENTINA
Apache Spark (Buenos Aires) - Thursday, December 11
http://www.meetup.com/buenos-aires-high-scalability-group/events/218779640/
IRELAND
Apache Spark and the Current Big Data Landscape (Dublin) - Thursday, December 11
http://www.meetup.com/DublinR/events/218808539/
INDIA
Big Data and Real-time Analytics with Spark (Bangalore) - Friday, December 12
http://www.meetup.com/Spark_big_data_analytics/events/208197842/
Practical MapReduce Programming Mini-Course (Chennai) - Saturday, December 13
http://www.meetup.com/Practical-BigData-Science/events/218777795/
Hadoop Hackathon (Gurgaon) - Saturday, December 13
http://www.meetup.com/Big-Data-Meetup-NCR-Chapter/events/218712366/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
Lots of news out of Europe this week as both StrataConf Barcelona and ApacheCon EU took place. In addition to several interesting presentations from those conferences, this week’s issue contains several articles on Spark and YARN. Also, Stripe has open-sourced four Hadoop tools, and there were releases from open-source projects Apache Pig and Apache Chukwa as well as vendor releases by Splice Machine and HP’s Vertica. One quick editorial note—with the short week in the US for Thanksgiving, I’m going to skip next week’s issue and resume with issue #98 on December 7th.
Technical
This tutorial describes the steps needed to run a HBase cluster with Microsoft Azure’s HDInsight, and how to build a simple application to interact with it. It details how to setup a new maven application, configure hbase-site.xml, and building an executable jar for running a test against HBase.
http://blogs.msdn.com/b/bigdatasupport/archive/2014/11/04/how-to-use-hbase-java-api-with-hdinsight-hbase-cluster-part-1.aspx
This post looks at a number of key configuration parameters relating to YARN memory usage (which is common area to tweak). It covers configuration for map reduce, the yarn scheduler, and yarn application managers.
http://blogs.msdn.com/b/bigdatasupport/archive/2014/11/11/some-commonly-used-yarn-memory-settings.aspx
This post is a good overview of Spark and how it compares to MapReduce. It describes RDD, describes the computation model (and how it compares to MapReduce), and compares Spark to Cascading/Scalding. The post also explores Spark’s reputation as being in-memory focussed and what it means in terms of performance.
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Hadoop’s default deployment doesn’t include any sort of authentication - the system trusts that the client is who it says it is. To enforce authentication, Hadoop has support for Kerberos. This gives a quick intro to the key concepts of Kerberos and how they integrate with Hadoop.
http://www.securityweek.com/big-data-smaller-problems-configuring-kerberos-authentication-hadoop
This presentation describes HBase’s upcoming support for richer encodings and data types. It looks at the new OrderedBytes API, the DataType API (for structs, unions, etc), and what’s upcoming. There are also examples of using the new APIs.
http://www.slideshare.net/xefyr/hbase-data-types
Twitter has posted about their system for building an index of the entire corpus of tweets. The system relies heavily on Hadoop, both for storage and for aggregating and preparing (via Pig jobs).
https://blog.twitter.com/2014/building-a-complete-tweet-index
Apache Samza is a stream processing framework built on YARN. Samza was originally open-sourced by LinkedIn, and they have written about operating Samza (and Apache Kafka) at scale. The post describes their deployment system as well as their metric collection and alerting setup.
http://engineering.linkedin.com/samza/operating-apache-samza-scale
This presentation explores the state of Hadoop and RDF. Two projects providing the must complete support are Apache Jena project, which has experimental modules for RDF in Hadoop and the Intel Graph Builder, which supports generic graph data via Pig UDFs.
http://www.slideshare.net/RobVesse/quadrupling-your-elephants-rdf-and-the-hadoop-ecosystem
SAMOA: Scalable Advanced Massive Online Analysis is a new system aiming to be the “Mahout of Streaming.” Open-sourced by Yahoo, the project includes support for several backends including Apache Storm, Apache S4, and Apache Samza. The system implements a number of machine learning algorithms, including Distributed Stream Clustering and Vertical Hoeffding Tree Classifier.
https://speakerdeck.com/gdfm/samoa-at-strata-barcelona-2014
A new (alpha) feature of the upcoming Hadoop 2.6 release will be the ability to run Docker containers as part of YARN applications. SequenceIQ has an overview of this feature, and how it can also be used when YARN is already running inside of Docker.
http://blog.sequenceiq.com/blog/2014/11/20/yarn-containers-and-docker/
The Altiscale blog has an article about the ubiquity of the Hive metastore. It notes how several new SQL-on-Hadoop systems support it (Impala, Presto, Spark SQL, Drill) and that they do this in order to be compatible with Hive. Both of these turn out to be good things for users—if your data is in HDFS and the Hive metastore, it’s easy to use in many other systems.
https://www.altiscale.com/hadoop-blog/rise-hive-metastore-semantic-repository/
The SequenceIQ blog has a post on building a hybrid Hadoop deploy in the cloud by using both a permanent cluster and adding ephemeral clusters as needed. They show how to build clusters with their Cloudbreak (cloud-agnostic provisioning built on docker and Ambari) tool and to use their other tool Periscope to autoscale the ephemeral clusters.
http://blog.sequenceiq.com/blog/2014/11/17/datalake-cloudbreak-2/
News
The Call For Abstracts for Hadoop Summit Europe 2015 closes on December 5th. The conference takes place in April in Brussels, Belgium.
http://2015.hadoopsummit.org/brussels/call-for-abstracts/
MapR, whose distribution is available as an option for Amazon Elastic MapReduce (EMR), has a post with some news on the integration. First, MapR is now supported via EMR on the latest instance families. Second, hourly pricing of the EMR+MapR will be lower for both the M5 and M7 enterprise versions (M3 continues to be free).
https://www.mapr.com/blog/amazon-web-services-just-made-running-mapr-even-easier
MapR and Teradata announced an expanded partnership in which Teradata will provide MapR’s distribution as part of the Teradata Unified Data Architecture.
https://www.mapr.com/blog/mapr-and-teradata-why-architecture-matters-customer-success
GigaOm has an article on eHarmony’s infrastructure ambitions around Hadoop and OpenStack. As part of the transition, they’re moving from several instances of a Hadoop appliance to a single YARN cluster as well as bringing up new technologies like Spark and Storm.
https://gigaom.com/2014/11/19/why-eharmony-is-rebuilding-itself-atop-openstack-and-hadoop/
Datanami has an article on the rise of Spark. They note that for the first time, Apache Spark has bypassed Apache Hadoop on Google Trends. The post has a look back at the highlights of Spark this year including the growing number of contributors and the recent sort workload results.
http://www.datanami.com/2014/11/21/spark-just-passed-hadoop-popularity-web-heres/
Qubole and Microsoft Azure have announced a new strategic relationship in which Qubole’s Big Data-as-a-Service platform is available on Azure.
http://www.qubole.com/qubole-partner-azure/
Releases
Splice Machine has announced general availability of their RDBMS built on Hadoop. Version 1.0 includes supports for SQL:2003 including analytics functions, native backup and recovery, and integrates with HCatalog. Splice Machine is positioned as a cheaper alternative than scaling out a traditional RDBMS.
http://www.zdnet.com/splice-machines-sql-on-hadoop-database-goes-on-general-release-7000035958/
Apache Pig 0.14.0 was released this week. Major highlights of the release include support for a Apache Tez as a backend, OrcStorage, and loader predicate push down. The release supports Hadoop 0.23.x, 1.x, and 2.x.
http://pig.apache.org/releases.html#20+November%2C+2014%3A+release+0.14.0+available
Stripe has open-sourced four new projects for Hadoop. The projects are: Timberlake, a dashboard for YARN and MRv2, Brushfire, a system for distributing learning of ensemble tree models inspired by Google’s PLANET, Sequins, a static database backed by SequenceFiles, and Herringbone, a tool for working with Parquet files and Impala/Hive.
https://stripe.com/blog/four-new-hadoop-projects
HP has announced a new version of their Vertica columnar MPP analytics engine that runs on Hadoop. It supports many major distributions, including MapR, Cloudera, Hortonworks, and Apache Hadoop.
http://www.datanami.com/2014/11/18/vertica-gets-hadoop-upgrade/
Apache Chukwa version 0.6.0 was released. Chukwa is a system for distributed monitoring and analysis from data in log files. The new release adds support for HBase, deprecates the Chukwa collector, and resolves a number of bugs.
http://chukwa.apache.org/docs/r0.6.0/releasenotes.html
Events (curated by Mortar Data)
UNITED STATES
Minnesota
Stream Processing on Hadoop (Saint Paul) - Monday, November 24
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/218083032/
Texas
All Models Are Wrong, Some Are Useful (Plano) - Monday, November 24
http://www.meetup.com/Dallas-Big-Data-Science/events/215952012/
UNITED KINGDOM
Options for Streaming Analytics on Azure and Azure Batch (London) - Tuesday, November 25
http://www.meetup.com/UKAzureUserGroup/events/216580082/
FRANCE
Hadoop Meetup on Cascading/Tez with Concurrent and Hortonworks (Paris) - Tuesday, November 25
http://www.meetup.com/Hadoop-User-Group-France/events/218753457/
GERMANY
November Meetup (Mannheim) - Monday, November 24
http://www.meetup.com/Big-Data-Mannheim-Rhein-Neckar/events/204416842/
Apache Spark (Hamburg) - Wednesday, November 26
http://www.meetup.com/Scala-Hamburg/events/216734832/
NETHERLANDS
What's New with Apache Spark? An Evening with Paco Nathan (Amsterdam) - Monday, November 24
http://www.meetup.com/Amsterdam-Spark/events/207220772/
SWEDEN
Offline and Real-time Click Stream Processing (Stockholm) - Wednesday, November 26
http://www.meetup.com/stockholm-hug/events/217569002/
AUSTRALIA
Apache Spark: Easier and Faster Big Data (Sydney) - Thursday, November 27
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/218140722/
SINGAPORE
Spark Singapore First Meetup! (Singapore) - Wednesday, November 26
http://www.meetup.com/Spark-Singapore/events/218794905/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
Big news this week out of Palo Alto as Hortonworks has filed paperwork for an initial public offering. There were also a number of notable releases this week, including Apache Hive 0.14.0. Technical posts cover a large number of ecosystem topics, including Apache Sqoop, Apache Drill, and Apache Pig. There’s a lot of breadth in this issue, so there should be something for everyone!
Technical
The Cloudera blog has a guest post from Cerner about integrating Apache Kafka with HBase and Storm for real-time processing. The post describes how adopting Kafka helped reduce load on HBase (which was previously used for queuing) and improve performance. This style of Kafka-based architecture seems to be more and more common, but it’s always interesting to hear how folks are putting together the pieces of the Hadoop ecosystem.
http://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-kafka/
The MapR blog has a post on using the recently-released Apache Drill 0.6.0-incubating to analyze Yelp’s public data set. The data, which is a JSON file, can be queried directly via SQL in Drill without first declaring the data’s schema (drill auto-detects it). The post has a number of sample queries which you can use to get started analyzing this or any other data set.
https://www.mapr.com/blog/how-turn-raw-data-yelp-insights-minutes-apache-drill
The Cloudera blog has a second guest post, this time from Dell, on the new Oracle direct-mode in Sqoop 1.4.5. The post describes several of the implemented optimizations in the Oracle direct mode and includes an analysis of performance improvements the connector provides.
http://blog.cloudera.com/blog/2014/11/how-apache-sqoop-1-4-5-improves-oracle-databaseapache-hadoop-integration/
The Hortonworks blog has a post on using Apache Pig with the Python Scikit-learn package in order predict flight delays using logistic regression and random forests. The post is a bit light in details, but there is a linked IPython notebook which has a very detailed overview and description of the entire process. Given that Python is often a data scientist’s top choice for machine learning on small data sets, it’s useful to see how to extend it to larger data sets with Pig.
http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/
The ingest.tips blog has a post on Sqoop1 support for Parquet, which leverages the Kite SDK to generate Parquet files during import. The post serves as a good introduction to Sqoop1, which can both import data to HDFS and update the Hive metastore with information about the data. There are examples demonstrating how to use Parquet support.
http://ingest.tips/2014/11/10/parquet-support-arriving-in-sqoop/
Tephra is a open-source system that provides globally-consistent transactions for Apache HBase. Cask, the makers of Tephra, have written a blog post describing the requirements and design of Tephra. Tephra is designed in such a way that it can be used with systems other than HBase, and it is even designed to support transactions spanning multiple data stores.
http://blog.cask.co/2014/11/how-we-built-it-designing-a-globally-consistent-transaction-engine/
This presentation focusses on Spark streaming, the micro-batch component of Apache Spark. The slides give an introduction to both Spark and Spark streaming, describe several use cases (claiming there are 40+ known production use cases), give an overview of several integrations (Cassandra, Kafka, Elastic Search, and more), and look ahead to some upcoming features and improvements in the development pipeline.
http://www.slideshare.net/pacoid/tiny-batches-in-the-wine-shiny-new-bits-in-spark-streaming
News
Hortonworks has filed paperwork for their initial public offering this week. The filing includes a number of details on the company, including financial numbers ($33.4M in revenue so far in 2014), an overview of key company milestones, and number of employees (524 at the end of September). GigaOm has an analysis of some of these numbers and an overview of what the IPO means for the rest of the industry.
https://gigaom.com/2014/11/10/hadoop-startup-hortonworks-has-filed-for-an-ipo/
https://gigaom.com/2014/11/10/why-the-hortonworks-could-be-a-bellwether-for-hadoop/
IBM’s Big Data for Social Good Challenge opened this week. The challenge includes $40k in prizes, which will be awarded by a panel composed of IBM and industry experts. IBM has a curated list of datasets which can be used as part of a challenge entry.
https://developer.ibm.com/hadoop/2014/11/10/participate-big-data-social-good-challenge/
Releases
Apache Drill 0.6.0-incubating was recently released. 0.6.0 is the second beta release, primarily containing bug fixes. Notable new features include ANSI SQL support for MongoDB, partition pruning, and (alpha) window function support.
http://mail-archives.apache.org/mod_mbox/incubator-drill-user/201411.mbox/%3CCAA_-67d996Ec22tSgUKQGE-_Ck1FqhLdqbp1dNGZPRD6OGsxuQ%40mail.gmail.com%3E
Cubert is a new open-source tool from LinkedIn for writing high-performance MapReduce jobs. It’s a new language on the same level of Pig or Hive (sharing some resemblance to Pig) as well as a novel storage format/layer called blocks. For statistical calculations, graph computations, and OLAP cubes, Cubert offers impressive performance improvements. There’s a lot more information in the introductory blog post.
https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics
Apache Hive 0.14.0 was released this week. The release resolves over 1,000 (!) Jira issues. I’m sure we’ll soon hear more details about the release in blog post form but some quick highlights include: support for insert/update/delete with ACID support, a cost-based optimizer, support for data stored in Accumulo, support for HBase snapshots, and many improvements to ORCFile and HiveServer 2.
http://mail-archives.apache.org/mod_mbox/hive-user/201411.mbox/%3CCAH93c2ZaxVGtKp72QMiVrQ4d0XKRpEJr9d2t9orT3=z0bQVnOQ@mail.gmail.com%3E
Pivotal Cloud Foundry (CF) has added support for deploying Cassandra via DataStax Enterprise. The blog post introducing the feature has many more details as well as an example of setting up a cluster.
http://blog.pivotal.io/cloud-foundry-pivotal/features/an-easier-way-to-deploy-cassandra-clusters
Version 0.4.1 of the Spark Job Server has been released. The new version supports Spark 1.1.0 and has improvements for deployment/configuration.
https://github.com/spark-jobserver/spark-jobserver/releases/tag/v0.4.1
https://github.com/spark-jobserver/spark-jobserver/blob/v0.4.1/notes/0.4.1.markdown
Microsoft released version 2.5 of the Azure SDK and a preview of Visual Studio 2015. The releases contain support for HDInsight (the Hadoop as a Service component of Azure) including a Hive query editor and job viewer.
http://azure.microsoft.com/blog/2014/11/12/announcing-azure-sdk-2-5-for-net-and-visual-studio-2015-preview/
Events (curated by Mortar Data)
UNITED STATES
California
Data Exploration in Spark (San Francisco) - Tuesday, November 18
http://www.meetup.com/San-Francisco-PyData/events/215142332/
Getting Started with Spark and Scala, by Paul Snively of Verizon OnCue (El Segundo) - Tuesday, November 18
http://www.meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/207973922/
OCBigData Monthly Meetup #7 (Irvine) - Wednesday, November 19
http://www.meetup.com/OCBigData/events/179381262/
49th Bay Area Hadoop User Group Monthly Meetup (Sunnyvale) - Wednesday, November 19
http://www.meetup.com/hadoop/events/152042012/
HBase Meetup @ WANdisco (San Ramon) - Thursday, November 20
http://www.meetup.com/hbaseusergroup/events/205219992/
Washington
Unlocking Your Hadoop Data with Apache Spark and CDH5 (Seattle) - Wednesday, November 19
http://www.meetup.com/Seattle-Spark-Meetup/events/169932382/
Oregon
MapR Presents Apache Drill: Self-Service Data Exploration (Portland) - Wednesday, November 19
http://www.meetup.com/Hadoop-Portland/events/216654112/
Apache Spark: Setup, Overview, and Comparison (Portland) - Wednesday, November 19
http://www.meetup.com/Portland-Data-Science-Workshops/events/215207692/
Missouri
Securing the Hadoop Cluster (Saint Louis) - Tuesday, November 18
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/215019942/
Texas
Hadoop Like a Champion! (Austin) - Tuesday, November 18
http://www.meetup.com/CloudAustin/events/212247982/
Spark and Cassandra: Building and Deploying an Application (Austin) - Thursday, November 20
http://www.meetup.com/Austin-Cassandra-Users/events/211707542/
Utah
Hadoop Lunch at Adobe (Lehi) - Thursday, November 20
http://www.meetup.com/BigDataUtah/events/217120332/
Virginia
Hadoop Tutorial: Map-Reduce on YARN, Part 1 (Sterling) - Saturday, November 22
http://www.meetup.com/The-Sterling-dbuser-Meetup-Group/events/210492652/
Pennsylvania
Understanding the Foundations of Hadoop (Philadelphia) - Tuesday, November 18
http://www.meetup.com/Big-Data-Developers-in-Philadelphia/events/217612702/
North Carolina
Triangle SQL Server UG Meeting (Raleigh) - Tuesday, November 18
http://www.meetup.com/tripass/events/218643575/
Automating Customer Intelligence Management in Hadoop (Charlotte) - Wednesday, November 19
http://www.meetup.com/CharlotteHUG/events/167353212/
When to Use Pig instead of Hive (Winston Salem) - Thursday, November 20
http://www.meetup.com/Triad-Hadoop-Users-Group/events/208153612/
New Jersey
YARN + Docker Containers: Integration and Privilege Isolation (Hamilton Township) - Wednesday, November 19
http://www.meetup.com/nj-hadoop/events/206636262/
New York
Privilege Isolation in Docker Containers (New York) - Thursday, November 20
http://www.meetup.com/Hadoop-NYC/events/207004472/
Massachusetts
SQL on Hadoop: Hands-on (Boston) - Wednesday, November 19
http://www.meetup.com/Big-Data-Developers-in-Boston/events/215125502/
UNITED KINGDOM
November 2014 Hadoop Meetup (London) - Monday, November 17
http://www.meetup.com/hadoop-users-group-uk/events/217791892/
SINGAPORE
Analyzing Real-World Data with Drill, Hadoop & MongoDB | Tomer Shiran, MapR (Singapore) - Monday, November 17
http://www.meetup.com/BigData-Hadoop-SG/events/216571852/
GERMANY
Apache Cassandra, Apache Spark, and Hadoop Meetup (Munich) - Tuesday, November 18
http://www.meetup.com/Big-Data-Developers-in-Munich/events/217571312/
Patrick McFadin Talks C* & Spark for Time Series, plus A Leap Forward for SQL on Hadoop (Berlin) - Wednesday, November 19
http://www.meetup.com/Berlin-Cassandra-Users/events/217584792/
NETHERLANDS
Patrick McFadin Talks Cassandra, Spark, Tips and Tricks (Amsterdam) - Friday, November 21
http://www.meetup.com/Netherlands-Cassandra-Users/events/218615363/
HUNGARY
Big Data Meetup, ApacheCon Edition (Budapest) - Tuesday, November 18
http://www.meetup.com/Big-Data-Meetup-Budapest/events/208253412/
AUSTRALIA
Drilling in on SQL and Hadoop (Melbourne) - Wednesday, November 19
http://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/218590301/
SPAIN
Databricks Comes to Barcelona (Barcelona) - Thursday, November 20
http://www.meetup.com/Spark-Barcelona/events/212164712/
INDIA
Big Data Meetup (Bangalore) - Friday, November 21
http://www.meetup.com/Bangalore-Hadoop-Meetups/events/216724732/
Hadoop Workshop (Hyderabad) - Saturday, November 22
http://www.meetup.com/hyderabad-scalability/events/217755662/