Latent Class @unbareablelightness - Tumblr Blog

Posts

Whether the setting is one of L.A.’s more rarefied neighborhoods or one of its most destitute, the rules are the same: Anybody can play, but it takes something else to stay

Triple double around the corner!

West Virginia has the highest overdose death rate in the country. Locals are fighting to save their neighbors—and their towns—from destruction.

These pieces are always hard for me. Addiction sucks, but the energy being put into the opioid epidemic while the crack epidemic was villanized will always bug me.

#westvirginia

Internet History Podcast: 143. Neil Hunt on the History of Netflix and Netflix Streaming

Completed in 2017 in Brasilia, Brazil. Images by Haruo Mikami. . The house is located in a residential area far away from the city center in Brasília. The weather in this part of Brazil is characterized by a...

Song Exploder: Michael Kiwanuka - Black Man in a White World

Internet History Podcast: 142. Andy Rachleff @arachleff co-founder of Benchmark and Wealthfront

Software Engineering Daily: Microservices Transition with Cassandra Shum

Data Pipelines with Apache Spark and Couchbase

#Apache Spark

Beginning Apache Flink

I’ve been committed to the Apache Spark ecosystem for as long as I’ve been doing data engineering. I’ve seen it’s adoption, been a fan of using the APIs and have even decided to write a book about Spark. In all of this experience, there has been a nagging sense that I need to try a “true” streaming framework like Twitter Heron, Apache Flink or Apache Apex. I’ve read articles and watched videos about the difference between a true streaming and a micro batch framework and been mostly unconvinced that it would make a difference for me. I started a project where Flink ended up being the go to streaming framework and I learned a great deal about Flink. This post is a combination of some notes and things I noticed while using Flink and hopefully anyone starting Flink can find it helpful.

Getting Started

Like Spark, Flink is fairly overwhelming to get started with. This is mostly because of installations and run-time configurations. After a bunch of searching around and I was able to put together a decent starter SBT config for Flink. I used Intellij to work with Flink because of it’s complex API, the type hinting and other niceties come in pretty handy.

For other ways using Flink (Docker, cluster, Mesos) I found the docs pretty helpful for that. This blog is fairly elementary, so I won’t cover those in this post.

Ergonomics / API

What I noticed early on using Flink is that it really was designed with streaming in mind first. This is evident by different APIs for a Key Stream, versus a DataStream versus a Window Stream. Once you work with streams for a little while, you’ll find yourself wanting an API that is robust and can handle streams in a sensible way. In Spark streaming, I found myself writing a bunch of boilerplate code aimed at accomplishing a lot of what Flink is designed to do out of the box. One of the ways this is most evident is the DataStream and it’s derivatives, I’ll walk through each.

DataStream

A Datastream is a typed stream of data is one-at-a-time. It’s what all of Flinks streaming architecture is built off of. You can use this stream to do whatever you wanted to. For example I could read in a stream and add two to each value:

Or, I could read in a stream and write it back out to the file system.

KeyedDataStream

A KeyedDataStream is a stream grouped by and evaluated by a key. Typically, this is the first step in a longer process of creating a window or other type of stream. The API is pretty straight forward, you just add a .keyBy("Name your key") to a DataStream. In and of itself it's not all that useful. It comes in handy with a Window.

WindowedDataStream

A WindowedDataStream is what you’d expect, it’s a DataStream over some window of time. This is how Spark Streaming works out of the box and how most introductory streams are introduced. Flinks API is very robust allowing you to configure streams based on number of occurances and windows of time. This kind of control makes it easy to work with out of the box versus custom code to do it.

You can also define what are known as TriggerPolicies and EvictionPolicies, in which you can define a specific trigger and end state for that trigger. These require a bit more orchestration but can be effective for a number of streaming situations.

Technically, Spark supports all 3 of these but not with the same ease of use. With the new Structured Streaming API most of these mechanics are there now, but it’s still not as intuitive as Flink in my opinion.

Another thing that is extremely nice about Flink is their dashboard. Spark has a notoriously bad dashboard but it’s not nearly as nice as this:

It’s quite easy to follow the flow of data in Flink and consume and create streaming sources without a ton of effort.

Batch Model

The real strength of Spark is in it’s batch model and in comparison Flink is actually pretty nice to use. Doing most of your batch related transformations is just as nice as it is to do in Spark. Here are a few examples:

Ecosystem

When your ecosystem bar is Spark, you’ll be hard pressed to meet it. Flink supports all of the major streaming technologies like Apache Kafka, AWS Kinesis and Debezium. Being the newer kid on the block, it’s just not as rich as what Spark has to offer. That being said, with full support of the Scala and Java ecosystem, I have yet to find a situation Flink couldn’t handle.

Summary

It’s clear that a good deal of smart people and companies are investing a lot in Apache Flink, but the question remains, should you? A lot is written and said about streaming, but in my experience, the majority of workloads are still batch. Spark is a much more mature ecosystem for batch workflows and a pretty mature one for streaming as well.

I’m a pretty big believer in streams and their potential. I’m not sure that streams are the wave of the future for Dashboards, reporting and applications that are frozen in time, but I think it’s already clear that some of the most impressive and useful apps are built off of real time data. These are the kinds of applications I want to build my career on, so I’ll be using Flink more often. I hope to write some useful beginner friendly tutorials on Flink in the coming months as well!

Notes

#Apache Flink #Streaming

One Week with Logitech Pop

Earlier this week I ordered a Logitech Pop. I had a small use case where to pop would work great. We have a small lamp with a Phillips Hue bulb that we use when our daughter wakes up in the night. Typically, we use HomeKit to turn the light on and off, but HomeKit is so spotty as to be impossible to use and using the Hue app is cumbersome in a frantic run to get a crying baby. I learned about the Pop and figured I could have an extra button next to the light switch that we could just tap on our way in and out, easy enough.

The Pop is a bit expensive for what it is, but that’s true of pretty much every device in this space. I got my starter kit for a bit cheaper than retail by getting it on eBay. Set up is a real breeze. You just plug the hub into an outlet and open the app. From there, adding buttons is as simple as clicking it and then naming it.

Once I actually had buttons set up, I was able to scan my wifi network and add devices on the network. I had all my Hue lights pop up and some other out of the box connections work. I set up my daughters light to turn on and off with a click, tried it and it worked great. There’s nothing magical about this, but the first time I needed it there was a tangible difference in using it and using HomeKit.

From here, I wanted to have some more fun. Since the starter pack I bought came with two buttons, I decided to do something more interesting with the second button. Each Pop button (in basic mode) has three modes. A tap, a double tap, and a long press. Each of these can do a different activity. I set up the tap to play All Blues by Miles Davis on the Sonos. I set up the double tap to turn my office lamp on and off and the long press to turn my security system on and off through IFTTT.

These are all a little gimmicky, but I find them fun and actually found myself using the buttons for all sorts of stuff and thinking of creative ways I could use the buttons. I’m pretty impressed with Logitech in this connected home space. I own a Harmony Elite, another impressive connected home device I should write about in the future.

Overall, it’s pretty hard to say this is a “must have” product, but its a fantastic way to extend technology you already own. I’ll get a few more buttons around the house for my Hue bulbs in each of my lamps over the months.

#iot #connected home

Amazon Athena Presentation

#AWS #Athena

Creating A Spark Server For Every Job With Livy

One of the frustrations that most people who are new to Spark have, is how exactly to run Spark. Before running your first Spark job you’re likely to hear about YARN or Mesos and it might seem like running a Spark job is a world unto it self. This barrier to entry makes it harder for beginners to imagine that's possible with Spark. Livy provides a RESTful interface to Apache Spark and helps obfuscate some of the details of Sparks execution mechanics and lets developers submit programs to a Spark cluster and gets results. This post if a summary of my notes using Livy to send jobs queued from web hooks to a Spark cluster. My aim to to talk about the benefits and drawbacks of using this setup as well as a small tutorial on Livy. If you’re interested in using Livy, the documentation is excellent.

The Basics

When running a Spark Job, you typically submit jobs via a Spark Shell. This can be in Python or Scala, but running a Spark Job looks something like this:

There are some exceptions, notably if you’re working in a Notebook context like Juypter Notebook, Zeppelin or Beaker Notebooks. In these cases, the notebooks are bound to a Spark Shell so you can run jobs dynamically instead of submitting Jar files or Python files.

In either context, you need to have a Spark Context (either create on in the notebook or within the file submitted to the shell) and code is isolated to your environment. This is fine for most workloads and for development, but it limits the kinds of programs you can write in Spark and the amounts of services that can communicate with Spark.

For example, if we built a regression model in Spark and wanted to run live data through it, it’s not immediately obvious how we’d do that, or over what protocol. It all seems too boxed in and tightly coupled with the machine it’s running on. That’s where Livy is helpful. Livy exposes a rest endpoint to Spark, allowing you to submit jobs from anywhere*. How it accomplishes this is a bit tricky and I’ll walk through the mechanics of it.

Mechanics

Spark doesn’t have a RESTful protocol to it’s engine, however with a little work you can create a rest API server that translates Python, Scala or R code to Spark Job lingo and return the results. This is essential with Livy does (forgive the oversimplification). This allows (for example) us to write a DSL that submits Spark Jobs over REST and gets data back (There are other ways to get about this like MLeap that I’ll cover in a future post)

The power of doing this should be immediately obvious, but the drawbacks might be as well. I worked through two examples to explore the API behind Livy and then to try and actually use REST to do something interesting.

A RESTful Endpoint Example

My first example is just an endpoint that squares the integer it receives on a POST request. For example, POST /2 would reply with 2^2 = 4. I chose this strategically because of the complexity in putting one of these endpoints together. My example is in Scala, but you could do the same thing in PySpark or SparkR. Here is the endpoint. I commented int he code about each part and what it’s doing. I find that much easier than posting the code and explaining it after the fact:

Predict My Weight

In order to do something a bit more applicable to an actual workload, I created a silly model. The models predicts what my weight will be one week from today, based on how many calories I ate + how many calories I burned today. It’s wildly inaccurate, but good for the purposes of this blog. I will enter my weight and calories burned in a Google Sheet and I used Microsoft Flow to trigger an HTTP event that fires to my Livy server and calculates my weight.

Here is a rough sketch of what will happen.

This will work a little differently from the example I shared above. Instead of writing a Scala HTTP client, I can just make a post request from the Microsoft Flow HTTP client. I won’t walk through how to do that as the above example already illustrates and the UI is intuitive. Essentially, I’ll add my weight and calories I burned today into a spread sheet, that’ll trigger an event to predict my weight and add it as a new column in a separate spreadsheet. Here is the function:

I entered 199 as current weight and 1500 as calories burned today ( Both fakes numbers) and it predicted my weight would be 188.99 a week from now.

Summary

Livy provides an interesting way to use Spark as a RESTful service. In my opinion, this is not an ideal way to interact with Spark, however. There is just a tad too much overhead of language interoperability to make it worth it. For starters, sending strings of Scala code over the wire doesn’t inspire a lot of confidence. It’s also not immediately clear why executing pre-defined JAR files over rest has. On the positive side, I expected something much slower than what I got out of Livy. For a use case as contrived as the one I made up for this blog it’s pretty solid, but the model in general might be hard to scale and reason about.

Notes

Microsoft Flow is very cool. I know it seems like an IFTTT clone, but with the ability to send HTTP requests and web hooks it’s much more customizable. Also the free tier is much more generous than something like Zapier.

This stuff takes forever to configure and use the first time.

There are some alternative projects that aim to accomplish the same task as Livy, most notably spark-jobserver, which I think is a little bit easier to use but I didn’t find out about until long after I started experimenting with Livy. If anyone would be interested in a tutorial about that feel free to let me know.

#Scala #Apache Spark

Announcement: I’m Writing A Book on Apache Spark

For the last year or so I’ve been blogging regularly about the Apache Spark platform. During that time, Spark has grown from something that people in data science and engineering have used to something that is almost ubiquitous. I’ve enjoyed working with the platform professionally, and even on a number of personal projects. Over this year, I’ve spent a lot of time trying to get SBT configurations to work correctly, converting JSON to DataSets, and painstakingly trying to get missing data imputations to work sensibly. This time has taught me that for as popular as Spark is, there is a pretty big gap in resources for it. It’s not that the docs are bad (they are actually excellent), it’s not that it’s a super hard platform to learn, it’s just because it’s programming. Programming is tough, digging through a huge Scaladoc is tough, but it’s what it takes to get decently proficient at Spark. This isn’t necessarily unique to Spark, but the pain is pain all the same.

Spark has enabled me to think about computing and data in an entirely different way. It has taught me to be much more ambitious about data, and I think lots of people can benefit from that. I’ve spent so much time writing and debugging Spark, that I feel like I have a lot to share. My blog is evidence of this, as many people have reached out to thank me for stuff I’ve written on Spark. What I wrote helped them think about a problem in a different way, or helped them appreciate an overlooked aspect of Spark more. I feel like I could continue to write blogs and have a good impact or I can write a more lasting resource, in the form of a book.

My motivation is not to “make a killing” off the book or to become a “thought leader.” I hope for it to be published and to provide value, but I care more about the experience of writing it than making tons of money. This is partly why I’m not interested in doing a course on Apache Spark. I don’t want the responsibility for on-going membership fees or keeping content up to date. I want to pass on principles and focus on platform level tips and not get bogged down in API details like a course would force me to. I also want to spend more time working on this project with my wife, Bethany. She provides all the illustrations for my blog and does a tremendous job and I believe together we can put together a lasting resource for those new to Spark.

As for timing, we’re working on a writing calendar this week and I’ll post updates on my newsletter for anyone interested in following the progress. The books working title is The Apache Spark Field Guide, I feel like “field guide” perfectly describes what I’m trying to do. It will not be your typical technical book with a lot of code samples, i’ll spend a lot more time walking through the nuances Spark execution and helpful tips in using Spark. There are already great books out on taking Spark from nowhere to somewhere, but there isn’t a good place to quickly explain concepts in a way that’s not fact based recitation.

Until next time.

#Apache Spark #Announcements

Lutron Caseta Review

I’ve been into this home automation thing for some time now. Any device on the market, I’ve most likely tried it already and there is an equally good chance that there is one functioning in my house. Most of the home automation products available for mass market are still pretty user-hostile and ever so expensive. The one area of home automation tech that is easy to use and pretty pleasing is light switches. They work great without the connected technology, but are enhanced by it. I’ve tried a number of other switches that connect over z-wave but the Caseta line always has the best reviews and are the most asthetically pleasing in my opinion. looking in my opinion and the easiest to use. So I got a few new ones on eBay and installed them. After getting a few for a good price on eBay I decided to give this who Lutron thing a try.

Installation is super simple. As I mentioned before, I have installed other dimmer switches and none of the had as clearly laid out instructions as the Casetas. Once all wired up, I plugged it in and it worked. There are additional settings you can configure, for instance if you wanted the light to turn on but not be full-brightness. I was excited to connect the switch to my SmartThings hub but in the process I realized I couldn’t connect Caseta over regular Z-wave. I needed to use the Lutron Hub, which would run me some more money (annoying). I was in a spending mood so I ran to BestBuy, got the hub and an additional switch and was off to the automation races.

I’ve connected the switches to my security system, to HomeKit and to Amazon Echo. It is much faster than the other Z-wave switches I have and just more pleasant to set up across the devices. I plan on outfitting all of my switches with these as time goes on. They aren’t exactly cheap at $45 a pop, but any of the z-wave switches run about the same.

#Home Automation #Lutron #Siri #Amazon Echo

A Gentle Intro To Graph Analytics With GraphFrames

Anyone steeped in the doctrine of relational databases will find that trying to use a graph database like Neo4J is painful and not at all intuitive. This is not your fault, or Neo4js fault, it’s just that graph traversal is nothing like SQL. When I say nothing, I literally mean nothing. You think about them in two completely different ways and the ergonomics of graph traversals are inherently harder to get used to. This issue is compounded when considering doing a tutorial on a graph database. Further, this is compounded when using a Graph Analytics library like GraphX. Already being forced to work with RDDs (Not exactly beginner friendly) adding the paradigm of graphs on top of it is too much for the uninitiated. What would be much easier to comprehend is if we could go from a table-like structure to a graph and do the same queries for comparison.

GraphFrames allow us to do exactly this. It’s an API for doing Graph Analytics on Spark DataFrames. This way, we can try to recreate SQL queries in Graphs and have a better grasp of the graph concepts. Not having to load the data and create the relationships makes a lot of difference in a pedagogical context (At least I’ve found).

A Simple Primer

To set this all up, we’re going to use the default example data found in the GraphFrames package with a few edits. It’s two tables that look like this:

In the second DataFrame, we have “src” and “dst” and “relationship” columns. This is just syntactic, and allows us to establish a vertex-edge relationship. You could make a pretty complex web of DataFrames that are connected to one another, but in order to maintain simplicity, I’ll just keep it as this simpler “friend/follow” relationship. It gives us enough data to go through the rest of this exercise without confusing us.

A Few Algorithms

We can start with PageRank, an algorithm developed by Larry Page, the CEO of Alphabet Inc. The basic idea is to establish how each edge in a graph references another. In the ancient web context, It would help us identify the authority on a topic. If every web page about Jay-z linked to Spotify.com then we’d know Spotify is an authority on Jay-z. For the data we have we'll look at the edges and it’s more a measure of connectedness:

You can look through the mathematical specification for a better understanding of what’s exactly going on, but essentially we built a DataFrame that described how each person was related to another. In a relational context, we would calculate the number of connections with a handful of queries, but as relationships get more numerous and complicated it becomes harder to do.

In a graph, there is a layer of abstraction that makes it easier to figure out this kind of information. Consider the following. If you were tasked with figuring out which of your friends knew each other, it would be a gargantuan task to call each and go through the list. It would be much easier if you could have each friend send their friends a message and for you to sort through the connections after. In a very oversimplified way, many of the algorithms in GraphFrames can be implemented with this message passive primitive.

For a more complicated example lets try the Strongly Conected Components algorithm. You can read through the math if you like but in laymen terms it’s a measure of each vertex in the graph being connected to another. From the definition it doesn’t have to be a direct connection, but the fewer hops to establish a connection the more “strongly connected” a vertex is. With that, we can use the GraphFrames implementation:

Again, figuring out this kind of information via SQL would be very hard. Largely because we don’t have semantics for figuring out connectedness, rather it’s great for collecting and summarizing information. Most of us don’t have an immediate need for graphs and what they have to offer. However, a lot can be uncovered if you can store your data in this way.

Nice Thing(s)

One of the kindest aspects of a library like GraphFrames is that edges and vertexes are Dataframes. This is valuable because we already have a whole set of APIs for how to deal with these things.

A second thing I like about GraphFrames are the algorithm implementations. There aren’t as many as GraphX but I feel like they are easier to use because they are dealing with DataFrames instead of RDDs. Many long-time Spark users are very familiar with RDDs and comfortable using them, I have been using Spark for a long time too, but always founded the DataFrames / DataSets to be more manageable.

Finally, querying GraphFrames is pretty nice! You have facilities to do regular search, breadth first search or structured queries. Breath first search is probably my favorite of the bunch:

Summary

I can’t say enough about how GraphFrames have enabled me to better understand graphs and graph analytics. It’s the first time I was able to successfully go from a column/row format to a graph and to compare the two. That being said, GraphFrames is very immature, as evidence by it’s release version and it’s lack of support for a number of features in GraphX or Apache Giraph. It’s immaturity is a blessing and no reflection of the quality and thought put into the API.

The two major hurdles to doing graph analytics is (1) the query language and (2) the paradigm. By using GraphFrames you practically eliminate (1), and mostly eliminate (2). Since first using GraphFrames, I went back and tried Neo4J and both of these hurdles were a non-factor. Doing some more complex things were still a little weird, but I didn’t get stuck on “Hello, World.” If you’re struggling with Graph Analytics, give GraphFrames a try. It’s well worth the few hours you’ll spend learning it.

#Scala #Apache Spark #Graph Analytics #GraphX

Which Hadoop File Format Should I Use?

The past few weeks I’ve been testing Amazon Athena as an alternative to standing up Hadoop and Spark for ad hoc analytical queries. During that research, I’ve been looking closely at file formats for the style of data stored in S3 for Athena. I have typically been happy with Apache Parquet as my go-to, because of it’s popularity and guarantees, but some research pointed me to Apache ORC and it’s advantages in this context. In researching ORC, I ran into Apache Carbondata and then I was reminded of my early usage of Apache Avro. All of this helped me realize how complex this world can be when you’re managing your own data. When you have to be opinionated, it requires a new set of knowledge. After researching an experimenting with these four file formats I put this post together as a set of heuristics to use then choosing one. I hope you find it helpful.

What The Hell Is Columnar Storage?

I’m assuming most people already know what columnar storage is and what its advantages are. If so you can skip this part.

On it’s surface, you can think of columnar storage as the simple idea that data stored on disk are organized by column rather than by row. Doing this, has some interesting implications, namely it reduces disk I/O and allows for much better compression schemes. These are both great features for large data. Terabytes of data storage is expensive, but much cheaper than using compute and memory, however if we can reduce both we will be much better off. Data will continue to grow and we can’t do anything about that, but we can get smarter about how we store data.

The file formats listed above (with the exception of Avro) are all columnar, so this is a necessary primer. Anyway, the side effects of storing data this way allows for much quicker analytical queries. An analytical query is something like taking all users and calculating the average age. The best way to take advantage of this set up is performing queries that only requires few columns.

With these two advantages comes some disadvantages. Relative to a row store, doing a filter query like: Select * from USERS where user_id in (Some_arbitrary_list_of_users);.

It’s just not designed to do these kind of queries efficiently. Since the IDs aren’t stored with the rest of the data and you’re returning all columns. Also, appending data comes at a cost, and most formats are append only or write once. This means that doing an operation like an Update or Delete is expensive or in some cases impossible. It makes it difficult for these kinds of stores to handle operations like stream ingestion as well.

Each of the file formats I worked with for this post has trade offs and I try to cover them in depth below.

Apache Avro

Avro is perhaps the simplest of the four formats because it is not columnar and it’s pretty similar to what most should be accustomed to when dealing with databases like MySQL. Avro’s main goal is to compress data and to do it without losing schema flexibility. For example, you might want to use Hadoop as a document store and keep all of your data as JSON in Avro files for compression, you can do that in Avro. You might have some complex schema that you like to work with and all of it can work with Avro as well. The flexibility of Avro allows you to dream up any number of schemas and still manage to get decent compression.

Another positive (read as might be positive if you like this) is Avro files are dynamically typed. This allows for the schema flexibility and for the RPC support. All together, Avro is a great format for data compression and most compression techniques in Spark will default to this one.

With the positives aside, Avro does come at the expense of some other things. For example, you aren’t going to get the best possible compression when compared to a columnar format. Further, all of the data formats will need row-like traversal for queries. These trade offs might not be a big deal because the schema flexibility is worth taking more space on disk, and you may not have enough data for it to matter very much.

Apache Parquet

Parquet has a different set of aims than Avro. Instead of allowing for maximum schema flexibility it seeks to optimize the types of schemas you can use to increase query speeds and reduce disk I/O. Parquet attempts to overcomes some of the weaknesses of traditional column stores by allowing nested types in columns. So you could technically have a column that is an array, or a column that’s actually several columns. There is a great talk from Spark Summit about doing just this and I’ve found it helpful in my work.

Parquet has a lot of low level optimizations and a number of details about how it is stored on disk that you can find in the documentation. Parquet is perhaps the most common file format you’ll see in a lot of Spark related projects and it’s what I tend to use as well.

ORC

ORC shares the columnar format of Parquet but has some differences. One of the main differences is that it’s strongly typed. ORC also supports complex types like lists and maps allowing for nested data types. As far as compression goes, ORC is said to compress data even more efficiently than Parquet, however this is contingent on how your data is structured.

On top of the features supported in Parquet, ORC also supports Indexes, and ACID transaction guarantees. The last point is very important when considering the number of applications that can benefit from ACID. It’s a bit complex how the ACID transactions work in an append only data format, but you can find out all the details in the documentation if interested.

If you’re using Presto or Athena, ORC is the preferred format. With recent changes to Presto engine, many advantages come from using ORC. Additionally, ORC is one of the few columnar formats that can handle streaming data. ORC also supports caching on the client side which can be extremely valuable.

I’ve really grown to love ORC for any kind of OLAP workload,

Carbondata

Carbondata is the new kid on the block. It is an incubating apache project and based on the Spark Summit talk on it, it promises the efficiency of querying data from a columnar format with ability to also handle random access queries. Carbondata does not have ACID support but it has a host of other features. I wont list them all, but the most important (to me) are Update and Delete support, bucketing and index based optimizations (several).

Update and Delete support are important for many workflows. Append only or write once requires re-writing or overwriting an entire file and the cognitive load can be a bit much coming from a RMBDS world.

Bucketing is an optimization that allows commonly joined columns to be stored in buckets. The underlying implementation details were kind of hard to follow, but I can say that joining two Carbondata files were much more performant than any of the other formats I tried.

Carbondata has multi-layer indexing, meaning the file is indexed the partitions are indexed and even min-max values of columns are also indexed. It makes for pretty speedy queries for analytical workloads. Queries requesting averages, and even some simple lookups were much faster using Carbondata than other formats.

The two downsides I found with Carbondata are that the files don’t compress as small as they do with ORC or Parquet. This is probably because of all the fancy additions and indexes kept in the files. They weren’t as big as Avro files but contingent on the schema they were pretty large.

Carbondata is new and feels a little new with it’s API relative to the other formats. Further, Presto and Athena do not support it yet, making it a non-starter for a lot of projects.

Other

Apache Arrow is still a bit to new to really fit into this world but it’s coming on strong and I imagine it’ll have a lot to contribute to the space.

So Which One Should I Use?

I made the above table to highlight the differences and similarities between the four formats. I linked to performance studies, but didn’t quote any directly in this post. In the world of distributed computing, it’s far too easy to design tests that are advantageous to a particular platform. I recommend trying each as a pre-cursor to any analytical workload you have.

After working with each of them, I’ve come up with some heuristics to help me. If you’re using Presto or Athena ORC is likely the best format for you. If you’re using Spark Parquet or Avro make the most sense. If you’re using Hive or writing your own MapReduce jobs, then ORC is probably the best option for you as well. If you’re using JSON, you’re only real option is Avro or if you want to build a pipeline to flatten your JSON, you could use any of the other formats.

Overall, each format provides some great optimizations over storing a text file or a csv, but they put all maintenance on the shoulders of data operations. Until next time!

#Apache Spark #Scala #Amazon Athena #Amazon S3

Notes On Halt and Catch Fire Season 3

Minor spoilers

This past week I finally caught up on Halt and Catch Fire Season 3. I’ve been a fan of the shows first two seasons, but I was a bit worried about the direction of the show after season 2. Surprisingly, one of the things I really like about the show is that I don’t know anyone else who watches it. It doesn’t seem too popular evidence by the number and it’s not talked about in any of the circles I run in (both TV and technical) but it seems to consistently get quality reviews. After watching Season 3, I’ve gained a level of love for the show that before was mere curiosity. The writers and director have managed to elevate the show and I’m thoroughly excited for season 4!

Things I Loved

I loved the emotion displayed in the show this season. Both Donna and Cameron had a great deal of emotion on display, and not in a cliche “girls like to cry” kind of way. With the growth and investor interest in their company, the test of their partnership was on full display and Donna failed in the worst way possible. With Donna sacrificing her friendship with Cameron for financial gain there are a lot of tears and screaming and I liked it.

Joe was not insufferable for once. Joe is a horribly insufferable character and has very few redeeming qualities until this season. He finally endured some tragedy that made him a better person rather than a worse one.

Joe is a product visionary? Joe has always played this kind of role, but with technological advances into the modern computing age he is able to play the role much more. I watched Steve Jobs and Joe seemed kind of like a poor mans Steve Jobs with his beard and pontification.

All the technology. One of the best features of Halt and Catch Fire is the technology it references. From the machines people are using to the internet protocols it’s just a fascinating listen. The last episode where they talking about HTML and HTTP is a treat.

Things I Didn’t Like

I felt like the last 2-3 episodes of the show were a bit rushed. The writers needed to push the show forward for the last season and they did it in a way that really felt out of place. The pace of the season was very slow and methodical and then at the end of episode 8 we flash forward 4 years in time, and major changes happened in the characters lives. Fortunately, the next 2 episodes were incredible and cleaned things up a bit.

Gordon’s role. Gordon really took a back seat this season in favor of more screen time for Joe, which is a bummer because he’s my favorite character on the show. It looks like he will be much more involved in season 4 which should be a great change of pace.

Season 4

I can’t wait for season 4 and how this fun story will finally come to an end. With the way things are set up, it’ll be a ton of drama and few laughs to come in the final act. As long as I get more references to 90s computer culture, the beginnings of the World Wide Web and video games, I’ll be a happy camper.

#TV #Reviews #halt and catch fire

Trending Blogs

Recently Viewed Blogs

Latent Class