Whether the setting is one of L.A.ās more rarefied neighborhoods or one of its most destitute, the rules are the same: Anybody can play, but it takes something else to stay
Triple double around the corner!
Today's Document
RMH
Keni

Andulka
One Nice Bug Per Day
tumblr dot com
Monterey Bay Aquarium
Alisa U Zemlji Chuda
NASA
Sade Olutola

#extradirty

izzy's playlists!
šŖ¼
Peter Solarz
styofa doing anything
2025 on Tumblr: Trends That Defined the Year
Cosimo Galluzzi

if i look back, i am lost

romaā
PUT YOUR BEARD IN MY MOUTH
seen from United States
seen from United States

seen from Germany
seen from Iraq

seen from Mexico

seen from Colombia
seen from United States
seen from Chile
seen from United States
seen from United States
seen from United States
seen from United States
seen from United States

seen from United States
seen from United States
seen from United States

seen from United States
seen from United States
seen from France

seen from Denmark
@unbareablelightness
Whether the setting is one of L.A.ās more rarefied neighborhoods or one of its most destitute, the rules are the same: Anybody can play, but it takes something else to stay
Triple double around the corner!
West Virginia has the highest overdose death rate in the country. Locals are fighting to save their neighborsāand their townsāfrom destruction.
These pieces are always hard for me. Addiction sucks, but the energy being put into the opioid epidemic while the crack epidemic was villanizedĀ will always bug me.Ā
Internet History Podcast: 143. Neil Hunt on the History of Netflix and Netflix Streaming
Completed in 2017 in Brasilia, Brazil. Images by Haruo Mikami. . The house is located in a residential area far away from the city center in BrasĆlia. The weather in this part of Brazil is characterized by a...
Song Exploder: Michael Kiwanuka - Black Man in a White World
Internet History Podcast: 142. Andy Rachleff @arachleff co-founder of Benchmark and Wealthfront
Software Engineering Daily: Microservices Transition with Cassandra Shum
Data Pipelines with Apache Spark and Couchbase
Beginning Apache Flink
Iāve been committed to the Apache Spark ecosystem for as long as Iāve been doing data engineering. Iāve seen itās adoption, been a fan of using the APIs and have even decided to write a book about Spark. In all of this experience, there has been a nagging sense that I need to try a ātrueā streaming framework like Twitter Heron, Apache Flink or Apache Apex. Iāve read articles and watched videos about the difference between a true streaming and a micro batch framework and been mostly unconvinced that it would make a difference for me. I started a project where Flink ended up being the go to streaming framework and I learned a great deal about Flink. This post is a combination of some notes and things I noticed while using Flink and hopefully anyone starting Flink can find it helpful.
Getting Started
Like Spark, Flink is fairly overwhelming to get started with. This is mostly because of installations and run-time configurations. After a bunch of searching around and I was able to put together a decent starter SBT config for Flink. I used Intellij to work with Flink because of itās complex API, the type hinting and other niceties come in pretty handy.
For other ways using Flink (Docker, cluster, Mesos) I found the docs pretty helpful for that. This blog is fairly elementary, so I wonāt cover those in this post.
Ergonomics / API
What I noticed early on using Flink is that it really was designed with streaming in mind first. This is evident by different APIs for a Key Stream, versus a DataStream versus a Window Stream. Once you work with streams for a little while, youāll find yourself wanting an API that is robust and can handle streams in a sensible way. In Spark streaming, I found myself writing a bunch of boilerplate code aimed at accomplishing a lot of what Flink is designed to do out of the box. One of the ways this is most evident is the DataStream and itās derivatives, Iāll walk through each.
DataStream
A Datastream is a typed stream of data is one-at-a-time. Itās what all of Flinks streaming architecture is built off of. You can use this stream to do whatever you wanted to. For example I could read in a stream and add two to each value:
Or, I could read in a stream and write it back out to the file system.
KeyedDataStream
A KeyedDataStream is a stream grouped by and evaluated by a key. Typically, this is the first step in a longer process of creating a window or other type of stream. The API is pretty straight forward, you just add a .keyBy("Name your key") to a DataStream. In and of itself it's not all that useful. It comes in handy with a Window.
WindowedDataStream
A WindowedDataStream is what youād expect, itās a DataStream over some window of time. This is how Spark Streaming works out of the box and how most introductory streams are introduced. Flinks API is very robust allowing you to configure streams based on number of occurances and windows of time. This kind of control makes it easy to work with out of the box versus custom code to do it.
You can also define what are known as TriggerPolicies and EvictionPolicies, in which you can define a specific trigger and end state for that trigger. These require a bit more orchestration but can be effective for a number of streaming situations.
Technically, Spark supports all 3 of these but not with the same ease of use. With the new Structured Streaming API most of these mechanics are there now, but itās still not as intuitive as Flink in my opinion.
Another thing that is extremely nice about Flink is their dashboard. Spark has a notoriously bad dashboard but itās not nearly as nice as this:
Itās quite easy to follow the flow of data in Flink and consume and create streaming sources without a ton of effort.
Batch Model
The real strength of Spark is in itās batch model and in comparison Flink is actually pretty nice to use. Doing most of your batch related transformations is just as nice as it is to do in Spark. Here are a few examples:
Ecosystem
When your ecosystem bar is Spark, youāll be hard pressed to meet it. Flink supports all of the major streaming technologies like Apache Kafka, AWS Kinesis and Debezium. Being the newer kid on the block, itās just not as rich as what Spark has to offer. That being said, with full support of the Scala and Java ecosystem, I have yet to find a situation Flink couldnāt handle.
Summary
Itās clear that a good deal of smart people and companies are investing a lot in Apache Flink, but the question remains, should you? A lot is written and said about streaming, but in my experience, the majority of workloads are still batch. Spark is a much more mature ecosystem for batch workflows and a pretty mature one for streaming as well.
Iām a pretty big believer in streams and their potential. Iām not sure that streams are the wave of the future for Dashboards, reporting and applications that are frozen in time, but I think itās already clear that some of the most impressive and useful apps are built off of real time data. These are the kinds of applications I want to build my career on, so Iāll be using Flink more often. I hope to write some useful beginner friendly tutorials on Flink in the coming months as well!
Notes
One Week with Logitech Pop
Earlier this week I ordered a Logitech Pop. I had a small use case where to pop would work great. We have a small lamp with a Phillips Hue bulb that we use when our daughter wakes up in the night. Typically, we use HomeKit to turn the light on and off, but HomeKit is so spotty as to be impossible to use and using the Hue app is cumbersome in a frantic run to get a crying baby. I learned about the Pop and figured I could have an extra button next to the light switch that we could just tap on our way in and out, easy enough.
The Pop is a bit expensive for what it is, but thatās true of pretty much every device in this space. I got my starter kit for a bit cheaper than retail by getting it on eBay. Set up is a real breeze. You just plug the hub into an outlet and open the app. From there, adding buttons is as simple as clicking it and then naming it.Ā
Once I actually had buttons set up, I was able to scan my wifi network and add devices on the network. I had all my Hue lights pop up and some other out of the box connections work. I set up my daughters light to turn on and off with a click, tried it and it worked great. Thereās nothing magical about this, but the first time I needed it there was a tangible difference in using it and using HomeKit.Ā
From here, I wanted to have some more fun. Since the starter pack I bought came with two buttons, I decided to do something more interesting with the second button. Each Pop button (in basic mode) has three modes. A tap, a double tap, and a long press. Each of these can do a different activity. I set up the tap to play All Blues by Miles Davis on the Sonos. I set up the double tap to turn my office lamp on and off and the long press to turn my security system on and off through IFTTT.
These are all a little gimmicky, but I find them fun and actually found myself using the buttons for all sorts of stuff and thinking of creative ways I could use the buttons. Iām pretty impressed with Logitech in this connected home space. I own a Harmony Elite, another impressive connected home device I should write about in the future.
Overall, itās pretty hard to say this is a āmust haveā product, but its a fantastic way to extend technology you already own. Iāll get a few more buttons around the house for my Hue bulbs in each of my lamps over the months.Ā
Amazon Athena Presentation
Creating A Spark Server For Every Job With Livy
One of the frustrations that most people who are new to Spark have, is how exactly to run Spark. Before running your first Spark job youāre likely to hear about YARN or Mesos and it might seem like running a Spark job is a world unto it self. This barrier to entry makes it harder for beginners to imagine that's possible with Spark. Livy provides a RESTful interface to Apache Spark and helps obfuscate some of the details of Sparks execution mechanics and lets developers submit programs to a Spark cluster and gets results. This post if a summary of my notes using Livy to send jobs queued from web hooks to a Spark cluster. My aim to to talk about the benefits and drawbacks of using this setup as well as a small tutorial on Livy. If youāre interested in using Livy, the documentation is excellent.
The Basics
When running a Spark Job, you typically submit jobs via a Spark Shell. This can be in Python or Scala, but running a Spark Job looks something like this:
There are some exceptions, notably if youāre working in a Notebook context like Juypter Notebook, Zeppelin or Beaker Notebooks. In these cases, the notebooks are bound to a Spark Shell so you can run jobs dynamically instead of submitting Jar files or Python files.
In either context, you need to have a Spark Context (either create on in the notebook or within the file submitted to the shell) and code is isolated to your environment. This is fine for most workloads and for development, but it limits the kinds of programs you can write in Spark and the amounts of services that can communicate with Spark.
For example, if we built a regression model in Spark and wanted to run live data through it, itās not immediately obvious how weād do that, or over what protocol. It all seems too boxed in and tightly coupled with the machine itās running on. Thatās where Livy is helpful. Livy exposes a rest endpoint to Spark, allowing you to submit jobs from anywhere*. How it accomplishes this is a bit tricky and Iāll walk through the mechanics of it.
Mechanics
Spark doesnāt have a RESTful protocol to itās engine, however with a little work you can create a rest API server that translates Python, Scala or R code to Spark Job lingo and return the results. This is essential with Livy does (forgive the oversimplification). This allows (for example) us to write a DSL that submits Spark Jobs over REST and gets data back (There are other ways to get about this like MLeap that Iāll cover in a future post)
The power of doing this should be immediately obvious, but the drawbacks might be as well. I worked through two examples to explore the API behind Livy and then to try and actually use REST to do something interesting.
A RESTful Endpoint Example
My first example is just an endpoint that squares the integer it receives on a POST request. For example, POST /2 would reply with 2^2 = 4. I chose this strategically because of the complexity in putting one of these endpoints together. My example is in Scala, but you could do the same thing in PySpark or SparkR. Here is the endpoint. I commented int he code about each part and what itās doing. I find that much easier than posting the code and explaining it after the fact:
Predict My Weight
In order to do something a bit more applicable to an actual workload, I created a silly model. The models predicts what my weight will be one week from today, based on how many calories I ate + how many calories I burned today. Itās wildly inaccurate, but good for the purposes of this blog. I will enter my weight and calories burned in a Google Sheet and I used Microsoft Flow to trigger an HTTP event that fires to my Livy server and calculates my weight.
Here is a rough sketch of what will happen.
This will work a little differently from the example I shared above. Instead of writing a Scala HTTP client, I can just make a post request from the Microsoft Flow HTTP client. I wonāt walk through how to do that as the above example already illustrates and the UI is intuitive. Essentially, Iāll add my weight and calories I burned today into a spread sheet, thatāll trigger an event to predict my weight and add it as a new column in a separate spreadsheet. Here is the function:
I entered 199 as current weight and 1500 as calories burned today ( Both fakes numbers) and it predicted my weight would be 188.99 a week from now.
Summary
Livy provides an interesting way to use Spark as a RESTful service. In my opinion, this is not an ideal way to interact with Spark, however. There is just a tad too much overhead of language interoperability to make it worth it. For starters, sending strings of Scala code over the wire doesnāt inspire a lot of confidence. Itās also not immediately clear why executing pre-defined JAR files over rest has. On the positive side, I expected something much slower than what I got out of Livy. For a use case as contrived as the one I made up for this blog itās pretty solid, but the model in general might be hard to scale and reason about.
Notes
Microsoft Flow is very cool. I know it seems like an IFTTT clone, but with the ability to send HTTP requests and web hooks itās much more customizable. Also the free tier is much more generous than something like Zapier.
This stuff takes forever to configure and use the first time.
There are some alternative projects that aim to accomplish the same task as Livy, most notably spark-jobserver, which I think is a little bit easier to use but I didnāt find out about until long after I started experimenting with Livy. If anyone would be interested in a tutorial about that feel free to let me know.
Announcement: Iām Writing A Book on Apache Spark
For the last year or so Iāve been blogging regularly about the Apache Spark platform. During that time, Spark has grown from something that people in data science and engineering have used to something that is almost ubiquitous. Iāve enjoyed working with the platform professionally, and even on a number of personal projects. Over this year, Iāve spent a lot of time trying to get SBT configurations to work correctly, converting JSON to DataSets, and painstakingly trying to get missing data imputations to work sensibly. This time has taught me that for as popular as Spark is, there is a pretty big gap in resources for it. Itās not that the docs are bad (they are actually excellent), itās not that itās a super hard platform to learn, itās just because itās programming. Programming is tough, digging through a huge Scaladoc is tough, but itās what it takes to get decently proficient at Spark. This isnāt necessarily unique to Spark, but the pain is pain all the same.
Spark has enabled me to think about computing and data in an entirely different way. It has taught me to be much more ambitious about data, and I think lots of people can benefit from that. Iāve spent so much time writing and debugging Spark, that I feel like I have a lot to share. My blog is evidence of this, as many people have reached out to thank me for stuff Iāve written on Spark. What I wrote helped them think about a problem in a different way, or helped them appreciate an overlooked aspect of Spark more. I feel like I could continue to write blogs and have a good impact or I can write a more lasting resource, in the form of a book.Ā
My motivation is not to āmake a killingā off the book or to become a āthought leader.ā I hope for it to be published and to provide value, but I care more about the experience of writing it than making tons of money. This is partly why Iām not interested in doing a course on Apache Spark. I donāt want the responsibility for on-going membership fees or keeping content up to date. I want to pass on principles and focus on platform level tips and not get bogged down in API details like a course would force me to. I also want to spend more time working on this project with my wife, Bethany. She provides all the illustrations for my blog and does a tremendous job and I believe together we can put together a lasting resource for those new to Spark.Ā
As for timing, weāre working on a writing calendar this week and Iāll post updates on my newsletter for anyone interested in following the progress. The books working title is The Apache Spark Field Guide, I feel like āfield guideā perfectly describes what Iām trying to do. It will not be your typical technical book with a lot of code samples, iāll spend a lot more time walking through the nuances Spark execution and helpful tips in using Spark. There are already great books out on taking Spark from nowhere to somewhere, but there isnāt a good place to quickly explain concepts in a way thatās not fact based recitation.
Until next time.
Lutron Caseta Review
Iāve been into this home automation thing for some time now. Any device on the market, Iāve most likely tried it already and there is an equally good chance that there is one functioning in my house. Most of the home automation products available for mass market are still pretty user-hostile and ever so expensive. The one area of home automation tech that is easy to use and pretty pleasing is light switches. They work great without the connected technology, but are enhanced by it. Iāve tried a number of other switches that connect over z-wave but the Caseta line always has the best reviews and are the most asthetically pleasing in my opinion. looking in my opinion and the easiest to use. So I got a few new ones on eBay and installed them. After getting a few for a good price on eBay I decided to give this who Lutron thing a try.Ā
Installation is super simple. As I mentioned before, I have installed other dimmer switches and none of the had as clearly laid out instructions as the Casetas. Ā Once all wired up, I plugged it in and it worked. There are additional settings you can configure, for instance if you wanted the light to turn on but not be full-brightness. I was excited to connect the switch to my SmartThings hub but in the process I realized I couldnāt connect Caseta over regular Z-wave. I needed to use the Lutron Hub, which would run me some more money (annoying). I was in a spending mood so I ran to BestBuy, got the hub and an additional switch and was off to the automation races.Ā
Iāve connected the switches to my security system, to HomeKit and to Amazon Echo. It is much faster than the other Z-wave switches I have and just more pleasant to set up across the devices. I plan on outfitting all of my switches with these as time goes on. They arenāt exactly cheap at $45 a pop, but any of the z-wave switches run about the same.
A Gentle Intro To Graph Analytics With GraphFrames
Anyone steeped in the doctrine of relational databases will find that trying to use a graph database like Neo4J is painful and not at all intuitive. This is not your fault, or Neo4js fault, itās just that graph traversal is nothing like SQL. When I say nothing, I literally mean nothing. You think about them in two completely different ways and the ergonomics of graph traversals are inherently harder to get used to. This issue is compounded when considering doing a tutorial on a graph database. Further, this is compounded when using a Graph Analytics library like GraphX. Already being forced to work with RDDs (Not exactly beginner friendly) adding the paradigm of graphs on top of it is too much for the uninitiated. What would be much easier to comprehend is if we could go from a table-like structure to a graph and do the same queries for comparison.
GraphFrames allow us to do exactly this. Itās an API for doing Graph Analytics on Spark DataFrames. This way, we can try to recreate SQL queries in Graphs and have a better grasp of the graph concepts. Not having to load the data and create the relationships makes a lot of difference in a pedagogical context (At least Iāve found).
A Simple Primer
To set this all up, weāre going to use the default example data found in the GraphFrames package with a few edits. Itās two tables that look like this:
In the second DataFrame, we have āsrcā and ādstā and ārelationshipā columns. This is just syntactic, and allows us to establish a vertex-edge relationship. You could make a pretty complex web of DataFrames that are connected to one another, but in order to maintain simplicity, Iāll just keep it as this simpler āfriend/followā relationship. It gives us enough data to go through the rest of this exercise without confusing us.
A Few Algorithms
We can start with PageRank, an algorithm developed by Larry Page, the CEO of Alphabet Inc. The basic idea is to establish how each edge in a graph references another. In the ancient web context, It would help us identify the authority on a topic. If every web page about Jay-z linked to Spotify.com then weād know Spotify is an authority on Jay-z. For the data we have we'll look at the edges and itās more a measure of connectedness:
You can look through the mathematical specification for a better understanding of whatās exactly going on, but essentially we built a DataFrame that described how each person was related to another. In a relational context, we would calculate the number of connections with a handful of queries, but as relationships get more numerous and complicated it becomes harder to do.
In a graph, there is a layer of abstraction that makes it easier to figure out this kind of information. Consider the following. If you were tasked with figuring out which of your friends knew each other, it would be a gargantuan task to call each and go through the list. It would be much easier if you could have each friend send their friends a message and for you to sort through the connections after. In a very oversimplified way, many of the algorithms in GraphFrames can be implemented with this message passive primitive.
For a more complicated example lets try the Strongly Conected Components algorithm. You can read through the math if you like but in laymen terms itās a measure of each vertex in the graph being connected to another. From the definition it doesnāt have to be a direct connection, but the fewer hops to establish a connection the more āstrongly connectedā a vertex is. With that, we can use the GraphFrames implementation:
Again, figuring out this kind of information via SQL would be very hard. Largely because we donāt have semantics for figuring out connectedness, rather itās great for collecting and summarizing information. Most of us donāt have an immediate need for graphs and what they have to offer. However, a lot can be uncovered if you can store your data in this way.
Nice Thing(s)
One of the kindest aspects of a library like GraphFrames is that edges and vertexes are Dataframes. This is valuable because we already have a whole set of APIs for how to deal with these things.
A second thing I like about GraphFrames are the algorithm implementations. There arenāt as many as GraphX but I feel like they are easier to use because they are dealing with DataFrames instead of RDDs. Many long-time Spark users are very familiar with RDDs and comfortable using them, I have been using Spark for a long time too, but always founded the DataFrames / DataSets to be more manageable.
Finally, querying GraphFrames is pretty nice! You have facilities to do regular search, breadth first search or structured queries. Breath first search is probably my favorite of the bunch:
Summary
I canāt say enough about how GraphFrames have enabled me to better understand graphs and graph analytics. Itās the first time I was able to successfully go from a column/row format to a graph and to compare the two. That being said, GraphFrames is very immature, as evidence by itās release version and itās lack of support for a number of features in GraphX or Apache Giraph. Itās immaturity is a blessing and no reflection of the quality and thought put into the API.
The two major hurdles to doing graph analytics is (1) the query language and (2) the paradigm. By using GraphFrames you practically eliminate (1), and mostly eliminate (2). Since first using GraphFrames, I went back and tried Neo4J and both of these hurdles were a non-factor. Doing some more complex things were still a little weird, but I didnāt get stuck on āHello, World.ā If youāre struggling with Graph Analytics, give GraphFrames a try. Itās well worth the few hours youāll spend learning it.
Which Hadoop File Format Should I Use?
The past few weeks Iāve been testing Amazon Athena as an alternative to standing up Hadoop and Spark for ad hoc analytical queries. During that research, Iāve been looking closely at file formats for the style of data stored in S3 for Athena. I have typically been happy with Apache Parquet as my go-to, because of itās popularity and guarantees, but some research pointed me to Apache ORC and itās advantages in this context. In researching ORC, I ran into Apache Carbondata and then I was reminded of my early usage of Apache Avro. All of this helped me realize how complex this world can be when youāre managing your own data. When you have to be opinionated, it requires a new set of knowledge. After researching an experimenting with these four file formats I put this post together as a set of heuristics to use then choosing one. I hope you find it helpful.
What The Hell Is Columnar Storage?
Iām assuming most people already know what columnar storage is and what its advantages are. If so you can skip this part.
On itās surface, you can think of columnar storage as the simple idea that data stored on disk are organized by column rather than by row. Doing this, has some interesting implications, namely it reduces disk I/O and allows for much better compression schemes. These are both great features for large data. Terabytes of data storage is expensive, but much cheaper than using compute and memory, however if we can reduce both we will be much better off. Data will continue to grow and we canāt do anything about that, but we can get smarter about how we store data.
The file formats listed above (with the exception of Avro) are all columnar, so this is a necessary primer. Anyway, the side effects of storing data this way allows for much quicker analytical queries. An analytical query is something like taking all users and calculating the average age. The best way to take advantage of this set up is performing queries that only requires few columns.
With these two advantages comes some disadvantages. Relative to a row store, doing a filter query like: Select * from USERS where user_id in (Some_arbitrary_list_of_users);.
Itās just not designed to do these kind of queries efficiently. Since the IDs arenāt stored with the rest of the data and youāre returning all columns. Also, appending data comes at a cost, and most formats are append only or write once. This means that doing an operation like an Update or Delete is expensive or in some cases impossible. It makes it difficult for these kinds of stores to handle operations like stream ingestion as well.
Each of the file formats I worked with for this post has trade offs and I try to cover them in depth below.
Apache Avro
Avro is perhaps the simplest of the four formats because it is not columnar and itās pretty similar to what most should be accustomed to when dealing with databases like MySQL. Avroās main goal is to compress data and to do it without losing schema flexibility. For example, you might want to use Hadoop as a document store and keep all of your data as JSON in Avro files for compression, you can do that in Avro. You might have some complex schema that you like to work with and all of it can work with Avro as well. The flexibility of Avro allows you to dream up any number of schemas and still manage to get decent compression.
Another positive (read as might be positive if you like this) is Avro files are dynamically typed. This allows for the schema flexibility and for the RPC support. All together, Avro is a great format for data compression and most compression techniques in Spark will default to this one.
With the positives aside, Avro does come at the expense of some other things. For example, you arenāt going to get the best possible compression when compared to a columnar format. Further, all of the data formats will need row-like traversal for queries. These trade offs might not be a big deal because the schema flexibility is worth taking more space on disk, and you may not have enough data for it to matter very much.
Apache Parquet
Parquet has a different set of aims than Avro. Instead of allowing for maximum schema flexibility it seeks to optimize the types of schemas you can use to increase query speeds and reduce disk I/O. Parquet attempts to overcomes some of the weaknesses of traditional column stores by allowing nested types in columns. So you could technically have a column that is an array, or a column thatās actually several columns. There is a great talk from Spark Summit about doing just this and Iāve found it helpful in my work.
Parquet has a lot of low level optimizations and a number of details about how it is stored on disk that you can find in the documentation. Parquet is perhaps the most common file format youāll see in a lot of Spark related projects and itās what I tend to use as well.Ā
ORC
ORC shares the columnar format of Parquet but has some differences. One of the main differences is that itās strongly typed. ORC also supports complex types like lists and maps allowing for nested data types. As far as compression goes, ORC is said to compress data even more efficiently than Parquet, however this is contingent on how your data is structured.
On top of the features supported in Parquet, ORC also supports Indexes, and ACID transaction guarantees. The last point is very important when considering the number of applications that can benefit from ACID. Itās a bit complex how the ACID transactions work in an append only data format, but you can find out all the details in the documentation if interested.
If youāre using Presto or Athena, ORC is the preferred format. With recent changes to Presto engine, many advantages come from using ORC. Additionally, ORC is one of the few columnar formats that can handle streaming data. ORC also supports caching on the client side which can be extremely valuable.
Iāve really grown to love ORC for any kind of OLAP workload,
Carbondata
Carbondata is the new kid on the block. It is an incubating apache project and based on the Spark Summit talk on it, it promises the efficiency of querying data from a columnar format with ability to also handle random access queries. Carbondata does not have ACID support but it has a host of other features. I wont list them all, but the most important (to me) are Update and Delete support, bucketing and index based optimizations (several).
Update and Delete support are important for many workflows. Append only or write once requires re-writing or overwriting an entire file and the cognitive load can be a bit much coming from a RMBDS world.
Bucketing is an optimization that allows commonly joined columns to be stored in buckets. The underlying implementation details were kind of hard to follow, but I can say that joining two Carbondata files were much more performant than any of the other formats I tried.
Carbondata has multi-layer indexing, meaning the file is indexed the partitions are indexed and even min-max values of columns are also indexed. It makes for pretty speedy queries for analytical workloads. Queries requesting averages, and even some simple lookups were much faster using Carbondata than other formats.
The two downsides I found with Carbondata are that the files donāt compress as small as they do with ORC or Parquet. This is probably because of all the fancy additions and indexes kept in the files. They werenāt as big as Avro files but contingent on the schema they were pretty large.
Carbondata is new and feels a little new with itās API relative to the other formats. Further, Presto and Athena do not support it yet, making it a non-starter for a lot of projects.
Other
Apache Arrow is still a bit to new to really fit into this world but itās coming on strong and I imagine itāll have a lot to contribute to the space.
So Which One Should I Use?
I made the above table to highlight the differences and similarities between the four formats. I linked to performance studies, but didnāt quote any directly in this post. In the world of distributed computing, itās far too easy to design tests that are advantageous to a particular platform. I recommend trying each as a pre-cursor to any analytical workload you have.
After working with each of them, Iāve come up with some heuristics to help me. If youāre using Presto or Athena ORC is likely the best format for you. If youāre using Spark Parquet or Avro make the most sense. If youāre using Hive or writing your own MapReduce jobs, then ORC is probably the best option for you as well. If youāre using JSON, youāre only real option is Avro or if you want to build a pipeline to flatten your JSON, you could use any of the other formats.
Overall, each format provides some great optimizations over storing a text file or a csv, but they put all maintenance on the shoulders of data operations. Until next time!
Notes On Halt and Catch Fire Season 3
Minor spoilers
This past week I finally caught up on Halt and Catch Fire Season 3. Iāve been a fan of the shows first two seasons, but I was a bit worried about the direction of the show after season 2. Surprisingly, one of the things I really like about the show is that I donāt know anyone else who watches it. It doesnāt seem too popular evidence by the number and itās not talked about in any of the circles I run in (both TV and technical) but it seems to consistently get quality reviews. After watching Season 3, Iāve gained a level of love for the show that before was mere curiosity. The writers and director have managed to elevate the show and Iām thoroughly excited for season 4!
Things I Loved
I loved the emotion displayed in the show this season. Both Donna and Cameron had a great deal of emotion on display, and not in a cliche āgirls like to cryā kind of way. With the growth and investor interest in their company, the test of their partnership was on full display and Donna failed in the worst way possible. With Donna sacrificing her friendship with Cameron for financial gain there are a lot of tears and screaming and I liked it.Ā
Joe was not insufferable for once. Joe is a horribly insufferable character and has very few redeeming qualities until this season. He finally endured some tragedy that made him a better person rather than a worse one.Ā
Joe is a product visionary? Joe has always played this kind of role, but with technological advances into the modern computing age he is able to play the role much more. I watched Steve Jobs and Joe seemed kind of like a poor mans Steve Jobs with his beard and pontification.Ā
All the technology. One of the best features of Halt and Catch Fire is the technology it references. From the machines people are using to the internet protocols itās just a fascinating listen. The last episode where they talking about HTML and HTTP is a treat.Ā
Things I Didnāt Like
I felt like the last 2-3 episodes of the show were a bit rushed. The writers needed to push the show forward for the last season and they did it in a way that really felt out of place. The pace of the season was very slow and methodical and then at the end of episode 8 we flash forward 4 years in time, and major changes happened in the characters lives. Fortunately, the next 2 episodes were incredible and cleaned things up a bit.Ā
Gordonās role. Gordon really took a back seat this season in favor of more screen time for Joe, which is a bummer because heās my favorite character on the show. It looks like he will be much more involved in season 4 which should be a great change of pace.
Season 4
I canāt wait for season 4 and how this fun story will finally come to an end. With the way things are set up, itāll be a ton of drama and few laughs to come in the final act. As long as I get more references to 90s computer culture, the beginnings of the World Wide Web and video games, Iāll be a happy camper.Ā