Shsksmdjsl nice tony
The Unbeatable Squirrel Girl #38

seen from Canada

seen from United States

seen from Portugal

seen from Lithuania

seen from Germany

seen from Australia

seen from United States
seen from China

seen from T1
seen from United States
seen from United States
seen from China

seen from Australia

seen from Brazil
seen from China
seen from United Kingdom
seen from Malaysia
seen from Poland

seen from Australia
seen from United States
Shsksmdjsl nice tony
The Unbeatable Squirrel Girl #38
Ride the bus on your computer!
Well not really. But kind of. Right? Through the magic of Google, you can explore a DATA bus before boarding. Check out the full Ride Guide for all the details about using public transit.
View Larger Map
Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:
There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.
The steps involved are what you’d expect for a large data set migration:
forklift
incremental replication
consistency checking
shadow writes
shadow writes and shadow reads for validation
end of life of the original data store (SimpleDB)
If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.
In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.
Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.
Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)
A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume1 is the one I’d look first (even if for taking inpiration on how to do things):
An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.
Just a couple of extra-thoughts:
Flume NG seems to offer 3 types of channels: file, jdbc, memory.
For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)
Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. ↩
Original title and link: Apache Flume Performance Tuning (NoSQL database©myNoSQL)
What Is Unique About LinkedIn's Databus
After learning about LinkedIn’s Databus low latency data transfer system, I’ve had a short chat with Sid Anand focused on understanding what makes Databus unique.
As I’ve mentioned in my post about Databus, Databus looks at first as a data-oriented ESB. But what is innovative about Databus comes from decoupling the data source from the consumers/clients thus being able to offer speed to a large number of subscribers that are up-to-date, but also help clients that fall behind or are just bootstrapping without adding load on the source database.
Databus clients are smart enough to:
ask for Consolidated Deltas since time T if they fall behind
ask for a Consistent Snapshot and then for a Consolidated Delta if they bootstrap
and Databus is build so it can serve both Consolidate Deltas and Consistent Snapshots without any impact on the original data source.
Diagram from Highscalability.com
The “catching-up” and boostrapping processes are described in much more details in Sid Anand’s article.
Databus is the single and only way that data is replicated from LinkedIn’s databases to search indexes, the graph, Memcached, Voldemort, etc.
Original title and link: What Is Unique About LinkedIn's Databus (NoSQL database©myNoSQL)
Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):
Databus offers the following feature:
Pub-sub semantics
In-commit-order delivery guarantees
Commits at the source are grouped by transaction
ACID semantics are preserved through the entire pipeline
Supports partitioning of streams
Ordering guarantees are then per partition
Like other messaging systems, offers very low latency consumption for recently-published messages
Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
High Availability and Reliability
The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.
Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team ↩
Original title and link: Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool (NoSQL database©myNoSQL)