Discover Top Posts Tagged with #databus

Shsksmdjsl nice tony

The Unbeatable Squirrel Girl #38

#tony stark #nancy whitehead #databus #the Unbeatable Squirrel girl #ben reads comics

Ride the bus on your computer!

Well not really. But kind of. Right? Through the magic of Google, you can explore a DATA bus before boarding. Check out the full Ride Guide for all the details about using public transit.

View Larger Map

#durhamnc #google #databus #publictransit #durm #bullcity

Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

The steps involved are what you’d expect for a large data set migration:

forklift

incremental replication

consistency checking

shadow writes

shadow writes and shadow reads for validation

end of life of the original data store (SimpleDB)

If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.

In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.

Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.

Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)

#SimpleDB #Cassandra #Databus #Netflix #LinkedIn #key-value store #column store #BigTable #Dynamo

A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume1 is the one I’d look first (even if for taking inpiration on how to do things):

An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.

Just a couple of extra-thoughts:

Flume NG seems to offer 3 types of channels: file, jdbc, memory.

For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)

Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. ↩

Original title and link: Apache Flume Performance Tuning (NoSQL database©myNoSQL)

#Flume #Databus

What Is Unique About LinkedIn's Databus

After learning about LinkedIn’s Databus low latency data transfer system, I’ve had a short chat with Sid Anand focused on understanding what makes Databus unique.

As I’ve mentioned in my post about Databus, Databus looks at first as a data-oriented ESB. But what is innovative about Databus comes from decoupling the data source from the consumers/clients thus being able to offer speed to a large number of subscribers that are up-to-date, but also help clients that fall behind or are just bootstrapping without adding load on the source database.

Databus clients are smart enough to:

ask for Consolidated Deltas since time T if they fall behind

ask for a Consistent Snapshot and then for a Consolidated Delta if they bootstrap

and Databus is build so it can serve both Consolidate Deltas and Consistent Snapshots without any impact on the original data source.

Diagram from Highscalability.com

The “catching-up” and boostrapping processes are described in much more details in Sid Anand’s article.

Databus is the single and only way that data is replicated from LinkedIn’s databases to search indexes, the graph, Memcached, Voldemort, etc.

Original title and link: What Is Unique About LinkedIn's Databus (NoSQL database©myNoSQL)

#Databus #LinkedIn #polyglot persistence

Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):

Databus offers the following feature:

Pub-sub semantics

In-commit-order delivery guarantees

Commits at the source are grouped by transaction

ACID semantics are preserved through the entire pipeline

Supports partitioning of streams

Ordering guarantees are then per partition

Like other messaging systems, offers very low latency consumption for recently-published messages

Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source

High Availability and Reliability

The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.

Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team ↩

#ETL #Databus #LinkedIn