myNoSQL @nosql - Tumblr Blog

Posts

Autoscaling allows customers to build more cost effective and resilient applications. Using Compute Engine Autoscaling, you can ensure that exactly the right number of Compute Engine instances are available at any given time to handle your application’s workload. This saves you money when your application’s usage is low, and ensures your application is responsive when utilization is high.

Autoscaling is the the Holy grail of a distributed system. The promise is that the system is be able to adapt—both up and down—to the needs/requirements/SLAs. Basically, the system will be able to get the performance it is demanded to provide, maximum availability, and these with optimal costs.

The first step in finding this Holy grail is to be able to describe the needs and requirements and SLAs of the system.

Original title and link: Autoscaling, welcome to Google Compute Engine (NoSQL database©myNoSQL)

#distributed systems #scalability

Mark Callghan takes a look at:

Amazon’s participation in the MySQL community — none

some of the things said during the presenttions — performance seems to be inflated

compability with existing MySQL features and especially InnoDB engine

features — very similar to my Amazon Aurora in bullet points

What is Aurora? I don’t know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB.

Original title and link: Aurora for MySQL is coming (NoSQL database©myNoSQL)

#Aurora #Amazon

Medium’s social graph stored in Neo4j and exposed through a Go service:

It makes a lot of sense to store social data in a graph database. Medium users, posts and collections are represented by graph nodes, and the edges between them describe relationships — users following users, users recommending posts, or users editing collections, to name a few common examples. Using a graph database also makes our queries simpler: we don’t have to do any complicated joins or other query wizardry.

It’s hard to deny that when looking at highly connected data the first answer is almost always a graph database. Once the amount of data stored grows, you start thinking how you access that data. In many cases, the predominant answer is not traversals.

Original title and link: Medium uses Neo4j and Go for GoSocial service (NoSQL database©myNoSQL)

#Neo4j #Go #graphdb #graph database

Stripe has put on GitHub 4 Hadoop related projects they’ve developed internally:

a dashboard for Hadoop jobs

a Scala framework for distributed learning

a database for serving data in SequenceFile format

a collection of command-line utilities.

As a side note, Stripe is using Cloudera Impala with Parquet.

Original title and link: Stripe's Hadoop tools open sourced (NoSQL database©myNoSQL)

#Hadoop #Impala #Parquet #MapReduce #BigData

NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th

01: Teradata QueryGrid is the technology used to allow querying both Teradata/AsterData and external data stored in Hadoop or Oracle. ★

02: MarkLogic 8 will bring Javascript server-side engine, RDF triple store engine with support for SPARQL 1.1, bitemporal data management. ★

I still believe that MarkLogic should position itself as real-time search solution.

03: For Cassandra 3.0, there’s an completely revamped, and optimized, solution for handling hinted handoff that uses sort of a commit log instead of a Cassandra system table (thus avoiding any overhead associated). ★

04: YASH. Yet another SQL-on-Hadoop. This one from HP Vertica. ★

05: Teradata and MapR are signing a partnership to collaborate on the integration and co-development of join products. Some can say this might impact the Hortonworks’s IPO. ★

Original title and link: NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th (NoSQL database©myNoSQL)

#Teradata #MarkLogic #document database

The states and transitions of a Couchbase node

The different states and the transitions of a Couchbase node in a diagram:

This post describes the states and actions that can trigger the transitions. One interesting aspect is that state changes are not applied immediately and you can commit multiple such changes at once when satisfied with the new topology.

Original title and link: The states and transitions of a Couchbase node (NoSQL database©myNoSQL)

#Couchbase #key-value store #document database

Betteridge’s law of headlines.

Original title and link: Can MapReduce Solve Planning Problems? (NoSQL database©myNoSQL)

#MapReduce

Dave Kellogg’s in-depth look at the Hortonworks’s filling for IPO, a comparison with RedHat’s model, and a definitely interesting hypothesis and conclusion:

While Hadoop and big data are unarguably huge trends driving the industry and while the future of Hadoop looks very bright indeed, on reading the Hortonworks S-1, the reader is drawn to the inexorable conclusion that it’s hard to make money in open source, or more crassly, it’s hard to make money when you give the shit away.

Others:

Gartner’s Merv Adrian

InfoWorld’s Yves de Montcheuil

myself

Original title and link: It Ain’t Easy Making Money in Open Source: Thoughts on the Hortonworks's IPO Filling (NoSQL database©myNoSQL)

#Hortonworks #Hadoop market

Keyword is partially:

CouchDB’s long road to clustering can be partially traced to conscious design decisions and philosophical choices made by CouchDB’s creators. As Lehnardt explained, “CouchDB has always said no to features that we know couldn’t be scalable in a cluster or even doable in a cluster. This puts us in a position to migrate upward seamlessly.”

Two years ago and CouchDB would have actually been somewhere.

Original title and link: CouchDB's long road to clustering (NoSQL database©myNoSQL)

#CouchDB #Cloudant #document database

Apache CouchDB 2.0 gets clustering support

At ApacheCon Europe 2014, the Apache CouchDB™ project today announced a Developer Preview release of its CouchDB 2.0 document database. The Developer Preview release brings all-new clustering technology to the Open Source NoSQL database, enabling a range of big data capabilities that include being able to store, replicate, sync, and process large amounts of data distributed across individual servers, data centers, and geographical regions in any deployment configuration, including private, hybrid, and multi-cloud.

I’m not sure who wrote the ASF PR announcement, but if it was me I would have simply posted “Apache CouchDB 2.0 features clustering support. Finally. </eom>"

Original title and link: Apache CouchDB 2.0 gets clustering support (NoSQL database©myNoSQL)

#CouchDB #document database

We rarely have the opportunity to learn about the almost complete architecture and data flow for a massive data indexing solution. Twitter’s blog post covers many details of their indexing solution starting with design goals and getting down to technical

But our long-standing goal has been to let people search through every Tweet ever published.

My notes:

half a trillion documents

average latency under 100ms

(super tuned) SSD used as storage

4 components: batch data aggregation and preprocess pipeline, inverted index builder, Earlybird shards and roots; what are the Earlybird roots?

ingestion processes one day of tweets batches. it is run every day; in this process tweets are scored and partitioned

Hadoop for ETL: ingestion process is run on Hadoop, with the output being stored in HDFS

Mesos is used to parallelize the inverted index creation; results are stored in HDFS

after praising the high parallelism and statelessness of the index builders, some coordination using ZooKeeper is mentioned:

These inverted index builders can coordinate with each other by placing locks on ZooKeeper, which ensures that two builders don’t build the same segment. Using this approach, we rebuilt inverted indices for nearly half a trillion Tweets in only about two days (fun fact: our bottleneck is actually the Hadoop namenode).

the Earlybird shards are the storage of the inverted index partitioned by time and then hash; partitioning by time tiers will allow growing the storage without affecting the current time tiers

the Earlybird roots are the endpoint for the client API; they forward requests to the corresponding Earlybird shards, merge results, etc;

not very sure how Earlybird roots decide what time tiers should not receive a query

no words about the actual Earlybird storage; can it be Manhattan?

no details about the query processor

this project started in 2012; the full index was completely built in 2014

Original title and link: The data flow and the massive historical Tweet index (NoSQL database©myNoSQL)

#full text indexing

For the technical part the list goes like this:

SAS and/or R

Python

Hadoop

SQL

unstructure data

#data science

In an interview with Bob Widerhold1, Roberto V. Zicary asks: “why Couchbase Lite is so strategically important?”

Bob Wiederhold: First, because the world is going mobile. That is indisputable. Mobile initiatives top the list of every IT department. As I said above, if you don’t have a mobile data management offering, you are not looking at the complete needs of the developer or the enterprise.

Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering for an embedded mobile JSON database.

Our complete mobile offering, Couchbase Mobile, includes Couchbase Server – for data management in the cloud, and Sync Gateway for synchronization of data stored on the device with other devices, or the database in the cloud. Today, because connectivity is unknown, data synchronization challenges force developers to either choose a total online (data stored in the cloud), or total offline (data stored on the device) data management strategy.

Maybe I’m seeing things from the wrong perspective:

the data synching between the disconnected device and the central databases needs to see very low contention; resolving conflicts on the device would be much more difficult than having a server component solving it;

as far as I can tell, the king of storage on mobile phones is SQLite; I somehow doubt that JSON + map/reduce can beat it;

while not an expert in iOS services, I think the CloudKit already covers the local-to-remote storage sync problem.

What am I missing?

Bob Widerhold is CEO of Couchbase. ↩

#Couchbase #key-value store #document database

Pretty much the same perspective about Hortonwork’s filling for IPO from Yves de Montcheuil (InfoWorld):

By filing first among Hadoop distribution vendors, Hortonworks is guaranteed to get the lion’s share of publicity for the foreseeable future. Any competitor who follows suit will be perceived as a copycat. And since it’s unlikely that said competitors can produce a more attractive balance sheet anyway, they would pretty much be in the same type of criticism.

#Hortonworks #Hadoop market

Merv Adrian is looking at 3 possible reasons for Hortonworks’s filing for IPO by switching the why question to who will benefit from this IPO. As for the why now part, the main question I’ve also asked myself, this seems to be the general answer:

Ultimately, it’s unlikely that Hortonworks will be alone as a public company for long. MapR told the Wall Street Journal they want to IPO next year, and they claim to have more customers, high margins and “efficient cash management.” Cloudera says they “are not ready yet” though they have lower rate of losses, and also claim more customers. At the end of the day, the answer may be rather simple. And again, answering a question with a question: if not now, when? There may not be a better time.

#Hortonworks

More details about Damien Katz’s new message queue project: it has a name, Kayos, and some goals:

Build a fast, low cost, fault tolerant messaging and queueing system that offers predictable performance and can take advantage of high end dedicated hardware as well as unreliable, commodity infrastructure like EC2. We want to support message de-duplication (newer versions of messages eliminate older versions) while also maintaining strict consistency (ordered synchronous delivery), causal consistency (ordered asynchronous delivery) and eventual consistency (unordered asynchonous delivery).

At the end of the long road ahead, “Shit be awesome yo“.

#Kayos

Kafka and Samza: Distributed stream processing in practice

Fantastic slide deck from Martin Kleppmann. These 2 screenshots below are a good summary of the talk, but I strongly encourage you to go through the 42 slides. Totally worth the time.

The parallel between the Unix philosophy and the new (big) data solutions shows up quite frequently. There’s an inherent extra complexity in the big data platform due to their distributed nature. But for some of these tools the rule of “doing one thing and doing it well” was relaxed; maybe too relaxed. And in some cases there’s less than optimal openness towards integration.

Kafka and Samza: Distributed stream processing in practice

#Kafka #Samza

Trending Blogs

Recently Viewed Blogs

myNoSQL