We rarely have the opportunity to learn about the almost complete architecture and data flow for a massive data indexing solution. Twitterâs blog post covers many details of their indexing solution starting with design goals and getting down to technical
But our long-standing goal has been to let people search through every Tweet ever published.
half a trillion documents
average latency under 100ms
(super tuned) SSD used as storage
4 components: batch data aggregation and preprocess pipeline, inverted index builder, Earlybird shards and roots; what are the Earlybird roots?
ingestion processes one day of tweets batches. it is run every day; in this process tweets are scored and partitioned
Hadoop for ETL: ingestion process is run on Hadoop, with the output being stored in HDFS
Mesos is used to parallelize the inverted index creation; results are stored in HDFS
after praising the high parallelism and statelessness of the index builders, some coordination using ZooKeeper is mentioned:
These inverted index builders can coordinate with each other by placing locks on ZooKeeper, which ensures that two builders donât build the same segment. Using this approach, we rebuilt inverted indices for nearly half a trillion Tweets in only about two days (fun fact: our bottleneck is actually the Hadoop namenode).
the Earlybird shards are the storage of the inverted index partitioned by time and then hash; partitioning by time tiers will allow growing the storage without affecting the current time tiers
the Earlybird roots are the endpoint for the client API; they forward requests to the corresponding Earlybird shards, merge results, etc;
not very sure how Earlybird roots decide what time tiers should not receive a query
no words about the actual Earlybird storage; can it be Manhattan?
no details about the query processor
this project started in 2012; the full index was completely built in 2014
Original title and link: The data flow and the massive historical Tweet index (NoSQL database©myNoSQL)