Over the past several months, I’ve been leading an effort to replace our aging Scribe/MongoDB-based stats infrastructure with a more scalable, cost-effective solution based on Suro, Kafka, Storm, and KairosDB.
Let’ see what each of these pieces gives us:
- Suro effectively replaces Scribe as the store-and-forward component, enabling us to survive the frequent network partitions in AWS without losing data.
- We’ve introduced Kafka to serve as a queue between our stats producers and consumers, enhancing the reliability and robustness of our system while enabling easier development of new features with alternative stats consumers.
- Storm is used to pre-aggregate the data before insertion into KairosDB. This drastically decreases the required write capacity at the database level.
- We’re replacing MongoDB with KairosDB, which is a time-series database built upon Cassandra. This provides us with high linear scalability, tunable replication, and impressive write-throughput.
Last week, I discussed the last two components in this pipeline at Gluecon 2014 in Denver.
Title: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Abstract: Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started… which soon becomes 50 distributed replica sets as volume increases. This session is about designing a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. Join a deep-dive session where we’ll explore how we’ve used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
If you want a peek into how these pieces fit together, peep the slides.
This is still a work-in-progress, so if you have any ideas or suggestions for more effectively tuning or scaling this system, hit me up!
How does your company handle high-volume time-series data?