Skip to content


Protected: Simplify Deployment with Infrastructure Manifest (Part 1)

This content is password protected. To view it please enter your password below:

Posted in Tutorials.


Mongo Multi-Key Index Performance

tl;dr – Mongo multi-key indexes don’t scale well with high-cardinality arrays. This is the same bad behavior that KairosDB is experiencing with its own indexes.

This post explores a technique for using MongoDB as a general-purpose time-series database. It was investigated as a possible temporary workaround until KairosDB adds support for high-cardinality tags. In particular, this describes using a MongoDB multi-key index for associating arbitrary key-value metadata with time-series metrics in MongoDB and the associated performance problems.

Motivation

We’ve been using KairosDB in production for close to a year now. KairosDB is a general-purpose time-series database built on Cassandra. Each “metric” consists of a name, value, and a set of associated “tags” (key-value metadata). This metadata is extremely useful as it provides structured metadata for slicing, filtering, and grouping the stats.

The main issue restricting us from adopting it more widely is its poor support for high-cardinality tags; that is, tag keys with a large number of distinct values, such as IP addresses or other unique identifiers. Unfortunately, these types of values are also a prime use case for tags in the first place. You can read all about this issue on the KairosDB user group, as its one of the most well-known issues currently. A few months ago I gave a presentation on in Building a Scalable Distributed Stats System which describes a work-around for this issue when there’s a small number of high-cardinality tag keys.

However, the new use case requires a set of high-cardinality keys which is dynamic and unknown a priori. Since the KairosDB team is looking into fixing this issue but hasn’t actually resolved it, I wanted to investigate whether we could use MongoDB temporarily as a backing store behind the Kairos API. Why MongoDB? Because its easy to use, we know how to scale it (even if its painful), and atomic increments are a powerful bonus.

MongoDB Schema

The first task in evaluating MongoDB for this general-purpose use case is to propose a schema of sorts; we need something flexible enough to use the same underlying model and update operations as KairosDB while allowing efficient querying using MongoDB indexes. The initial schema looked something like:

{
  "timestamp": ,
  "name": ,
  "value": ,
  "tags": [
    "key1=val1",
    ...
  ]
}

You might be wondering why “tags” is an array of strings rather than a true subdocument. The answer is indexing. Ideally, we could use a hashed index on a proper “tags” subdocument; however, as the documentation states, “you cannot create compound indexes that have hashed index fields.” Instead, we try to use a multi-key index on an array of values. We can combine this multi-key index on tags with the timestamp and name to create a compound index by which we can query for specific metrics. If we call our collection metrics, then we create the index like so:

db.metrics.ensureIndex({timeStamp:1,name:1,tags:1})

Query Plan Explanation

Before we went any farther with this proof-of-concept, I wanted to understand whether these indices were likely to be performant for our query and update operations. If you don’t know about MongoDB’s explain() operator, I want you to stop what you’re doing right now and go read: cursor.explain()

Finished? Good. Hopefully you can see where we’re going with this now. We’ll execute an example query with various documents in the database and let MongoDB walk us through the query operations.

Let’s get a baseline with an empty collection.

db.metrics.find({'timeStamp':1234,'name':'metric1','tags':['tag1=val1','tag2=val2']}).explain()

Amongst the output, you should see

"cursor" : "BtreeCursor timeStamp_1_name_1_tags_1 multi",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 0,
"scanAndOrder" : false,
"indexOnly" : false,

This confirms that we’re using our new multikey compound index. Although the indexOnly=false line may look scary, it means that there are fields to be returned that aren’t in the index; namely, the value itself is stored in the document and must be consulted.  This StackOverflow article helped me understand this output field better.

Let’s review the most important fields for our use case. From the documentation:

  • n is the number of documents that match the query
  • nscanned is the total number of index entries scanned
  • nscannedObjects is the total number of documents scanned

Since there are no index entries or documents yet, all three values are 0 initially.

Okay, now let’s add the first metric.

db.metrics.update({'timeStamp':1234,'name':'metric1','tags':['tag1=val1','tag2=val2']}, {'$inc':{'value':1}}, {upsert:true})

Here we’re just atomically incrementing the value field by one. Let’s run the same explain request to see what the query plan looks like now.

"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 2,

We see that the query now scans two index entries and one document. That seems reasonable.

What if we insert a record with a new name but the same tags?

db.metrics.update({'timeStamp':1234,'name':'metric2','tags':['tag1=val1','tag2=val2']}, {'$inc':{'value':1}}, {upsert:true})

The query plan for the original document would still look like

"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 2,

Great, so it seems like the query criteria is good at only selecting the single correct document.

Now let’s insert a record with the same name but a different value for “tag2″.

db.metrics.update({'timeStamp':1234,'name':'metric1','tags':['tag1=val1','tag2=other']}, {'$inc':{'value':1}}, {upsert:true})

Let’s look at the query plan now.

"n" : 1,
"nscannedObjects" : 2,
"nscanned" : 3,

Uh-oh. This doesn’t look too good. Adding one new value for the second tag increased the number of scanned index entries and documents by one.

What happens if we add a new value for “tag1″ instead?

db.metrics.update({'timeStamp':1234,'name':'metric1','tags':['tag1=other','tag2=val2']}, {'$inc':{'value':1}}, {upsert:true})

Let’s look at the query plan now.

"n" : 1,
"nscannedObjects" : 2,
"nscanned" : 3,

Well, that’s not so bad. Its the same before and after So, in the worst case, the number of scans increases linearly with the number of tag permutations.

If you continue with this exercise, you’ll start to understand the pattern. Essentially, each new tag in the tags array adds a new entry into the index. Since its doing a range search on the tags, it depends on where the new tag entry falls in the index. If its the last tag, its going to fall near or at the end, depending on the new and previous values of the final tag.

What we’ve learned is that mongo multi-key indexes don’t scale well with high-cardinality arrays. Since we’re in the same position as Cassandra-backed KairosDB, its back to the drawing board for me.

It feels like others must have solved these stats problems before. From real-time pre-write aggregation to attaching high-cardinality metadata, we must be reinventing the wheel. Is everything that is good at these tasks proprietary?

What systems do you use for real-time stats?

Posted in Tutorials.


Libertarians? Greens? Lock ‘em Out!

The 2014 midterm elections seem to be bigger than prior years with more ads, robo-calls, and social media posts. During this turmoil, I learned a number of new things about the leading political parties that disgust me. At the top of this list was the Republicans’ and Democrats’ efforts at controlling access to the ballot.

The idea of “ballot access” control is that third parties will undercut votes from the “Big 2″ parties. Specifically, the belief is that a vote for the Libertarian party is likely one less vote for the Republicans; likewise, the Democrats could lose votes to Green candidates. So, the story goes, its in the best interest of the two predominant parties to restrict other parties from being present on the ballot at all.

Continued…

Posted in Commentary.


Keep Out The Vote

In the chaos leading up to Election Day on Tuesday, we’ve all been inundated with Get Out The Vote messages from both parties.

This is supposed to be the parties’ way of encouraging citizens’ active participation in our great democratic society. So when Pretty Nerd and I got a call from one of Rauner’s people, we were initially pleasant and politely informed them that we were already voting, though not for Rauner. Imagine our surprise when, lo and behold, Rauner’s campaign caller responded with “just don’t go to the polls then.” She repeated this statement Continued…

Posted in Commentary.


Custom JMeter Samplers and Config Elements

tl;dr – Writing custom JMeter plugins doesn’t have to be complicated. This tutorial describes the process of developing a custom Sampler and Config Element. We develop a Kafka Producer Sampler and example Synthetic Load Generator Config Element. If you just want to send messages from JMeter to Kafka or see an example of generating synthetic traffic, you can go straight to the source.

So you want to load test a non-HTTP system. At first, you don’t think your favorite load testing tool, JMeter, will be of any help. But you remember that its open source and supposedly extensible. Let’s see if we can do this.

For my use case, I wanted a simple way to load test a system which reads its requests from Kafka. This has two requirements:

  1. read or generate synthetic requests (messages)
  2. publish the messages to a Kafka topic

For step 1, if I wanted to pre-generate all the requests, I could use the CSV Data Set Config to read them into JMeter. However, this would require generating a sufficiently-large request set for each test scenario. I preferred to let JMeter generate the actual request from a simple configuration describing the traffic distribution. This configuration could also be generated from real data to effectively simulate the shape of the data coming into the system. Thus, step 1 required development of a new “Config Element” in JMeter.

For step 2, there was no existing option for sending data to Kafka. But now you have one, so just use the Kafka Producer Sampler from kafkameter.

Let’s dig in.

Continued…

Posted in Tutorials.


Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Over the past several months, I’ve been leading an effort to replace our aging Scribe/MongoDB-based stats infrastructure with a more scalable, cost-effective solution based on Suro, Kafka, Storm, and KairosDB.

Let’ see what each of these pieces gives us:

  • Suro effectively replaces Scribe as the store-and-forward component, enabling us to survive the frequent network partitions in AWS without losing data.
  • We’ve introduced Kafka to serve as a queue between our stats producers and consumers, enhancing the reliability and robustness of our system while enabling easier development of new features with alternative stats consumers.
  • Storm is used to pre-aggregate the data before insertion into KairosDB. This drastically decreases the required write capacity at the database level.
  • We’re replacing MongoDB with KairosDB, which is a time-series database built upon Cassandra. This provides us with high linear scalability, tunable replication, and impressive write-throughput.

Last week, I discussed the last two components in this pipeline at Gluecon 2014 in Denver.

Title: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Abstract: Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started… which soon becomes 50 distributed replica sets as volume increases. This session is about designing a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. Join a deep-dive session where we’ll explore how we’ve used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.

If you want a peek into how these pieces fit together, peep the slides.

Continued…

Posted in Tutorials.


Should I max out my 401(k) or pay down student loans?

This is a common question new graduates ask. Although I graduated two years ago, I didn’t really run the numbers until recently… and boy am I disappointed in past-Cody for not doing this sooner.

The spreadsheet I used to answer this question for myself is below (but with fake numbers :-). Punch in your own numbers to see how much money you can save by increasing your 401(k) contributions. Now that I know better, I’m saving an extra $3,000 each year by maxing out my 401(k). How much can you save?

Continued…

Posted in Tutorials.


Hacking Twitter Competitions: Automatically Tracking Followers Count

Just before Christmas, a Chicago Food Truck decided to give away free sandwiches for a year to their 1000th Twitter follower.

Cheesies_Truck Competition Tweet

You know I had to try.

After checking it a few times over a 15 minute period or so, I noticed that the follower count was increasing very slowly. I knew I wouldn’t have the diligence to continue checking, so I decided to write a script that would do the check and notify me every ten minutes or so. Since I’m on a Mac, I decided to use Growl for these notifications.

In this post, I’ll walk you through how to automatically check a Twitter user’s follower count and get a Growl notification periodically.

What You Need Continued…

Posted in Tutorials.


Add to Goodreads from Amazon

Tired of splitting your reading wish-list between Amazon and GoodReads? Me too. Here’s an “Add to GoodReads” bookmarklet. Just highlight the code and drag it to your bookmark bar. You might have to right-click->Edit to give it a title like “Add to GoodReads”. This should work from Amazon product detail pages where you would otherwise click “Add to Wish List”.

View the code on Gist.

Instead of adding books to your Amazon Wish List, you can now add them to Goodreads instead. Yay!

Happy reading!

Posted in Tutorials.


Dependency Injection in Sinatra

Dependency injection (DI) is a very common development practice in many languages, but its never been huge in Ruby. Part of that is because Ruby is dynamic enough that it doesn’t really need dependency injection like, say, Java. But I argue that Ruby can greatly benefit from DI. Do you use a singleton configuration object? Or worse, other singleton objects, especially those with mutable state?

def some_method(*args)
  foo = MyApp.configuration.foo
end

Mutable singletons have ripple effects across the app and make it very difficult (and scary) to evolve. Even mostly-read configuration objects introduce tight and often invisible/forgotten coupling between objects. Continued…

Posted in Tutorials.




Log in here!