Thoughts on InfluxDB for Analytics

Analytics means different things to different people...

So I was turned on to InfluxDB by a friend I think. Or Google. Either way I loved it at first glance. I still love it. However, I’m not sure it’s fair for them to put “analytics” on their home page under “what” InfluxDB is used for.

InfluxDB is a time series database and a damn good one at that. Built with the ability to use LevelDB or RocksDB (or HyperlevelDB I believe?), it is fast. Primarily it’s a key/value store, but there’s a lot of nice aggregates and queries baked into it. The idea of Lua scripts running in there (for map/reduce type stuff, etc.) also comes up in discussions. When that will be implemented is anyone’s guess.

The progress of the project is fairly steady. I came back a few months later after first discovering it and they moved on to RocksDB as their primary storage engine over LevelDB. They add new features all the time.

Continuous queries are my favorite feature and then there’s a few aggregation queries available. It groups by date so that’s nice for graphing. The other thing I really love about it is the fact that it automatically expires old data. On the same note, you can also define how data is broken up by time. Essentially controlling the size of files on disk which is really great for tuning things (locking and open files).

So that’s great and graphing my system resource usage on my servers in real-time is pretty awesome. It’s refreshing to have something more flexible than RRD.

Only that’s not my use case. My use is for the “analytics” part described on the home page. Specifically, social media analytics. I want to know what was the most mentioned hashtag or the most shared link on Twitter. You can’t quite get that with InfluxDB. It would require multiple queries and aside from a loss in performance, you’re also now doing more work in application code. So there’s zero convenience and your technical debt goes up.

InfluxDB has a TOP() aggregate function to return the top n values for a field. Great! But that’s not occurrances that’s highest value. So if your CPU spiked to 99.98% and 99.95% you’d get those values before the lower values if you took 5. Plus, it only shows that value then. You can’t exactly comma separate that with other fields like in a normal SQL query. Kinda useless if you get a list of the highest recorded values without the rest of the document providing context, right? Well, the context is that fact that you’re supposed to know what you queried for.

So you do this in your application code, making a few queries. The problem is, you can’t sort. Everything is always returned by time (which is great for many needs, but not all). So you end up with a huge response. A response that comes back over HTTP. This is wildly inefficient if what you’re grouping by has many results. In my case, if I were to group by shared URLs, I’ve seen 30,000+ for just a few days of data. Can you imagine returning 30,000 rows in a response just to weed out 29,900 of them?

Without some of these basic features, I just don’t think InfluxDB is cut out to be an “analytics” database. If you have anything other than basic aggregate needs, Postgres with partitions is looking like a better bet today.

Another thing that annoys me, specifically with InfluxDB and Go, is that you get results returned as an interface{} type. So you basically need to use reflection to get to usable data. The handy mapstructure package helps a lot in this case. It makes it trivial to map the query result to a struct. Though I wish, like many other database packages, it would do this mapping for you.

Should InfluxDB have sorting for other fields and add the Lua support for map/reduce type stuff…Then bingo, we’ve got enough for an analytics database.

Of course, I don’t want to see it just stop with Lua support because that’s annoying to work with too. I’d rather go use Aerospike or something if I have to bury my head in Lua. Convenience is not to be overlooked when building a database. This is why MongoDB has done so well.

The good news is InfluxDB does add new features all the time. Despite what looks like an explosion in their GitHub Issues section, things get done. It’s fast and stable. I do like it. I’d just say use it for “timeseries” and “metrics” for now though. We’re not quite at the “analytics” level. Yet.

Tags// , , , ,
comments powered by Disqus