occasionally useful ruby, ubuntu, etc

24Aug/082

CouchDB and logs…

I've taken it upon myself to do some log analysis at my work. This involves taking a large volume of logs (multiple gigabytes) and somehow organizing them so that the data stored within them are readily available. This is no small task. At present, there are thousands of individually gzipped files, and the best way to find what you're looking for is to essentially 'zcat *.gz | grep -in "session=123"' to get what you want and is, frankly, absurd. In some types of log files, you have 15 lines that all pertain to the same log record, with each line in the format KEY=monkeyfacevalue, and in others you have everything on the same line, with key/value pairs comma delimited. Doesn't really matter, but there are key-value pairs in both case, and no two log records necessarily have the same set of keys. This sounded like an opportunity to try out a new type of database...CouchDB.

But, in the end, I think CouchDB is not the best way to solve this problem...

CouchDB is what's called a document-oriented database. This is as opposed to relational databases, like MySQL, PostreSQL, Oracle, even sqlite. For convenience I'm going to abbreviate each as DOD and RD, respectively.

How they differ:

  1. DODs are schema-less. This means there are no columns/fields to speak of with size constraints. A "document" is a record in the database, which is essentially a hashmap (or associative array if you prefer) with a couple other little features (i.e. revisioning).
  2. There are no "queries" in DODs (CouchDB, at least). If you were curious about how something without rigid structure could be fast, this is how. Instead of queries, you have "views", which are only slightly related to views in the RD sense. There are two types of views -- permanent views and temporary views. A view is the result of calling a user-defined function on every record in the database -- if the function evaluates to true, the record is included in the view. Permanent views are stored in the form of design documents within the database itself. Every time you add or modify a record, every view function is called on that record to (re)evaluate whether that record should be included in a particular view. Temporary views are the basically the same, but not stored permanently in a design document -- they disappear after they're used.
  3. Some other things, see Wikipedia or their official page.

So here's the problem. You can't just say "Show all blog posts by user X", because that would require you pass a variable. Not a big deal; you create a view that returns true if the blog post belongs to that user. But still, you have to create a view for every single user if you want to be able to select all the blog posts by any particular user -- I'd imagine that could take up a lot of space, and CPU cycles. Alternatively, you could have a view that just returned all blog posts, and then filter through those in your application. Neither would be that good though, I feel, but I might be missing something.

Similarly, in a log file, you can't say "Give me all records with session ID Y" -- you'd need a view for that, and a temporary view would have to be generated on the fly, and the permanent view...well, you'd want a permanent view for every session ID, and there could be tens of thousands of those. Plus every time you add a new record, you'd have to run the view functions for all those on it, and that could take a while.

On the bright side, CouchDB doesn't have any locks in it (helps that it was written in Erlang). This means that no matter who or how many are reading or writing, nothing will ever get blocked. Additionally, I guess you can seamlessly distribute the contents of the database across multiple servers and get redundancy pretty easily, but I haven't looked into that tooo much. Also, CouchDB is technically Alpha, so all of the functionality hasn't actually been implemented yet, like interfacing with Apache Lucene (a full-text indexer search thing).

So CouchDB looks like it has a lot of potential, but I'm not exactly sure for what, unless you never have that many documents in the database or not that many different parameters you want to be searching on.

Comments (2) Trackbacks (0)
  1. Heya,

    you got it all wrong :) Let’s see:

    “But still, you have to create a view for every single user if you want to be able to select all the blog posts by any particular user — I’d imagine that could take up a lot of space, and CPU cycles. Alternatively, you could have a view that just returned all blog posts, and then filter through those in your application. Neither would be that good though, I feel, but I might be missing something.”

    You’re close on this one: Create a view that returns all blogs, with the user as a key:

    map: function(doc) { emit(doc.blog_user_id, null); }

    A view with this function can be queried with key=$username parameter which gives you the your filtering. This is the preferred way of doing it and this use-case is lightning fast.

    The session id problem is the same:

    map: function(doc) { emit(doc.session_id, null); }

    You can also do range queries with startkey=$start&endkey=$end in case of sequential keys (say a date).

    Notice `null` as the second parameter of the emit() function. This means the view will stay lean and mean, but it also means that you need to fetch the actual document data with another request. If you can afford your view indexes to be big, you can put in the `doc`, instead of `null`. This is simply a trade-off.

    In the `null` case where you’d need to fetch the doc data subsequently, you might wonder “what if I have a range with 10k docs”. Yes fetching them individually would be a pain. There is a patch in review at the moment, that will let you fetch all the associated doc data with a view result without actually storing it in the view. Watch out for that. For N < 100(0) docs, individual fetches are okay though.

    Yes, documentation of all that could be improved, but there are only so many days in an an hour and hey, this is open source (lame excuse of the day :) .

    Feel free to contact me about further questions but you’re best off subscribing to the CouchDB mailing lists (http://incubator.apache.org/couchdb/community/lists.html) or by hopping onto #couchdb on irc.freenode.org.

    Cheers
    Jan

  2. Sweet, I’m glad to learn I’m wrong if it means that things are better than I thought!

    I’m a fan of IRC, so I think I’ll jump on there…


Leave a comment


No trackbacks yet.