Pulling data from CouchDB to a Relational Database made easy with _changes
(This is the first post in Dimagi’s new tech blog series, where we shift the focus from the impact of our projects to the technology side of our work.)
Much has been written and debated about the various advantages and disadvantages of different document databases, and by far the most common choice in evaluating a document DB is a choice between CouchDB and MongoDB. This post isn’t going to get into that debate, but at Dimagi we’ve been using CouchDB succcessfully in multiple production projects for over a year now and overall we’re really happy with it. CouchDB has some features that made it an obvious choice for our products, and in particular its crash recovery and built-in replication have been life-savers. But for each time we’ve sung CouchDB’s praises and sworn our alleigance, we’ve cursed it for making certain remedial tasks truly painful.
Perhaps at the top of the list of “things that are annoying in CouchDB” is general reporting. Here we’ll look at how to work around this problem by hooking into CouchDB’s _changes feed and storing your data redundantly in something that’s much easier to work with (a relational database).
First, the problem
Couch’s map/reduce is great for extracting aggregated data from a very specific set of information in your database, but is terrible for things like “show me all data between January and June 2011, broken down by user and location”. One thing CouchDB does to help with this is let you use a complex key, to provide different levels of aggregation of your data. You can get pretty far with complex keys if you know the order of things you want to query.
But what if you want to aggregate on different things? Sometimes by place, then by year, sometimes by year, then by place. Unfortunately, because aggregation only happens in the order defined by the keys, this isn’t really possible natively. Instead, you’re left with a couple unattractive options:
Option 1: Writing a new view that provides access to the data with a different key hierarchy for each possible order:
view_by_place.js
emit([doc.place, doc.year], doc.value);
view_by_year.js
emit([doc.year, doc.place], doc.value);
Now you can choose which view to query based on the ordering, but this adds a lot of overhead both in terms of duplication of code, needless extra views, and remembering which one to query.
Option 2: Using the first element of the key as an index into the rest of the key.
This is a trick where you encode the logic directly into the view using the first element of the key.
view_both.js
emit(["place_first", doc.place, doc.year], doc.value);
emit(["year_first", doc.year, doc.place], doc.value);
Now you can query with something like key=[“place_first”, “boston”] or key=[“year_first”, 2011]. But, again, this is a pretty cluttered approach and as the number of different orders increases can become quite arduous to manage. It also deeply couples your queries to your views, and makes it challenging to add new queries later.
What we really want is a Plain Old Relational Database (PORD) to write arbitrary queries against. It turns out that this is actually quite easy.
_ch _ch _ch _ch _changes!
One of the the main engines inside CouchDB is the _changes feed. This thing powers all of replication and makes it really easy to build ancillary applications on your database. Essentially it is a line-by-line feed of all activity that has occurred in your database (pruned for redundancy). With this you can easily write a standalone tool that listens for changes to the database and does something with them – in this case storing data in a Plain Old Relational Database.
The pattern is very simple:
1. Connect to _changes feed
2. For every line in the feed:
a. Determine if you want to act on it
b. (If yes) do something about it
The details
We use couchdbkit and django to interact with CouchDB and our Relational Database. Couchdbkit makes interacting with the _changes feed trivial through the Consumer object. First we define a callback function that we want executed for each line (which actually comes in as a dictionary in couchdbkit). Our callback will get the document associated with the line and, do an export to a django model. Some details are emitted for clarity:
def export(line):
# line is like:
# {"seq":3,"id":"test3","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
doc_id = line.get("id")
doc = db.get(doc_id) # assumes db is defined
if doc.exportable: # assumes doc has some property "exportable"
# and some other properties to query on
MyModel.objects.create(id=doc.get_id, place=doc.place, date=doc.date)
Once we have this function we initiate a new Consumer object and reference the callback:
c = Consumer(db)
c.wait(cb=export)
That’s it! This process just needs to be started and will happily sit for all of eternity waiting for incoming changes and immediately act on them by creating models in the database.
There are some additional optimizations and details to this (see below), but that’s the entire gist of it.
Pruning the data with filters
If you know you only care about data matching a certain pattern, then you can use CouchDB filters [LINK: http://guide.couchdb.org/draft/notifications.html#filters] to pre-restrict the data so that only documents matching a certain filter show up in it. For example, instead of doing the “doc.exportable” check in python we can only include documents with a property of exportable set to true in the feed using the following:
foofilter.js
function (doc, req) {
return doc.exportable == true;
}
Anything that doesn’t return “true” will not be included in the feed. This can negate the need for the check in the export function, since we can use the filter to only include relevant documents.
Django integration
Since we’re going to save our objects in the relational database we want to be able to run this in a bootstrapped django environment. The easiest way to do this is with a management command.
mymodel_export.py
from django.core.management.base import LabelCommand
from couchdbkit.client import Database
from couchdbkit.consumer import Consumer
from myapp.models import MyModel
db = Database("couchdbname")
def export(line):
# line is like:
# {"seq":3,"id":"test3","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
doc_id = line.get("id")
doc = db.get(doc_id) # assumes db is defined
# and some other properties to query on
MyModel.objects.create(id=doc.get_id, place=doc.place, date=doc.date)
class Command(LabelCommand):
def handle(self, *args, **options):
c = Consumer(db)
c.wait(filter="myapp/foofilter", cb=export)
There are a few more details to pay attention to – in particular you don’t want to act on documents that have been deleted (which can still show up in the _changes feed) or old revisions of documents, but for the most part the code above is fully functional. Couch to SQL in 20 lines of code!
p.s. Our use case was storing the structured parts of unstructured xform submissions in django so that we can report on them. The full code that provides the skeleton for this example is available on github, and in particular the management command.
Share
Tags
Similar Articles
Another day, another Zero Day: What all Digital Development organizations should take away from recent IT security news
Even if you don’t work as a software developer, you probably heard about recent, high profile security issues that had IT Admins and developers frantically patching servers over the holidays and again more recently. Dimagi's CTO shares what these recent issues mean for Digital Development organizations.
Technology
January 28, 2022
Join the fight to support critical open source infrastructure
Open Source tools are a critical piece of global infrastructure, and need champions for long term investment
Technology
March 17, 2020
Two big lessons that Iowa and Geneva can teach us about technology in digital development
Last week brought two high profile technology failures into the global spotlight. Although these two mishaps may seem quite different at first glance, they both highlight challenges that are inherent in providing software in the public sector (regardless of locale) and illustrate cautionary lessons worth discussing for practitioners in Digital Development. The Iowa Caucus Debacle
Technology
February 7, 2020