Marian Steinbach: Blog

What I learned about CouchDB

2011/11/11

With many people talking about CouchDB, I got curious and took a closer look. Especially Map/Reduce made me want to find out if CouchDB would be a good solution to create aggregated statistics from a large number of records. A presentation on SlideShare cought my eye, especially the structured keys and group levels (see slide 43). Since I am now maintaining three growing time series databases (radiation data from Japan and Germany plus air quality data) I wondered if CouchDB would be an option to store and process that data.

Short disclaimer: I am by no means a CouchDB, NoSQL or DBMS expert. I usually make MySQL do the things I want, without ever having crossed the boundaries of one server. All I did was some testing, I didn’t go into production. The reason I post my experience anyway is to foster discussion, learn from it and help others save their time.

Data import

I imported data from CSV via couchdbkit (Python). A CSV file contained about 850,000 rows. Example row:

"1150000004";"2011-08-15 16:30:00";"38";"0"

The script I used is https://gist.github.com/1357839. Note that it makes use of bulk_save() to write 10,000 rows at once. A dry run without writing objects to CouchDB took 50 Sec. The actual run took 3 Min 56 Sec. In comparison, MySQL needs 45 Seconds to import the same data, including creating several indexes.

This is what the row above looks like as a CouchDB document:

{
   "_id": "003fd8eca823e1b46995ef4ca3000ec5",
   "_rev": "1-6770700eb84c2df9f5eef3bd24348723",
   "doc_type": "Measure",
   "sa": 38,
   "ra": 0,
   "station_id": "1150000004",
   "datetime": [ 2011, 8, 15, 16, 30, 0 ]
}

The doc_type: “Measure” is added by couchdbkit according to the class I used for the objects in Python. This is of course useful to distinguish different types of documents within the same database.

The “datetime” field is represented as a list because I wanted to mimic the behaviour I saw in the presentation mentioned above. It wouldn’t be necessary though. One could also code this field as a DateTimeProperty and later use the map function to create a structured key as needed.

Storage consumption

After importing my first chunk of 850,000 rows, the CouchDB database had a size of 283 MB. I tried compacting it, which caused CouchDB to increase the use of disk space up to 569 MB, only to reduce it to 288 MB. (If you do the math, you can see that the required disk space for compaction equals the space used by the DB before compaction plus the compacted one. This is why it is advisable to run multiple smaller instances of CouchDB on the same server. Otherwise CouchDB can become a big, unmanageable behemoth.). As for the amount of disk space used, I’m impressed in a bad way. Disk space might be cheap, but it’s not that it’s that unimportant.

The graph above illustrates the relation of disk space required by CSV, MySQL and CouchDB for roughly the same data.

The interesting thing here is that this is only the beginning. I haven’t created any views yet. Views can easily become multiple times as big as the underlying document store.

Querying the data

In CouchDB, views are what queries are in SQL. To start with something simple, I opened Futon, the CouchDB Admin interface, and created a temporary view based on this simple map function:

function(doc) {
   emit(doc.station_id, sa);
}

and this reduce function:

_count

Creating this temporal view took about 4 minutes. This is kind of stunning. If I would compare this to SQL, what this view should do is similar to this:

SELECT station_id, COUNT(*) FROM measures
GROUP BY station_id

On the same machine (my MacBook) this takes 640ms, without any indexes.

It seems that my database containing 850,000 documents is much too big for development purposes. The actual, untruncated database that I took the test data from currently contains 7,8 Mio rows. How would I test and get an idea of response times at that scale? Do I need a cluster to find out?

After several coffee breaks, I had finally created one meaningful view with a reduce function that gave me daily, weekly or hourly mean values. Once the view was built, querying it seemed quite fast. But, honestly, I didn’t bother to measure. After creating two simple views, the disk consumption was at 1.6 GB.

Here is where my journey ended. Maybe I just quit where the fun would have started, but right now, CouchDB just doesn’t seem to be an option to store my rather small but growing databases in. The alternative will most likely be to stick with MySQL, calculating data aggregation using background workers and writing the results to MySQL tables as well. With time series data, there is always the option to archive the original fine-grained values and only keep aggregated data for past periods. And with the space-efficiency of MySQL it seems as if I could make much more use of disks and RAM than I could by using CouchDB.

What puzzles me though is how others cope with CouchDB. How do people test their views? Is anyone working locally during development? Did I do a foolish thing?

12 Comments

Bradley Holt on 2011/11/11 at 18:18h GMT:

“How do people test their views?”

There are a couple of options here. You could load only a small subset of your data into your development environment. This would make your temporary views much faster to work with in development. Alternatively, I believe Couchbase Single Server has a feature that lets you run temporary views against only a subset of data from within Futon.

Ewan Makepeace on 2011/11/12 at 05:21h GMT:

Some quick math:

1) The example document you show above is 222 chars in length.
2) Assuming it is typical, 850,000 similar documents should take about 188MB (222 x 850,000).
3) However due to the append only nature of Couch, every time you write a new batch of documents, many of the B-Tree index nodes must be rewritten (and they also append, not overwrite).
4) So after loading the data you could easily end up at 288MB as you did – B-Tree index nodes are quite large as it is a very flat tree.
5) When you compact it copies the last revision of every node and document to a new file and only then removes the old file, so yes you need the sum of the two file sizes in space to do this.
6) However after compaction I would expect you to have fallen back to around 190MB and I am surprised by that number. Are you sure compaction succeeded?

So – the database size is largely driven by your document structure. Fields like _id, _rev and doc_type make up more than half the characters in the document – obviously the ratio gets better if you put more information into each document. Nevertheless naming every field explicitly (rather than have a single common schema as in a relational database) will always be expensive.

Temporary views are not recommended for anything – I suspect they are written to disk for a one time access and then deleted, explaining the slow speed and making comparison to a SQL Select query less fair. Views in CouchDB are expected to be created incrementally and cached to disk, making subsequent access very fast (just like Indexes in relational databases, but without the option to do queries that dont have indexes defined).

Note also that when you edit your view the old view index on disk is not automatically deleted, you need to execute a view cleanup command for that. Most of that 1.6GB in disk use were probably your false starts in creating your view.
[This sounds silly, but views are saved using an MD5 hash of the view code as the file name such as a344011eb1c07df5a78af55337034bd2.view which guarantees that as you edit the function the filename changes. This is an elegant way to guarantee that wrong data is never used, but easily creates orphan index files if you dont clean up).

So the bottom line is:

1) CouchDB is a disk hog.
2) If you let your database size exceed free disk space you cannot even compact your database any longer.
3) Frequent small writes will also cause file size to balloon.
4) Creating new Views is very slow.
5) However accessing existing views can be very fast.
6) When developing your view code you probably want to work on a subset of the database (easily created with a filtered replication for example).
7) You also need to manually clean up your old view files.

I am still learning myself and hope others will chime in with tips to avoid potential converts being demotivated on their first day…

Christian Trabold on 2011/11/13 at 14:58h GMT:

Hi Marian,

nice write up! I couldn’t get warm with CouchDB either for several reasons.

On the other hand: Your data structure looks quite simple – not like “documents”. Maybe CouchDB is the wrong tool for your needs?!

If you just want to query the data, I’d go for Redis: http://redis.io

Although I’m not a Redis PRO, I’m quite confident that you could make great use of the RANGE commands for your daily, weekly etc. reports.

Give it a spin here (no installation required): http://try.redis-db.com/

Would be cool to have a comparison between MySQL and Redis ;)

Cheers

Christian

Marian Steinbach on 2011/11/16 at 12:26h GMT:

Thanks for the very helpful comments so far! They kept me from prematurely abandoning CouchDB :) I guess the main problem is that my mental model (coming from an SQL world) doesn’t apply to Couch.

@Christian Most likely I won’t look at Redis right now, but maybe some time? We’ll see.

qyl on 2011/11/30 at 21:01h GMT:

Thanks for your analysis. I opened o post on couchbase forum on similar topic, maybe it would add something to the discussion:
http://www.couchbase.org/forums/thread/couchbase-performance-real-tests-its-horrible-or-am-i-doing-somethin-wrong

Henner on 2012/01/04 at 13:13h GMT:

Hi,
I’d just like to point to another tool that works nicely with CouchDB: http://www.elasticsearch.org/
Elastic Search allows to have a proper index-based search on CouchDB documents and the setup is done quite quickly.

Marian Steinbach on 2012/01/04 at 23:12h GMT:

Henner, although your comment looks semi-spammy, I approved it, since I wish those guys (you one of them?) luck with their service.

Update Nov 2013: Obviously I misread the URL you posted. I am aware of ElasticSearch and I’m using it in various projects. Thanks for the comment!

Tom on 2012/01/27 at 11:07h GMT:

I’ve just finished compacting a COUCHDB database with approximately 18 million rows of data.

The database was built up over approximately 1 month, with inserts ONLY (no updates or deletes).

The original size of the database before compaction was 130GB.

Compaction too approximately 24 hours on a proliant dl585 server (with slow scsi array).

Final database size after compaction is 7.3GB. **** 93% REDUCTION *****

I have heard rumours that CouchDB might be working on optional auto-compaction, does anybody know if this is true?

kimberlad on 2012/04/04 at 18:16h GMT:

We are seeing the same issue with disk space usage. However outside of this Couchdb fits our use case very well (Analytics of a JSON document feed) and is very easy to use. Were looking to implment BigCouch to help speed up compaction i.e. a number of smaller servers rather than one big one with loads of disk space. We are also looking at pre-reduction to reduce the number of documents i.e .aggregate every 10 or 20 documents. However were also hopeing that this is up the CouchDB wish list for improvements :-)

arete74 on 2013/11/28 at 15:29h GMT:

Hi, yu can share you view for daily, weekly or hourly mean values?

regards
Gerardo Di Iorio

marian on 2013/11/28 at 15:39h GMT:

@arete74 Sorry, but I can’t. The post is two years old, I haven’t worked with CouchDB since and I haven’t kept the files.

arete74 on 2013/11/29 at 17:53h GMT:

#marian
ok,thanks

Your comment

Note: Due to issues with comment spam, your comment might not be published immediately.