I’m a strong believer in data portability for software as a service applications. As a consumer, I never want to use a service that won’t let me get my data out. And as a provider, I’d much prefer to compete on features, quality, customer service, etc than on locking in customers. Additionally, it takes a lot of hubris to assume that your application can provide all the analysis a customer may want to do with her data. Letting customers import data into Excel or a similar tool is incredibly valuable.
At Mobile Commons, we have a ton of data that customers may want to export. User profiles, text messages, donations, phone call records, etc. And customers want to use different query parameters for their export. For example, pull every constituent in New York that has opted in to a particular advocacy campaign and donated money.
This basic idiom repeats itself over and over again in our application:
- A customer pulls up a page with a customizeable set of search parameters
- An HTML view displays the data with pagination
- A CSV export button enables them to download the data, but obviously without pagination this time
In Rails, this is pretty easy to accomplish with respond_to blocks. We setup the ActiveRecord finders, then return the appropriate content type:
For example, consider the following URL: http://foo.example.com/articles/123/comments?published=true&after=2010-01-01&order=updated_at
class CommentsController
def index
finder = current_user.articles.find(params[:article_id]).comments
finder = finder.published if params[:published] = true
finder = finder.scoped(:conditions=>'created_at > ?', params[:after])
finder = finder.scoped(:order=>params[:order_by])
respond_to do |format|
format.html do
finder.paginate( params[:page])
end
format.csv do
send_csv finder.all.to_csv
end
end
end
end
This works quite well. The code is DRY and easy to read. The problem comes when the number of comments grows large. Let’s say we have 1,000 comments in the database (for example, if we become the first google result for “facebook login”). The pagination version still works fine. We show 10 on a page and let the user click through lots of pages. But the CSV version falls flat as we try to pull 1,000 records from the database at once on every request.
No problem! This is a perfect opportunity for asychronous processing. We push a message onto a queue, run the report in the background, and provide a callback URL for the user to pickup the CSV when it’s complete.
This approach works great, but makes for an interesting architectural decision: where should the finder logic live? It seemed like the controller was a good place for it, but now we want to access it from another process that knows nothing about MVC. You probably don’t realize how much business logic is implicitly encoded in your routes until you try something like this. For example, how did the current_user variable get set? Maybe by parsing the subdomain? And how do we know what article_id is? We extracted it from the URL path.
Suddenly you realize that repackaging that logic to be used in both places and remain DRY may not be trivial. For one controller it’s not so bad, but remember that you may have this idiom repeated over and over in 20 different controllers, each with their own routes and query params. The question really becomes: what message should you put on the queue so that your workers don’t all need custom duplicate logic?
We ended up taking an rather clever approach to this problem (thanks, Justin). We already have code that works when it’s a web request. What if we just serialize the URL and HTTP params and put them on the queue. Our worker can pick them up and just call the controller method directly.
Rails already has this packaged nicely for us with the ActionController::Integration::Session class, which is typically used for integration testing.
#first deserialize the message to a hash object
path = message[:path]
params = message[:params]
app = ActionController::Integration::Session.new
app.host! "foo.example.com"
app.get(path, params)
csv = app.response.body
Works like a charm! We go through the same code path as a regular web request, but we do it all in-process. We never go over the wire. Suddenly you don’t care if the process takes 2 seconds or 2 minutes. It’s running happily in the background and you’re good to go.
Then days go by, then the weeks. Suddenly your comment thread has gone from 1,000 records to 10,000 records. The background job gets slower and slower. Which is okay, but you notice it’s using more and more machine resources.
What’s happening? Well, a good guess is that finder.all is putting all 10,000 ActiveRecord objects into memory at once, which consumes a ton of resources. If you haven’t read the Engine Yard post on memory bloat, I highly recommend it.
No problem! Rails 2.3 ships with easy utility functions find_in_batches & find_each which pulls 1,000 records at a time and iterates through them all.
format.csv do
csv = ""
finder.all.find_each do |records|
csv << records.to_csv
end
send_csv csv
end
Great! Memory usage plummets and everything is honky dory.
Then days go by, then the weeks. Suddenly your comment thread has gone from 10,000 records to 100,000 records. The background job gets slower and slower. Which is okay, but you notice it’s using more and more machine resources.
Wait, what? Why did you repeat that? What happened? I thought we fixed that.
Yeah, me too. But suddenly our export started taking forever and using 1GB+ of memory. What the deuce? find_each should have solved this problem. We tested the export code over and over. We couldn’t reproduce the problem. We had unit tests and integration tests and load tests. We tried in IRB and rake tasks and just couldn’t reproduce.
After much investigation and a great tip from Aaron Quint, we discovered a nasty little gotcha: in Rails 2, the ActiveRecord cache is enabled automatically for all controller actions (http://dev.rubyonrails.org/changeset/6189). Because we were calling app.get in our worker task, it was automatically loading the query cache. So every instance of a Comment was being cached, all 100,000 of them, even if we were only finding 1,000 at a time. The problem only showed up when called via the controller because the cache is disabled by default everywhere else.
Turns out the problem is trivially fixable by wrapping find_each in an uncached{} block:
format.csv do
csv = ""
Comment.uncached do
finder.all.find_each do |records|
csv << records.to_csv
end
end
send_csv csv
end
And voila! We now have a background task that calls a single bit of DRY controller code, processes in batches, and doesn’t use a lot of memory. Wawaweewa!


