Huge CSV Data Exports in Rails

I’m a strong believer in data portability for software as a service applications.  As a consumer, I never want to use a service that won’t let me get my data out.  And as a provider, I’d much prefer to compete on features, quality, customer service, etc than on locking in customers.   Additionally,  it takes a lot of hubris to assume that your application can provide all the analysis a customer may want to do with her data.  Letting customers import data into Excel or a similar tool is incredibly valuable.

At Mobile Commons, we have a ton of data that customers may want to export.  User profiles, text messages, donations, phone call records, etc.  And customers want to use different query parameters for their export.  For example, pull every constituent in New York that has opted in to a particular advocacy campaign and donated money.

This basic idiom repeats itself over and over again in our application:

  • A customer pulls up a page with a customizeable set of search parameters
  • An HTML view displays the data with pagination
  • A CSV export button enables them to download the data, but obviously without pagination this time

In Rails, this is pretty easy to accomplish with respond_to blocks.  We setup the ActiveRecord finders, then return the appropriate content type:

For example, consider the following URL: http://foo.example.com/articles/123/comments?published=true&after=2010-01-01&order=updated_at


class CommentsController
  def index
    finder = current_user.articles.find(params[:article_id]).comments
    finder = finder.published if params[:published] = true
    finder = finder.scoped(:conditions=>'created_at > ?', params[:after])
    finder = finder.scoped(:order=>params[:order_by])
    respond_to do |format|
      format.html do
        finder.paginate( params[:page])
      end
      format.csv do
        send_csv finder.all.to_csv
      end
    end
  end
end

This works quite well.  The code is DRY and easy to read.  The problem comes when the number of comments grows large.  Let’s say we have 1,000 comments in the database (for example, if we become the first google result for “facebook login”).  The pagination version still works fine. We show 10 on a page and let the user click through lots of pages. But the CSV version falls flat as we try to pull 1,000 records from the database at once on every request.

No problem!  This is a perfect opportunity for asychronous processing.  We push a message onto a queue, run the report in the background, and provide a callback URL for the user to pickup the CSV when it’s complete.

This approach works great, but makes for an interesting architectural decision: where should the finder logic live?  It seemed like the controller was a good place for it, but now we want to access it from another process that knows nothing about MVC.  You probably don’t realize how much business logic is implicitly encoded in your routes until you try something like this.  For example, how did the current_user variable get set?  Maybe by parsing the subdomain?  And how do we know what article_id is?  We extracted it from the URL path.

Suddenly you realize that repackaging that logic to be used in both places and remain DRY may not be trivial.  For one controller it’s not so bad, but remember that  you may have this idiom repeated over and over in 20 different controllers, each with their own routes and query params.  The question really becomes: what message should you put on the queue so that your workers don’t all need custom duplicate logic?

We ended up taking an rather clever approach to this problem (thanks, Justin).  We already have code that works when it’s a web request.  What if we just serialize the URL and HTTP params and put them on the queue.  Our worker can pick them up and just call the controller method directly.

Rails already has this packaged nicely for us with the ActionController::Integration::Session class, which is typically used for integration testing.


#first deserialize the message to a hash object
path   = message[:path]
params = message[:params]
app = ActionController::Integration::Session.new
app.host! "foo.example.com"
app.get(path, params)
csv = app.response.body

Works like a charm!  We go through the same code path as a regular web request, but we do it all in-process.  We never go over the wire.  Suddenly you don’t care if the process takes 2 seconds or 2 minutes.  It’s running happily in the background and you’re good to go.

Then days go by, then the weeks.  Suddenly your comment thread has gone from 1,000 records to 10,000 records.  The background job gets slower and slower.  Which is okay, but you notice it’s using more and more machine resources.

What’s happening?  Well, a good guess is that finder.all is putting all 10,000 ActiveRecord objects into memory at once, which consumes a ton of resources.  If you haven’t read the Engine Yard post on memory bloat, I highly recommend it.

No problem!  Rails 2.3 ships with easy utility functions find_in_batches & find_each which pulls 1,000 records at a time and iterates through them all.


format.csv do
  csv = ""
  finder.all.find_each do |records|
    csv << records.to_csv
  end
  send_csv csv
end

Great!  Memory usage plummets and everything is honky dory.

Then days go by, then the weeks.  Suddenly your comment thread has gone from 10,000 records to 100,000 records.  The background job gets slower and slower.  Which is okay, but you notice it’s using more and more machine resources.

Wait, what?  Why did you repeat that?  What happened?  I thought we fixed that.

Yeah, me too.  But suddenly our export started taking forever and using 1GB+ of memory.  What the deuce?  find_each should have solved this problem.  We tested the export code over and over.  We couldn’t reproduce the problem.  We had unit tests and integration tests and load tests. We tried in IRB and rake tasks and just couldn’t reproduce.

After much investigation and a great tip from Aaron Quint, we discovered a nasty little gotcha:  in Rails 2, the ActiveRecord cache is enabled automatically for all controller actions (http://dev.rubyonrails.org/changeset/6189).  Because we were calling app.get in our worker task, it was automatically loading the query cache.  So every instance of a Comment was being cached, all 100,000 of them, even if we were only finding 1,000 at a time.  The problem only showed up when called via the controller because the cache is disabled by default everywhere else.

Turns out the problem is trivially fixable by wrapping find_each in an uncached{} block:


format.csv do
  csv = ""
  Comment.uncached do
    finder.all.find_each do |records|
      csv << records.to_csv
    end
  end
  send_csv csv
end

And voila!  We now have a background task that calls a single bit of DRY controller code, processes in batches, and doesn’t use a lot of memory.  Wawaweewa!

Tuesday, March 9, 2010

Admit It, Microsoft: You Suck at the Web – GigaOM

Let’s see… 60% of the browser market, one of the top 3 most visited web sites, one of the top 3 search engines, and $40B in cash…

I would hate to see what things would look like if they were just “pretty good at the web.”

Pratik has a very thorough and well written post about the upcoming changes to the ActiveRecord finder methods in Rails 3.1 and 3.2.

Specifically, the current options parameter is going to be removed in favor of method chaining.

For example,

Car.all(:conditions=>{:color=>'black'}, :order=>'cars.price DESC', :limit=>10)

will be replaced by:

Car.where(:color => 'black').order('cars.price DESC').limit(10)

While I do think the new syntax looks nice, my first thought was “how much time is this going to take to upgrade our app”

$ grep ":conditions" app lib vendor/gems vendor/plugins | wc -l
350

Okay, so that pretty much means we’re not moving to the new syntax any time soon. We obviously rely heavily on the exisiting :conditions syntax. There will be an officially supported plugin to keep the old-style, and I applaud the core team for doing that.

We have become quite fond of building up an array of options to pass to finder methods. This idiom works incredibly well when deal with user-defined queries. Consider an example where a user can query for their social graph.


Class User
  has_many :relationships
end

Now consider a web interface to query for friends. The user can sort by name (the default) or age, can choose to see 10, 20, or all of them, and can choose to only see men, women or both. Here’s how I would code that up (ignoring sanitization for now)*:

options={}
options[:order]=params[:order] if params[:order]
options[:limit]=params[:limit] if params[:limit]
options[:conditions]={:gender=>params[:gender]} if params[:gender]
user.relationships.all options

How would you do that with the new syntax? Well, instead of building up an array of options, you would need to conditionally pass messages:

relationships = user.relationships
relationships = relationships.limit(params[:limit]) if params[:limit]
relationships = relationships.order(params[:order]) if params[:order]
relationships = relationships.where(:gender=>params[:gender] ) if params[:gender]
relationships.all

They are both about the same amount of code, and will generate the SQL under the hood. But the latter requires you understand the way ActiveRecord lazy loads associations and handles message. In this case, I much prefer the former; building up options and passing them to a finder is a much simpler concept.

* NB You can’t simply do this:

user.relationships.limit(params[:limit]).order(params[:order]).where(:gender=>params[:where])

because all the params are optional. As designed, these queries will use the default values from the has_many association. Passing in :gender=>nil is very different than the default.

Monday, January 25, 2010
kevin:guillee:


Star Wars opening crawl, using only HTML &amp; CSS.
Caveats: It only works in Snow Leopard in Safari 4.0.4 and the WebKit nightly. Nothing else supports the CSS 3D transforms and animations I used, but I just wanted to see if it could be done.


Lovin&#8217; HTML5. Pretty much everywhere, it&#8217;s gonna be hot.

kevin:guillee:

Star Wars opening crawl, using only HTML & CSS.

Caveats: It only works in Snow Leopard in Safari 4.0.4 and the WebKit nightly. Nothing else supports the CSS 3D transforms and animations I used, but I just wanted to see if it could be done.

Lovin’ HTML5. Pretty much everywhere, it’s gonna be hot.

From Charles Nutter: &#8220;Lately my interest in JRuby on Android has increased. I realized very recently that JRuby is just about the only mainstream JVM languge that can create *new* code while running on the device, which opens up a whole host of possibilities. It is not possible to implement an interactive shell in Groovy or Scala or Clojure, for example, since they all must first compile code to JVM bytecode, and JVM bytecode can&#8217;t run on the Dalvik VM directly.&#8221;
That&#8217;s pretty dope. Seems like a fun project to play with to satisfy your weekend hacking cravings.

From Charles Nutter: “Lately my interest in JRuby on Android has increased. I realized very recently that JRuby is just about the only mainstream JVM languge that can create *new* code while running on the device, which opens up a whole host of possibilities. It is not possible to implement an interactive shell in Groovy or Scala or Clojure, for example, since they all must first compile code to JVM bytecode, and JVM bytecode can’t run on the Dalvik VM directly.”

That’s pretty dope. Seems like a fun project to play with to satisfy your weekend hacking cravings.

Speaking Schedule: Jan 19th @ Pivotal Labs in NYC

I’ll be speaking this Tuesday at Pivotal’s NYC office about architecting successful software platforms that are heavily reliant on (inherently unreliable) 3rd party web services and services in the cloud.

Here’s the abstract:

Beyond the Hype: What it Really Takes to Build a Technology Business on the Cloud

After the marketing and sales people leave the room, what does it really take to run a successful business relying on the cloud? Using real world use-cases, Mobile Commons CTO Benjamin Stein discusses the unique challenges of building a company reliant on 3rd party APIs, including vendor selection, system architecture, production support, and the factors that determine customer loyalty.
Sunday, January 17, 2010

100% of all Browser Certificate Warnings are from Legitimate Sites

This is just another interesting tidbit from the podcast this morning:

Setting up SSL properly is really hard and letting an SSL certificate accidentally expire is really easy.  I think every web developer ever has had a problem with SSL at some point in their career.

On the other hand, malicious web sites work very hard to make sure they look legit.  Either they don’t use SSL at all, or they make sure their certificate appears kosher.  No malicious sites use mismatched certificates.

In other words, we can assume that almost 100% of all browser certificate warnings are actually from legitimate sites.  Weird, right?

Tuesday, January 12, 2010 — 2 notes

Embarrassing Server Log of the Day

2010-01-12 15:44:50 Completed in 500545ms (View: 367097, DB: 133140) | 200 OK

Tuesday, January 12, 2010

Explaining URLs is Surprisingly Hard

I listened to a moderately interesting Security Now episode from a couple weeks ago.  The topic was explaining security best practices to non-techno people.  Specifically, can you tell if a URL is safe to click on.  Turns out parsing URLs is a suprisingly hard problem that nerds completely take for granted.

Try explaining the following rules about clicking links to your grandma:

It’s so intuitive for techies to see the good and bad URLs but there’s just no simple set of rules for explaining it.  I guess you could forward them the RFC

Tuesday, January 12, 2010 — 63 notes

Dr Nic’s Syntax Highlighting in Tumblr

A hosted style sheet for adding syntax highlighting for a number of different languages to your Tumblelog.  Thanks, Dr. Nic!

Skip ActiveRecord Callbacks

Just a simple little Rails tip to end the week.  Nothing new here, but a nice snippet of code to remember and keep in your toolbox.

Let’s say you have an Address model.  Every time you update an address, you want it geocoded.  So you add an after_save :geocode callback.  Now let’s say you want to standardize all state names to 2 letter uppercase abbreviations.  We want to go back and upcase all the states that are already in the database but not re-geocode them (since the address didn’t actually change).

Here’s a handy little snippet code you can stick onto ActiveRecord::Base that will easily let you skip callback:

ActiveRecord::Base.class_eval do
  def self.skip_callback(callback, &block)
    method = instance_method(callback)
    remove_method(callback) if respond_to?(callback)
    define_method(callback){ true }
    begin
      result = yield
    ensure
      remove_method(callback)
      define_method(callback, method)
    end
    result
  end
end

Now in script/console you can do the following:

Address.skip_callback(:geocode) do 
  Address.find_each do |address|
    address.update_attributes! :state=>address.state.upcase
  end
end
Friday, January 8, 2010

Dive Into HTML5

HTML5 is coming. Do you know what’s in store?  This is the online version of Mark Pilgrim’s upcoming book.  It will be published by O’Reilly but remain online under a Creative Commons license.

So far his chapters on canvas, video, and forms are available online.

Increase the number of file descriptors on Centos and Fedora Linux

Raising the number of file descriptors for a regular user on CentOS/Fedora/Redhat is surprisingly difficult to learn how to do. There are lots of incomplete walk throughs on the web, some with typos and other problems.

Here are the steps that worked for me to raise the open file descriptor limit from 1024 (the default) to 65535:

1. As root, edit /etc/sysctl.conf and add the line:

fs.file-max = 512000

2. At the bash prompt, run:

$ sysctl -p

That will cause the settings to take effect. You can also `cat 512000 > /proc/sys/fs/file-max` but that may reset on reboot.

3. Edit /etc/security/limits.conf and add the following:

* - nofile 65535

See the inline comments for more details on what that does and how to make it more restrictive if you prefer.

4. As root, run

$ ulimit -n 65535

and make sure you have no errors. To double check, run `ulimit -n` and make sure the response is 65535.

4. Ensure that PAM authentication is turned on for SSH, or else when you try to connect as a regular user, you won’t see the new limits. Edit /etc/ssh/sshd_config and make sure you have:

UsePAM yes

Restart SSH `/sbin/service sshd restart` if you made any changes.

6. Login as a regular user with a new SSH session & shell and run:

$ ulimit -n 65535

Run `ulimit -n` again to check and good luck!

Tuesday, January 5, 2010 — 1 note

ActiveRecord without_timestamps

ActiveRecord’s automatically updating created_at and updated_at timestamps are certainly a great feature.  Every once in a while though they can bite you in the butt.

For example, let’s say you have a query that finds all the recently updated blog posts:

Post.all :order=>'updated_at DESC', :limit=>5

That code works great and has been in production for a long time.  Then one day you decide to change the title of a blog post your wrote last year.

Post.find(123).update_attributes! :title=>"My New Title"

What’s the problem?  All of a sudden this old post shows up as one of your recently updated blog posts on your site’s homepage!  Uh oh.  That’s probably not what you meant to happen.  (At this point you realize that a “published_at” timestamp would have been useful, but hey, you didn’t write the original code.)

So how can you change the contents without changing the updated_at timestamp?  One way is to skip ActiveRecord and run the query in the database; that certainly works but isn’t an ideal practice.

ActiveRecord comes with a flag that lets you turn timestamp setting on and off:

Post.record_timestamps = false

You can now safely change the title and the updated_at timestamp won’t change.  You can even wrap it up inside a little function if you want:

def without_timestamps(*models, &block)
models.each{|m| m.record_timestamps = false }
begin
yield
ensure
models.each{|m| m.record_timestamps = true }
end
end

Then you can use it like so:

without_timestamps(Post,Comment) do
Post.find(123).update_attributes! :title=>"My New Title"
end
Sunday, January 3, 2010

Skip DNS Lookups for a More Reliable MySQL

By default, the MySQL grants table can authenticate users based on IP address or hostname.  When a client connects from a particular host, MySQL does a reverse DNS lookup and compares the hostname and IP address.

This is normally pretty fast and makes setting up your grants table easier: you can allow anyone within your domain to connect, for example.  And it adds a layer of abstraction between your database and app servers.

However for most applications out there, allowing a single host or few hosts to connect is all you really need.  If that’s the case, why add the extra layer of complexity doing DNS lookups?  More importantly, if you have a DNS problem on your local LAN (service down, stale cache, etc), it could bring down your entire application because no one can connect to the database.

It’s very easy to turn off reverse DNS.  First setup your mysql.user grant table to allow connections from IP addresses and remove the host names.  Then edit your my.conf file and add the line

skip-name-resolve

This solution won’t scale to 100s or 1,000s of servers because it becomes to annoying to manage all the IP addresses, but for small clusters, say, 10 servers or so, this adds just a tiny bit of robustness to your infrastructure at no extra cost.

See the MySQL reference manual for more details.

Saturday, January 2, 2010