A few months ago we faced a memory issue on some of our background jobs. Heroku was killing our dyno because it was exceeding its allowed memory. Thanks to our instrumentation of Sidekiq, it was easy to spot the culprit. The job was doing a fairly complex SQL request, and outputing the query’s result into a CSV file before archiving this file.
In this article, I’ll explain what happened and detail the method we used to solve the problem. I had never seen or used this technique before thus I thought it would be nice to share.
We run a tiny framework, something more like a convention, to run SQL queries and archive the results. If I remove the noise of the framework, we had a code like:
In this simplified example, there are:
with_replica_database
: a helper that helps us run a piece of code using a replica database,query
: our SQL query, as a String
, andheader
: a placeholder for the Array
of our columns names.We used select_rows
as the results of the query didn’t really match any of our
models. It is a reporting query that does too many join
, group by
, and
subqueries
. The query takes dozens of minutes to run. We could, and probably
should, integrate that into our ETL but that’s not the point…
The resulting CSV file wasn’t that big, maybe a hundred megabytes.
The memory comsumption of this came from the many rows returned by the
select_rows
method. Each row is an array containing many entries as our CSV
have many columns. Each entry could be a complex datatype converted by
ActiveRecord
into even more complex Ruby objects. We had many instances of
Time
with their TimeZone
, BigDecimal
, …
Since the query returns millions of rows, even while having a linear complexity, the memory consumption is too high.
At first I thought about paginating the results much in the same way that find_each
works. The problem with that was that for 10000 rows, if I paginatd by 1000, it
would take 10 times the time of the same request without pagination.
Our query looked like this:
Just imagine t, u, v being subqueries with unions, OR
conditions, other
GROUP BY
and more of poorly performing stuff. The sad part is the GROUP BY
which required the engine to go through all results in order to group rows
correctly. Using pagination on this would be something like:
So the fewer entries on a page, the less memory used on the client-side but the more time spent in the database because more requests will be done. The more entries on a page, the more memory used on the client-side but the less time spent in the database because less requests will be done.
In the end, this approach wouldn’t have been future-proof.
It was easy to try to find solutions to the results does not fit in memory
problem because it is a known one. It is common with Rails that long lists and
association-preloading will cause you memory issues. The quick-fix is to use
the find_each
or in_batches
methods.
I realized that I didn’t actually need to load everything in memory, I’m only interested in getting one line at a time in order to write it into the CSV and then forgotting about it, thanks to the garbage collector.
After acknowledging what the true issue was, it was possible to find something more efficient: streaming APIs.
The idea was to bypass ActiveRecord
and use the underlying MySQL client which
was providing the stream option.
I’m sure there are similar options for other databases.
With that implementation, we only do one request, so no pagination, but we won’t have all the results in memory. We never needed to have all those results in memory in the first place anyway.
I would be very interested to use this feature with ActiveRecord
’s ability to
return models rather than rows. Maybe it is already possible but I didn’t find
it. If you have any further information on the subject, please let me know!
I hope you won’t have to use these lower level APIs. But, if you do encounter the same kind of memory issues, don’t throw money at it right away. Try this first ^^
And obviously, most of this could be avoided by tweaking the layout of data and their relations. In our case, denormalization could make this easier but we’re not ready to pay that cost - yet.
Edit As nyekks mentionned it on Reddit, sequel seems to be better at this out of the box.