November 20, 2017 – Adrien Siami – 7-minute read
Recently, we had to send an e-mail to all our active users. For cost reasons, we decided to invest a bit of tech time and to go with transactional e-mails instead of using an e-mail marketing platform.
While it would certainly be quite straightforward for, say, hundreds or even thousands of users, it starts to get a bit more complicated for larger user bases.
In our case, we had to send the e-mail to ~1.5 million e-mail addresses.
In this blog post, I’ll quickly explain why a standard approach is not acceptable and go through the solution we chose.
Let’s implement a very naive way to send an e-mail to all our users. We’re going to create a job that loops through all the users and enqueues an e-mail.
Now let’s see what could go wrong.
Looping through millions of users is not free and it will most likely take a fair amount of time.
During this time, you’re maybe going to deploy, restarting your job manager, and killing your job. Now you don’t know which users have received the e-mail, and which have not yet.
One easy fix would be to run the code outside of your job workers, maybe in a rake task, but you have to make sure it won’t get killed, or that if it’s killed, you can resume it without any issue.
E-mail providers don’t like spam. If you send thousands of e-mails from the same IP in a short time, you’re guaranteed to get throttled or even blacklisted.
Therefore, it is necessary to space out the e-mails a bit, for example, adding a 30s delay every 100 e-mails.
Every e-mail to be sent equals a job run in your job queue: if you enqueue millions of jobs in the same queue you use for other operations, you’re going to create a lot of congestion.
Therefore, you’d probably want to have a special queue only for your sending with a dedicated worker.
First, let’s list the requirements we had in mind:
Redis is an amazing multi-purpose tool, it can be used for storing short lived data such as cache, used as a session store, etc. It has a multitude of useful data structures, the one we’re going to use today is the Sorted Set.
A sorted set it a bit like a hash / dictionary / associative array. It contains a list of values, and each of these values has a score.
Redis offers very useful functions to deal with sorted sets, let’s have a look at one in particular.
This function returns a range of
from the sorted set, with a score included between
max, can you see where
this is going? :)
We’re going to store all our user ids in a sorted set, with a score of 0, and change that score to 1 when we enqueue an e-mail for them.
Then, it’s really easy to ask for any number of users for whom we haven’t enqueued the
Let’s create a rake task to populate a sorted set with our user ids.
Here I’m using
MULTI to add the user ids 100 by
100 to the set in transactions, to go easy on redis CPU.
While this task may take quite some time, it is safe to re-launch if killed.
Now that we have our sorted set, let’s write another task. This one will pick a given number of user ids from the set and enqueue an e-mail for them, while spacing out the sends in time a bit.
Here I get as many user ids as requested thanks to
ZRANGEBYSCORE and its
I then iterate over the ids and enqueue the jobs 100 by 100, while delaying the sending by
30 seconds each time.
And that’s it! Thanks to this system you can gradually increase your e-mail batches while keeping an eye on deliverability.
Send 100 mails to test it out:
Everything looks good ? Send 1000, then 10000, etc.
Then it’s easy to know how many e-mails are left to be scheduled: just pop a redis console and ask away using zcount!
Remaining e-mails to schedule:
ZCOUNT mass_email_user_ids 0 0
E-mails already scheduled or sent:
ZCOUNT mass_email_user_ids 1 1
Obviously there is no perfect solution, here are a few downsides:
This is clearly not a fire and forget solution, it needs the attention of a dev for a little bit of time: enqueuing the sends, monitoring, waiting for a batch to finish and then send another one, etc.
However, this kind of sending is usually rare but important, so having it done right is worth the effort.
If you enqueue a lot of small batches, you’re going to be fine, but at some point you are going to enqueue batches of 100k e-mails or even more.
What if something goes wrong (deliverability dropping, etc) and you want to stop everything to have a look? You would need to stop the dedicated worker but the jobs are already enqueued, meaning that if you don’t resume for a long time, when starting over the jobs are going to run without delay and you may experience congestion or throttling from your e-mail provider.
This is a risk we were willing to take and that we mitigated with strong monitoring and cautious batching.
This solution worked well for our needs, but as always, your mileage may vary!
Sending millions of e-mails is tricky, but is an interesting problem to solve. Thanks to a bit of custom dev and redis, we were able to send our e-mail in a reasonable amount of time with excellent deliverability.