Processing Pipeline - SQS vs RabbitMQ

Apr 2017 - code

Our data pipeline needs a message broker for a Celery job queue; RabbitMQ and Amazon SQS are two good choices. They are designed to scale, guarantee message delivery, and seem to have good support.

RabbitMQ has a solid reputation as a broker, with diverse and robust features. It works very well with Celery. The only real disadvantage is that it relies on custom configuration and deployment. SQS is a message queue as a service, self-scaling and configurable from a web UI. It also plays well with other AWS offerings. However, it can be pricier and slower at scale, and less flexible to use.

Long term, RabbitMQ seems like a clear winner, but there are more factors to consider:

We have 2 developers, one of which is leaving soon.
We lack devops experience.
While this pipeline is important to the business, it’s not operationally crucial.
There is only one use case where the speed of dropping off jobs will be important.

Maintenance, reliability, security, and scaling would all be taken care of on SQS - making it a very appealing option, provided that 3 assumptions were reasonable:

Given that our clients are in Africa, any performance differences will be trivial compared to the network latency.
Given jobs of suffcient processing time, any performance differences will be trivial compared to the job computations.
Even if we scale at a rapid rate, our current business size means that it won’t be “big enough” any time soon.

To confirm, I ran some tests on AWS, swapping out the broker. I had a Flask gateway and a Celery worker on separate EC2 instances, and used Celery to manage jobs. I ran 5 trials for each broker, at 2 different times during the day. Each trial had 300 requests for 1 second jobs.

I had total request time as one metric, using the requests package’s built in profiling method. I also had a metric for total time taken for processing, using worker logs. Agreeing with general concensus, RabbitMQ is distinctly faster in both reading and writing.

Writing/requesting time (s):

RabbitMQ: 209 +- 28
AWS SQS: 248 +- 57

Reading/processing time (s):

RabbitMQ: 361 +- 3
AWS SQS: 368 +- 6

However in both cases, their uncertainty ranges overlap and other components of the pipeline could easily become more dominant bottlenecks as compared to the brokers themselves.

As a cash-strapped startup, cost was another important factor, and estimates were made with the test results.

Scaling:

Assume client use rate of 100% (currently < 5%)
Doubling client #s every month
=> clients will push ~60 million requests in about 12 months as a very optimistic estimate

Costs:

RabbitMQ
A base cost of ~$9 for a fully used T2.micro instance
~2% CPU usage for ~1 request/second
Taking into account baseline CPU and CPU credits, we get about 23 million requests/month
23 million requests / $9
AWS SQS
20 million requests / $10

So we start to see a slight cost advantage for RabbitMQ starting at around 20 million requests. We assume that at smaller scales, SQS will be advantageous (lacking overhead costs), and at larger scales, RabbitMQ wil become more advantageous (using more cost-effective, large instances). Since the scaling estimates indicate that this switch will occur ~11 months from now at the earliest, SQS appears to be a cost-effective choice for a reasonable amount of time in the future.

Since it has been shown that RabbitMQ’s 2 main advantages, better performance and lower costs, will likely not be important factors for at least another year, we have decided to use AWS SQS for now. Actually, one of the main advantages of Celery is that it shouldn’t be too hard to change the broker that we’re using later on.

Rollout results are discussed in part 3.

JACK

Processing Pipeline - SQS vs RabbitMQ