Processing Pipeline - Requirements

Apr 2017 - code

At my work, we’re building a pipeline to provide data-driven feedback to users.

The de-facto king for data is Python, thanks to its community. A common way to deploy these models is via a Flask application, but this isn’t a good fit for every case. What if the service doesn’t need a front end? What if clients don’t want to wait for a potentially long operation? What if our business requirements are still uncertain, and we anticipate a need to scale?

We could use Twisted or Tornado to imitate some of that Node.js non-blocking goodness, but other libraries might not be compatible. It’s also a new paradigm for the devs to learn. Scaling and tight coupling are still concerns as with any monolithic web app.

A strong solution is to use job queues, where we drop off work and have workers share tasks. Dropping off work takes very little time. Having a storage for the jobs improves reliability. There can be any number of workers, of any type, allowing great flexibility and scalability. The disadvantage is having to set up and maintain the queue infrastructure. Given the importance of this pipeline to our business, it’s a worthwhile investment.

Nameko is a framework focused on microservices that work with message queues. However, it’s quite new and research doesn’t turn up many resources on deployment and maintenance practices for a production service. A risky choice for an early startup with poor devops experience.

Iron.io is a nicely packaged solution for managing job queues and workers. However, the price can be relatively hefty and our technology would be locked with their services. A risky choice for a startup on a budget, with rapidly changing requirements.

PythonRQ, a simple self-hosted framework for job queues and workers, may similarly have too many restrictions long term. It only operates with certain technologies and cannot guarantee job completion.

The natural choice for job management becomes the industry standard, Celery, another self-hosted framework with great configurability.

However, we also have to choose the message storage that Celery would use (a broker). There are many possibilities here as well, but we narrowed it down to two big players: RabbitMQ and Amazon SQS. They were designed for message reliability at scale and have strong user communities.

The final choice is described in detail in part 2.

top