Configuration, security, and deployment with SQS were all as easy as expected, and IAM has been a great convenience. It was also pretty straight forward to write tasks for Celery - but successes are not interesting.
Celery does not play perfectly with SQS.
One source of struggle was having scheduled tasks, thanks to SQS limitations (mostly relating to timeouts). We were forced to use local scheduling solutions for timed output delivery - in this case, Advanced Python Scheduler. Unfortunately, out of the box, jobs can be dropped if worker instances fail before an ETA is met as tasks are stored in memory. To make this a more robust long term solution, we’ll have to use disk storage.
SQS also seemed to cause Celery workers to hang when trying to process with > 1 worker per instance. This is a fairly significant problem at scale and will hopefully be resolved soon.
Presumably, PyQs should work better with SQS, and may deserve another look if support is improved.
Our design also allowed a little extra complexity to sneak in. The information required for the jobs come from a Node.js lambda service that was already used for shuttling data around. There was no clean way of dropping these jobs directly into the Python worker queues, so we needed another Python lambda. However, it’s not a huge issue as the data could be reformated at this gateway anyways, giving a nice seperation of concerns between the two systems.
Overall, there were no disasters during the initial rollout. There were some hiccups as you’d expect from any plan. Pending maturation and scaling. Maybe there will be a part 4.