Memory Leak in Celery
Turn out Celery has some memory leaks. We don’t know that beforehand. After deploying some Celery servers using AWS ECS we notice that all Celery tasks will consume most of the server memory and then become idle.
My first attempt was set hard limit for container memory to 1GiB. And guess what? Celery will consume 99.9% of that limit then become idle after some times. It’s good for the server but doesn’t solve our problem.
My second attempt was set CELERYD_TASK_TIME_LIMIT
to 300, so celery tasks
will be killed after 5 minutes no matter what. This time Celery continue to
take memory percentage as much as it can and then become inactive, but after
5 minutes it kills all the tasks to release memory and then back to work
normally.
I thought it worked, but it didn’t.
After running for some periods, Celery still hung. So it’s not due to the leak anymore. Continue digging around, I found out the main reason Celery hangs is due to some thread locks caused by neo4j python driver. And that can only be solved completely by changing the way neo4j driver save & fetch data to async, which is still an open issue on GitHub. Although people gave some temporary solutions to the problem, it’s only apply for Python3, and our project is still Python2. Hence, a transition from Python2 to Python3 is needed.
In the mean time, I set up a cronjob to restart Celery after some times to remove the lock.