Here at ticketea we depend on many external services, some of which can be down at times, or even worse, become slow.
Why is it worse for an external service to be slow rather than to be down?
At ticketea, we have applications running with processes and thread pools. This means that we have a worker process for each request, and that for the duration of the request that worker will be only handling that single request. This is better explained by the Little's law which relates the latency of your controllers and the number of workers you have with the number of users requests you can serve. So, if you have a synchronous call to an external web service, the time that this third party web service takes to give an answer, will be added to the time your worker is unavailable to serve other requests.
Now, suppose this third party service returns a 5xx HTTP response, then you can handle this appropriately in your code and return a proper response indicating the situation to your end user.
But if instead, the third party service keeps the connection open for a long time, then your worker will get stuck waiting for a response. If this happens to more workers, you might end up without workers to serve your own application, thus having no other option than return an HTTP 504 (Gateway Timeout) from your load balancer.
One solution to this problem, is to add a timeout to your requests (something you should always do anyway) but this just alleviates the problem, as in the discussed situation your workers will still have to wait for this timeout to expire. Setting a low timeout also has its own trade-offs, as that means you are now less tolerant to the third party service latency.
A common solution to this problem, is the Circuit Breaker Pattern pattern, which can be very helpful in this situation. TLDR; you wrap your code in a way so that after a number of failed requests to an external service, it will fail immediately during a certain period of time so that your service stays healthy and handles appropriately the situation of an external service being down, you give a breath of air to the backend service to be able to recover and finally you don't cause a cascading failure.
We looked for a Python implementation of the circuit breaker pattern. After trying out several libraries which didn't work for us (some of them didn't work well over multiple hosts/processes, some others were too complicated to use and provided many more features than we really needed), we eventually implemented our own thing.
We've been working with our circuit breaker library in production for a few months, and since we're happy with the results, we decided to follow the same road we followed with Pynesis: We published an open source library named failfast uploaded it to Pypi and wrote this blog post.