Understanding and debugging blocked event loops in Python async applications and Hatchet workers.
 Matt KayeEngineer  · Hatchet
Matt KayeEngineer  · HatchetSince you're here, you might be interested in checking out Hatchet — the platform for running background tasks, data pipelines and AI agents at scale.
Blocked event loops are, by far, the most common problem we see when providing support to Hatchet users. If you use Hatchet, and Hatchet's Python SDK in particular, you might've seen a warning like this:
Warning: THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED
Scary! Let's talk through what's going on under the hood, and some possible causes for this warning in Hatchet and how to effectively debug.
Note: New to
async/awaitand event loops in Python? I'd recommend checking out FastAPI's async documentation quickly before getting started here. Hatchet handles synchronous and asynchronous work very similarly to FastAPI.
First and foremost, in the vast majority of cases, this scary warning from Hatchet is being caused by the event loop being blocked. And if the event loop is blocked, there's a very good chance that some code it is trying to run (read: a Hatchet task) is doing some blocking work. The asyncio documentation puts their recommendation for how to handle blocking functions correctly very eloquently, in one sentence that gets right to the crux of the issue:
Blocking (CPU-bound) code should not be called directly.
Note: CPU-bound work in the simplest terms is work that spends most of its time doing actual computation as opposed to e.g. waiting for some external process (like an API call) to complete. Importantly, using e.g.
requests.getto make an API call also (confusingly) falls under this definition of "CPU-bound" even though it's also just waiting, since while it waits the program cannot context switch to some other work running in the event loop (becauserequestsis not async).
Let's give a simple example, which we'll come back to later as a helpful debugging strategy. We'll first write two functions:
And let's run these concurrently with asyncio.gather and asyncio.create_task:
If you run this code, you'll see logs like this:
On the other hand, you can run two tasks running the non-blocking function concurrently as you'd expect:
Which results in the logs below. Note that the output from the two tasks, A and B, are interleaved, indicating that they're correctly running concurrently.
If you were to run code like this in a Hatchet task, you'd see the scary warning from above.
The long and short of the problem here, as so nicely put by the asyncio documentation, is that if some async code is doing anything blocking, then everything else will need to wait for that blocking operation to complete. This means that if you have a Hatchet worker running 1,000 tasks concurrently and one of them does something blocking, none of your other tasks will run while that blocking operation is happening.
Some common (and some less common) examples of blocking operations might include:
requests.getpsycopg, such as running an expensive SELECT statement that takes a long time to completeIn each of these cases, while this work is happening, no other async work on your Hatchet workers will be able to progress. We see some interesting and scary behavior if we run some blocking code in Hatchet.
Here we define a few tasks, one which is async and does blocking work (time.sleep), one which is sync and does blocking work (time.sleep), and one that is async and does non-blocking work (asyncio.sleep).
As an experiment, we can run them as follows to simulate what might happen in a production environment:
The intention of this example is to first kick off the non-blocking sync and async tasks, let them start to process, then kick off the blocking task, let it start to process, and finally kick off the non-blocking sync task again, and then let all of them complete. The worker logs are illustrative:
Here's a play-by-play of what happened:
run: start step: blocking:blocking, indicating that the worker has now started running the blocking task.Non blocking async logs, as the event loop is blocked. Notice that at this point, we continue to see Non blocking sync logs. This is an important design decision in Hatchet. Hatchet runs synchronous tasks in a thread pool so they can be executed in a non-blocking way, which means that once a sync task has started, it can continue executing even if the main event loop is blocked.start step run event, which indicates the last run has been triggered: rx: start step run: 7742df98-169f-4afa-9075-e43c8b3ea8df/non_blocking_sync:non_blocking_sync. Importantly, you might expect that since this task is sync, it will be executed correctly without being blocked, similarly to how the previous one was in 3). This is not the case! Since the event loop is blocked, Hatchet cannot begin to execute this task run, which is why we immediately start seeing the scary warning log.THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED: Waiting Steps 1finished step run: blocking:blockingNon blocking async 2 log. Importantly, this index (2) was where it left off before, but it "slept" for about six seconds (the duration of the blocking task), as opposed to the one second that we intended, between log lines.Non blocking sync task starting and finishing.So you're seeing the scary warning: Now what?
asyncio's DEBUG modeasyncio has a debug mode, which will give you more observability into the async operations that your worker is doing.
This will log warnings about slow callbacks and provide additional information about tasks that are taking too long.
First line of defense: look for things that are obviously blocking. API calls, database operations, for loops doing something involved or running many iterations, and so on. Depending on what the problem is, there are different ways to handle different situations:
requests or similar, try using aiohttp instead to make the calls async.psycopg2 or similar synchronous database libraries for database I/O, try using asyncpg or psycopg[binary] with asyncio support instead, to make database operations async.asyncio.to_thread to run them in a separate thread so they don't block the main event loop. For example: await asyncio.to_thread(some_blocking_function, arg1, arg2).asyncio.to_thread there too to offload the work to a separate thread.As a last resort, you can also change your tasks from being async to sync, although we don't recommend this in the majority of cases.
Ruff, via flake8 (for example), has an ASYNC linting rule to help you catch potential issues in async code.
If you've resolved all of the obvious issues but the Scary Warning ™️ is still popping up, instrumenting your code can help find the bottleneck. Hatchet's Python SDK provides an OpenTelemetry Instrumentor, which allows you to easily export traces and spans from your Hatchet workers. If you have some long-running tasks (or long start times), you can use the traces to get a better sense for what might be blocking. In particular, if there are some async operations that appear to just be hanging for significantly longer durations than they should take, this is a good indication they're being blocked by something.
Similarly, you can also instrument your code with the AsyncioInstrumentor and other, similar instrumentors depending on other tools in your stack.
As a last resort, another thing to try is running your code in a fashion similar to how we did above, outside of Hatchet, by creating async tasks and using gather to run them concurrently. If there's blocking behavior, it'll be apparent when one of the tasks is blocked.
Blocked event loops can significantly impact the performance of your Hatchet workers, causing tasks to wait unnecessarily and triggering those scary warning messages. We added the scary warning to the SDK to help flag that something might be blocking the loop. Note that it's not always an indication that the event loop is blocked, but it's a hint that something might be wrong.
By following the debugging steps outlined in this post, you should be able to:
asyncio.to_threadTo reiterate the main point from the start of the post, taken directly from the asyncio documentation:
Blocking (CPU-bound) code should not be called directly.