The performance implications of misunderstanding Node.js promises
There is nothing quite like the existential dread of realizing you may have a performance issue. Months, possibly years, worth of codebase are suddenly now failing you in a live environment. Reproduction is difficult, and even if you can reproduce it on your dev machine your usual methods of debugging are likely useless. Sure, you can litter your code with entirely too much analytics, or leverage Node’s own profiling, but this results in a data hose that takes a deep understanding of Node, V8, and data science itself to make sense of it. This is a rather dire situation.
A common scenario in any service is you may have some asynchronous operation that you need to perform on a data set of unknown length. Your first instinct may look something like this:
You quickly discover that this is not very fast. Simple interrogation of CPU utilization will show Node is spending a lot of time waiting. Depending on your lint rules, you may not even be able to get this code past a CI build. The solution for many is to simply fire all the promises then await all of them. That, in and of itself, is not a bad thing; for example, say you know that every time you execute a given route, you need to make 3 unrelated calls to other services. Since you know it will always create 3 promises, this is bounded and safe. The problems arise when the promise creation is potentially unbounded.
You test your code and discover it is much faster, Node is not spending time waiting around, and the lint error is gone. You merge and deploy your code, all the integration tests pass, and you go home to watch The Expanse. Suddenly, right when the Rocinante (a legitimate salvage) is entering a tense battle, you start getting PagerDuty alerts. Your services containers are getting restarted constantly due to Out of Memory exceptions, the container host is experiencing socket exhaustion, and your response times are through the roof. You have a major performance and scalability issue.
Benchmarking unbounded promise scenarios
This is a nice clean scenario, so the suspect code is easily identified, but what if it isn’t? Why does this code fail to scale? This is where Clinic.js shines. For the sake of exercising Clinic.js, I created 6 scenarios that all perform the same asynchronous work some n times using different strategies. They are using for…of, the new ES2018 for await…of, the Promise.all() scenario described above, using Bluebird.map() with a concurrency limit set, the promise-limit module, and the p-limit module.
For the test, I start a clustered web server that accepts a get and post request. Both verbs will perform some arbitrary work before completing the request. In the test itself, I get the data, do arbitrary work, and post the results back. In both cases the work is primarily parsing JSON and manipulating the object in an attempt to simulate a common Node workload. Each test run does this work 500 times, for each element in a mock collection. For the limiting modules, they are all set to allow 10 concurrent promises. All tests were performed using Node v12.14.0. The average execution times over 10 test runs can be seen below.
╔═══════════════╦══════════════════════════════════╗ ║ Test ║ Average Execution Time (Seconds) ║ ╠═══════════════╬══════════════════════════════════╣ ║ await-for-of ║ 6.943 ║ ║ bluebird-map ║ 4.550 ║ ║ for-of ║ 6.745 ║ ║ p-limit ║ 4.523 ║ ║ promise-all ║ 4.524 ║ ║ promise-limit ║ 4.457 ║ ╚═══════════════╩══════════════════════════════════╝
The results here certainly reflect the performance improvement when going from for…of to issuing concurrent promises, but the limiting libraries have similar execution times. Let’s dive deeper into what is going on.
for…of and for await…of
In the for await…of and for..of scenarios (respectively), the CPU utilization is bouncing around and is averaging well below 100%. This means the Node process is spending a lot of time waiting around. The execution time is also non-ideal at roughly 7 seconds averaged over 10 runs. Clinic detects this and reports a potential I/O wait issue bottlenecking the process. It recommends running a ‘bubbleprof’ to better identify the issue.
Note the light blue segment in the graph, this represents network I/O waits. It is spending quite a bit of time waiting for network operations. Perhaps we can speed this up by issuing them concurrently.
In the Promise.all() scenario, execution time has improved dramatically to 4.5 seconds on average. The CPU is now being efficiently utilized. However, notice the memory usage spiking up much higher with incredibly high event loop delay. Clinic detects this issue and reports a potential event loop issue. Why is this?
As you can see, the promise callback is invoked first, then it executes to the end of the block. Only after that do the `then` blocks fire. This, however, is only part of the story.
In the case of the Promise.all() code above, it will create a request for every item in the collection before any `then` block is resolved. This is true even if the early requests complete before it reaches the end of the collection. It will not process the responses until the event loop ticks over. This is due to I/O callbacks not being invoked until the event loop ticks into its I/O phase. It is only at this point `.then()`, or the code block proceeding `await`, is executed.
Ultimately the creation of an unbounded number of promises will result in all of the response objects becoming queued in memory and waiting for the event loop to tick over. This can swamp the Node process with excessive memory allocations, or worse, cause out of memory events. When these objects go out of scope, it will freeze up when the garbage collector executes as it is primarily a blocking, O(N) operation. It also means when the event loop finally does tick over, all of the `.then()` callbacks of completed requests fire synchronously. If this contains heavy work, the event loop will be stuck in the callback phase until this work is completed or another truly asynchronous operation is encountered. This means any events, inbound requests, timers, etc. will not be handled until the ‘.then()’ callbacks are all completed. The biggest concern here is that response times can spike to several seconds. To solve this, we need to limit the number of concurrent promises created.
Limiting simultaneous promises
As indicated in the graphs, Bluebird.map() with a concurrency limit pretty much eliminates the event loop blocking and the high memory usage while maintaining efficient CPU utilization. It matches the performance of Promise.all(), completing the test in 4.5s. Bluebird however has lost its performance edge in recent versions of V8 when compared to native promises and is a rather heavy complete Promise replacement. Let’s consider some other native promise options.
The promise-limit module also resolves our issue, and against matches the performance of Promise.all() at 4.5s. This is due to the aforementioned pressure on Node’s memory system creating unnecessary memory allocations and complex garbage collection events. The overhead of this module does get flagged by Clinic.js as creating some event loop blocking. It notably also is nowhere near as popular as the last module, p-limit.
The p-limit module has the least amount of event loop blocking of all the options, while providing the same performance as Promise.all() at 4.5s. Given the popularity compared to promise-limit and the performance of this module, this is the clear winner.
Clinic.js enables introspection into how your Node application is performing beyond simple top line execution times. It introspects not only your resource utilization, but what the code itself is doing. By no means have I explored all the capabilities of this tool here, however it grants us insight into why a specific codepath may appear to perform without issue on its own, but wreak havoc inside of a live service. NearForm presented this to us during Node Day 2019 and it has quickly become an indispensable debugging tool for us. Its ease of use and clear presentation helps not only quickly identify problematic code, but also convey the “why” to others who are not deep into the debugging process with you. It was the obvious choice to illustrate the issues with unbounded promise creation.