Dev Time Stories Episode 3 — Five steps to quickly track down and handle uncaught exceptions in Node.js

If you’re a Node.js developer and your application hasn’t crashed yet, because of some uncaught exception, you’re an amateur. I mean it, you’re probably just getting past “Hello world!”. But keep reading, it’s good to be well prepared for the future.

One thing developers face when working with Node.js is uncaught exceptions that kill the process. In this short article I want to give you a quick and easy strategy for handling uncaught exceptions.

If you’re not aware, there’s an uncaughtException event propagated to the process itself, whenever the system encounters an error and cannot continue running your application code. To access it, you use the same syntax you use for binding to other events:

process.on('uncaughtException', /* callback */)

Now, there are a couple of things you need to do when an error bubbles up throughout your system. Especially if your application is already running in production, and this is exactly what this article will teach you. How to react in a crisis situation.

1.Don’t waste time

If the app is already running in production, you don’t have time to waste debating who is at fault. You also don’t have time to do library research right now, you need to get the job done. Find out the cause of that error.

2.Catch at topmost level

This taps into the first point, as well. Instead of trying to figure out where the error is from, just add an error handler at the topmost level — app.js or server.js. This is better since you might not know for sure what is the exact place / piece of code that triggers the exception. What if you have two exceptions, but you’re currently unaware of the second one. This way, you can simply add a global handler and catch everything.

process.on('uncaughtException', (error) => {
  // handle error
})

This also comes with a warning. As the Node.js documentation states, this should be used as a last resort. Here’s a quote from the official docs

Note that uncaughtException is a crude mechanism for exception handling intended to be used only as a last resort. The event should not be used as an equivalent to On Error Resume Next. Unhandled exceptions inherently mean that an application is in an undefined state. Attempting to resume application code without properly recovering from the exception can cause additional unforeseen and unpredictable issues.

In layman’s terms, you shouldn’t use this handler to resume the execution of your app. The app is probably in a weird state that you need to clean up as you will learn, while going through the next steps of this strategy. It’s also worth mentioning that if the code inside the handler you pass to process.on throws, this exits the process directly.

3.Clean up

Do everything in your power to make sure upstream and downstream systems are not left in weird states. This means rolling back unfulfilled database transactions, clean up caching systems, signal downstream systems that an error has occurred and give as much detail as they need / require.

Again the documentation is very explicit about this. Pay close attention to the “synchronous” part. Nothing here can be asynchronous because it might also throw an error that would not be caught until the process gets restarted.

The correct use of ‘uncaughtException’ is to perform synchronous cleanup of allocated resources (e.g. file descriptors, handles, etc) before shutting down the process.It is not safe to resume normal operation after’uncaughtException’.

4.Do the logging

Whatever you do, log the exception. You need stack trace, accounts causing the issue, correlation ids if the request passes through multiple systems, anything you can get your hands on, to figure out the state of the application and what is causing the issue. Be sure you anonymise customer data — create hashes instead of sending usernames and passwords in plain text, stuff like that.

5.Crash and restart it

Once you performed all the above you need to let the process die. Let the application crash and restart. This will ensure that you start up with clean memory and no weird state is stored in your specific instance’s memory. It will get rid of any unexpected behaviour.

The docs say you need an external “monitor” to detect failures and recover. You can find such a monitor at the end of the article.

To restart a crashed application in a more reliable way, whether uncaughtException is emitted or not, an external monitor should be employed in a separate process to detect application failures and recover or restart as needed.

Following the steps above will help you react faster to nasty production issues. It will at least help you get a clue about what is causing the problem. That is unless you’re reading stuff from the URL and your product owner has the URL open with some bad input, and whenever the application restarts and Websockets reconnect, the application crashes again.

Bonus #1 — Promises

Handling promise errors. If your errors bubble up from promises, and you’re on an older version of Node.js, you have a similar handler for promise-generated errors. As you know, previous implementations of promises in Node.js would swallow the exception and fail silently. So unless you had a .catch statement attached to the promise, your code might fail and you would have no clue why. If you’re the type of person who hates to write Promise-then-catch every time, you can use process.on again. Though you should handle your errors at the level they appear or the next level, this can be a strategy, too, as long as you don’t resume execution of whatever the application is doing. Here’s an example piece of code you could use to handle your promise errors:

process.on('unhandledRejection', (reason, promise) => {
  // follow steps 1-5
})

Bonus #2 — The monitor

To avoid having to crash and restart the application yourself, I would highly recommend PM2. It’s been a great help to me and to the teams I worked with, throughout time. Since 2016 I stopped using it, mainly because I started working with Docker containers exclusively, and you can set up the container to restart when the process it is running fails. This way, I can use my regular exception handling mechanism, log and do everything else, then let the process fail and crash the container, directly.

Bonus #3 — The post-mortem

If you really want your team to win a lot from the experience, have a post-mortem after you fix the issue. Try to keep it as impersonal as possible. Don’t try to identify who is at fault but try to identify behaviours and gaps in communication or documentation that could have led to the issue. Create checklists for all the things you do internally. From the way you write your commit messages, how you create branches, code review process, all the way to production deployment, you should have checklists. People don’t have to think a lot about what they need to do. You wouldn’t want the junior dev deploying to production today becoming “creative” with the deployment, right? Create checklists, period.

The bottom line

I am aware that this is a crude and hacky way of handling exceptions, but it gets the job done when push comes to shove. I agree you need proper error handling and especially proper monitoring and logging in place.

I’m curious what are your thoughts on this. What is your strategy for tracking down and fixing errors like this?