I broke our main API last week. I merged a PR of mine I had been working for a while. It got two approvals from two co-workers, green test suite and it was working perfectly fine locally and on our staging environment, the one that we test things before we send them to production. Somehow, the Node.js server failed to boot on production and we had to rollback to the last deployment, resulting into a 2-3 minutes downtime.
Why did the Node.js server failed to boot on production while was working perfectly fine locally and on our pre-production staging environment? Keep reading.
Wanna see the PR that brought our production down? Here it is:
I literally moved a dependency from the
dependencies section to
devDependencies. Reinstall my dependencies locally and rerun the server and was working perfectly. I then deployed to our staging and again, worked perfectly fine. But, failed to boot on production.
Specifically, the production server failed to boot cause it wasn't able to find the
Yes! That's obscure! Well not so much. Let's take it from the beginning.
The Node.js server was mostly running in three different environments. It was using the
NODE_ENV to denote the current environment. It could take three different values based on where it was running,
development for when it was running locally,
staging for when it was running on our staging environment and
production for when it was running on our production environment.
Now here's an interesting piece from
With the --production flag (or when the NODE_ENV environment variable is set to production), npm will not install modules listed in devDependencies.
NODE_ENV will actually impact your production environment and will enlarge the gap between production and other environments. Let's revisit a famous quote from the The twelve-factor app:
Keep development, staging, and production as similar as possible.
Unfortunately, by setting our
NODE_ENV to a value based on
the current environment it was running on, we were actually making our environments parity bigger.
The immediate action we took after that incident was simple. To decouple our
application environment from
NODE_ENV. We introduced the
NODE_ENV occurrences to that and then moved
NODE_ENV to only
development when running locally or while running unit tests
production for all other environments.
Another aspect of this is that some other library may be using this variable
as well without even knowing about it. And it won't be looking for multiple values, it will only be looking for
production vs anything else.
Now where did
NODE_ENV came from and why does npm actually uses it? Node.js documentation mention nothing about such a variable. Well the
NODE_ENV variable became famous from the Express.js framework where it was using it to decide whether it should enable some development features on production. After people started to use it, other projects started to adopt it as well and we reached today.
NODE_ENV to denote your application's environment is not wise
since so many utilities around depend on it. Keep it to mark online
production-like environment vs local development environments.
Like I mention above, we learned quite a few things from that incident, but
our immediate action was to rename
NODE_ENV in an attempt to keep to our online environments as similar as possible. 🤓
Were there any actions or decisions you took recently to address this issue?