Shuhei Kagawa

Check Your server.keepAliveTimeout

Apr 25, 2019 - Node.js

One of my Node.js server applications at work had constant 502 errors at AWS ELB (Application Load Balancer) in front of it (HTTPCode_ELB_502_Count). The number was very small. It was around 0.001% of the entire requests. It was not happening on other applications with the same configuration but with shorter response times and more throughputs. Because of the low frequency, I hadn't bothered investigating it for a while.

clients -> AWS ELB -> Node.js server

I recently came across a post, A tale of unexpected ELB behavior. It says ELB pre-connects to backend servers, and it can cause a race condition where ELB thinks a connection is open, but the backend closes it. It clicked my memory about the ELB 502 issue. After some googling, I found Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer. It describes an issue on GCP Load Balancer and NGINX, but its takeaway was to have the server's keep alive idle timeout longer than the load balancer's timeout. This advice seemed applicable even to AWS ELB and Node.js server.

According to AWS documentation, Application Load Balancer has 60 seconds of connection idle timeout by default. It also suggests:

We also recommend that you configure the idle timeout of your application to be larger than the idle timeout configured for the load balancer.

Node.js http/https server has 5 seconds keep alive timeout by default. I wanted to make it longer. With Express, we can do it like the following:

const express = require("express");

const app = express();
// Set up the app...
const server = app.listen(8080);

server.keepAliveTimeout = 61 * 1000;

And the ELB 502 errors disappeared!

As hindsight, there was already Dealing with Intermittent 502's between an AWS ALB and Express Web Server on the internet, which describes exactly the same issue with more details. (I found it while writing this post...) Also, the same issue seems to be happening with different load balancers/proxies and different servers. Especially the 5-second timeout of Node.js is quite short and prone to this issue. I found that it had happened with a reverse proxy (Skipper as k8s ingress) and another Node.js server at work. I hope this issue becomes more widely known.

Update on April 29, 2019

Oleksii told me in a comment that only server.keepAliveTimeout was not enough on Node.js 10.15.2. It turned out that we also need to configure server.headersTimeout longer than server.keepAliveTimeout on Node.js 10.15.2 and newer. See his issue on GitHub for more details. Thanks, Oleksii!

server.keepAliveTimeout = 61 * 1000;
server.headersTimeout = 65 * 1000; // This should be bigger than `keepAliveTimeout + your server's expected response time`