DNS polling for reliability

Apr 30, 2019 - Node.js

In December 2018, I wrote a package to poll and cache DNS records, pollen, as a mitigation for incidents at work.

My team at work migrated our Node.js servers from AWS EC2 C4 instances to C5 instances. Then mysterious timeout errors on outbound HTTP(S) calls started happening. They happened only in an availability zone at a time. We tried different things to investigate the issue, like profiling and tcpdump, but couldn't find the cause. Eventually, AWS Support suggested that the incidents were correlated to DNS timeouts in their metrics. According to them, C5 instances don't retry DNS lookups while C4 instances do.

Node.js is vulnerable to DNS failures

In the microservice world, we work hard to make remote procedure calls (with HTTPS) reliable. We use timeout, retry, fallback, etc. to make it as reliable as possible. However, we hadn't paid enough attention to DNS lookup, which we use for service discovery. It can easily be a single point of failure because we can't call servers without knowing their IP addresses.

Node.js is especially vulnerable to DNS lookup failures because:

Node.js standard library doesn't have DNS cache by default while other languages/runtimes like Java and Go have it by default.
Node.js uses a small thread pool to make DNS lookups. When there are slow DNS queries or packet loss, subsequent DNS lookups need to wait for them to finish or timeout.
- Before Node 10.12.0, it was even worse because slow DNS queries affected other tasks in the thread pool like file IO and gzip encoding/decoding.

Caching at OS-level

We can make DNS lookups fast and reliable by caching it. An issue on the nodejs/node repo recommends to have caching at OS-level. We can run a daemon like dnsmasq, unbound, CoreDNS, etc.

However, it's not always easy depending on the platform that you are using. My team was using a platform where we just deploy your application Docker container, and it was hard to set up another daemon on the OS. The majority of the users of the platform were application runtimes such as Java and Go, which have basic DNS caching by default and rarely have the same issues with Node.js applications. It was hard to convince the platform team to introduce per-node DNS caching to the platform only for Node.js applications without a concrete evidence while they were focusing on a new Kubernetes-based platform. (They eventually added per-node DNS caching to the new platform later, but the application in question won't move to it because of reasons...)

Because the incidents didn't happen on C4 instances and we had other priorities to work on, we just rolled back and kept using C4 instances for a while. However, I wanted to finish the issue before celebrating 2019. So, I decided to implement DNS caching on the application layer with Node.js.

DNS caching and prefetching with Node.js

There were already some DNS caching packages:

The packages looked great, but there was an edge case that they didn't cover. Both of the packages throw away caches after some time (dnscache uses ttl option and lookup-dns-cache uses the TTL that DNS servers return) and make DNS lookups again. This poses a risk where HTTP requests fail if DNS servers are down at the time.

To avoid making DNS lookups on demand, we can prefetch DNS records and always provide cached DNS records. This means that we may get outdated IP addresses. However, DNS records didn't change often for my case. I thought it would be better to use expired DNS records than just giving up. In the worst case, we would get an SSL certificate error if the expired IP addresses point to wrong servers as long as we use HTTPS.

HTTP Keep-Alive (persistent connection)

There was another issue that I wanted to solve with this package: keeping HTTP Keep-Alive connections as long as possible.

We have been using HTTP Keep-Alive for good performance. However, we couldn't keep the Keep-Alive connections forever because our backend servers may change their IP addresses (DNS-based traffic switch in our case). To avoid keeping stale connections, we were re-creating TCP/TLS connections for each minute, by rotating HTTP agents and later using the activeSocketTTL option of keepaliveagent. However, this is not optimal because IP addresses don't change most of the time.

The DNS caching and prefetching tell us when IP addresses change. So we can keep using existing connections as long as IP addresses stay same and re-connect only when IP addresses change. In this way, we can avoid unnecessary TCP/TLS handshakes.

Result

I wrote pollen, tested it with C4 instances and migrated our servers to C5 again. No issues happened after five months. So, it seems that DNS failure was the cause and the package can mitigate it.

I had expected performance improvement because of fewer TCP/TLS handshakes, but I didn't find much difference in latency.

How to use it

npm i -S @shuhei/pollen
# or
yarn add @shuhei/pollen

const https = require("https");
const { DnsPolling, HttpsAgent } = require("@shuhei/pollen");

const dnsPolling = new DnsPolling({
  interval: 30 * 1000 // 30 seconds by default
});
// Just a thin wrapper of https://github.com/node-modules/agentkeepalive
// It accepts all the options of `agentkeepalive`.
const agent = new HttpsAgent();

const hostname = "shuheikagawa.com";
const req = https.request({
  hostname,
  path: "/",
  // Make sure to call `getLookup()` for each request!
  lookup: dnsPolling.getLookup(hostname),
  agent
});

Bonus: DNS lookup metrics

Because DNS lookup is a critical operation, it is a good idea to monitor its rate, errors and latency. pollen emits events for this purpose.

dnsPolling.on("resolve:success", ({ hostname, duration, update }) => {
  // Hypothetical functions to update metrics...
  recordDnsLookup();
  recordDnsLatency(duration);

  if (update) {
    logger.info({ hostname, duration }, "IP addresses updated");
  }
});
dnsPolling.on("resolve:error", ({ hostname, duration, error }) => {
  // Hypothetical functions to update metrics...
  recordDnsLookup();
  recordDnsLatency(duration);
  recordDnsError();

  logger.warn({ hostname, err: error, duration }, "DNS lookup error");
});

I was surprised by DNS lookups occasionally taking 1.5 seconds. It might be because of retries of c-ares, but I'm not sure yet (its default timeout seems to be 5 seconds...).

Because pollen makes fewer DNS lookups, the events don't happen frequently. I came across an issue of histogram implementation that greatly skewed percentiles of infrequent events, and started using HDR histograms. Check out Histogram for Time-Series Metrics on Node.js for more details.

Even if you don't use pollen, it is a good idea to monitor DNS lookups.

const dns = require("dns");

const lookupWithMetrics = (hostname, options, callback) => {
  const cb = callback || options;
  const startTime = Date.now();

  function onLookup(err, address, family) {
    const duration = Date.now() - startTime;
    cb(err, address, family);

    // Hypothetical functions to update metrics...
    recordDnsLookup();
    recordDnsLatency(duration);
    if (err) {
      recordDnsError();
      logger.warn({ hostname, err, duration }, "DNS lookup error");
    }
  }

  return dns.lookup(hostname, options, onLookup);
};

const req = https.request({
  // ...
  lookup: lookupWithMetrics
});

Conclusion

Give pollen a try if you are:

seeing DNS timeouts on outbound API calls
using DNS for service discovery
running your Node.js servers without DNS caching

Also, don't forget to monitor DNS lookups!

Check your server.keepAliveTimeout

Apr 25, 2019 - Node.js

One of my Node.js server applications at work had constant 502 errors at AWS ELB (Application Load Balancer) in front of it (HTTPCode_ELB_502_Count). The number was very small. It was around 0.001% of the entire requests. It was not happening on other applications with the same configuration but with shorter response times and more throughputs. Because of the low frequency, I hadn’t bothered investigating it for a while.

clients -> AWS ELB -> Node.js server

I recently came across a post, A tale of unexpected ELB behavior. It says ELB pre-connects to backend servers, and it can cause a race condition where ELB thinks a connection is open, but the backend closes it. It clicked my memory about the ELB 502 issue. After some googling, I found Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer. It describes an issue on GCP Load Balancer and NGINX, but its takeaway was to have the server’s keep alive idle timeout longer than the load balancer’s timeout. This advice seemed applicable even to AWS ELB and Node.js server.

According to AWS documentation, Application Load Balancer has 60 seconds of connection idle timeout by default. It also suggests:

We also recommend that you configure the idle timeout of your application to be larger than the idle timeout configured for the load balancer.

Node.js http/https server has 5 seconds keep alive timeout by default. I wanted to make it longer. With Express, we can do it like the following:

const express = require("express");

const app = express();
// Set up the app...
const server = app.listen(8080);

server.keepAliveTimeout = 61 * 1000;

And the ELB 502 errors disappeared!

In hindsight, there was already Dealing with Intermittent 502's between an AWS ALB and Express Web Server on the internet, which describes exactly the same issue with more details. (I found it while writing this post...) Also, the same issue seems to be happening with different load balancers/proxies and different servers. Especially the 5-second timeout of Node.js is quite short and prone to this issue. I found that it had happened with a reverse proxy (Skipper as k8s ingress) and another Node.js server at work. I hope this issue becomes more widely known.

Update on April 29, 2019

Oleksii told me in a comment that only server.keepAliveTimeout was not enough on Node.js 10.15.2. It turned out that we also need to configure server.headersTimeout longer than server.keepAliveTimeout on Node.js 10.15.2 and newer. See his issue on GitHub for more details. Thanks, Oleksii!

server.keepAliveTimeout = 61 * 1000;
server.headersTimeout = 65 * 1000; // This should be bigger than `keepAliveTimeout + your server's expected response time`

2018 in review

Feb 18, 2019 - Review

Looking back 2018, it flew like an arrow. It was so fast that it's already in February 2019!

Sunset at Tempelhof in April

Move

We had lived in an apartment on the border of Schöneberg and Wilmersdorf for 2 years, and decided to move out at the end of October without extending the contract. We spent two or three months for flat search, and a month and a half for moving, buying furniture and setting up the new apartment. After all, we like the new area and are looking forward to spend time on the balcony in the summer.

In the meanwhile, I got my left arm injured and it took a few months to recover.

Travel

I visited two new countries and seven new cities. I wanted to visit a few more countries, but could not manage mainly because of the moving.

Tokyo, Japan in Feburary
Spreewald, Germany in March
Amsterdam, Netherlands for React Amsterdam in April
Leipzig, Germany in May
Vienna, Austria for a wedding party in July
München, Germany for Oktoberfest in September
Köln and Düsseldorf, Germany in November

German language

After finishing an A2 course at office, I started a B1 course at Speakeasy. I felt that I should have taken A2 again... In the end, I was distracted by something else and stopped going to the course.

Work

It has been 2 years since I started working at Zalando. 2017 was about architecture migration from a monolith to microservices. 2018 was about optimization (and the next migration already started...).

In addition to front-end tasks, I focused more on non-feature stuff.

In the first half of the year, I focused on web (frontend) performance optimization. My team's work was featured in a blog post, Loading Time Matters, on the company blog.

In June, my team had a series of incidents on one of our applications, but we didn't know why. It opened a door of learning for me. I dug into Node.js internals and Linux network stack. I was lucky enough to find Systems Performance by Brendan Gregg, which is one of my all-time favorite technical books. As a by-product of the research/learning, I profiled Node.js servers on production and made some performance improvements. Wrote about it on Node.js under a Microscope: CPU FlameGraph and FlameScope.

Side projects

I didn't worked on many side projects in 2018. Instead, I learned a lot of low-level stuff. Network, Linux, Node.js. I put some of what I learned into the knowledge repo inspired by yoshuawuyts/knowledge. Also, as a permanent solution for the issue at work, I wrote a library to keep Node.js app resilient against DNS timeouts, pollen. It's been working without issues for 1.5 months!

Some other unfinished pieces:

Wrote some Haskell for a GLSL parser in the Elm compiler with @w0rm, but it's pending
Experimented Node.js profiling at perf-playground
Played around with image formats at incomplete-image-parser
Tried to write a Node.js profiler inspired by rbspy, but gave up to figure out memory layout of V8 objects
Investigated an issue with React + Google Translate

2019

In 2018, I focused on tiny things such as shaving hundreds of milliseconds. In 2019, I would like to be more open. Try new things. Travel more.