Shuhei Kagawa

Node.js under a Microscope: CPU FlameGraph and FlameScope

Sep 16, 2018 - Node.js, Linux

Last week, I had an opportunity to talk about profiling Node.js applications on production at an internal guild meeting at work. Here is a written version of it plus some additional information.


I have been working on Node.js microservices, which fetch data from API servers and render HTML with React, at work. We monitor response times at load balancers, in application metrics and with distributed tracing with OpenTracing. One of the microservices had a weird gap between 99 percentile response times of itself and its dependencies. It was spending an extra 500 milliseconds—but I didn't know why.

My first suspect was the network. It is the place full of uncertainty. After learning and trying different commands and metrics, I took tcpdump and checked packets one by one with my eyes and a script. There were no significant delays that I had expected. So I had to stop blaming the network—or someone else.

CPU Profiling with Linux perf Command

Because the weird latency was happening in the application itself, I wanted to know what's going on in it. There are mainly two ways to achieve this: profiling and tracing. Profiling records some samples and tracing records everything. I wanted to do it on production, so profiling was naturally a good fit because of its smaller overhead.

For Node.js, there are mainly two different tools. One is V8 profiler, and the other is Linux perf. V8 profiler uses the profiler provided by V8. It covers all JavaScript executions and V8 native functions. It works on non-Linux operating systems. If you use non-Linux machines, it might be pretty handy. On the other hand, Linux perf can profile almost anything including Linux kernel, libuv, and all processes on your OS with minimal overhead. However, as the name suggests, it works only on Linux. According to Node CPU Profiling Roadmap, it seems that V8 profiler is the one officially supported by the V8 team, but Linux perf will keep working for a while. After all, I picked Linux perf because of low performance-overhead and small intervention to applications.

Linux perf record records stack traces into a binary file called by default. The binary file has only addresses and file names of functions. perf script converts the stack traces into a human-readable text file adding function names from program binaries and symbol map files.

# Install dependencies for `perf` command
sudo apt-get install linux-tools-common linux-tools-$(uname -r)
# Test `perf` command
sudo perf top

# Record stack traces 99 times per second for 30 seconds
sudo perf record -F 99 -p ${pid} -g -- sleep 30s
# Generate human readable stack traces
sudo perf script > stacks.${pid}.out

Now we have human-readable stack traces, but it's still hard to browse thousands of stack traces and get insights from them. How can we efficiently analyze them?

CPU Flame Graph

CPU Flame Graph by Brendan Gregg is a great way of visualizing stack traces. It aggregates stack traces into one chart. Frequently executed functions are shown wider and rarely executed functions are narrower in the chart.

CPU Flame Graph A CPU Flame Graph from a sample application

I found some insights about the application on production with CPU Flame Graph:

  • React server-side rendering is considered to be a very CPU-intensive task that blocks Node.js event loop. However, JSON.parse() was using 3x more CPU than React—it might be because we had already optimized React server-side rendering though.
  • Gzip decompression was using the almost same amount of CPU as React server-side rendering.

There are a few tools like FlameGraph and 0x to generate CPU Flame Graph from Linux perf stack traces. However, I eventually didn't need them because FlameScope, which I'll explain next, can generate CPU Flame Graph too.


FlameScope by Netflix is another great tool for visualizing stack traces in a time-series. It shows a heatmap out of stack traces. Each cell represents a short amount of time, 20 ms if 50 cells per second, and its color represents how many times the process was on-CPU. It visualizes patterns of your application's activity.

FlameScope Image from Netflix/flamescope

If you select a time range on the heatmap, FlameScope shows you a CPU Flame Graph of the range. It allows you to examine what happened when in details.

To use FlameScope, check out the repository and run the python server. Then put stack trace files from perf script into examples directory, and open http://localhost:5000.

I found a couple of exciting insights about the application on production using this tool.

Example 1: Heavy Tasks in the Master Process

The application used the cluster module to utilize multiple CPU cores. FlameScope showed that the master process was not busy for most of the time, but it occasionally kept using CPU for 1.5 seconds continuously! FlameScope showed that it was caused by metrics aggregation.

The master process was aggregating application metrics from worker processes, and it was responding to metrics collectors a few times in a minute. When the metrics collectors asked for data, the master process calculated percentiles of response times and prepared a JSON response. The percentile calculation was taking long time because the application had a lot of metrics buckets and the library that we used was using JSON.stringify() and JSON.parse() to deep-copy objects!

Example 2: Frequent Garbage Collections

FlameScope showed that the worker processes were not overloaded for most of the time, but they had a few hundred milliseconds of CPU-busy time in about 10 seconds. It was caused by mark-sweep and mark-compact garbage collections.

The application had an in-memory fallback cache for API calls that was used only when API calls and retries fail. Even when API had problems, the cache hit rate was very low because of the number of permutations. In other words, it was not used almost at all. It cached large API responses for a while and threw them away after the cache expired. It looked innocent at first glance—but it was a problem for V8's generational garbage collector.

The API responses were always promoted to the old generation space causing frequent slow GCs. GC of the old generation is much slower than GC of the young generation. After removing the fallback cache, the application's 99 percentile response time improved by hundreds of milliseconds!

Node.js Gotchas

perf script collects symbols for function addresses from program binaries. For Node.js, we need something special because functions are compiled just in time. As far as I know, there are two ways to record symbols:

  1. Run your Node.js process with --perf-basic-prof-only-functions option. It generates a log file at /tmp/perf-${pid}.map. The file keeps growing. The speed depends on your application, but it was a few megabytes per day for an application at work. Another problem is that functions in V8 keep moving and the addresses in /tmp/perf-${pid}.map get outdated. I wrote a script to fix the issue.
  2. Use mmarchini/node-linux-perf. It generates the same /tmp/perf-${pid}.map as --perf-basic-prof-only-functions does, but on demand. Because it always freshly generates the file, it doesn't contain outdated symbols. It seems to be the way to go, but I haven't tried this on production yet.

In addition to the above, there are a few more Node.js options that you can use to improve your stack traces—though I haven't tried them on production because the stack traces were already good enough for me:

  • --no-turbo-inlining turns off function inlining, which is an optimization done by V8. Because function inlining fuses multiple functions into one, it can make it harder to understand stack traces. Turning it off generates more named frames.
  • --interpreted-frames-native-stack fixes Builtin:InterpereterEntryTrampoline in stack traces. It is available from Node.js 10.4.0. Check out "Interpreted Frames" in Updates from the Diagnostics Summit for more details.

Docker Gotchas

It gets a bit tricky when you are using containers to run your application. There are two ways to use Linux perf with Docker:

  1. Run perf record and perf script in the same Docker container as your application is running
  2. Run perf record and perf script in the host OS

I eventually chose the option 2. I tried the option 1 first but gave up because I was using Alpine Linux as the base image and it was hard to make Linux perf available on it.

To run perf record in the host OS, we need to figure out pid of the application process in the host.

$ ps ax | grep -n 'node --perf'
21574 pts/0    Sl+    2:53 node --perf-basic-prof-only-functions src/index.js
30481 pts/3    S+     0:00 grep --color=auto node --perf
# or
$ pgrep -f 'node --perf'

$ sudo perf record -F 99 -p 21574 -g -- sleep 30s

perf script collects symbols from binaries and symbol files to get human-readable function names. It needs to be able to read the binaries whose functions were recorded with perf script and /tmp/${pid}.map files that applications generate. However, perf script in the host OS cannot read them with the same file names as the container can. (It seems that this is not the case anymore with the latest Linux kernel because its perf command knows containers. But it was the case for me because I was not using the latest kernel.)

I learned how to overcome the issue from Making FlameGraphs with Containerized Java. I just copied necessary files from the container to the host.

# Horrible hack! Binaries to be used depend on your set up. `perf script` tells you what it wants if anything is missing.
sudo docker cp mycontainer:/usr/bin/node /usr/bin/node
sudo docker cp mycontainer:/lib/ /lib/
sudo docker cp mycontainer:/usr/lib/ /usr/lib/

To copy symbol map files, we need to find the pid in the container. We can do it by checking /proc/${host_pid}/status.

$ cat /proc/21574/status | grep NSpid
NSpid:  21574   6
$ sudo docker cp mycontainer:/tmp/ /tmp/

Now everything is ready! Then we can use perf script as usual.

sudo perf script > stacks.${pid}.out

I set up a sample project for profiling a Node.js application on Docker. It was nice to practice profiling a bit before doing it on production!


Linux perf provides great observability to Node.js applications on production. Tools like CPU Flame Graph and FlameScope helped me to identify performance bottlenecks.

There are some gotchas to profile Node.js applications on Docker with Linux perf. It took some time for me to figure out how to do it because Node.js and Linux evolve day by day and I couldn't find many up-to-date resources online. I hope this post is helpful!


2017 in Review

Dec 25, 2017


I moved to Berlin from Tokyo at the end of September 2016. 2017 was my almost first year in Berlin.

I like the city so far. It is more relaxed than Tokyo and other big cities in Europe. Summer is especially nice. BBQ makes it even better. After my office moved to a building in front of Spree River, I enjoy my commute crossing Oberbaum Bridge and walking along the river.


I traveled more than ever. The destinations were Germany (Dresden, Heidelberg, Frankfurt, Köln), Italy (Venice, Florence, Bologna), France (Paris), UK (London), Portugal (Lisbon) and Japan (Tokyo). I had fun in each of them, but if I have to choose one, I will name Lisbon. The city is full of what I miss in Berlin. Fresh and inexpensive seafood, views from hills, cute ceramic tiles, and beautiful weather. The sky was clear on every single day while I was there, and the highest temperature was 18 degrees in December!


I am glad to have found Fuerst Wiacek. Their German Movies is my No.1 beer so far. Biererei is a gem in Berlin, where I can buy fresh craft beers from Europe with growlers.

British ale was a discovery to me. I liked pubs in London a lot. I also attended the first craft beer hackathon in the world and won 12 crates of craft beer...!


I bought an ergonomic keyboard and a neck support pillow. Both of them lifted up my quality of life. My body is getting older.

Language Learning

I learned a bit of German Language. I finished A1 in May and started A2 after a pause of 5 months. While the learning process is prolonged, now German feels less cryptic to me.


I was lucky to join an awesome team. We work together and hang out together. A research at Google shows that psychological safety is a key to team effectiveness. I feel it on my team.

On the technical side, my team joined a relatively large project and completed it on time. I worked mostly in architecture, performance optimization, type checking with Flow, SRE, etc. for apps with React and Node.js. I also helped my colleagues to start building an internal tool with Elm.

Side Projects

I enjoyed working with Elm. I wrote a mobile weather app, flew to Paris for Elm Europe 2017, built a mobile-friendly pixel editor and talked about it at Elm Berlin Meetup. I also helped an experiment of its compiler-side in Haskell, although it is still pending.

I didn't do much with JavaScript for side projects but wrote a tiny library for server-side rendering with tagged template literals while hanging out with friends at a cafe. It's used in the choo ecosystem now.

Aside from building things, I learned monad transformers, etc. from Haskell Book and machine learning with neural networks from Deep Learning Specialization on Coursera.


After all, I lived a year in a new country and enjoyed it. I have settled down, and now I feel prepared for new challenges next year. Let's see what is going to happen!


Getting Memory Usage in Linux and Docker

May 28, 2017 - Linux, Docker

Recently I started monitoring a Node.js app that we have been developing at work. After a while, I found that its memory usage % was growing slowly, like 20% in 3 days. The memory usage was measured in the following Node.js code.

const os = require('os');

const total = os.totalmem();
const free = os.freemem();
const usage = (free - total) / total * 100;

So, they are basically from OS, which was Alpine Linux on Docker in this case. Luckily I also had memory usages of application processes recorded, but they were not increasing. Then why is the OS memory usage increasing?

Buffers and Cached Memory

I used top command with Shift+m (sort by memory usage) and compared processes on a long-running server and ones on a newly deployed server. Processes on each side were almost same. The only difference was that buffers and cached Mem were high on the long-running one.

After some research, or googling, I concluded that it was not a problem. Most of buffers and cached Mem are given up when application processes claim more memory.

Actually free -m command provides a row for used and free taking buffers and cached into consideration.

$ free -m
             total  used  free  shared  buffers cached
Mem:          3950   285  3665     183       12    188
-/+ buffers/cache:    84  3866
Swap:         1896     0  1896

So, what are they actually? According to the manual of /proc/meminfo, which is a pseudo file and the data source of free, top and friends:

Buffers %lu
       Relatively temporary storage for raw disk blocks that
       shouldn't get tremendously large (20MB or so).

Cached %lu
       In-memory cache for files read from the disk (the page
       cache).  Doesn't include SwapCached.

I am still not sure what exactly Buffers contains, but it contains metadata of files, etc. and it's relatively trivial in size. Cached contains cached file contents, which are called page cache. OS keeps page cache while RAM has enough free space. That was why the memory usage was increasing even when processes were not leaking memory.

If you are interested, What is the difference between Buffers and Cached columns in /proc/meminfo output? on Quora has more details about Buffers and Cached.


So, should we use free + buffers + cached? /proc/meminfo has an even better metric called MemAvailable.

MemAvailable %lu (since Linux 3.14)
       An estimate of how much memory is available for
       starting new applications, without swapping.
$ cat /proc/meminfo
MemTotal:        4045572 kB
MemFree:         3753648 kB
MemAvailable:    3684028 kB
Buffers:           13048 kB
Cached:           193336 kB

Its background is explained well in the commit in Linux Kernel, but essentially it excludes non-freeable page cache and includes reclaimable slab memory. The current implementation in Linux v4.12-rc2 still looks almost same.

Some implementation of free -m have available column. For example, on Boot2Docker:

$ free -m
       total  used  free  shared  buff/cache  available
Mem:    3950    59  3665     183         226       3597
Swap:   1896     0  1896

It is also available on AWS CloudWatch metrics via --mem-avail flag.

Some background about Docker

My another question was "Are those metrics same in Docker?". Before diving into this question, let's check how docker works.

According to Docker Overview: The Underlying Technology, processes in a Docker container directly run in their host OS without any virtualization, but they are isolated from the host OS and other containers in effect thanks to these Linux kernel features:

  • namespaces: Isolate PIDs, hostnames, user IDs, network accesses, IPC, etc.
  • cgroups: Limit resource usage
  • UnionFS: Isolate file system

Because of the namespaces, ps command lists processes of Docker containers in addition to other processes in the host OS, while it cannot list processes of host OS or other containers in a docker container.

By default, Docker containers have no resource constraints. So, if you run one container in a host and don't limit resource usage of the container, and this is my case, the container's "free memory" is same as the host OS's "free memory".

Memory Metrics on Docker Container

If you want to monitor a Docker container's memory usage from outside of the container, it's easy. You can use docker stats.

$ docker stats
fc015f31d9d1  0.00%  220KiB / 3.858GiB  0.01%  1.3kB / 0B  0B / 0B    2

But if you want to get the memory usage in the container or get more detailed metrics, it gets complicated. Memory inside Linux containers describes the difficulties in details.

/proc/meminfo and sysinfo, which is used by os.totalmem() and os.freemem() of Node.js, are not isolated, you get metrics of host OS if you use normal utilities like top and free in a Docker container.

To get metrics specific to your Docker container, you can check pseudo files in /sys/fs/cgroup/memory/. They are not standardized according to Memory inside Linux containers though.

$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes

memory.limit_in_bytes returns a very big number if there is no limit. In that case, you can find the host OS's total memory with /proc/meminfo or commands that use it.


It was a longer journey than I initially thought. My takeaways are:

  • Available Memory > Free Memory
  • Use MemAvailable if available (pun intended)
  • Processes in a Docker container run directly in host OS
  • Understand what you are measuring exactly, especially in a Docker container