step-ca stopped!

I am using step-ca for over 2 years now on a small little NanoPi R2S (1GB RAM). And I am monitoring it too, e.g. for the last 6 months memory is very stable and more importantly: not increasing over time. Memory leaks are real, but not on this baby:

At the very end it’s changing, and it’s changing very suddenly too. Here the last 7 days:

What did not work:

  • Reset the server
  • Reboot the server

Actually it worked for less than a minute. By then memory exhaustion happened and the server was busy swapping. Connecting via ssh became a gamble at this point.

After a reboot there’s about 1min time to stop step-ca. A simple kill won’t do because systemd would restart it, so a

systemctl stop step-ca

did the job. Just have to be fast enough to execute it.

What happened?

It seems that the internal DB (BadgerV2), which by default is in ~/.step/db/ increased over time to much that its management consumed rather suddenly so much memory that swapping happened. One parameter I did not use (left it empty):

badgerFileLoadingMode [optional]: can be set to FileIO (instead of the default MemoryMap) to avoid memory-mapping log files. This can be useful in environments with low RAM. Make sure to use badgerV2 as the database type if using this option.

    MemoryMap: default.
    FileIO: This can be useful in environments with low RAM

Needless to say, the default works fine as long as the DB does not get too big. In my case it was 4.7GB in size:

I find it impressive that the default of MemoryMapped worked that well on a 1 GB RAM machine, but I guess in the last days it stopped working.

The Fix

Starting step-ca manually worked with no error messages, but it used more and more memory. I downgraded from v0.23.0 to 0.17.2 as I upgraded some week ago, but it made zero difference. Finding the 4 GB db solved the problem: since the config and all secret keys/certificates are not in the DB, I tried to simply wipe it out, and that worked as expected: step-ca created a new DB. When started it used about 10% memory, and then even less. Back to normal.

Lesson I learned: watch your monitoring. This behavior started on 15th and it took me 3 days to realize.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.