Deno and FFI – How much Overhead?

Deno can use shared libraries which follow the normal C convention, thus allowing Deno programs to call C functions. Since there’s no way to create raw Ethernet packets within Deno and I found no library doing this, I think I’ll have to create it myself similar to what I did with Dart and its FFI.

But first I wanted to know what the overhead of calling a C function in a shared library is as this GitHub issue seems to make it very slow.

Time to measure! So I created a simple C function to add up bytes in an array:

int sum(const uint8_t *p, int count) {
        int res=0;
        for (int i=0; i<count; ++i) {
                res += p[i];
        }
        return res;
}

and call this with variable amounts of count, from 1 to 4096. From Deno call it:

for (let size = 1 ; size < buf.length; size *= 2) {
  start = performance.now();
  for (let i = 0; i < 1_000_000; ++i) {
    result = dylib.symbols.sum(buf, size);
  }
  end = performance.now();
  console.log(`Time used for ${size} bytes: ${end-start}ns`);
}

performance.now() returns a time stamp in ms, thus since the test call is done 1M times, the result shows the time in ns:

BytesAMD GX-420CA AMD Ryzen 5 3400GERK3328@1.3GHzKhadas VIM3L S905D3@1.9GHzAWS C6G
438121274864228
856181342902232
1690341464982252
321746416981134292
6431014021701436386
12858426831202042556
256113054449983252904
51222241022876456781620
10244410204816304105203014
20488800407231378202005808
4096175548076616903959811374
Overhead23141213830212
Calling FFI from Deno with increasing amount of work done inside the FFI, Deno version 1.29.3 (ARM: 1.29.4)

Linear regression shows 23ns and 14ns overhead (extrapolate for size=0) for the x86_64 CPUs. Note how nicely the time increases with larger payloads. The ARM CPUs start to show linear increases only at about 128 bytes, and their overhead is quite a lot higher at 830ns and 212ns.

Given that one 1500 byte Ethernet frame at 1 GBit/s takes 12μs, the overhead for the slower AMD CPU is only 0.2%, this is very acceptable. Even for more typical frames of 500 byte (128 pixel, 3 colors, plus a bit of overhead), the overhead is only 0.6%.

The ARM CPUs have significantly more overhead (7% for a 1500 byte frame for the S905D3, and 20% for a 500 byte frame). Even using a server type ARM CPU does not improve it by much.

Appendix

Deno version: 1.29.3 for x86_64, and 1.29.4 for ARMv8 compiled via

cargo install deno --locked
Advertisement

step-ca stopped!

I am using step-ca for over 2 years now on a small little NanoPi R2S (1GB RAM). And I am monitoring it too, e.g. for the last 6 months memory is very stable and more importantly: not increasing over time. Memory leaks are real, but not on this baby:

At the very end it’s changing, and it’s changing very suddenly too. Here the last 7 days:

What did not work:

  • Reset the server
  • Reboot the server

Actually it worked for less than a minute. By then memory exhaustion happened and the server was busy swapping. Connecting via ssh became a gamble at this point.

After a reboot there’s about 1min time to stop step-ca. A simple kill won’t do because systemd would restart it, so a

systemctl stop step-ca

did the job. Just have to be fast enough to execute it.

What happened?

It seems that the internal DB (BadgerV2), which by default is in ~/.step/db/ increased over time to much that its management consumed rather suddenly so much memory that swapping happened. One parameter I did not use (left it empty):

badgerFileLoadingMode [optional]: can be set to FileIO (instead of the default MemoryMap) to avoid memory-mapping log files. This can be useful in environments with low RAM. Make sure to use badgerV2 as the database type if using this option.

    MemoryMap: default.
    FileIO: This can be useful in environments with low RAM

Needless to say, the default works fine as long as the DB does not get too big. In my case it was 4.7GB in size:

I find it impressive that the default of MemoryMapped worked that well on a 1 GB RAM machine, but I guess in the last days it stopped working.

The Fix

Starting step-ca manually worked with no error messages, but it used more and more memory. I downgraded from v0.23.0 to 0.17.2 as I upgraded some week ago, but it made zero difference. Finding the 4 GB db solved the problem: since the config and all secret keys/certificates are not in the DB, I tried to simply wipe it out, and that worked as expected: step-ca created a new DB. When started it used about 10% memory, and then even less. Back to normal.

Lesson I learned: watch your monitoring. This behavior started on 15th and it took me 3 days to realize.

Leben und Tod

Am Freitag vor 2 Tagen ist ein langjähriger Freund von mir gestorben. Er war etwa so alt wie ich, und er schien kerngesund zu sein vor ein paar Jahren als wir uns zuletzt getroffen haben. Er hatte eine neue Lebensgefährtin gefunden und sie planten im nächsten Jahr zu heiraten. Beim letzten Telefongespräch hörte er sich sehr glücklich an und ich hätte mich sehr gefreut beide beim nächsten Besuch in Deutschland zu treffen.

Daraus wird aber leider nichts mehr.

Man kann die Vergangenheit nicht ändern, aber in der Zukunft werde ich weniger Sachen auf das nächste Jahr verschieben. Manchmal gibt es eben kein “nächstes Jahr”.

Replace a Mouse with a Tablet

2 years ago I bought a tablet (Huion HS611) for playing Osu! but I quickly found out that I suck at it. Drawing is not my thing either, so it was mostly unused.

But then recently my Microsoft Sculpt Mouse broke (single click made double clicks). I have a spare mouse which is very “un-ergonomic”. So I needed another replacement, and I tested again the tablet, but this time as a mouse-replacement.

And it’s better than I expected!

  • Scrolling is very easy (like on a phone).
  • Single-click is super-easy although click-and-drag does not work at all as it’s used for scrolling.
  • Double-click and right-mouse-button-clicks are hard: the pen pointer moves a lot more than when using a mouse.
  • Having the pen in your hand means that when you type, you either drop the pen or you type with a pen between your fingers.

The one problem I had was that the Huion driver puts Windows into “Tablet mode”, which is generally correct. However it also means Windows likes to pop up a soft keyboard. It should not as I have a USB keyboard connected. It’s annoying as heck then it happens. Luckily the fix is simple.

No more soft keyboard pop-up

Run services.msc, find the Touch Keyboard and Handwriting Panel Service and simply disable it.

BASIC Benchmarks

Found this Wikipedia article about BASIC benchmarks and it had some run times for some old computers I used before. E.g. benchmark 7 took 21.1s on a BBC Micro which was particularly fast. A C64 took 47.5s

How long does a current computer take for this kind of work?

I got no BASIC, but JavaScript is kind’a similar: it’s often the first language to learn programming. So let’s see how long that takes (after translating the BASIC program into JavaScript):

function doNothing() {
    return;
}

function bench7() {
    let k = 0;
    let m = [];
    do {
        ++k;
        let a = k / 2 * 3 + 4 - 5;
        doNothing();
        for (let l = 0; l < 5; ++l) {
            m[l] = a;
        }
    } while (k < 1000);
}

function manyBench(n) {
    console.log("S");
    for (let i=0; i<n; ++i) {
        bench7();
    }
    console.log("E");
}

manyBench(500000);

Running this took not that long:

❯ time node benchmark7.js
S
E
node benchmark7.js  2.82s user 0.02s system 99% cpu 2.845 total

That’s for 500,000 times though, so each benchmark run takes about 0.056ms on my low-end PC (Ryzen 5 Pro 3400GE). That’s over 3.7M times faster.

And before anyone mentions it: yes, any modern compiler will optimize the whole benchmark away since no useful output or calculation is done. I am not sure how much Node.js (resp. the V8 engine) will remove. Making the code less do-nothing-like and taking the number of loops from the command line did not increase the run time significantly beside what I would have expected from the additional code, so I concluded that the code is executed as-is and parts have not been optimized away.

Winter Project: SDR

It’s still September, but winter will come soon. I learned about Software Defined Radio years ago, but the equipment was very expensive or limited. The software ecosystem wasn’t that large either.

However that has changed in the meantime: an Airspy is US$169 and the SpyVerter $49 and that combo can scan anything from DC to 1.8GHz.

First order was watching the spectrum at various wavelength via SDR#:

Scanning 433MHz band

As you can see it detects nicely when I pushed a button on the 433.92MHz remote control. Interestingly there’s a slight 3kHz miss-tuning for the small remote compared to the RF bridge:

Red line: 433.92MHz.
Lower signal: RF Bridge, upper signal: remote control

And I can listen to FM radio (very poor quality since the antenna I have is for 433MHz band):

FM Radio

To use any other bands beside 433MHz I’d need different or adjustable antennas. Something for later to worry about.

Decoding the RF Remote Control

Recognizing and decoding digital signals is surprisingly easy with the right tools. rtl_433 can decode a lot of standard devices, so it was my first attempt. Compiling was a bit more complex than I though due to dependencies. It was well enough documented though. After that’s done, time to decode some signals!

rtl_433 -d airspy -Y autolevel -M level -R 0 -v -A

This shows output like this for every button press and usually several times repeated:

Analyzing pulses...
Total count:   25,  width: 290297.52 ms         (72574379 S)
Pulse width distribution:
 [ 0] count:    1,  width: 289230564 us [289230564;289230564]   (72307641 S)
 [ 1] count:   16,  width: 11380 us [11372;11456]       (2845 S)
 [ 2] count:    8,  width: 33640 us [33636;33644]       (8410 S)
Gap width distribution:
 [ 0] count:    9,  width: 11708 us [11684;11728]       (2927 S)
 [ 1] count:   15,  width: 34020 us [33880;34284]       (8505 S)
Pulse period distribution:
 [ 0] count:    1,  width: 289242292 us [289242292;289242292]   (72310573 S)
 [ 1] count:   23,  width: 45380 us [45256;45660]       (11345 S)
Pulse timing distribution:
 [ 0] count:    1,  width: 289230564 us [289230564;289230564]   (72307641 S)
 [ 1] count:   25,  width: 11496 us [11372;11728]       (2874 S)
 [ 2] count:   23,  width: 33888 us [33636;34284]       (8472 S)
 [ 3] count:    1,  width: 100004 us [100004;100004]    (25001 S)
Level estimates [high, low]:    126,     -1
RSSI: -42.3 dB SNR: 42.0 dB Noise: -84.3 dB
Frequency offsets [F1, F2]:     121,      0     (+0.5 kHz, +0.0 kHz)
Guessing modulation: Pulse Width Modulation with sync/delimiter
view at https://triq.org/pdv/#AAB104FFFF2CE88460FFFF819292929292A19292A1A1A1A1929292A1A192929292A1929355
Attempting demodulation... short_width: 11380, long_width: 33640, reset_limit: 34288, sync_width: 289230560
Use a flex decoder with -X 'n=name,m=OOK_PWM,s=11380,l=33640,r=34288,g=0,t=0,y=289230560'
pulse_slicer_pwm(): Analyzer Device
bitbuffer:: Number of rows: 1
[00] {24} fb 0e 7b  : 11111011 00001110 01111011

Note the line with the flex decoder: those numbers vary slightly. Take the average for s, for l and r. Use the most common values for g and y. Use t to add some margin of timing error. I have 2 RF devices: a small remote control and a RF-Wifi bridge. Both can send the same signals, but their timing is slightly off:

  • Remote: s=13378±5, l=33638±10, r=34929±10
  • Bridge: s=11635±40, l=34198±10, r=34768±10

So I took the average, added some slack and that worked for both devices:

$ rtl_433 -d airspy -Y autolevel -M level -R 0 -v -X 'n=remote1,m=OOK_PWM,s=11500,l=34000,r=34900,g=0,t=800,bits=24'
[...]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
time      : 2022-09-23 11:46:45
model     : name         count     : 1             num_rows  : 2             rows      :
len       : 0            data      : ,
len       : 24           data      : fb0e7b
codes     : {0}, {24}fb0e7b
Modulation: ASK          Freq      : 433.9 MHz
RSSI      : -48.3 dB     SNR       : 36.0 dB       Noise     : -84.3 dB
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
time      : 2022-09-23 11:46:45
model     : name         count     : 1             num_rows  : 2             rows      :
len       : 0            data      : ,
len       : 24           data      : fb0e7b
codes     : {0}, {24}fb0e7b
Modulation: ASK          Freq      : 433.9 MHz
RSSI      : -48.3 dB     SNR       : 36.0 dB       Noise     : -84.3 dB
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[...]

Note that the shown code fb0e7b is unique to a single button. The other 3 buttons send different 24 bit codes.

Other 433.92MHz Devices

I was hoping to find more devices on that band, but I guess I need a really good antenna on a roof to detect them.

Frequency Usage in Japan

Here is the most useful general frequency usage plan.

OverTheWire – Security Wargames

I wish I had seen this a long time ago: https://overthewire.org/wargames/ are a set of challenges in the area of security: buffer overflows, command injection, web server security, plus fun Linux command line skills.

I learned a lot about how to detect and exploit buffer overflows and shellcode in the Narnia wargame. Good fun and educational too.

Natas is nice since it’s using a web server while the other ones use ssh. The harder ones are…hard, but there’s solutions all over the web in case you get stuck.

All in all, highly recommended to see how easy it is to exploit badly written code and how to write code which does not repeat those known security problems.

Dart & Pool

Tip of the day: If you use Dart and want to use the Pool library, expect not much help from Google when searching for those keywords: you get the expected results. Adding “future” or “async” helps.

Anyway, the point of this post is a small example how to use a Pool to run commands in parallel, but not too many concurrently.

import 'dart:io';
import 'package:pool/pool.dart';
import 'package:test/test.dart';

Future<ProcessResult> runCommand(String command, List<String> args) {
  return Process.run(command, args);
}

void main() {
  test('Run 20 slow date commands', () async {
    var pool = Pool(5);
    List<Future<ProcessResult>> results = [];
    for (var i = 0; i < 20; ++i) {
      results.add(pool.withResource(() => runCommand('./date_slow', ['+%N'])));
    }
    pool.close();
    await pool.done;
    for (var process in results) {
      var res = await process;
      print(res.stdout);
    }

./date_slow is a simple script which returns something on STDOUT and finishes in 2s:

#!/bin/bash
date $*
sleep 2

What happens then is that Pool(5) creates a pool with 5 slots. The first for loop tries to run 20 commands though and it’ll queue all 20 immediately, but only 5 at most will run. The rest simply waits until it’s their turn.

pool.close() stops any new entries and the await pool.done simply waits until the pool is closed and all jobs are executed.

The 2nd for loop (with the print() statement) uses await to get the ProcessResult from the Future<ProcessResult> which Process.run() returns and which is store in the results[] list.

The outcome here is that if the pool is 5 slots large, and each command runs for 2 seconds, the complete set of 20 jobs runs in 20/5*2=8 seconds. If I make the pool 10 slots large, it’ll run in 20/10*2=4 seconds and in case of 20 or more slots, it’ll be 2 seconds. And it’ll never run more processes than slots are available.

Why I need this? I have a list of URLs from few to hundreds which I need to query. While I can query many concurrently, each instance takes up some non-trivial amount of memory since it’s an external program. Currently at 100 concurrent calls, it uses up all memory on a 16GB RAM notebook. While there are many ways to work around this (do one at a time is the safest and slowest one), using a pool is perfect: it runs many commands in parallel, but I can limit the number of how many run in parallel.