Deno and FFI – How much Overhead?

Deno can use shared libraries which follow the normal C convention, thus allowing Deno programs to call C functions. Since there’s no way to create raw Ethernet packets within Deno and I found no library doing this, I think I’ll have to create it myself similar to what I did with Dart and its FFI.

But first I wanted to know what the overhead of calling a C function in a shared library is as this GitHub issue seems to make it very slow.

Time to measure! So I created a simple C function to add up bytes in an array:

int sum(const uint8_t *p, int count) {
        int res=0;
        for (int i=0; i<count; ++i) {
                res += p[i];
        }
        return res;
}

and call this with variable amounts of count, from 1 to 4096. From Deno call it:

for (let size = 1 ; size < buf.length; size *= 2) {
  start = performance.now();
  for (let i = 0; i < 1_000_000; ++i) {
    result = dylib.symbols.sum(buf, size);
  }
  end = performance.now();
  console.log(`Time used for ${size} bytes: ${end-start}ns`);
}

performance.now() returns a time stamp in ms, thus since the test call is done 1M times, the result shows the time in ns:

BytesAMD GX-420CA AMD Ryzen 5 3400GERK3328@1.3GHzKhadas VIM3L S905D3@1.9GHzAWS C6G
438121274864228
856181342902232
1690341464982252
321746416981134292
6431014021701436386
12858426831202042556
256113054449983252904
51222241022876456781620
10244410204816304105203014
20488800407231378202005808
4096175548076616903959811374
Overhead23141213830212
Calling FFI from Deno with increasing amount of work done inside the FFI, Deno version 1.29.3 (ARM: 1.29.4)

Linear regression shows 23ns and 14ns overhead (extrapolate for size=0) for the x86_64 CPUs. Note how nicely the time increases with larger payloads. The ARM CPUs start to show linear increases only at about 128 bytes, and their overhead is quite a lot higher at 830ns and 212ns.

Given that one 1500 byte Ethernet frame at 1 GBit/s takes 12μs, the overhead for the slower AMD CPU is only 0.2%, this is very acceptable. Even for more typical frames of 500 byte (128 pixel, 3 colors, plus a bit of overhead), the overhead is only 0.6%.

The ARM CPUs have significantly more overhead (7% for a 1500 byte frame for the S905D3, and 20% for a 500 byte frame). Even using a server type ARM CPU does not improve it by much.

Appendix

Deno version: 1.29.3 for x86_64, and 1.29.4 for ARMv8 compiled via

cargo install deno --locked
Advertisement

TensorFlow on arm64

The VIM3L (Cortex A55 cores) I have has a NPU accelerator built-in. An interesting article about it is here, however before doing fancy NPU stuff, let’s get TensorFlow working first. Should be easy.

Famous last words. Turns out that “pip install tensorflow” does not work: on arm64 (AKA aarch64 AKA ARMv8) TensorFlow is not officially supported. So I had to compile it first.

Compiling TensorFlow

https://www.tensorflow.org/install/source described the compile process reasonable well. It is missing a lot of details though, so here is a more detailed walk-through. Start with a Ubuntu 20.xx image with an extra 70 GB disk for TensorFlow source code:

# One-time action: for the data disk, create a volume and a filesystem
# to mount under /data

sudo bash
pvcreate /dev/nvme1n1
vgcreate vg_data /dev/nvme1n1
lvcreate -L69G -n data vg_data
mke2fs -j /dev/vg_data/data
mkdir /data
echo -e '/dev/mapper/vg_data-data\t/data\text4\tdefaults\t0 1' >>/etc/fstab
mount /data
chown ubuntu:users /data
umount /data
exit

sudo apt update
sudo apt -y upgrade
sudo reboot

After a reboot, you now have a /data of about 70GB.

sudo apt -y install build-essential python3 python3-dev python3-venv pkg-config zip zlib1g-dev unzip curl tmux wget vim git htop liblapack3 libblas3 libhdf5-dev openjdk-11-jdk

# Get bazel

wget https://github.com/bazelbuild/bazel/releases/download/4.2.2/bazel-4.2.2-linux-arm64
chmod a+x bazel-4.2.2-linux-arm64
sudo cp bazel-4.2.2-linux-arm64 /usr/local/bin/bazel

# bazel uses ~/.cache/bazel

mkdir -p /data/.cache/bazel
ln -s /data/.cache/bazel ~/.cache/bazel

# Build a Python 3 virtual environment

python3 -m venv ~/venv
source ~/venv/bin/activate
pip install wheel packaging
pip install six mock numpy grpcio h5py
pip install keras_applications --no-deps
pip install keras_preprocessing --no-deps

# Get TensorFlow source

cd /data
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow/
git checkout r2.8
cd /data/tensorflow

./configure

# Build Python package:

bazel build -c opt \
--copt=-O3 \
--copt=-std=c++11 \
--copt=-funsafe-math-optimizations \
--copt=-ftree-vectorize \
--copt=-fomit-frame-pointer \
--copt=-DRASPBERRY_PI \
--host_copt=-DRASPBERRY_PI \
--verbose_failures \
--config=noaws \
--config=nogcp \
//tensorflow/tools/pip_package:build_pip_package

# Build Python whl:

BDIST_OPTS="--universal" bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg

# And for tfjs:
# (see https://github.com/tensorflow/tfjs/tree/master/tfjs-node)

bazel build --config=opt --config=monolithic //tensorflow/tools/lib_package:libtensorflow
# The result is at bazel-bin/tensorflow/tools/lib_package/libtensorflow.tar.gz

It does take a lot of time (about 2-3h for each the Python package and the tfjs-node library). When I tried 2 CPU and 8 GB RAM, some compiler runs were killed as they were running out of memory. 4 CPU and 16 GB RAM worked fine.Thus AWS m6g.xlarge recommended. m6g.large failed to build.

Using spot instances for the m6g.xlarge (regular $0.154/h, spot price $0.04/h) helped a bit to limit the financial impact.

Python and TensorFlow

It took me several tries:

  • When using Ubuntu 22.04 to compile TF, the resulting binary wanted GLIBC 2.35 which my VIM3L did not have. It had 2.31. It also used Python 3.10 to compile.
  • When using Ubuntu 20.04, it used Python 3.8 to compile. My VIM3L had Python 3.9. While GLIBC was fine, Python was not.
  • The created whl file could be loaded and used on the machine I compiled it on. No Python or GLIBC version problems here.

That covered all my Python needs. Now moving to the main target:

Node.js and TensorFlow

tfjs-node uses the libtensorflow.so library, so that should remove some of the CPython version problems I have seen. Compiling was easy now: https://github.com/tensorflow/tfjs/tree/master/tfjs-node#optional-build-optimal-tensorflow-from-source is spot on.

The biggest problem was to make Node.js understand to not use the non-existing arm64 pre-compiled library, but instead use the one I created. The instructions in the above link did not explain in enough details how to make this work. In hindsight it’s easy, but it took some tries to make me understand it. In short:

  • Do an “npm install –ignore script”
  • Add a file scripts/custom-binary.json into the modules directory for @tensorflow/tfjs-node (this gave me the hint)
  • Run “npm install” in the tfjs-node directory
  • That will download the tensorflow library archive
  • Now do the “npm install” where your application is (which is the only “npm install” you’d usually do)
❯ npm install --ignore-script
❯ pushd .
❯ cd node_modules/@tensorflow/tfjs-node/scripts
❯ cat >>custom-binary.json <<_EOF_
{
  "tf-lib": "https://MYSERVER.com/libtensorflow-2.8-arm64.tar.gz"
}
_EOF_
❯ cd ..
❯ npm install
[...]
> @tensorflow/tfjs-node@3.16.0 install
> node scripts/install.js

CPU-linux-3.16.0.tar.gz
* Downloading libtensorflow
https://MYSERVER/libtensorflow-2.8-arm64.tar.gz
[==============================] 3685756/bps 100% 0.0s
* Building TensorFlow Node.js bindings
[...]
❯ popd
❯ npm install

Benchmarks

As a benchmark I modified slightly server.js from the tfjs-examples/baseball-node to not listen to the port which means after the training it’ll exit. Then run this on the VIM3L (S905D3), my ThinkCentre m75q (Ryzen 5), and my HP T620 (GX-420CA) once with CPU backend (tfjs) and once with the C++ TF library (tfjs-node):

CPUBackendTime/s
Amlogic S905D3-N0N @ 1.9GHzcpu803
Amlogic S905D3-N0N @ 1.9GHztensorflow189
AMD Ryzen 5 PRO 3400GE @ 3.3GHzcpu122
AMD Ryzen 5 PRO 3400GE @ 3.3GHztensorflow36
AMD GX-420CA @ 2GHzcpu530
AMD GX-420CA @ 2 GHztensorflow119
All running Node.js 16.x

I did not expect Node.js to be just 4 times slower than C++. Really impressive. Still, using tfjs-node makes a lot of sense. While on x86_64 this was not an issue, with above instructions it’s doable on arm64 too.

XKCD Userscript

Learning is easiest with a goal. The goal here was to recreate this and expand it a bit. Nothing earth-shaking. Still learning how to use querySelector().

To install, have Tampermonkey installed and click here.

Here the script:

// ==UserScript==
// @name         Display XKCD IMG title
// @namespace    https://hkubota.wordpress.com/
// @downloadURL  https://raw.githubusercontent.com/haraldkubota/xkcd-userscript/main/display_title.user.js
// @supportURL   https://hkubota.wordpress.com/2021/09/04/xkcd-userscript/
// @version      0.1
// @description  Show the IMG title and Explain XKCD Link
// @author       Harald Kubota
// @match        http*://*xkcd.com/*
// @icon         https://www.google.com/s2/favicons?domain=xkcd.com
// @grant        none
// ==/UserScript==

(function() {
    window.addEventListener('load', () => {
        const titleElement = document.querySelector('#comic [title]');
        if (titleElement) {
            const title = titleElement.title;
            let p = document.createElement('p');
            p.innerText = title;
            document.querySelector('#comic').append(p);
            const mystyle = {
                "font-variant": "none",
                "background": "lightgray",
                "padding": "10px"
            };
            Object.assign(p.style, mystyle);
        }
        const currentComic = document.querySelector('div#middleContainer.box > a');
        if (currentComic) {
            let uriPath = currentComic.text.split('/');
            let currentComicNumber=1;
            for (let i=uriPath.length-1; i>=0; --i) {
                let n = parseInt(uriPath[i]);
                if (!isNaN(n)) {
                    currentComicNumber = n;
                    break;
                }
            }
            let div = document.createElement('div');
            let a = document.createElement('a');
            var link = document.createTextNode('https://www.explainxkcd.com/wiki/index.php/'+currentComicNumber);
            a.appendChild(link);
            a.setAttribute('href', 'https://www.explainxkcd.com/wiki/index.php/'+currentComicNumber);
            div.appendChild(a);
            document.querySelector('#comic').append(div);
        }
    }, false);
})();

Now it looks like this:

The elements with the red arrows are added via above userscript

Deno and ES6 Modules

Continuing my expedition into the JavaScript front-end land…I like Deno. A lot. It does lots of things right (from my point of view, e.g. permissions), and it fixes many problem Node.js has. Using Deno for back-end work is straightforward, but since I focus currently on the front-end side, how can I create front-end JavaScript with Deno and still use Deno’s testing framework?

It’s actually not hard. Here a small TypeScript file:

import capitalize from "https://unpkg.com/lodash-es@4.17.15/capitalize.js";

function main() {
  console.log(capitalize("hello from the web browser"));
}

function sum(a: number, b: number): number {
    return a+b;
}

window.onload = () => {
  console.info(capitalize("module loaded!"));
};

export { main, sum }                                                                                                   

embedded in a simple index.html file:

<!DOCTYPE html>
<html>
<head>
    <title>Testing Deno</title>
</head>
<body>
    <div>Some Text</div>
<script type="module">
    import {main, sum} from './dist/browser.js';
    main();
    console.log(sum(5,6));
    console.log("Done.");
</script>
</body>
</html>

So far that’s straightforward: loading index.html runs main() and prints out the sum of 5+6 on the console.

Here the test script:

import { assertEquals } from "https://deno.land/std@0.106.0/testing/asserts.ts";

import { sum } from "./example.ts";

Deno.test("Sum 5 and 7", () => {
  assertEquals(sum(5, 7), 12);
});

Again, standard Deno. And deno test works as expected. And to deploy as plain-JavaScript, a simple deno bundle src/example.ts dist/browser.js does the trick.

And there you go: plain TypeScript and yet the logic can be tested with the normal deno test command. No extra tools needed. No Babel, WebPack, tsc. No require vs import either.

deno_dom is not yet working enough though (see the stub definition of addEventListener() here), so until then only basic DOM operations work.

Update: This commit added addEventListener()