NVIDIA Jetson Xavier NX Benchmarks

I recently received a NVIDIA Jetson Xavier NX board to review and write some posts. The first one is an unofficial guide to upgrade Ubuntu 18.04 to latest Ubuntu Focal (20.04).

Here I will do some benchmarks and compare the performance between the Jetson NX and other SBCs. A while back, I’ve benchmarked some ARM boards comparing their performance on Java and other workloads. Here I will do a similar approach and add some GPU and power consumption tests and comparisons.

Of course the price range varies a lot, from $79 for the Odroid N2 and the RockPro64 to $399 to the Xavier NX we cannot expect similar performance or features. All boards were set using Debian or Ubuntu and no Graphical Interface enabled.

To measure the CPU and GPU use, I’ve installed jtopfrom jetson-stats repo. It’s a utility similar to htop but fetching NVIDIA stats. The app can be installed with sudo -H pip install -U jetson-stats and run with sudo jtop.

Disclaimer: These tests and benchmarks are not scientific and validated in a tight laboratory spec. They were made in my home lab using open source tools and equipment available in the marked. Also the benchmarks I used are in no way comprehensive on all workload available.

NVIDIA Jetson Xavier NX

  • CPU: 6-core Carmel Arm 64-bit CPU, 6MB L2 + 4MB L3.
  • GPU: NVIDIA Volta with 384 NVIDIA CUDA cores and 48 Tensor Cores, plus 2x NVDLA.
  • 8GB 128-bit LPDDR4x RAM
  • Gigabit Ethernet

HardKernel Odroid N2

  • Amlogic S922X (4x Cortex-A73 @ 1.8GHz, 2x Cortex-A53 @ 1.9GHz); 12nm fab; Mali-G52 GPU with 6x 846MHz EEs.
  • 4GB DDR4 RAM.
  • Gigabit Ethernet

Pine64 RockPro64

  • Rockchip RK3399 Hexa-Core (dual ARM Cortex A72 and quad ARM Cortex A53) 64-Bit Processor and MALI T-860 Quad-Core GPU.
  • 4GB LPDDR4 RAM.
  • PCIe x4.
  • Gigabit Ethernet

Benchmarks used:

I use the 7zip benchmark as a baseline for most boards since it shows the bare CPU performance and is a good comparison between boards.

NVIDIA Jetson Xavier NX

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: - 64000000 64000000 - 128000000 256000000 512000000 - 2048000000
RAM size: 7763 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 9309 466 1945 9057 | 113067 568 1699 9642
23: 9036 454 2026 9207 | 110032 564 1688 9521
24: 9186 465 2122 9877 | 108373 567 1678 9512
25: 9337 469 2275 10661 | 106344 564 1679 9464
---------------------------------- | ------------------------------
Avr: 464 2092 9701 | 566 1686 9535
Tot: 515 1889 9618

Odroid N2

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: 1597 1763 1797 1796 1795 1786 1792 1792 1791
RAM size: 3713 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 5848 525 1083 5690 | 118603 535 1892 10115
23: 5615 530 1080 5722 | 114888 532 1868 9941
24: 5511 539 1099 5926 | 113151 536 1853 9932
25: 5223 540 1105 5964 | 107952 530 1812 9607
---------------------------------- | ------------------------------
Avr: 533 1092 5826 | 533 1856 9899
Tot: 533 1474 7862

RockPro64

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: 64000000 - - - - - 512000000 - -
RAM size: 3793 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 4468 476 913 4347 | 94209 522 1540 8034
23: 4114 473 886 4192 | 92694 525 1528 8021
24: 3965 472 902 4264 | 90617 525 1516 7954
25: 3840 480 913 4385 | 88555 525 1501 7881
---------------------------------- | ------------------------------
Avr: 475 904 4297 | 524 1521 7972
Tot: 500 1213 6135

Here we see that the NVIDIA Jetson NX is 22% faster than the Odroid-N2 and 56% faster than the RK3399 SOC from the RockPro64.

Java SPECjvm2008

In the past I've compared some Java versions on ARM64 and also between some of my boards in the same benchmark and it has a great balance to demonstrate CPU performance on multiple kinds of workloads.

Here I compare the three boards broke into those benchmarks:

In this test, we see that in average of multiple SPECJvm2008 benchmarks the AmLogic SOC is 27% faster than the RK3399. The NVIDIA Xavier NX is 150% faster than the AmLogic and 187% faster than the RK3399.

Next I've tested the memory performance using sysbench. In these tests, I've used 1K, 1M blocks with read and write tests.

Tests used 50%, 100% and 200% amount of memory for each board.

Xavier NX

❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1024KiB
total size: 8192MiB
operation: write
scope: global
Initializing worker threads...Threads started!Total operations: 8192 (15042.72 per second)8192.00 MiB transferred (15042.72 MiB/sec)General statistics:
total time: 0.5421s
total number of events: 8192
Latency (ms):
min: 0.06
avg: 0.07
max: 0.50
95th percentile: 0.08
sum: 532.89
Threads fairness:
events (avg/stddev): 8192.0000/0.00
execution time (avg/stddev): 0.5329/0.00

Odroid N2

❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=4G run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1024K
Memory transfer size: 4096MMemory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 4096 ( 1236.33 ops/sec)4096.00 MB transferred (1236.33 MB/sec)Test execution summary:
total time: 3.3130s
total number of events: 4096
total time taken by event execution: 3.3096
per-request statistics:
min: 0.29ms
avg: 0.81ms
max: 2.31ms
approx. 95 percentile: 1.57ms
Threads fairness:
events (avg/stddev): 4096.0000/0.00
execution time (avg/stddev): 3.3096/0.00

RockPro64

rock64@rockpro64:~$ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1024KiB
total size: 8192MiB
operation: write
scope: global
Initializing worker threads...Threads started!Total operations: 8192 ( 7796.96 per second)8192.00 MiB transferred (7796.96 MiB/sec)General statistics:
total time: 1.0473s
total number of events: 8192
Latency (ms):
min: 0.12
avg: 0.13
max: 0.68
95th percentile: 0.13
sum: 1044.03
Threads fairness:
events (avg/stddev): 8192.0000/0.00
execution time (avg/stddev): 1.0440/0.00

Below a table with all tested block sizes and amount of memory.

The Xavier averages 200% faster memory access in comparison to the RockPro64 and 500% faster than the Odroid-N2. This is the opposite we see in CPU comparison where the Odroid-N2 is usually faster than the RockPro64.

To test the Cuda GPU acceleration, I took a sample app from PyTorch framework that is the MNIST image classification. In this case I didn’t compare to the other ARM boards since neither has a GPU. The tests were done in the Xavier board itself with and without GPU acceleration for ML workloads. This is where this board shines.

The sample app can be found here:

The execution is done on the Docker container provided by NVIDIA with all required PyTorch requirements and drivers installed. Run Docker with:

docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.5-py3

Paste it to the mnist.py file from the Gist and execute with time python3 mnist.py. Below are the test results:

With GPU

First the sample was executed with full Cuda support. The jtop utility oscillates the CPU usage around 30–40% while the GPU jumped into 1.1Ghz and oscillates around 40–75%.

root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1
Train Epoch: 1 [0/60000 (0%)] Loss: 2.333409
..
Test set: Average loss: 0.0564, Accuracy: 9802/10000 (98%)
real 0m52.126s
user 1m2.100s
sys 0m8.316s

I ran 1 epoch (1 iteration) with the command time python3 mnist.py --epochs=1and the execution took 52.4 seconds.

Without GPU

Without the GPU acceleration, all 6 cores ran at 1.4Ghz and 100% all the time. GPU stayed at 0% as expected.

root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1 --no-cuda
Train Epoch: 1 [0/60000 (0%)] Loss: 2.311259
..
Test set: Average loss: 0.0520, Accuracy: 9825/10000 (98%)
real 12m59.672s
user 65m59.436s
sys 0m44.076s

The execution of 1 epoch (1 iteration) with the command time python3 mnist.py --epochs=1 --no-cuda took 12:59.7s. Yes, you read it right, almost 13 minutes! More than 13x longer than the GPU time.

In comparison, running the same workload in my MacBook Pro 15'’ 2018 (2.6GHz 6-core Intel Core i7) without a CUDA GPU on a Docker container running in a VM with 10 cores and 6GB RAM, took:

> docker run -it --rm -v $(pwd):/work -w /work --rm pytorch/pytorch bash
> time python3 mnist.py --epochs=1
....
real 1m39.388s
user 5m32.886s
sys 1m6.857s

Amazing that a tiny GPU in a small board computer consuming way less power can run the same workload in almost half the time than a top-of-line laptop.

TensorFlow 2

For the TensorFlow 2 tests, I’ve used another MNIST sample. The app is based on this file but needed some adjustments. I’ve uploaded it to this Gist.

Since NVIDIA builds container images only for TensorFlow 1.5, for this test, I created a Dockerfile for TF2. The image has been pushed into DockerHub with the tag carlosedp/l4t-tensorflow:r32.4.2-tf1-py3.

The tests were executed with the Docker using the nvidia-runtime for GPU acceleration or without it for CPU only.

Curiously I’ve seen very similar times between the CPU and GPU executions. I even opened a thread on NVIDIA Forums and trying those suggestions got me similar numbers.

GPU

The container is executed with docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash

> python3 mnist.py
...
2020-06-12 19:59:30.666323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3560 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2020-06-12 19:59:32.241779: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
step: 100, loss: 0.373517, accuracy: 0.898438
step: 200, loss: 0.250729, accuracy: 0.933594
...
Test Accuracy: 0.966400
real 0m52.721s
user 0m51.516s
sys 0m5.936s

CPU

The container is executed with docker run -it --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash

> python3 mnist.py
...
2020-06-12 19:57:30.661145: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-12 19:57:30.661284: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
step: 100, loss: 0.347883, accuracy: 0.914062
step: 200, loss: 0.248891, accuracy: 0.933594
...
Test Accuracy: 0.964100
real 0m51.953s
user 1m14.548s
sys 0m4.824s

As seen, pretty similar times. Also I haven’t seen the GPU on jtop to go more than 25–30% at 114Mhz (the minimum) on GPU mode. Maybe someone with more knowledge about TensorFlow and Keras can send me a feedback.

Jetson Xavier NX consumes around 7.3W while idle and topped 11.7W on 6-core, 1.4Ghz mode and 9.6W on 2-core 1.9Ghz mode.

This is at least 1.8x higher power consumption at full use and 2.5x higher on idle compared to the Odroid-N2 reviewed by me.

During the Pytorch MNIST test, I saw peaks of 13.5W while on GPU and 10.2W on CPU only. Considering the amount of time the test took, It's worth the GPU acceleration since it took 13x less time to execute.

All in all, the NVIDIA Jetson Xavier NX board is a powerhouse with more processing power I’ve seen on most ARM SBCs. With it’s GPU a new frontier is open with capabilities for processing Machine Learning workloads in the edge topping the CPU processing of a pretty recent computer (MacBook Pro 2018).

Also as a desktop, the Xavier NX is a great option paired with an M.2 SSD (for more storage performance and space) and can be used as one’s daily driver for most development tasks.

The down point is it’s price, placed well above most SBCs at $399 but with the CPU and RAM performance, GPU, M.2 slot and the NVIDIA ecosystem it's in a different category close to what an ARM workstation would be.

If you have suggestions or ideas of tests to be done in these boards, send me your feedback on Twitter @carlosedp.

Writing everything cloud and all the tech behind it. If you like my projects and would like to support me, check my Patreon on https://www.patreon.com/carlosedp