NVIDIA Jetson Xavier NX Benchmarks

10 min readJul 3, 2020

I recently received a NVIDIA Jetson Xavier NX board to review and write some posts. The first one is an unofficial guide to upgrade Ubuntu 18.04 to latest Ubuntu Focal (20.04).

Here I will do some benchmarks and compare the performance between the Jetson NX and other SBCs. A while back, I’ve benchmarked some ARM boards comparing their performance on Java and other workloads. Here I will do a similar approach and add some GPU and power consumption tests and comparisons.

Of course the price range varies a lot, from $79 for the Odroid N2 and the RockPro64 to $399 to the Xavier NX we cannot expect similar performance or features. All boards were set using Debian or Ubuntu and no Graphical Interface enabled.

To measure the CPU and GPU use, I’ve installed jtopfrom jetson-stats repo. It’s a utility similar to htop but fetching NVIDIA stats. The app can be installed with sudo -H pip install -U jetson-stats and run with sudo jtop.

Disclaimer: These tests and benchmarks are not scientific and validated in a tight laboratory spec. They were made in my home lab using open source tools and equipment available in the marked. Also the benchmarks I used are in no way comprehensive on all workload available.

Board Quick Specs

NVIDIA Jetson Xavier NX

CPU: 6-core Carmel Arm 64-bit CPU, 6MB L2 + 4MB L3.
GPU: NVIDIA Volta with 384 NVIDIA CUDA cores and 48 Tensor Cores, plus 2x NVDLA.
8GB 128-bit LPDDR4x RAM
Gigabit Ethernet

HardKernel Odroid N2

Amlogic S922X (4x Cortex-A73 @ 1.8GHz, 2x Cortex-A53 @ 1.9GHz); 12nm fab; Mali-G52 GPU with 6x 846MHz EEs.
4GB DDR4 RAM.
Gigabit Ethernet

Pine64 RockPro64

Rockchip RK3399 Hexa-Core (dual ARM Cortex A72 and quad ARM Cortex A53) 64-Bit Processor and MALI T-860 Quad-Core GPU.
4GB LPDDR4 RAM.
PCIe x4.
Gigabit Ethernet

Benchmarks used:

SPECjvm2008 — Download from https://www.spec.org/jvm2008/
7zip benchmark — Install with apt install p7zip-full
Sysbench for memory benchmarks
PyTorch and TensorFlow for Machine Learning tests

CPU Benchmarks

7zip

I use the 7zip benchmark as a baseline for most boards since it shows the bare CPU performance and is a good comparison between boards.

NVIDIA Jetson Xavier NX

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)LE
CPU Freq: - 64000000 64000000 - 128000000 256000000 512000000 - 2048000000RAM size:    7763 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS22:       9309   466   1945   9057  |     113067   568   1699   9642
23:       9036   454   2026   9207  |     110032   564   1688   9521
24:       9186   465   2122   9877  |     108373   567   1678   9512
25:       9337   469   2275  10661  |     106344   564   1679   9464
----------------------------------  | ------------------------------
Avr:             464   2092   9701  |              566   1686   9535
Tot:             515   1889   9618

Odroid N2

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)LE
CPU Freq:  1597  1763  1797  1796  1795  1786  1792  1792  1791RAM size:    3713 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS22:       5848   525   1083   5690  |     118603   535   1892  10115
23:       5615   530   1080   5722  |     114888   532   1868   9941
24:       5511   539   1099   5926  |     113151   536   1853   9932
25:       5223   540   1105   5964  |     107952   530   1812   9607
----------------------------------  | ------------------------------
Avr:             533   1092   5826  |              533   1856   9899
Tot:             533   1474   7862

RockPro64

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)LE
CPU Freq: 64000000 - - - - - 512000000 - -RAM size:    3793 MB,  # CPU hardware threads:   6
RAM usage:   1323 MB,  # Benchmark threads:      6Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS22:       4468   476    913   4347  |      94209   522   1540   8034
23:       4114   473    886   4192  |      92694   525   1528   8021
24:       3965   472    902   4264  |      90617   525   1516   7954
25:       3840   480    913   4385  |      88555   525   1501   7881
----------------------------------  | ------------------------------
Avr:             475    904   4297  |              524   1521   7972
Tot:             500   1213   6135

Here we see that the NVIDIA Jetson NX is 22% faster than the Odroid-N2 and 56% faster than the RK3399 SOC from the RockPro64.

Java SPECjvm2008

In the past I've compared some Java versions on ARM64 and also between some of my boards in the same benchmark and it has a great balance to demonstrate CPU performance on multiple kinds of workloads.

Here I compare the three boards broke into those benchmarks:

In this test, we see that in average of multiple SPECJvm2008 benchmarks the AmLogic SOC is 27% faster than the RK3399. The NVIDIA Xavier NX is 150% faster than the AmLogic and 187% faster than the RK3399.

Memory Benchmarks

Next I've tested the memory performance using sysbench. In these tests, I've used 1K, 1M blocks with read and write tests.

Tests used 50%, 100% and 200% amount of memory for each board.

Xavier NX

❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)Running the test with following options:
Number of threads: 1
Initializing random number generator from current timeRunning memory speed test with the following options:
  block size: 1024KiB
  total size: 8192MiB
  operation: write
  scope: globalInitializing worker threads...Threads started!Total operations: 8192 (15042.72 per second)8192.00 MiB transferred (15042.72 MiB/sec)General statistics:
    total time:                          0.5421s
    total number of events:              8192Latency (ms):
         min:                                    0.06
         avg:                                    0.07
         max:                                    0.50
         95th percentile:                        0.08
         sum:                                  532.89Threads fairness:
    events (avg/stddev):           8192.0000/0.00
    execution time (avg/stddev):   0.5329/0.00

Odroid N2

❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=4G run
sysbench 0.4.12:  multi-threaded system evaluation benchmarkRunning the test with following options:
Number of threads: 1Doing memory operations speed test
Memory block size: 1024KMemory transfer size: 4096MMemory operations type: write
Memory scope type: global
Threads started!
Done.Operations performed: 4096 ( 1236.33 ops/sec)4096.00 MB transferred (1236.33 MB/sec)Test execution summary:
    total time:                          3.3130s
    total number of events:              4096
    total time taken by event execution: 3.3096
    per-request statistics:
         min:                                  0.29ms
         avg:                                  0.81ms
         max:                                  2.31ms
         approx.  95 percentile:               1.57msThreads fairness:
    events (avg/stddev):           4096.0000/0.00
    execution time (avg/stddev):   3.3096/0.00

RockPro64

rock64@rockpro64:~$ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)Running the test with following options:
Number of threads: 1
Initializing random number generator from current timeRunning memory speed test with the following options:
  block size: 1024KiB
  total size: 8192MiB
  operation: write
  scope: globalInitializing worker threads...Threads started!Total operations: 8192 ( 7796.96 per second)8192.00 MiB transferred (7796.96 MiB/sec)General statistics:
    total time:                          1.0473s
    total number of events:              8192Latency (ms):
         min:                                    0.12
         avg:                                    0.13
         max:                                    0.68
         95th percentile:                        0.13
         sum:                                 1044.03Threads fairness:
    events (avg/stddev):           8192.0000/0.00
    execution time (avg/stddev):   1.0440/0.00

Below a table with all tested block sizes and amount of memory.

The Xavier averages 200% faster memory access in comparison to the RockPro64 and 500% faster than the Odroid-N2. This is the opposite we see in CPU comparison where the Odroid-N2 is usually faster than the RockPro64.

Machine Learning Tests

To test the Cuda GPU acceleration, I took a sample app from PyTorch framework that is the MNIST image classification. In this case I didn’t compare to the other ARM boards since neither has a GPU. The tests were done in the Xavier board itself with and without GPU acceleration for ML workloads. This is where this board shines.

The sample app can be found here:

pytorch/examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - pytorch/examples

github.com

The execution is done on the Docker container provided by NVIDIA with all required PyTorch requirements and drivers installed. Run Docker with:

docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.5-py3

Paste it to the mnist.py file from the Gist and execute with time python3 mnist.py. Below are the test results:

With GPU

First the sample was executed with full Cuda support. The jtop utility oscillates the CPU usage around 30–40% while the GPU jumped into 1.1Ghz and oscillates around 40–75%.

root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1
Train Epoch: 1 [0/60000 (0%)] Loss: 2.333409
..
Test set: Average loss: 0.0564, Accuracy: 9802/10000 (98%)real 0m52.126s
user 1m2.100s
sys 0m8.316s

I ran 1 epoch (1 iteration) with the command time python3 mnist.py --epochs=1and the execution took 52.4 seconds.

Without GPU

Without the GPU acceleration, all 6 cores ran at 1.4Ghz and 100% all the time. GPU stayed at 0% as expected.

root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1 --no-cuda
Train Epoch: 1 [0/60000 (0%)] Loss: 2.311259
..
Test set: Average loss: 0.0520, Accuracy: 9825/10000 (98%)real 12m59.672s
user 65m59.436s
sys 0m44.076s

The execution of 1 epoch (1 iteration) with the command time python3 mnist.py --epochs=1 --no-cuda took 12:59.7s. Yes, you read it right, almost 13 minutes! More than 13x longer than the GPU time.

In comparison, running the same workload in my MacBook Pro 15'’ 2018 (2.6GHz 6-core Intel Core i7) without a CUDA GPU on a Docker container running in a VM with 10 cores and 6GB RAM, took:

> docker run -it --rm -v $(pwd):/work -w /work --rm pytorch/pytorch bash
> time python3 mnist.py --epochs=1
....
real 1m39.388s
user 5m32.886s
sys 1m6.857s

Amazing that a tiny GPU in a small board computer consuming way less power can run the same workload in almost half the time than a top-of-line laptop.

TensorFlow 2

For the TensorFlow 2 tests, I’ve used another MNIST sample. The app is based on this file but needed some adjustments. I’ve uploaded it to this Gist.

Since NVIDIA builds container images only for TensorFlow 1.5, for this test, I created a Dockerfile for TF2. The image has been pushed into DockerHub with the tag carlosedp/l4t-tensorflow:r32.4.2-tf1-py3.

The tests were executed with the Docker using the nvidia-runtime for GPU acceleration or without it for CPU only.

Curiously I’ve seen very similar times between the CPU and GPU executions. I even opened a thread on NVIDIA Forums and trying those suggestions got me similar numbers.

GPU

The container is executed with docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash

> python3 mnist.py
...
2020-06-12 19:59:30.666323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3560 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2020-06-12 19:59:32.241779: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
step: 100, loss: 0.373517, accuracy: 0.898438
step: 200, loss: 0.250729, accuracy: 0.933594
...
Test Accuracy: 0.966400real 0m52.721s
user 0m51.516s
sys 0m5.936s

CPU

The container is executed with docker run -it --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash

> python3 mnist.py
...
2020-06-12 19:57:30.661145: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-12 19:57:30.661284: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
step: 100, loss: 0.347883, accuracy: 0.914062
step: 200, loss: 0.248891, accuracy: 0.933594
...
Test Accuracy: 0.964100real 0m51.953s
user 1m14.548s
sys 0m4.824s

As seen, pretty similar times. Also I haven’t seen the GPU on jtop to go more than 25–30% at 114Mhz (the minimum) on GPU mode. Maybe someone with more knowledge about TensorFlow and Keras can send me a feedback.

Power Consumption

Jetson Xavier NX consumes around 7.3W while idle and topped 11.7W on 6-core, 1.4Ghz mode and 9.6W on 2-core 1.9Ghz mode.

This is at least 1.8x higher power consumption at full use and 2.5x higher on idle compared to the Odroid-N2 reviewed by me.

During the Pytorch MNIST test, I saw peaks of 13.5W while on GPU and 10.2W on CPU only. Considering the amount of time the test took, It's worth the GPU acceleration since it took 13x less time to execute.

Conclusion

All in all, the NVIDIA Jetson Xavier NX board is a powerhouse with more processing power I’ve seen on most ARM SBCs. With it’s GPU a new frontier is open with capabilities for processing Machine Learning workloads in the edge topping the CPU processing of a pretty recent computer (MacBook Pro 2018).

Also as a desktop, the Xavier NX is a great option paired with an M.2 SSD (for more storage performance and space) and can be used as one’s daily driver for most development tasks.

The down point is it’s price, placed well above most SBCs at $399 but with the CPU and RAM performance, GPU, M.2 slot and the NVIDIA ecosystem it's in a different category close to what an ARM workstation would be.

If you have suggestions or ideas of tests to be done in these boards, send me your feedback on Twitter @carlosedp.