NVIDIA Jetson Xavier NX Benchmarks

Board Quick Specs

NVIDIA Jetson Xavier NX

  • CPU: 6-core Carmel Arm 64-bit CPU, 6MB L2 + 4MB L3.
  • GPU: NVIDIA Volta with 384 NVIDIA CUDA cores and 48 Tensor Cores, plus 2x NVDLA.
  • 8GB 128-bit LPDDR4x RAM
  • Gigabit Ethernet
  • Amlogic S922X (4x Cortex-A73 @ 1.8GHz, 2x Cortex-A53 @ 1.9GHz); 12nm fab; Mali-G52 GPU with 6x 846MHz EEs.
  • 4GB DDR4 RAM.
  • Gigabit Ethernet
  • Rockchip RK3399 Hexa-Core (dual ARM Cortex A72 and quad ARM Cortex A53) 64-Bit Processor and MALI T-860 Quad-Core GPU.
  • 4GB LPDDR4 RAM.
  • PCIe x4.
  • Gigabit Ethernet

CPU Benchmarks

7zip

I use the 7zip benchmark as a baseline for most boards since it shows the bare CPU performance and is a good comparison between boards.

❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: - 64000000 64000000 - 128000000 256000000 512000000 - 2048000000
RAM size: 7763 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 9309 466 1945 9057 | 113067 568 1699 9642
23: 9036 454 2026 9207 | 110032 564 1688 9521
24: 9186 465 2122 9877 | 108373 567 1678 9512
25: 9337 469 2275 10661 | 106344 564 1679 9464
---------------------------------- | ------------------------------
Avr: 464 2092 9701 | 566 1686 9535
Tot: 515 1889 9618
❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: 1597 1763 1797 1796 1795 1786 1792 1792 1791
RAM size: 3713 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 5848 525 1083 5690 | 118603 535 1892 10115
23: 5615 530 1080 5722 | 114888 532 1868 9941
24: 5511 539 1099 5926 | 113151 536 1853 9932
25: 5223 540 1105 5964 | 107952 530 1812 9607
---------------------------------- | ------------------------------
Avr: 533 1092 5826 | 533 1856 9899
Tot: 533 1474 7862
❯ 7z b7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,6 CPUs LE)
LE
CPU Freq: 64000000 - - - - - 512000000 - -
RAM size: 3793 MB, # CPU hardware threads: 6
RAM usage: 1323 MB, # Benchmark threads: 6
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 4468 476 913 4347 | 94209 522 1540 8034
23: 4114 473 886 4192 | 92694 525 1528 8021
24: 3965 472 902 4264 | 90617 525 1516 7954
25: 3840 480 913 4385 | 88555 525 1501 7881
---------------------------------- | ------------------------------
Avr: 475 904 4297 | 524 1521 7972
Tot: 500 1213 6135

Memory Benchmarks

Next I've tested the memory performance using sysbench. In these tests, I've used 1K, 1M blocks with read and write tests.

❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1024KiB
total size: 8192MiB
operation: write
scope: global
Initializing worker threads...Threads started!Total operations: 8192 (15042.72 per second)8192.00 MiB transferred (15042.72 MiB/sec)General statistics:
total time: 0.5421s
total number of events: 8192
Latency (ms):
min: 0.06
avg: 0.07
max: 0.50
95th percentile: 0.08
sum: 532.89
Threads fairness:
events (avg/stddev): 8192.0000/0.00
execution time (avg/stddev): 0.5329/0.00
❯ sysbench --test=memory --memory-block-size=1M --memory-total-size=4G run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1024K
Memory transfer size: 4096MMemory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 4096 ( 1236.33 ops/sec)4096.00 MB transferred (1236.33 MB/sec)Test execution summary:
total time: 3.3130s
total number of events: 4096
total time taken by event execution: 3.3096
per-request statistics:
min: 0.29ms
avg: 0.81ms
max: 2.31ms
approx. 95 percentile: 1.57ms
Threads fairness:
events (avg/stddev): 4096.0000/0.00
execution time (avg/stddev): 3.3096/0.00
rock64@rockpro64:~$ sysbench --test=memory --memory-block-size=1M --memory-total-size=8G run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1024KiB
total size: 8192MiB
operation: write
scope: global
Initializing worker threads...Threads started!Total operations: 8192 ( 7796.96 per second)8192.00 MiB transferred (7796.96 MiB/sec)General statistics:
total time: 1.0473s
total number of events: 8192
Latency (ms):
min: 0.12
avg: 0.13
max: 0.68
95th percentile: 0.13
sum: 1044.03
Threads fairness:
events (avg/stddev): 8192.0000/0.00
execution time (avg/stddev): 1.0440/0.00

Machine Learning Tests

To test the Cuda GPU acceleration, I took a sample app from PyTorch framework that is the MNIST image classification. In this case I didn’t compare to the other ARM boards since neither has a GPU. The tests were done in the Xavier board itself with and without GPU acceleration for ML workloads. This is where this board shines.

docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work nvcr.io/nvidia/l4t-pytorch:r32.4.2-pth1.5-py3
root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1
Train Epoch: 1 [0/60000 (0%)] Loss: 2.333409
..
Test set: Average loss: 0.0564, Accuracy: 9802/10000 (98%)
real 0m52.126s
user 1m2.100s
sys 0m8.316s
root@40cb50bd2bc3:/work/pytorch# time python3 mnist.py --epochs=1 --no-cuda
Train Epoch: 1 [0/60000 (0%)] Loss: 2.311259
..
Test set: Average loss: 0.0520, Accuracy: 9825/10000 (98%)
real 12m59.672s
user 65m59.436s
sys 0m44.076s
> docker run -it --rm -v $(pwd):/work -w /work --rm pytorch/pytorch bash
> time python3 mnist.py --epochs=1
....
real 1m39.388s
user 5m32.886s
sys 1m6.857s
> python3 mnist.py
...
2020-06-12 19:59:30.666323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3560 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2020-06-12 19:59:32.241779: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
step: 100, loss: 0.373517, accuracy: 0.898438
step: 200, loss: 0.250729, accuracy: 0.933594
...
Test Accuracy: 0.966400
real 0m52.721s
user 0m51.516s
sys 0m5.936s
> python3 mnist.py
...
2020-06-12 19:57:30.661145: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-12 19:57:30.661284: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
step: 100, loss: 0.347883, accuracy: 0.914062
step: 200, loss: 0.248891, accuracy: 0.933594
...
Test Accuracy: 0.964100
real 0m51.953s
user 1m14.548s
sys 0m4.824s

Power Consumption

Conclusion

All in all, the NVIDIA Jetson Xavier NX board is a powerhouse with more processing power I’ve seen on most ARM SBCs. With it’s GPU a new frontier is open with capabilities for processing Machine Learning workloads in the edge topping the CPU processing of a pretty recent computer (MacBook Pro 2018).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Carlos Eduardo

Carlos Eduardo

529 Followers

Writing everything cloud and all the tech behind it. If you like my projects and would like to support me, check my Patreon on https://www.patreon.com/carlosedp