FEED Validator

for Atom and RSS and KML

Congratulations!

This is a valid Atom 1.0 feed.

Recommendations

This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

line 1, column 118: Use of unknown namespace: http://schemas.google.com/blogger/2008 [help]

... r.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2 ...
                                             ^

line 1, column 118: Unable to validate namespace: http://schemas.google.com/g/2005. See the Google Data specification at https://developers.google.com/gdata/docs/1.0/elements [help]

... r.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2 ...
                                             ^

line 1, column 2601: subtitle should not be blank [help]

... meu Vizoso</title><subtitle type='html'></subtitle><link rel='http://sch ...
                                             ^

line 1, column 2863: Self reference doesn't match document location [help]

... eeds/664175667937540078/posts/default'/><link rel='alternate' type='text ...
                                             ^

line 1, column 0: content should not contain data-original-height attribute (14 occurrences) [help]
```
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
```
line 1, column 0: content should not contain data-original-width attribute (14 occurrences) [help]
```
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
```

line 1, column 0: content should not contain iframe tag (4 occurrences) [help]

<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...

line 1, column 0: content should not contain imageanchor attribute (9 occurrences) [help]
```
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
```
line 13, column 0: content should not contain role attribute (10 occurrences) [help]
```
the different phases until the HW configurations is generated.&lt;/p&gt;&lt; ...
```
line 13, column 0: style attribute contains potentially dangerous content: font-size: calc(var(--scale-factor)*14.35px); left: 18.84%; top: 13.41%; transform: scaleX(0.902854); (9 occurrences) [help]
```
the different phases until the HW configurations is generated.&lt;/p&gt;&lt; ...
```

line 13, column 0: Non-html tag: mark [help]

the different phases until the HW configurations is generated.&lt;/p&gt;&lt; ...

line 15, column 0: content should not contain trbidi attribute (4 occurrences) [help]
```
&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; status = &quot;okay&quot;;&lt;br  ...
```

Source: http://blog.tomeuvizoso.net/feeds/posts/default

<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-664175667937540078</id><updated>2024-04-27T01:57:34.561+02:00</updated><category term="gnome"/><category term="sugar"/><category term="python"/><category term="collabora"/><category term="mesa"/><category term="introspection"/><category term="npu"/><category term="tensorflow"/><category term="etnaviv"/><category term="vipnano-qi"/><category term="vivante"/><category term="machine-learning"/><category term="librecomputer"/><category term="clutter"/><category term="telepathy"/><category term="webkit"/><category term="kernel"/><category term="olpc"/><category term="ubuntu"/><category term="chromeos"/><category term="multitouch"/><category term="ceibal"/><category term="pygobject"/><category term="bosch"/><category term="gnome3"/><category term="google"/><category term="graphics"/><category term="hackfest"/><category term="mutter"/><category term="rk3588"/><category term="rockchip"/><category term="verisilicon"/><category term="X11"/><category term="canonical"/><category term="debian"/><category term="desktopsummit"/><category term="fedora"/><category term="fsf"/><category term="gesture"/><category term="gnash"/><category term="gtk-doc"/><category term="igalia"/><category term="panfrost"/><category term="upstream"/><category term="webgl"/><category term="CI"/><category term="EGL"/><category term="bof"/><category term="brno"/><category term="chamelium"/><category term="crosvm"/><category term="devconf"/><category term="docs"/><category term="documentation"/><category term="git"/><category term="greece"/><category term="gstreamer"/><category term="intel"/><category term="kernelci.org"/><category term="kosovo"/><category term="kvm"/><category term="lava"/><category term="lucid sleep"/><category term="mali"/><category term="markdown"/><category term="memory"/><category term="minijail"/><category term="mobile"/><category term="opengl"/><category term="opensuse"/><category term="redhat"/><category term="scaling"/><category term="tegra"/><category term="testing"/><category term="trisquel"/><category term="virgl"/><category term="virtualization"/><category term="wayland"/><title type='text'>Tomeu Vizoso</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default?start-index=26&max-results=25'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>133</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-7716416125935972669</id><published>2024-04-19T10:17:00.003+02:00</published><updated>2024-04-19T10:18:30.411+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'> Rockchip NPU update 3: Real-time object detection on RK3588</title><content type='html'><h3 style="text-align: left;">Progress</h3><p>Yesterday I managed to implement in my open-source driver all the remaining operations so the <a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> model can run on Rockchip's NPU in the RK3588 SoC.</p><p>Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/s1920/object_detection_rk3588.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1080" data-original-width="1920" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/w400-h225/object_detection_rk3588.png" width="400" /></a></div><br />&nbsp;I uploaded the generated video to YouTube at:<p></p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/DDccYn4wpnY" width="320" youtube-src-id="DDccYn4wpnY"></iframe></div><p></p><div style="text-align: left;">You can get the source code at my branch <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/rocket/?ref_type=heads">here</a>.<br /></div><h3 style="text-align: left;">&nbsp;</h3><h3 style="text-align: left;">Next steps</h3><p>Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.</p><p>There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/7716416125935972669/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=7716416125935972669' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/7716416125935972669'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/7716416125935972669'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/04/rockchip-npu-update-3-real-time-object.html' title=' Rockchip NPU update 3: Real-time object detection on RK3588'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/s72-w400-h225-c/object_detection_rk3588.png" height="72" width="72"/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-2268059939300496747</id><published>2024-03-28T08:47:00.000+01:00</published><updated>2024-03-28T08:47:00.757+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'>Rockchip NPU update 2: MobileNetV1 is done</title><content type='html'><h3 style="text-align: left;">Progress</h3><p style="text-align: left;">For&nbsp; the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.</p><p style="text-align: left;">I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.</p><p style="text-align: left;">And it not only runs flawlessly, but at the same performance level as the blob.</p><p style="text-align: left;">It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://commons.wikimedia.org/w/index.php?curid=285598" target="_blank"><span style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s320/hen.jpg" width="320" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;">by Julien Langlois CC BY-SA 3.0<br /></td></tr></tbody></table><p>&nbsp;<span style="font-family: courier;">tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from libteflon.so with args: {}<br />Teflon delegate: loaded rknpu driver<br /><br />teflon: compiling graph: 89 tensors 27 operations<br />...<br />teflon: compiled graph, took 413 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 10 ms<br />teflon: invoked graph, took 10 ms<br /><b>0.984314: hen</b><br />0.019608: cock<br />0.000000: toilet tissue<br />0.000000: sea cucumber<br />0.000000: wood rabbit<br />time: 10.776ms<br /></span></p><p style="text-align: left;"><span style="font-family: inherit;">Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.</span></p><p style="text-align: left;"><span style="font-family: inherit;">Thanks to TL Lim of <a href="https://pine64.org/">PINE64</a> for sending me a <a href="https://wiki.pine64.org/wiki/QuartzPro64_Development">QuartzPro64</a> board for me to hack on. <br /></span></p><h3 style="text-align: left;"><span style="font-family: inherit;">Next steps</span></h3><p style="text-align: left;"><span style="font-family: inherit;">I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.</span></p><p style="text-align: left;"><span style="font-family: inherit;">After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.<br /></span></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/2268059939300496747/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=2268059939300496747' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2268059939300496747'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2268059939300496747'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/03/rockchip-npu-update-2-mobilenetv1-is.html' title='Rockchip NPU update 2: MobileNetV1 is done'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s72-c/hen.jpg" height="72" width="72"/><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1181718218983280915</id><published>2024-03-16T12:46:00.003+01:00</published><updated>2024-03-16T18:49:59.475+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'>Rockchip NPU update 1: A walk in the park?</title><content type='html'><p>During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.<br /></p><p>The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s640/pexels-mart-production-8121657.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="427" data-original-width="640" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s320/pexels-mart-production-8121657.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>A nice walk in the park</i><br /></td></tr></tbody></table><p></p><p>Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.</p><p>To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.<br /></p><p>Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by <a href="https://github.com/phhusson/rknpu-reverse-engineering">Pierre-Hugues Husson</a> and <a href="https://github.com/mtx512/rk3588-npu/">Jasbir Matharu</a> to get started, a big thanks to them!<br /></p><p>After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.</p><p>I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.</p><p>With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).</p><p>Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.</p><p>I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.</p><p>Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...<br /></p><p>Tests run fast and reliably, even with high concurrency:</p><p><span style="font-family: courier;"><span style="font-size: x-small;">tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt&nbsp; --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt <br />Running gtest on 8 threads in 1-test groups<br />Pass: 0, Duration: 0<br />Pass: 139, Skip: 14, Duration: 2, Remaining: 2<br />Pass: 277, Skip: 22, Duration: 4, Remaining: 0<br />Pass: 316, Skip: 24, Duration: 4, Remaining: 0</span></span></p>You can find the source code in <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/rocket?ref_type=heads">this branch</a>.<br /><p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1181718218983280915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1181718218983280915' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1181718218983280915'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1181718218983280915'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/03/rockchip-npu-update-1-walk-in-park.html' title='Rockchip NPU update 1: A walk in the park?'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s72-c/pexels-mart-production-8121657.jpg" height="72" width="72"/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6608771821592155584</id><published>2024-02-23T13:10:00.001+01:00</published><updated>2024-02-23T13:10:25.672+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 17: Faster!</title><content type='html'><p>In the last update I explained how compression of zero weights gave our driver such a big performance improvement.</p><p>Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.</p><h4 style="text-align: left;">TL;DR</h4><p style="text-align: left;">Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.</p><p style="text-align: left;">On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/s848/perf_evol_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="431" data-original-width="848" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/w640-h326/perf_evol_2.png" width="640" /></a></div><p></p><h4 style="text-align: left;">Enable more convolutions</h4><p>Our driver
was rejecting convolutions with a number of output channels that is not
divisible by the number of convolution cores in the NPU because at the
start of the development the code that lays the weights out in memory
didn't support that. That caused TensorFlow Lite to run the convolutions
in CPU, and some of them were big enough to take a few milliseconds,
several times more than on the NPU.<br /></p><p>When implementing support
for bigger kernels I had to add improvements to the tiling of the
convolutions and that included adding support for these other
convolutions. So by just removing the rejection of these, we got a nice
speed up on SSD MobileDet: from 32.7ms to 27ms!</p><p>That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.</p><h4 style="text-align: left;">Caching of the input tensor</h4><p>So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.</p><p>That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.</p><h4 style="text-align: left;">Next steps</h4><p>At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.</p><p>But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.</p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6608771821592155584/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6608771821592155584' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608771821592155584'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608771821592155584'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/02/etnaviv-npu-update-17-faster.html' title=' Etnaviv NPU update 17: Faster!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/s72-w640-h326-c/perf_evol_2.png" height="72" width="72"/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-2527511145589800731</id><published>2024-02-08T10:36:00.000+01:00</published><updated>2024-02-08T10:36:04.340+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 16: A nice performance jump</title><content type='html'><p>After the open-source driver for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPU</a> was <a href="https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-15-we-are-upstream.html">merged into Mesa</a> two weeks ago, I have been taking some rest and thinking about what will come next.</p><h3 style="text-align: left;">Automated testing <br /></h3><p>I have a <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27214">merge request</a> to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. <a href="https://www.collabora.com/">Collabora</a> is graciously looking at it. Thanks!</p><h3 style="text-align: left;">Performance<br /></h3><p>I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.</p><p>It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.<br /></p><p>IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.<br /></p><h3 style="text-align: left;">Weight compression <br /></h3><p></p><p>Something I started working on last week is compression of zero values in the weight buffers. <a href="https://arxiv.org/abs/2102.00554">Sparsity</a> is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.</p><p>By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).</p><p>By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/s848/perf_evol.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="431" data-original-width="848" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/w640-h326/perf_evol.png" width="640" /></a></div><br /><br /></div><p></p><p>As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.</p><h3 style="text-align: left;">IRC channel</h3><p>I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click <a href="https://webchat.oftc.net/?channels=ml-mainline" target="_blank">here</a> to join via a web interface, though I recommend setting up an account at <a href="https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/">matrix.org</a>.</p><h3 style="text-align: left;">What next</h3><p>Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?</p><p>I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.</p><p>I'm currently available for contracting, so I should be able to get on your project full-time on short notice.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/2527511145589800731/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=2527511145589800731' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2527511145589800731'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2527511145589800731'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/02/etnaviv-npu-update-16-nice-performance.html' title=' Etnaviv NPU update 16: A nice performance jump'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/s72-w640-h326-c/perf_evol.png" height="72" width="72"/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8436159336587732290</id><published>2024-01-24T11:52:00.000+01:00</published><updated>2024-01-24T11:52:46.494+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 15: We are upstream!</title><content type='html'><p>Today the <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">initial merge request for Teflon</a> was merged into Mesa, along with the first hardware driver, for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPU</a>.</p><p>For those who don't know, <a href="https://docs.mesa3d.org/teflon.html">Teflon</a> is a <a href="https://www.tensorflow.org/lite/performance/delegates">TensorFlow Lite delegate</a> that aims to support several <a href="https://en.wikipedia.org/wiki/AI_accelerator">AI accelerators</a> (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the <a href="https://en.wikipedia.org/wiki/MIT_License">MIT license</a>.<br /></p><p style="text-align: center;"><a href="https://gitlab.freedesktop.org/uploads/-/system/group/avatar/1155/gears.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="773" data-original-width="773" height="200" src="https://gitlab.freedesktop.org/uploads/-/system/group/avatar/1155/gears.png" width="200" /></a> <br /></p><p>This will have the following advantages for the project:</p><ol style="text-align: left;"><li>The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the <a href="https://docs.mesa3d.org/release-calendar.html">release calendar</a>.<br /></li><li>Contribution to the project will happen within the <a href="https://docs.mesa3d.org/submittingpatches.html">development process of Mesa</a>. This is a well-established process in which employees from companies such as Google, Valve, <a href="https://docs.mesa3d.org/drivers/powervr.html">Imagination</a>, Intel, <a href="https://docs.mesa3d.org/drivers/d3d12.html">Microsoft</a> and <a href="https://docs.mesa3d.org/drivers/radv.html">AMD</a> work together on their GPU drivers.<br /></li><li>The project has great technical infrastructure, maintained by awesome sysadmins:</li><ul><li>A well-maintained <a href="https://gitlab.freedesktop.org/">Gitlab instance</a>,</li><li><a href="https://docs.mesa3d.org/ci/index.html">extensive CI</a>, for both build and runtime testing, on real hardware,</li><li>mailing list, web server, etc.<br /></li></ul><li>More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:</li><ul><li>The <a href="https://docs.mesa3d.org/nir/index.html">NIR intermediate representation</a> with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.</li><li>The <a href="https://docs.mesa3d.org/gallium/index.html">Gallium internal API</a> that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as <a href="https://developer.android.com/ndk/guides/neuralnetworks">Android NNAPI</a>.</li></ul><li>And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: <a href="https://www.x.org/wiki/Events/">XDC</a>.<br /></li></ol><div><h3 style="text-align: left;">The story so far</h3><p style="text-align: left;">In 2022, while still at <a href="http://collabora.com/">Collabora</a>, I started adding OpenCL support to the <a href="https://github.com/etnaviv/etna_viv#introduction">Etnaviv</a> driver in Mesa. Etnaviv is a userspace and kernel driver for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPUs</a>.</p><p style="text-align: left;">The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.</p><p style="text-align: left;">I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (<a href="https://en.wikipedia.org/wiki/Systolic_array">systolic arrays</a>, as I realized later).</p><p style="text-align: left;">Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since <a href="https://arxiv.org/abs/1704.04760">Google released their first TPU</a>.</p><p style="text-align: left;">With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.<br /></p><p style="text-align: left;">During the summer and with <a href="https://libre.computer/">Libre Computer</a>'s sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.</p><p style="text-align: left;">By <a href="https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html">autumn</a> I was able to run that same object classification model (<a href="https://arxiv.org/abs/1704.04861">MobileNet V1</a>) 3 times faster than the CPU was able to. A <a href="https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html">month later</a> I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.</p><p style="text-align: left;">Afterwards I got to work on object detection models, and by the <a href="https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-14-object-detection.html">start of 2024</a> I managed to run <a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!</p><p style="text-align: left;">The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.<br /></p><h3 style="text-align: left;">Next steps</h3><p style="text-align: left;">Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!</p><p style="text-align: left;">I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.</p><p style="text-align: left;">There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.</p><p style="text-align: left;">There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.</p><p style="text-align: left;">And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.<br /></p><h3 style="text-align: left;">Thanks</h3><p>One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:</p><p><a href="http://collabora.com/">Collabora</a> for allowing me to start playing with this while I still worked with them.</p><p><a href="https://libre.computer/">Libre Computer</a> and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.</p><p><a href="https://www.igalia.com/">Igalia</a> for letting <a href="https://christian-gmeiner.info/">Christian Gmeiner</a> spend time reviewing all my code and answering my questions about Etnaviv. <br /></p><p></p><p style="text-align: left;"><a href="https://embedded-recipes.org/">Embedded Recipes</a> for giving me the opportunity to present my work last autumn in Paris.</p></div><div><p style="text-align: left;">Lucas Stach from <a href="https://www.pengutronix.de/en/index.html">Pengutronix</a> for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.</p><p style="text-align: left;">Neil Armstrong from <a href="https://www.linaro.org/">Linaro</a> for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.</p><p style="text-align: left;">And a collective thanks to the DRI/Mesa community for being so awesome!<br /></p><p></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8436159336587732290/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8436159336587732290' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8436159336587732290'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8436159336587732290'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-15-we-are-upstream.html' title=' Etnaviv NPU update 15: We are upstream!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6524113164238186020</id><published>2024-01-10T12:14:00.004+01:00</published><updated>2024-01-10T12:14:56.646+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 14: Object detection with decent performance</title><content type='html'><p>When almost two months ago I <a href="https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html">got MobileNetV1 running with useful performance</a> on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.</p><p>Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?<br /></p><p>Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.</p><p><a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.</p><p>The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.</p><p>So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:<br /></p><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/s1536/test1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1024" data-original-width="1536" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/w548-h366/test1.jpg" width="548" /></a></div><p></p><p>Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.</p><p>I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.</p><p>After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.</p><p>If we look at a practical use case such that supported by <a href="https://frigate.video/">Frigate NVR</a>, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.</p><p>Given the price level of the <a href="https://libre.computer/products/aml-a311d-cc/">single board computers that contain the VIPNano</a>, this is quite a good bang for your bucks. And all open source and heading to mainline!</p><p><b>Next steps</b></p><p>I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.</p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6524113164238186020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6524113164238186020' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6524113164238186020'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6524113164238186020'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-14-object-detection.html' title=' Etnaviv NPU update 14: Object detection with decent performance'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/s72-w548-h366-c/test1.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-239856422670257258</id><published>2023-12-21T09:16:00.002+01:00</published><updated>2023-12-21T09:16:49.906+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 13: Don't cross the tensors</title><content type='html'><p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s389/1520238648692.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="209" data-original-width="389" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s16000/1520238648692.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span class="ILfuVd" lang="en"><span class="hgKElc"><i>"Don't cross the streams. It would be bad."</i></span></span></td></tr></tbody></table><h4 style="text-align: left;">IR refactorings <br /></h4><p>A big part of what I have been up to in the past two weeks has been a
serious refactoring of the data structures that hold the model data in
the different phases until the HW configurations is generated.</p><p>What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.</p><p>The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8uT4oTOPviR6_aqbR0KFWycEcCxHBFoptasiS8nfb_2aiJ0XKNBE7BIVjFNBA46LPV204yMIBjrzPkJT_WyWc5k3HUcLLzzAMD9-NWei85UbmKHTgxHTHje8vEIdxQTfAEP9nk7HCWJEtgxpXU3CsrY1xykjiSa9QI35In5amVjFu7OGl8BmUA_j_oQQ/s888/mobiledet_add.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="888" data-original-width="290" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8uT4oTOPviR6_aqbR0KFWycEcCxHBFoptasiS8nfb_2aiJ0XKNBE7BIVjFNBA46LPV204yMIBjrzPkJT_WyWc5k3HUcLLzzAMD9-NWei85UbmKHTgxHTHje8vEIdxQTfAEP9nk7HCWJEtgxpXU3CsrY1xykjiSa9QI35In5amVjFu7OGl8BmUA_j_oQQ/w210-h640/mobiledet_add.png" width="210" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A small subsection of SSDLite MobileDet</td></tr></tbody></table><p>The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.</p><p>As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.</p><p>For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.</p><p>I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.</p><p>For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.</p><h4 style="text-align: left;">Tensor addition</h4><p>Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.</p><h4 style="text-align: left;">Lowering the strides</h4><p>Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.</p><h4 style="text-align: left;">Status and next steps <br /></h4><p>Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.</p><p>The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.<br /></p><p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxIxl-0oWNOqrRirUSUkf7k5b_pYiudHW1aOxIdF5K2MULi1zPldgxEfr2lNi5aZQqfUJ7KpmHFLl6KpWpCC0wbfxDi47I4hswY-p-gfDLsoA68OZfD_9YjxyHqa1maSHXHL9WRKrVik_5haHpLUeRrPwJyeiBwkqAt7iyQxdd7nVrjQYhb-4Z0esauK0/s21360/ssdlite_mobiledet_coco_qat_postprocess.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="21360" data-original-width="3392" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxIxl-0oWNOqrRirUSUkf7k5b_pYiudHW1aOxIdF5K2MULi1zPldgxEfr2lNi5aZQqfUJ7KpmHFLl6KpWpCC0wbfxDi47I4hswY-p-gfDLsoA68OZfD_9YjxyHqa1maSHXHL9WRKrVik_5haHpLUeRrPwJyeiBwkqAt7iyQxdd7nVrjQYhb-4Z0esauK0/w102-h640/ssdlite_mobiledet_coco_qat_postprocess.png" width="102" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The whole of SSDLite MobileDet<br /></td></tr></tbody></table><br />&nbsp;<p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/239856422670257258/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=239856422670257258' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/239856422670257258'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/239856422670257258'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/12/etnaviv-npu-update-13-dont-cross-tensors.html' title=' Etnaviv NPU update 13: Don't cross the tensors'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s72-c/1520238648692.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8772896522615830396</id><published>2023-12-06T11:21:00.001+01:00</published><updated>2023-12-06T11:22:57.008+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 12: Towards SSDLite MobileDet</title><content type='html'><p>During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to <a href="https://arxiv.org/abs/2004.14525">MobileDet</a>, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example <a href="https://frigate.video/">Frigate NVR</a>.</p><p>I haven't mentioned it in a few updates, but all this work keeps being sponsored by <a href="https://libre.computer/">Libre Computer</a>, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out <a href="https://libre.computer/products/aml-a311d-cc/">Alta</a> and <a href="https://libre.computer/products/aml-s905d3-cc/">Solitude</a> for the first such boards in the market.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://libre.computer/api/products/aml-a311d-cc/gallery/1.webp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="704" data-original-width="800" height="282" src="https://libre.computer/api/products/aml-a311d-cc/gallery/1.webp" width="320" /></a></div><p></p><h3 style="text-align: left;">Upstreaming</h3><div style="text-align: left;"><p>Igalia's Christian Gmeiner has been giving me great feedback at the <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">merge request</a>, and as part of that I <a href="https://lore.kernel.org/lkml/20231116140910.1613508-1-tomeu@tomeuvizoso.net/T/#m3047ef1f33ee2ccdfeeaaa38bb8dfd0cfca95bab">submitted a patch</a> to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded.&nbsp;</p><p>This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.<br /></p></div><h3 style="text-align: left;">Convolutions with 5x5 weights</h3><p>Until now I had implemented support only for weights with dimensions 1x1 (aka <a href="https://arxiv.org/abs/1712.05245">pointwise convolutions</a>) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.</p><p>I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.<br /></p><h3 style="text-align: left;">Tensor addition</h3><p>I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.</p><p>This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.<br /></p><h3 style="text-align: left;">Softmax pooling</h3><p>One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.</p><p>The blob implements these operations on the programmable core, with CL-like kernels.</p><p>So I undusted the work that I did in the <a href="https://blog.tomeuvizoso.net/2023/04/a-long-overdue-update.html">first half of 2023</a> and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in&nbsp;Etnaviv to make use of the programmable core.</p><p>Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.</p><p>With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.<br /></p><h3 style="text-align: left;">A new test suite</h3><p>With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.</p><p>To get around that, I reimplemented the tests in C++ with <a href="https://en.wikipedia.org/wiki/Google_Test">GoogleTest</a>, which is supported by Emma Anholt's <a href="https://gitlab.freedesktop.org/anholt/deqp-runner">deqp-runner</a> and will allow me to run the tests in parallel, making full use of the CPU cores in the board.</p><p>That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.</p><p>With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.</p><p>Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.</p><p>There is a <a href="https://elixir.bootlin.com/linux/v6.6.4/source/drivers/gpu/drm/etnaviv/etnaviv_sched.c#L16">kernel module parameter</a> to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.<br /></p><h3 style="text-align: left;">Next steps</h3><p>I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.</p><p>So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.</p><p style="text-align: left;"><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8772896522615830396/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8772896522615830396' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8772896522615830396'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8772896522615830396'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/12/etnaviv-npu-update-12-towards-ssdlite.html' title=' Etnaviv NPU update 12: Towards SSDLite MobileDet'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1719986941793663440</id><published>2023-11-17T08:46:00.001+01:00</published><updated>2023-12-06T09:03:01.270+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 11: Now twice as fast!</title><content type='html'><h1 style="text-align: left;">Progress</h1><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.<br /></div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">This makes our image classification benchmark twice as fast, as expected:<br /></div><p><span style="font-family: courier;">tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}<br /><b>Running the NN job took 13 ms.</b><br />0.866667: military uniform<br />0.031373: Windsor tie<br />0.015686: mortarboard<br />0.007843: bow tie<br />0.007843: academic gown<br /><b>time: 15.650ms</b><br /></span></p><div style="text-align: left;">60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.</div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Upstreaming is going well, those interested can follow it here:</div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;"><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714</a>.<br /></div><div style="text-align: left;">&nbsp;</div><h1 style="text-align: left;">Next steps</h1><div style="text-align: left;">&nbsp;</div><p>These will be my priorities during the next couple of weeks, in order:</p><ol style="text-align: left;"><li>Upstreaming</li><li>Get the Mobilenet SSD V1 model running on the HW, for object detection<br /></li><li>Performance<br /></li></ol></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1719986941793663440/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1719986941793663440' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1719986941793663440'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1719986941793663440'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html' title=' Etnaviv NPU update 11: Now twice as fast!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-5685163152487206629</id><published>2023-11-06T10:30:00.003+01:00</published><updated>2023-12-06T09:02:52.226+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 10: Upstreaming and TP jobs update</title><content type='html'><p>&nbsp;If you remember the <a href="https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html">last update</a> two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and <a href="https://www.mesa3d.org/">Mesa</a>.</p><p>One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.</p><p>Regarding Mesa, I have made several cleanups and have started getting great <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">review comments</a> from <a href="https://github.com/austriancoder">Christian Gmeiner</a>.</p><p>While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster&nbsp; than the naive code I was running on the CPU for this.</p><p>Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.</p><p>I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.</p><p>During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.<br /></p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/5685163152487206629/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=5685163152487206629' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5685163152487206629'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5685163152487206629'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-10-upstreaming-and.html' title=' Etnaviv NPU update 10: Upstreaming and TP jobs update'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-5705381674930396395</id><published>2023-10-23T09:16:00.005+02:00</published><updated>2024-01-24T10:16:06.403+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 9: We got there!</title><content type='html'><h1 style="text-align: left;">Progress</h1><div style="text-align: left;">Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa:&nbsp;</div><div class="separator" style="clear: both; text-align: center;"><a href="https://coral.ai/static/docs/images/grace_hopper.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="606" data-original-width="517" height="200" src="https://coral.ai/static/docs/images/grace_hopper.bmp" width="171" /></a></div><div style="text-align: left;"><span style="font-family: courier;"><blockquote>tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from libteflon.so with args: {}<br />Processing the input took <b>18 ms.</b><br />Running the NN job took <b>13 ms.</b><br />Processing the output took 1 ms.<br />0.866667: military uniform<br />0.031373: Windsor tie<br />0.015686: mortarboard<br />0.007843: bow tie<br />0.007843: academic gown<br />time: 33.094ms<br /></blockquote></span>That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.</div><div style="text-align: left;">&nbsp;</div><h1 style="text-align: left;">Next steps</h1><p>Now that we have something that people can use in their products, I will switch to upstreaming mode.</p><p>I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">here</a>.</p><p>I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.</p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/5705381674930396395/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=5705381674930396395' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5705381674930396395'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5705381674930396395'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html' title=' Etnaviv NPU update 9: We got there!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-4260046790772956889</id><published>2023-10-06T17:16:00.001+02:00</published><updated>2023-12-06T09:02:37.001+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 8: Finally some inference</title><content type='html'><h1 style="text-align: left;">Progress</h1><p>Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.</p><h2 style="text-align: left;">Refactored the Gallium front-end</h2><p>As commented in the <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html">previous update</a>, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.</p><p>I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.</p><p>As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.</p><p>Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.</p><p>Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.</p><h2 style="text-align: left;">Got MobileNet V1 to run</h2><div style="text-align: left;">As part of the refactoring&nbsp; from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:</div><div style="text-align: left;"><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://commons.wikimedia.org/w/index.php?curid=285598" target="_blank"><span style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s320/hen.jpg" width="320" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;">by Julien Langlois CC BY-SA 3.0<br /></td></tr></tbody></table><br /><p></p><blockquote><span style="font-family: courier;">tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so</span> <br /></blockquote><blockquote><span style="font-family: courier;">Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}<br />tflite_plugin_create_delegate<br />Teflon delegate: loaded etnaviv driver<br />INFO: Initialized TensorFlow Lite runtime.<br />PrepareDelegate<br />VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.<br /><b>0.960784: hen</b><br />0.015686: cock<br />0.007843: goose<br />0.003922: Pembroke<br />0.003922: Ibizan hound<br />time: 22.802ms<br />tflite_plugin_destroy_delegate</span></blockquote><p>This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.</p><h2 style="text-align: left;">Presented this work at Embedded Recipes 2023</h2><p>Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5slClDUZ5VBIFkJpgfL4Ng92TjVjFAvulkPXBCw8kT_iEDdZN3ph8uTma65Cd7d6-5z4YxYmQZc2NqStG3RGhllCuL30lJVp1XukKiS2qZQUpcOYY-m5A3RXQ4KiUYeDfVZ122lWfUg1_yZMpyZf2bbaNERyfzC6W7U3oGhXcQgxwXV6DWUOy9t20ajo/s4000/F7LiLViWUAA4PaR.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3000" data-original-width="4000" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5slClDUZ5VBIFkJpgfL4Ng92TjVjFAvulkPXBCw8kT_iEDdZN3ph8uTma65Cd7d6-5z4YxYmQZc2NqStG3RGhllCuL30lJVp1XukKiS2qZQUpcOYY-m5A3RXQ4KiUYeDfVZ122lWfUg1_yZMpyZf2bbaNERyfzC6W7U3oGhXcQgxwXV6DWUOy9t20ajo/s320/F7LiLViWUAA4PaR.jpg" width="320" /></a></div><br /><p>You can find the <a href="https://embedded-recipes.org/2023/schedule/accelerated-ml-at-the-edge-with-mainline/">slides here</a>, and listen to the talk at:</p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="334" src="https://www.youtube.com/live/s5_BZdljpqc?feature=shared&t=2340" width="560" youtube-src-id="s5_BZdljpqc"></iframe></div><br /><p><br /></p><h1 style="text-align: left;">Next steps</h1><p>The <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html">previous update</a> got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:</p><ol style="text-align: left;"><li>Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network</li><li>Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)<br /></li><li>Upstream changes to the Linux kernel</li><li>Propose Teflon to the Mesa folks<br /></li></ol><p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/4260046790772956889/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=4260046790772956889' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4260046790772956889'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4260046790772956889'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-8-finally-some.html' title=' Etnaviv NPU update 8: Finally some inference'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s72-c/hen.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-4728184777760523572</id><published>2023-09-26T13:37:00.007+02:00</published><updated>2023-12-06T09:03:10.483+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 7: Summer is over</title><content type='html'><h1 style="text-align: left;">Progress</h1><p style="text-align: left;">With the kids back in school I have been able to work on the <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">Vivante VIP NPU</a> driver full-time during the two weeks after the <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-6-almost-there.html">last update</a>, with quite some work coming out of the pipeline:</p><h2 style="text-align: left;">Found the problem with enabling the 8th NN core</h2><div style="text-align: left;">Though I don't know exactly yet what the problem is, I found that by going back to a <a href="https://gitlab.freedesktop.org/tomeu/linux/-/commit/af365186ab305d2fa3e91145ac79d2569b9df2a5">previous brute-force approach</a> to powering up the NPU, the 8th core works just fine.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.<br /></div><div style="text-align: left;"></div><h2 style="text-align: left;">Added support for depthwise convolutions</h2><div style="text-align: left;"><a href="https://arxiv.org/abs/1704.04861">MobileNetV1</a> introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a <a href="https://paperswithcode.com/method/depthwise-convolution">depthwise convolution</a> to process each depth level separately, plus a <a href="https://paperswithcode.com/method/pointwise-convolution">pointwise convolution</a> to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.<br /></div><h2 style="text-align: left;">Added support for pointwise convolutions</h2><div style="text-align: left;">For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.<br /></div><h2 style="text-align: left;">Added support for unsigned weights</h2><div style="text-align: left;">TensorFlow Lite has moved towards implementing a new <a href="https://www.tensorflow.org/lite/performance/quantization_spec#signed_integer_vs_unsigned_integer">quantization specification</a> which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">That also implied moving to per-tensor quantization, instead of per-axis.<br /></div><h2 style="text-align: left;">Added support for higher IFMs and OFMs (up to 256 each)</h2><div style="text-align: left;">In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><h1 style="text-align: left;">Next steps</h1><h2 style="text-align: left;">Model partition compilation and resource management<br /></h2><div style="text-align: left;">I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">So my next task after I get back from <a href="https://embedded-recipes.org/2023/">Embedded Recipes</a> will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.</div><div style="text-align: left;"><h2 style="text-align: left;">Move development to Cottonwood A311D board</h2></div><div style="text-align: left;">Da Xue of <a href="https://libre.computer/">LibreComputer</a> has got Etnaviv and Teflon working on the <a href="https://hub.libre.computer/t/2023-09-25-libre-computer-aml-a311d-cc-alta-ai-sbc-announcement-pre/2905">new boards</a> that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.<br /></div><div style="text-align: left;"><h2 style="text-align: left;">Bigger coefficient tensors</h2></div><div style="text-align: left;">The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.</div><div style="text-align: left;"></div><div style="text-align: left;"><h2 style="text-align: left;">Medium term goals</h2></div><div style="text-align: left;">I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:</div><div style="text-align: left;"><ol style="text-align: left;"><li>Avoid copies in and out of the model partition, by mapping user buffers to the NPU</li><li>Use the TP units for tensor manipulation (transposing, mostly)</li><li>Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM</li><li>Use the external SRAM for intermediate tensor data</li><li>Chain all TP and NN jobs in a model partition in the same command stream</li><li>Enable zero-run-length compression in the coefficient buffer<br /></li><li>Tune the tiling parameters for reduced memory bandwidth usage</li></ol></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/4728184777760523572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=4728184777760523572' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4728184777760523572'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4728184777760523572'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html' title=' Etnaviv NPU update 7: Summer is over'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1682105245552353397</id><published>2023-09-07T18:19:00.007+02:00</published><updated>2023-12-06T09:03:19.411+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 6: Almost there!</title><content type='html'><h2 style="text-align: left;">Progress</h2><p>&nbsp;This week started quite fruitfully, these features were added:</p><ul style="text-align: left;"><li>Convolutions with multiple input and output channels (input and output feature maps)</li><li><a href="https://keras.io/api/layers/convolution_layers/convolution2d/">"Same"</a> padding in convolutions</li></ul><p>And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.</p><h2 style="text-align: left;">One more roadblock <br /></h2><p>Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps. <br /></p><p>Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.</p><p>So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.</p><h2 style="text-align: left;">Want to try it out?</h2><p>I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.</p><p>I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:</p><p><a href="https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst">https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst</a></p><p>You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.</p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1682105245552353397/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1682105245552353397' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1682105245552353397'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1682105245552353397'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-6-almost-there.html' title=' Etnaviv NPU update 6: Almost there!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-3023393172299765340</id><published>2023-08-24T12:45:00.000+02:00</published><updated>2023-08-24T12:45:36.117+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 5: Harder convolutions!</title><content type='html'><h2 style="text-align: left;">Progress <br /></h2><p>Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.</p><p>Since the <a href="https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-4-its-convoluting.html">last update</a> I have:</p><ul style="text-align: left;"><li> implemented support for strided convolutions with more than one input channel, and</li><li>Implemented support for more than one output channel, but for now only for a single input channel.</li></ul><p>Next steps are&nbsp; to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.</p><p>As a reminder, I'm pushing all the code to this branch: <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/">https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/</a>.<br /></p><h2 style="text-align: left;">IRC channel</h2><p>A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.</p><p>For those of you that may not have a IRC bouncer setup yet, you can easily join with the <a href="https://webchat.oftc.net/">web chat UI</a>, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:<br /><br /><a href="https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/">https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/</a></p><h2 style="text-align: left;">Embedded recipes</h2><p>I have been invited to give a talk about all this ML with mainline effort at <a href="https://embedded-recipes.org/2023/">Embedded Recipes 2023</a>, Paris 28-29 September. Slides and a recording will be published after the conference ends.</p><h2 style="text-align: left;">Sponsor</h2><p>Last but not least, if I am able to invest so much effort on this is because the folks at <a href="https://libre.computer/">LibreComputer</a> have been supporting me financially this last couple of months.</p><p>Thanks to <a href="https://twitter.com/librecomputer">Da Xue</a> for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/3023393172299765340/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=3023393172299765340' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3023393172299765340'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3023393172299765340'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-5-harder-convolutions.html' title='Etnaviv NPU update 5: Harder convolutions!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-745522949949199487</id><published>2023-08-07T18:52:00.002+02:00</published><updated>2023-12-06T09:02:13.799+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 4: It's convoluting! </title><content type='html'><p><span style="font-family: inherit;">Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the <a href="https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-3-deeper-into.html">last update</a>.</span></p><h2 style="text-align: left;"><span style="font-family: inherit;">TL;DR</span></h2><p><span style="font-family: inherit;">The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.</span></p><p><span style="font-family: inherit;">3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.</span></p><p><span style="font-family: inherit;">The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.</span></p><p><span style="font-family: inherit;">There are three features left before I can run a real, full-fledged commercially interesting model:</span></p><ol style="text-align: left;"><li><span style="font-family: inherit;">3D inputs for strided convolutions</span></li><li><span style="font-family: inherit;">Multiple output channels</span></li><li><span style="font-family: inherit;">Padded convolutions</span></li></ol><h2 style="text-align: left;"><span style="font-family: inherit;">Re-quantization</span></h2><p><span style="font-family: inherit;">The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.</span></p><p><span style="font-family: inherit;">After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:</span></p><ul style="text-align: left;"><li><span style="font-family: inherit;"><a href="https://arxiv.org/pdf/2106.00127.pdf">Integer-Only Neural Network Quantization Scheme<br />Based on Shift-Batch-Normalization</a></span></li><li><span style="font-family: inherit;"><a href="https://arxiv.org/pdf/1712.05877.pdf"><span dir="ltr" role="presentation" style="font-size: calc(var(--scale-factor)*14.35px); left: 18.84%; top: 13.41%; transform: scaleX(0.902854);">Quantization and Training of Neural Networks for Efficient</span><br role="presentation" /><span dir="ltr" role="presentation" style="font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">Integer-Arithmetic-Only Inference</span></a></span></li></ul><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.</span></p><h2 style="text-align: left;"><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">3D input tensor</span></h2><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:</span></p><blockquote><span style="font-family: inherit;">+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt<br />--- /home/tomeu/mesa.txt&nbsp;&nbsp;&nbsp; 2023-08-07 18:28:29.939750225 +0200<br />+++ /home/tomeu/galcore.txt&nbsp;&nbsp;&nbsp; 2023-08-07 18:28:42.116625362 +0200<br />@@ -1,176 +1,273 @@<br />&nbsp;{<br />-&nbsp;&nbsp;&nbsp; 0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0x00000011, /*&nbsp;&nbsp; PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */<br />-&nbsp;&nbsp;&nbsp; 0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0x00000002, /*&nbsp;&nbsp; GL.API_MODE := OPENCL */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; GL.OCB_REMAP_START := 0x0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; GL.OCB_REMAP_END := 0x0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000010, /*&nbsp;&nbsp; GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0xffff3000, /*&nbsp;&nbsp; PS.NN_INST_ADDR := *0xffff3000 */<br />+&nbsp;&nbsp;&nbsp; 0x3348e780, /*&nbsp;&nbsp; PS.NN_INST_ADDR := *0x3348e780 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; 0x010A4 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000c23, /*&nbsp;&nbsp; GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000c23, /*&nbsp;&nbsp; GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;}<br />&nbsp;map-&gt;layer_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;no_z_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_xy_size = 0x2;&nbsp; /* (2) */<br />&nbsp;map-&gt;kernel_z_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;kernels_per_core = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;pooling = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;pooling_xy_size = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;prelu = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;nn_layer_flush = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;kernel_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_x_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_y_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_x_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_y_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused0 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;brick_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;brick_distance = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;relu = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused1 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_shift = 0x17;&nbsp; /* (23) */<br />&nbsp;map-&gt;unused2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;no_flush = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_x_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_y_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_z_size = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;rounding_mode = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;in_image_x_offset_bit_3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_y_offset_bit_3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_tile_x_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_tile_y_size = 0x3;&nbsp; /* (3) */<br />-map-&gt;kernel_address = 0x3fffd00;&nbsp; /* (67108096) */<br />+map-&gt;kernel_address = 0xcd237f;&nbsp; /* (13443967) */<br />&nbsp;map-&gt;kernel_z_size2 = 0x0;&nbsp; /* (0) */<br />-map-&gt;in_image_address = 0xffff6000;&nbsp; /* (4294926336) */<br />-map-&gt;out_image_address = 0xffff7000;&nbsp; /* (4294930432) */<br />+map-&gt;in_image_address = 0x3348e240;&nbsp; /* (860414528) */<br />+map-&gt;out_image_address = 0x89ffc500;&nbsp; /* (2315240704) */<br />&nbsp;map-&gt;image_caching_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_caching_mode = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;partial_cache_data_unit = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_pattern_msb = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_y_size = 0x2;&nbsp; /* (2) */<br />&nbsp;map-&gt;out_image_y_stride = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;kernel_pattern_low = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_pattern_high = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_cache_start_address = 0x800;&nbsp; /* (2048) */<br />&nbsp;map-&gt;kernel_cache_end_address = 0xa00;&nbsp; /* (2560) */<br />&nbsp;map-&gt;image_start_address = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;image_end_address = 0x800;&nbsp; /* (2048) */<br />&nbsp;map-&gt;in_image_border_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_border_const = 0x7d;&nbsp; /* (125) */<br />&nbsp;map-&gt;unused4 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier_1_to_6 = 0x1f;&nbsp; /* (31) */<br />&nbsp;map-&gt;post_shift_bit_5_6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused5 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_x_stride = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_y_stride = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;out_image_x_stride = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;unused6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier_7_to_14 = 0x61;&nbsp; /* (97) */<br />&nbsp;map-&gt;out_image_circular_buf_size = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused7 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;per_channel_post_mul = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_circular_buf_end_addr_plus_1 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;unused8 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_circular_buf_size = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused9 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_circular_buf_end_addr_plus_1 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;unused10 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;coef_zero_point = 0x80;&nbsp; /* (128) */<br />&nbsp;map-&gt;out_zero_point = 0x77;&nbsp; /* (119) */<br />&nbsp;map-&gt;kernel_direct_stream_from_VIP_sram = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;depthwise = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused11 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused12 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused13 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused14 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused15 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused16 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further1 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further3 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;further4 = 0x7f800000;&nbsp; /* (2139095040) */<br />&nbsp;map-&gt;further5 = 0xff800000;&nbsp; /* (4286578688) */<br />&nbsp;map-&gt;further6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further7 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further8 = 0x0;&nbsp; /* (0) */<br />&nbsp;&nbsp; 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,<br />&nbsp;&nbsp; 0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,<br />&nbsp;&nbsp; 0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00<br />&nbsp;&nbsp; 0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,<br />&nbsp;&nbsp; 0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,<br />&nbsp;&nbsp; 0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,<br />&nbsp;&nbsp; 0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,<br />&nbsp;&nbsp; 0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,<br />&nbsp;&nbsp; 0xf7, 0x64, 0xc3, 0xca<br />&nbsp;&nbsp; 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77</span></blockquote><p><span style="font-family: inherit;">This corresponds to a convolution with the following parameters:</span></p><ul style="text-align: left;"><li><span style="font-family: inherit;">8x8x1 input tensor</span></li><li><span style="font-family: inherit;">3x3x1 weight tensor</span></li><li><span style="font-family: inherit;">stride == 2</span></li></ul><p><span style="font-family: inherit;">The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.&nbsp;</span></p><p><span style="font-family: inherit;">At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.<br /></span></p><p><span style="font-family: inherit;">When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.</span></p><h2 style="text-align: left;"><span style="font-family: inherit;">Strided convolutions</span></h2><p><span style="font-family: inherit;">The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:</span></p><ul style="text-align: left;"><li><a href="https://www.arxiv-vanity.com/papers/1712.02502/" style="font-family: inherit;">Take it in your stride: Do we need striding in CNNs?</a><span style="font-family: inherit;"><br /></span></li></ul><p><span style="font-family: inherit;">By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.</span></p><p><span style="font-family: inherit;">For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency. <br /></span></p><h2 style="text-align: left;"><span style="font-family: inherit;">Test suite</span></h2><p><span style="font-family: inherit;">With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.</span></p><p><span style="font-family: inherit;">I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.<br /></span></p><p><span style="font-family: inherit;">At some point I will add a CI job, probably before sending the initial merge request.<br /></span></p><div><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);"></span></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/745522949949199487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=745522949949199487' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/745522949949199487'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/745522949949199487'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-4-its-convoluting.html' title=' Etnaviv NPU update 4: It's convoluting! '/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8257480099972567574</id><published>2023-06-26T08:46:00.003+02:00</published><updated>2023-12-06T09:01:38.692+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 3: Deeper into the convolution units</title><content type='html'><p>What two weeks!</p><h2 style="text-align: left;">Programming of the convolution units</h2><div style="text-align: left;"><p style="text-align: left;">Taking from where I left at the <a href="https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-2-diving-into.html">last update</a>, I made progress in understanding the format of the buffer that contains the weights and biases. <br /></p></div><div style="text-align: left;"><p style="text-align: left;">The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.<br /></p><p style="text-align: left;">With that, I&nbsp; was able to correctly run a convolution on a small matrix with arbitrary weights and biases.</p><p style="text-align: left;">The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.<br /></p><p style="text-align: left;">That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.</p><p style="text-align: left;">A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.</p><h2 style="text-align: left;">A new Gallium frontend: Teflon</h2><p style="text-align: left;">If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:</p><p style="text-align: left;"><a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/">https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/</a><br /></p><p style="text-align: left;">TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.</p><p style="text-align: left;">Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).</p><p style="text-align: left;">But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.</p><p style="text-align: left;">Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.</p><p style="text-align: left;">How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.</p><p style="text-align: left;">Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.</p><p style="text-align: left;">And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!</p><p style="text-align: left;">Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.<br /></p><h2 style="text-align: left;">Next steps</h2><p style="text-align: left;">I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z &gt; 1).</p><p style="text-align: left;">After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!<br /></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8257480099972567574/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8257480099972567574' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8257480099972567574'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8257480099972567574'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-3-deeper-into.html' title='Etnaviv NPU update 3: Deeper into the convolution units'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-956184837449237805</id><published>2023-06-10T14:14:00.001+02:00</published><updated>2023-12-06T09:01:28.578+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 2: Diving into the convolution units</title><content type='html'><p>In the <a href="https://blog.tomeuvizoso.net/2023/05/etnaviv-npu-update-1-planning-for.html">previous update</a> I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.</p><p>Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.</p><p>By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of <a href="https://github.com/TierMobility/linux/blob/242f3e8c8502ff8e818028f8b9fd9894e0feef2e/drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware_func_flop_reset.c#L4751">programming this HW</a> with a trivial job (a 2x2x1 kernel with a single bias value).<br /></p><p>Took some boring work to translate what the code does to a C struct, but this was the initial one:</p><p><span style="font-family: courier;">struct etna_nn_params {<br />&nbsp;&nbsp; uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */<br />&nbsp;&nbsp; uint32_t no_z_offset : 1;<br />&nbsp;&nbsp; uint32_t kernel_x_size : 4;<br />&nbsp;&nbsp; uint32_t kernel_z_size : 14; /* &amp; 0x3FFF */<br />&nbsp;&nbsp; uint32_t kernels_per_core : 7;<br />&nbsp;&nbsp; uint32_t zero1 : 2;<br />&nbsp;&nbsp; uint32_t zero2 : 1;<br />&nbsp;&nbsp; uint32_t zero3 : 1;<br />&nbsp;&nbsp; uint32_t nn_layer_flush : 1;<br /><br />&nbsp;&nbsp; uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t in_image_x_size : 13;<br />&nbsp;&nbsp; uint32_t in_image_y_size : 13;<br /><br />&nbsp;&nbsp; uint32_t zero4 : 3;<br />&nbsp;&nbsp; uint32_t zero5 : 3;<br />&nbsp;&nbsp; uint32_t unused0 : 1;<br />&nbsp;&nbsp; uint32_t zero6 : 16;<br />&nbsp;&nbsp; uint32_t zero7 : 1;<br />&nbsp;&nbsp; uint32_t enable_relu : 1;<br />&nbsp;&nbsp; uint32_t zero9 : 1;<br />&nbsp;&nbsp; uint32_t post_shift : 6;<br /><br />&nbsp;&nbsp; uint32_t unused1 : 2;<br />&nbsp;&nbsp; uint32_t zero10 : 1;<br />&nbsp;&nbsp; uint32_t zero11 : 1;<br />&nbsp;&nbsp; uint32_t unused2 : 2;<br />&nbsp;&nbsp; uint32_t out_image_x_size : 13;<br />&nbsp;&nbsp; uint32_t out_image_y_size : 13;<br /><br />&nbsp;&nbsp; uint32_t out_image_z_size : 14;<br />&nbsp;&nbsp; uint32_t zero12 : 2; /* 0x0 */<br />&nbsp;&nbsp; uint32_t zero13 : 1; /* (0 &gt;&gt; 3) &amp; 0x1 */<br />&nbsp;&nbsp; uint32_t zero14 : 1; /* (0 &gt;&gt; 3) &amp; 0x1 */<br />&nbsp;&nbsp; uint32_t unk0 : 7;&nbsp; /* 1 */<br />&nbsp;&nbsp; uint32_t unk1 : 7;&nbsp; /* 1 */<br /><br />&nbsp;&nbsp; uint32_t kernel_address : 26; /* &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t kernel_z_size2 : 6; /* &gt;&gt; 14 */<br /><br />&nbsp;&nbsp; uint32_t in_image_address;<br /><br />&nbsp;&nbsp; uint32_t out_image_address;<br /><br />&nbsp;&nbsp; uint32_t unused3 : 12;<br />&nbsp;&nbsp; uint32_t kernel_y_size : 4;<br />&nbsp;&nbsp; uint32_t out_image_y_size2 : 16;&nbsp; /* maybe stride? */<br /><br />&nbsp;&nbsp; uint32_t zero15;<br /><br />&nbsp;&nbsp; uint32_t zero16;<br /><br />&nbsp;&nbsp; uint32_t zero17;<br /><br />&nbsp;&nbsp; uint32_t kernel_cache_end_address;<br /><br />&nbsp;&nbsp; uint32_t zero19;<br /><br />&nbsp;&nbsp; uint32_t image_end_address;<br /><br />&nbsp;&nbsp; uint32_t zero20 : 2;<br />&nbsp;&nbsp; uint32_t zero21 : 16;<br />&nbsp;&nbsp; uint32_t kernel_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t in_image_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t out_image_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t zero22 : 6;<br />&nbsp;&nbsp; uint32_t post_shift_bit_5_6 : 2;<br />&nbsp;&nbsp; uint32_t unused4 : 3;<br /><br />&nbsp;&nbsp; uint32_t in_image_stride : 16;<br />&nbsp;&nbsp; uint32_t in_image_y_size2 : 16; /* again? */<br /><br />&nbsp;&nbsp; uint32_t out_image_stride : 16;<br />&nbsp;&nbsp; uint32_t unused5 : 8;<br />&nbsp;&nbsp; uint32_t zero23 : 8;<br /><br />&nbsp;&nbsp; uint32_t zero24 : 26; /* 0 &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t zero25 : 1;<br />&nbsp;&nbsp; uint32_t zero26 : 1;<br />&nbsp;&nbsp; uint32_t zero27 : 1; /* 0 &gt;&gt; 4 */<br />&nbsp;&nbsp; uint32_t zero28 : 1; /* 0 &gt;&gt; 4 */<br />&nbsp;&nbsp; uint32_t zero29 : 1;<br />&nbsp;&nbsp; uint32_t kernel_data_type_bit_3 : 1;<br /><br />&nbsp;&nbsp; uint32_t unk2 : 26; /* 0xFFFFFFFF &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t unused6 : 4;<br />&nbsp;&nbsp; uint32_t zero30 : 1;<br />&nbsp;&nbsp; uint32_t in_image_data_type_bit_3 : 1;<br /><br />&nbsp;&nbsp; uint32_t zero31 : 26; /* 0 &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t out_image_data_type_bit_3 : 1;<br />&nbsp;&nbsp; uint32_t unused7 : 6;<br /><br />&nbsp;&nbsp; uint32_t unk3 : 26; /* 0xFFFFFFFF &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t unused8 : 6;<br /><br />&nbsp;&nbsp; uint32_t coef_zero_point : 8;<br />&nbsp;&nbsp; uint32_t out_zero_point : 8;<br />&nbsp;&nbsp; uint32_t zero32 : 1;<br />&nbsp;&nbsp; uint32_t zero33 : 1;<br />&nbsp;&nbsp; uint32_t zero34 : 8;<br />&nbsp;&nbsp; uint32_t unused9 : 6;<br /><br />&nbsp;&nbsp; uint32_t zero35;<br /><br />&nbsp;&nbsp; uint32_t zero36 : 4;<br />&nbsp;&nbsp; uint32_t zero37 : 28;&nbsp; /* 0 &gt;&gt; 4 */<br /><br />&nbsp;&nbsp; uint32_t zero38 : 4;<br />&nbsp;&nbsp; uint32_t zero39 : 28;&nbsp; /* 0 &gt;&gt; 4 */<br /><br />&nbsp;&nbsp; uint32_t further1;<br />&nbsp;&nbsp; uint32_t further2;<br />&nbsp;&nbsp; uint32_t further3;<br />&nbsp;&nbsp; uint32_t further4;<br />&nbsp;&nbsp; uint32_t further5;<br />&nbsp;&nbsp; uint32_t further6;<br />&nbsp;&nbsp; uint32_t further7;<br />&nbsp;&nbsp; uint32_t further8;<br /></span>};<br /></p><p>As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.<br /></p><p>I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.</p><p>By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.</p><p>Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.</p><p>To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.</p><p>What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.</p><p>Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/956184837449237805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=956184837449237805' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/956184837449237805'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/956184837449237805'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-2-diving-into.html' title='Etnaviv NPU update 2: Diving into the convolution units'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6608550388257645646</id><published>2023-05-29T11:31:00.001+02:00</published><updated>2023-12-06T09:01:09.123+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 1: Planning for performance</title><content type='html'><p>As I wrote in the <a href="https://blog.tomeuvizoso.net/2023/04/a-long-overdue-update.html">last update</a>, my <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-opencl">OpenCL branch</a> is able to correctly run <a href="https://arxiv.org/abs/1704.04861">MobileNet v1</a> with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.</p><p>In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!). </p><p>When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.</p><p>After a lots of fumbling around in the internets I stumbled upon <a href="https://github.com/phytec/android-phytec-devices/commit/530d1d3102c93b00ae0a6a87a50db2648f874277">a commit</a> that included an interestingly-named environment variable: <span class="Text-sc-17v1xeu-0 cExLQ"><mark>VIV_VX_DISABLE_TP_NN</mark>_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).</span></p><p><span class="Text-sc-17v1xeu-0 cExLQ">With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.<br /></span></p><p>Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.</p><p>Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.<br /></p><p>The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:</p><p><span style="font-family: courier;">&nbsp; Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -&gt; [T#5]<br />&nbsp; Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -&gt; [T#31]<br />... <br /></span></p><p><span style="font-family: courier;">[12 more pairs of CONV_2D and </span><span style="font-family: courier;">DEPTHWISE_CONV_2D</span><span style="font-family: courier;">] </span></p><p><span style="font-family: courier;">...<br /></span></p><p><span style="font-family: courier;">&nbsp; Op#27 AVERAGE_POOL_2D(T#29) -&gt; [T#0]<br />&nbsp; Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -&gt; [T#1]<br />&nbsp; Op#29 RESHAPE(T#1, T#86[-1, 1001]) -&gt; [T#85]<br />&nbsp; Op#30 SOFTMAX(T#85) -&gt; [T#87]</span><br /></p><p>As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end.&nbsp;</p><p>By using some of the environment variables that are mentioned in <a href="https://github.com/VeriSilicon/tflite-vx-delegate/issues/20#issuecomment-952472901">this issue</a> in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:</p><p><span style="font-family: courier;">&nbsp; operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN<br />... <br /></span></p><p><span style="font-family: courier;">[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN]&nbsp;</span></p><p><span style="font-family: courier;">...<br /></span></p><p><span style="font-family: courier;">&nbsp; operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH<br />&nbsp; operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH<br /></span><br />From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.<br /><br />So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...</p><p></p><h2 style="text-align: left;">Next steps&nbsp;</h2><p style="text-align: left;">What I will be working on next:<br /></p><ol style="text-align: left;"><li>Adapt the existing RE tooling to dump information regarding NN and TP workflows</li><li>Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.</li><li>Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.</li><li>Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.</li><li>Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".</li><li>Get things working for this specific network and measure performance.</li></ol><p>If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.</p><p>The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the <a href="https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.1/html/d6/d9a/group__group__cnn.html#ga870c106e8ceb4c118692c6f754f75f43">vxConvolutionLayer node</a>).<br /></p><p>The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.</p><p>That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6608550388257645646/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6608550388257645646' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608550388257645646'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608550388257645646'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/05/etnaviv-npu-update-1-planning-for.html' title='Etnaviv NPU update 1: Planning for performance'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8730763367013492345</id><published>2023-04-26T11:54:00.003+02:00</published><updated>2023-12-06T09:00:54.278+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>A long overdue update</title><content type='html'><p style="text-align: left;">Cannot believe it has been years since my last update here!</p><p style="text-align: left;">There are two things that I would like to tell people about:</p><p style="text-align: left;">The first is that I no longer work at <a href="https://www.collabora.com/">Collabora</a>. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.</p><p style="text-align: left;">They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.<br /></p><p style="text-align: left;">And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.</p><h2 style="text-align: left;">TL;DR</h2><p style="text-align: left;"><span style="font-family: courier;">tomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1</span></p><p style="text-align: left;"><span style="font-family: courier;">[snip]<br />INFO: invoked<br />INFO: average time: 1261.99 ms<br />INFO: 0.666667: 458 bow tie<br />INFO: 0.294118: 653 military uniform<br />INFO: 0.0117647: 835 suit<br />INFO: 0.00784314: 611 jersey<br />INFO: 0.00392157: 922 book jacket</span><br /></p><p style="text-align: left;">That is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from <a href="https://cs.wikipedia.org/wiki/Grace_Hopperov%C3%A1#/media/Soubor:Commodore_Grace_M._Hopper,_USN_(covered)_head_and_shoulders_crop.jpg">Grace Hopper's portrait</a> in military uniform.</p><h2 style="text-align: left;">The story behind this work<br /></h2><p style="text-align: left;">Many years ago, when I was working on the operating system for the <a href="https://en.wikipedia.org/wiki/One_Laptop_per_Child">One Laptop Per Child</a> project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.</p><p style="text-align: left;">This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from <a href="https://rosenzweig.io/">Alyssa Rosenzweig</a>.<br /></p><p style="text-align: left;">Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.<br /></p><p style="text-align: left;">Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.<br /></p><p style="text-align: left;">Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.</p><p style="text-align: left;">We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.</p><p style="text-align: left;">The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:</p><p style="text-align: left;"><a href="https://lkml.org/lkml/2023/4/26/19">[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPU</a></p><p style="text-align: left;">Due to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts:</p><p style="text-align: left;">
<span style="font-family: courier;">&amp;npu {<br />
&nbsp; &nbsp; &nbsp;&nbsp; status = "okay";<br />};</span></p><p style="text-align: left;">Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.</p><p style="text-align: left;">Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.</p><p style="text-align: left;">First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.<br /></p><p style="text-align: left;">At that point I had to pause the work to focus on other unrelated stuff, but Collabora's <a href="https://gitlab.freedesktop.org/italove">Italo Nicola</a> and <a href="https://gitlab.freedesktop.org/gfxstrand/">Faith Ekstrand </a>did great work to extend the existing compiler to generate OpenCL kernels.</p><p style="text-align: left;">Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.</p><p style="text-align: left;">And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.</p><p style="text-align: left;">With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.</p><p style="text-align: left;">Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.</p><p style="text-align: left;">Current performance situation (label_image):</p><ul style="text-align: left;"><li>OpenCL delegate with Etnaviv: 1261.99 ms</li><li>OpenCL delegate with Galcore: 787.733 ms</li><li>CPU: 149.19 ms</li><li>TIM-VX delegate: 2.567 ms (!)</li></ul><p>The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.<br /></p><h2 style="text-align: left;">Next steps</h2><p style="text-align: left;">Italo is cleaning up an <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18986">initial submission</a> for inclusion in Mesa upstream. Once that is done I will rebase <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-opencl">my branch</a> and start submitting features.</p><p style="text-align: left;">In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.</p><h2 style="text-align: left;">Thanks</h2><p style="text-align: left;">There is a lot of people besides the ones mentioned above that have made this possible. Some of they are:</p><ul style="text-align: left;"><li>The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.</li><li>The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.<br /></li><li>And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.</li></ul><p style="text-align: left;">Last but not least, there are some individuals to whom I was able to turn when I needed help:</p><ul style="text-align: left;"><li>Christian Gmeiner (austriancoder)</li><li>Lucas Stach (lynxeye)</li><li>Neil Armstrong (narmstrong)</li><li>Faith Ekstrand (gfxstrand)</li><li>Karol Herbst (karolherbst)</li></ul>A big thanks, it has been a lot of fun!<br /></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8730763367013492345/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8730763367013492345' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8730763367013492345'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8730763367013492345'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/04/a-long-overdue-update.html' title='A long overdue update'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-2166761723407285229</id><published>2019-03-04T17:28:00.000+01:00</published><updated>2019-03-05T07:33:12.230+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="graphics"/><category scheme="http://www.blogger.com/atom/ns#" term="kernel"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="panfrost"/><title type='text'>Panfrost update: a new kernel driver</title><content type='html'><div dir="ltr" style="text-align: left;" trbidi="on">
<div class="ace-line" id="magicdomid2" style="text-align: left;">
<h2>
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The video</span></h2>
</div>
<div class="ace-line" id="magicdomid3">
</div>
<div class="ace-line" id="magicdomid4">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Below you can see the same scene that I <a href="https://blog.tomeuvizoso.net/2019/01/a-panfrost-milestone.html">recorded in January</a>, which was rendered by Panfrost in Mesa but using Arm's kernel driver. This time, Panfrost is using a new kernel driver that is in a form close to be acceptable in the mainline kernel:</span></div>
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.blogger.com/video.g?token=AD6v5dxD0tL_dAEbEWD4TRrIhnJncH1OlKMiZHOatY2BswJc8orb93DZM6ZRgWR8R9b2LudU9tHga2WUEcMP9H3_JA' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<div class="ace-line" id="magicdomid6">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14"></span></div>
<div class="ace-line" id="magicdomid7">
</div>
<div class="ace-line" id="magicdomid8" style="text-align: left;">
<h2>
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The history behind it</span></h2>
</div>
<div class="ace-line" id="magicdomid9">
</div>
<div class="ace-line" id="magicdomid10">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">During the past two months Rob Herring and I have been working on a new driver for Midgard and Bifrost GPUs that could be accepted mainline.</span></div>
<div class="ace-line" id="magicdomid11">
<br /></div>
<div class="ace-line" id="magicdomid12">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Arm already maintains a driver out of tree with an acceptable open source license, but it doesn't implement the DRM ABI and several design considerations make it unsuitable for inclusion in mainline Linux.</span></div>
<div class="ace-line" id="magicdomid13">
<br /></div>
<div class="ace-line" id="magicdomid14">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The absence of a driver in mainline prevents users from keeping their kernels up-to-date and hurts integration with other parts of the free software stack. It also discourages SoC and BSP vendors from submitting their code to mainline, and hurts their ability to track mainline closely.</span></div>
<div class="ace-line" id="magicdomid15">
<br /></div>
<div class="ace-line" id="magicdomid16">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Besides the code of the driver itself, there's one more condition for mainline inclusion: an open source implementation of the userspace library needs to exist, so other kernel contributors can help verifying, debugging and maintaining the kernel driver. It's an enormous pile of difficult work to reverse engineer the</span><span class="author-a-oz84zz90zz83zz79zz68zn5z69zz84znz87zz65zz73z0x"> inner</span><span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14"> working</span><span class="author-a-oz84zz90zz83zz79zz68zn5z69zz84znz87zz65zz73z0x">s</span><span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14"> of a GPU and </span><span class="author-a-oz84zz90zz83zz79zz68zn5z69zz84znz87zz65zz73z0x">then </span><span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">implement a compiler and command submission infrastructure, so big thanks to <a href="https://rosenzweig.io/blog/">Alyssa Rosenzweig</a> for leading that effort.</span></div>
<div class="ace-line" id="magicdomid17">
</div>
<div class="ace-line" id="magicdomid18">
<h2 style="text-align: left;">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Upstream status</span></h2>
</div>
<div class="ace-line" id="magicdomid19">
</div>
<div class="ace-line" id="magicdomid20">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Most of the Panfrost code is already part of mainline Mesa, with the code that directly interacts with the new DRM driver being in the review stage. Currently targeted GPUs are T760 and T860, with the RK3399 being the SoC more often used for testing.</span></div>
<div class="ace-line" id="magicdomid21">
<br /></div>
<div class="ace-line" id="magicdomid22">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The kernel driver is being developed in the open and though we are trying to follow the best practices as displayed by other DRM drivers, there's a number of tasks that need to be done before we consider it ready for submission.</span></div>
<div class="ace-line" id="magicdomid23">
</div>
<div class="ace-line" id="magicdomid24">
<h2 style="text-align: left;">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The work ahead</span></h2>
</div>
<div class="ace-line" id="magicdomid25">
</div>
<div class="ace-line" id="magicdomid26">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">In the kernel:</span></div>
<div class="ace-line" id="magicdomid26">
</div>
<div class="ace-line" id="magicdomid27">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Make MMU code more complete for correctness and better performance</span></div>
<div class="ace-line" id="magicdomid28">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Handle errors and hangs and correctly reset the GPU</span></div>
<div class="ace-line" id="magicdomid29">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Improve fence handling</span></div>
<div class="ace-line" id="magicdomid30">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Test with compute shaders (to check completeness of the ABI)</span></div>
<div class="ace-line" id="magicdomid31">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Lots of cleanups and bug fixing!</span></div>
<div class="ace-line" id="magicdomid32">
<br /></div>
<div class="ace-line" id="magicdomid33">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">In Mesa:</span></div>
<div class="ace-line" id="magicdomid34">
</div>
<div class="ace-line" id="magicdomid34">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Get GNOME Shell working</span></div>
<div class="ace-line" id="magicdomid35">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Get Chromium working with accelerated WebGL</span></div>
<div class="ace-line" id="magicdomid36">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Get all of glmark2 working</span></div>
<div class="ace-line" id="magicdomid37">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Get a decent subset of dEQP passing and use it in CI</span></div>
<div class="ace-line" id="magicdomid38">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Keep refactoring the code</span></div>
<div class="ace-line" id="magicdomid39">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- Support more hardware</span></div>
<div class="ace-line" id="magicdomid40">
</div>
<div class="ace-line" id="magicdomid41">
<h2 style="text-align: left;">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">Get the code</span></h2>
</div>
<div class="ace-line" id="magicdomid42">
</div>
<div class="ace-line" id="magicdomid43">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">The exact bits used for the demo recorded above are in various stages of getting upstreamed to the various upstreams, but here are in branches for easier reproduction:</span></div>
<div class="ace-line" id="magicdomid44">
<br /></div>
<div class="ace-line" id="magicdomid45">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- </span><span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14 url"><a href="https://gitlab.freedesktop.org/tomeu/linux/tree/panfrost-5.0-rc4">https://gitlab.freedesktop.org/panfrost/linux/tree/panfrost-5.0-rc4</a></span></div>
<div class="ace-line" id="magicdomid46">
<span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14">- </span><span class="author-a-z75zz67zz75zbkgz76zz68zz69zmz77zivz85z14 url"><a href="https://gitlab.freedesktop.org/tomeu/mesa/tree/mainline-driver">https://gitlab.freedesktop.org/tomeu/mesa/tree/mainline-driver</a></span></div>
<div class="ace-line" id="magicdomid47">
<br /></div>
</div>
</content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/2166761723407285229/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=2166761723407285229' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2166761723407285229'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2166761723407285229'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2019/03/panfrost-update-new-kernel-driver.html' title='Panfrost update: a new kernel driver'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6568743566561788614</id><published>2019-01-07T12:54:00.000+01:00</published><updated>2019-01-07T13:33:06.267+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="collabora"/><category scheme="http://www.blogger.com/atom/ns#" term="mali"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="panfrost"/><title type='text'>A Panfrost milestone</title><content type='html'><div dir="ltr" style="text-align: left;" trbidi="on">
<h3 style="text-align: left;">
The video</h3>
<div style="text-align: left;">
Below you can see glmark2 running as a Wayland client in Weston, on a <a href="http://wiki.friendlyarm.com/wiki/index.php/NanoPC-T4">NanoPC -T4</a> (so a RK3399 SoC with a Mali T-864 GPU)). It's much smoother than on the video, which is limited to 5FPS by the webcam.</div>
<div style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='600' height='500' src='https://www.blogger.com/video.g?token=AD6v5dwflTwwj84eQUbGsL-lrS2lUsTZw93yOqIX1PV4KR22Df2ox3_DZZq4OijrN23s5Be0x3jbyjq0Ti8sTq5uiw' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<br />
Weston is running with the DRM backend and the GL renderer.<br />
<br />
<h3 style="text-align: left;">
The history behind it </h3>
<br />
For more than 10 years, at <a href="https://www.collabora.com/">Collabora</a> we have been happily helping our customers to make the most of their hardware by running free software.<br />
<br />
One area some of us have specially enjoyed working on has been open drivers for GPUs, which for a long time have been considered the next frontier in the quest to have a full software platform that companies and individuals can understand, improve and fix without having to ask for permission first.<br />
<br />
Something that has saddened me a bit has been our reduced ability to help those customers that for one reason or another had chosen a hardware platform with ARM Mali GPUs, as no open driver was available for those.<br />
<br />
While our biggest customers were able to get a high level of support from the vendors in order to have the Mali graphics stack well integrated with the rest of their product, the smaller ones had a much harder time in achieving that level of integration, which manifested in reduced performance, increased power consumption and slipped milestones.<br />
<br />
That's why we have been following with great interest the several efforts that aimed to come up with an open driver for GPUs in the Mali family, one similar to those already existing for Qualcomm, NVIDIA and Vivante.<br />
<br />
At XDC last year we had the chance of meeting the people involved in the latest effort to develop such a driver: Panfrost. And in the months that followed I made some room in my backlog to come up with a plan to give the effort a boost.<br />
<br />
At that point, Panfrost was only able to get its bits in the screen by an elaborate hack that involved copying each frame into a X11 SHM buffer, which besides making the setup of the development environment much more cumbersome, invalidated any performance analysis. It also limited testing to demos such as glmark2.<br />
<br />
Due to my previous work on Etnaviv I was already familiar with the abstractions in Mesa for setups in which the display of buffers is performed by a device different from the GPU so it was just a matter of seeing how we could get the kernel driver for the Mali GPU to play well with the rest of the stack.<br />
<br />
So during the past month or so I have come up with a proper implementation of the winsys abstraction that makes use of ARM's kernel driver. The result is that now developers have a better base on which to work on the rendering side of things.<br />
<br />
By properly creating, exporting and importing buffers, we can now run applications on GBM, from demos such as kmscube and glmark2 to compositors such as Weston, but also big applications such as Kodi. We are also supporting zero-copy display of GPU-rendered clients in Weston.<br />
<br />
This should make it much easier to work on the rendering side of things, and work on a proper DRM driver in the mainline kernel can proceed in parallel.<br />
<br />
For those interested in joining to the effort, Alyssa has graciously taken the time to update the <a href="https://panfrost.freedesktop.org/building-panfrost-mesa.html">instructions to build and test Panfrost</a>. You can join us at #panfrost in Freenode and can start sending merge requests to <a href="https://gitlab.freedesktop.org/panfrost/">Gitlab</a>.<br />
<br />
Thanks to <a href="https://www.collabora.com/">Collabora</a> for sponsoring this work and to Alyssa Rosenzweig and Lyude Paul for their previous work and for answering my questions.</div>
</content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6568743566561788614/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6568743566561788614' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6568743566561788614'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6568743566561788614'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2019/01/a-panfrost-milestone.html' title='A Panfrost milestone'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1907760410368179865</id><published>2017-11-06T08:06:00.000+01:00</published><updated>2017-11-06T08:07:14.851+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="chromeos"/><category scheme="http://www.blogger.com/atom/ns#" term="crosvm"/><category scheme="http://www.blogger.com/atom/ns#" term="gnome"/><category scheme="http://www.blogger.com/atom/ns#" term="graphics"/><category scheme="http://www.blogger.com/atom/ns#" term="kernel"/><category scheme="http://www.blogger.com/atom/ns#" term="kvm"/><category scheme="http://www.blogger.com/atom/ns#" term="minijail"/><category scheme="http://www.blogger.com/atom/ns#" term="virgl"/><category scheme="http://www.blogger.com/atom/ns#" term="virtualization"/><category scheme="http://www.blogger.com/atom/ns#" term="wayland"/><title type='text'>Experiments with crosvm</title><content type='html'><div dir="ltr" style="text-align: left;" trbidi="on">
Last week I played a bit with crosvm, a KVM monitor used within Chromium OS for application isolation. My goal is to learn more about the current limits of virtualization for isolating applications in mainline. Two of crosvm's defining characteristics is that it's written in Rust for increased security, and that uses namespaces extensively to reduce the attack surface of the monitor itself.<br />
<br />
It was quite easy to get it running outside Chromium OS (have been testing with Fedora 26), with the only complication being that minijail isn't widely packaged in distros. In the instructions below we hack around the issue with linker environment variables so we don't have to install it properly. Instructions are in form of shell commands for illustrative purposes only.<br />
<br />
Build kernel:<br />
<blockquote>
$ cd ~/src<br />
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git<br />
$ cd linux<br />
$ git checkout v4.12<br />
$ make x86_64_defconfig<br />
$ make bzImage<br />
$ cd .. </blockquote>
Build minijail:<br />
<blockquote class="tr_bq">
$ git clone https://android.googlesource.com/platform/external/minijail<br />
$ cd minijail<br />
$ make<br />
$ cd .. </blockquote>
Build crosvm:<br />
<blockquote class="tr_bq">
$ git clone https://chromium.googlesource.com/a/chromiumos/platform/crosvm<br />
$ cd crosvm<br />
$ LIBRARY_PATH=~/src/minijail cargo build </blockquote>
Generate rootfs: <br />
<blockquote>
$ cd ~/src/crosvm<br />
$ dd if=/dev/zero of=rootfs.ext4 bs=1K count=1M<br />
$ mkfs.ext4 rootfs.ext4<br />
$ mkdir rootfs/<br />
$ sudo mount rootfs.ext4 rootfs/<br />
$ debootstrap testing rootfs/<br />
$ sudo umount rootfs/</blockquote>
Run crosvm:<br />
<blockquote>
$ LD_LIBRARY_PATH=~/src/minijail ./target/debug/crosvm run -r rootfs.ext4 --seccomp-policy-dir=./seccomp/x86_64/ ~/src/linux/arch/x86/boot/compressed/vmlinux.bin</blockquote>
The work ahead includes figuring out the best way for Wayland clients in the guest interact with the compositor in the host, and also for guests to make efficient use of the GPU.<br />
<br /></div>
</content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1907760410368179865/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1907760410368179865' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1907760410368179865'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1907760410368179865'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2017/11/experiments-with-crosvm_6.html' title='Experiments with crosvm'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-804434725988119615</id><published>2016-12-22T09:50:00.000+01:00</published><updated>2016-12-22T09:58:39.916+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="chamelium"/><category scheme="http://www.blogger.com/atom/ns#" term="chromeos"/><category scheme="http://www.blogger.com/atom/ns#" term="CI"/><category scheme="http://www.blogger.com/atom/ns#" term="collabora"/><category scheme="http://www.blogger.com/atom/ns#" term="google"/><category scheme="http://www.blogger.com/atom/ns#" term="intel"/><category scheme="http://www.blogger.com/atom/ns#" term="kernel"/><category scheme="http://www.blogger.com/atom/ns#" term="kernelci.org"/><category scheme="http://www.blogger.com/atom/ns#" term="upstream"/><title type='text'>Slides on the Chamelium board</title><content type='html'><div dir="ltr" style="text-align: left;" trbidi="on">
Yesterday I gave a short talk about the <a href="https://www.chromium.org/chromium-os/testing/chamelium">Chamelium board</a> from the ChromeOS team, and thought that the slides could be useful for others as this board gets used more and more outside of Google.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://people.collabora.com/~tomeu/Chamelium_Overview.odp"><img alt="https://people.collabora.com/~tomeu/Chamelium_Overview.odp" border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIM2CS8oRdNoRCcpsaHuhgCfT77abM5Krrvlr8iaO-YoIIweUWfkqYRGG4MhhLZbv0Rbe1xDXV4RBiB30EH1D3rAEEDH0NabimMjONQVx6TpdLNEqOjYgfSEKlCAC5YDmuWOPCWlk5jSo/s320/Screenshot+from+2016-12-22+09-54-38.png" width="320" /></a></div>
<br />
<br />
If you are interested in how this board can help you automate the testing of your display (and not only!) code and hardware, a <a href="https://groups.google.com/a/chromium.org/forum/#!forum/chamelium-users">new mailing list</a> has been created to discuss its uses. We at <a href="https://www.collabora.com/">Collabora</a> will be happy to help you integrate this board in your CI lab as well.<br />
<br />
Thanks go to <a href="https://01.org/">Intel</a> for sponsoring the preparation of these slides and for allowing me to share them under an open license.<br />
<br />
And of course, thanks to Google's ChromeOS team for releasing the hardware design with an open hardware license along with the code they are running on it and with it.</div>
</content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/804434725988119615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=804434725988119615' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/804434725988119615'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/804434725988119615'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2016/12/slides-on-chamelium-board.html' title='Slides on the Chamelium board'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIM2CS8oRdNoRCcpsaHuhgCfT77abM5Krrvlr8iaO-YoIIweUWfkqYRGG4MhhLZbv0Rbe1xDXV4RBiB30EH1D3rAEEDH0NabimMjONQVx6TpdLNEqOjYgfSEKlCAC5YDmuWOPCWlk5jSo/s72-c/Screenshot+from+2016-12-22+09-54-38.png" height="72" width="72"/><thr:total>2</thr:total></entry></feed>

If you would like to create a banner that links to this page (i.e. this validation result), do the following:

Download the "valid Atom 1.0" banner.
Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)
Add this HTML to your page (change the image src attribute if necessary):

<a href="http://www.feedvalidator.org/check.cgi?url=http%3A//blog.tomeuvizoso.net/feeds/posts/default"><img src="valid-atom.png" alt="[Valid Atom 1.0]" title="Validate my Atom 1.0 feed" /></a>

If you would like to create a text link instead, here is the URL you can use:

http://www.feedvalidator.org/check.cgi?url=http%3A//blog.tomeuvizoso.net/feeds/posts/default

Home · About · News · Docs · Terms