This is a valid Atom 1.0 feed.
This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.
... r.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2 ...
^
... r.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2 ...
^
... meu Vizoso</title><subtitle type='html'></subtitle><link rel='http://sch ...
^
... eeds/664175667937540078/posts/default'/><link rel='alternate' type='text ...
^
line 1, column 0: (12 occurrences) [help]
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
line 1, column 0: (19 occurrences) [help]
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
line 1, column 0: (19 occurrences) [help]
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blog ...
line 5, column 0: (3 occurrences) [help]
of the weight encoding and added some patches to the kernel that were requi ...
line 17, column 90613: (10 occurrences) [help]
... ;></span></p></div></content><link rel='replies' ty ...
^
line 17, column 90613: (9 occurrences) [help]
... ;></span></p></div></content><link rel='replies' ty ...
^
line 17, column 121709: [help]
... es can accelerate.<br /></p></content><link rel='replies' ty ...
^
<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-664175667937540078</id><updated>2024-11-17T11:24:15.444+01:00</updated><category term="gnome"/><category term="sugar"/><category term="mesa"/><category term="npu"/><category term="python"/><category term="tensorflow"/><category term="collabora"/><category term="etnaviv"/><category term="introspection"/><category term="vivante"/><category term="machine-learning"/><category term="vipnano-qi"/><category term="librecomputer"/><category term="clutter"/><category term="telepathy"/><category term="kernel"/><category term="verisilicon"/><category term="webkit"/><category term="olpc"/><category term="ubuntu"/><category term="chromeos"/><category term="multitouch"/><category term="ceibal"/><category term="pygobject"/><category term="rk3588"/><category term="rockchip"/><category term="bosch"/><category term="gnome3"/><category term="google"/><category term="graphics"/><category term="hackfest"/><category term="ideasonboard"/><category term="imx8mp"/><category term="mutter"/><category term="nxp"/><category term="vipnano-si+"/><category term="X11"/><category term="canonical"/><category term="debian"/><category term="desktopsummit"/><category term="fedora"/><category term="fsf"/><category term="gesture"/><category term="gnash"/><category term="gtk-doc"/><category term="igalia"/><category term="panfrost"/><category term="upstream"/><category term="webgl"/><category term="CI"/><category term="EGL"/><category term="bof"/><category term="brno"/><category term="chamelium"/><category term="crosvm"/><category term="devconf"/><category term="docs"/><category term="documentation"/><category term="git"/><category term="greece"/><category term="gstreamer"/><category term="intel"/><category term="kernelci.org"/><category term="kosovo"/><category term="kvm"/><category term="lava"/><category term="linux"/><category term="lucid sleep"/><category term="mainline"/><category term="mali"/><category term="markdown"/><category term="memory"/><category term="minijail"/><category term="mobile"/><category term="opengl"/><category term="opensuse"/><category term="redhat"/><category term="s905d3"/><category term="scaling"/><category term="tegra"/><category term="testing"/><category term="trisquel"/><category term="virgl"/><category term="virtualization"/><category term="wayland"/><title type='text'>Tomeu Vizoso</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default?start-index=26&max-results=25'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>138</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-3966247031396089060</id><published>2024-11-16T10:27:00.002+01:00</published><updated>2024-11-16T10:31:33.071+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="ideasonboard"/><category scheme="http://www.blogger.com/atom/ns#" term="imx8mp"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="nxp"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-si+"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 21: Support for the NPU in the NXP i.MX 8M Plus SoC is upstream!</title><content type='html'><p>Several months have passed since the <a href="https://blog.tomeuvizoso.net/2024/07/etnaviv-npu-update-20-fast-object.html">last update</a>. This has been in part due to the summer holidays and a gig doing some non-upstream work, but I have also had the opportunity to continue my work on the NPU driver for the VeriSilicon NPU in the NXP i.MX 8M Plus SoC, thanks to my friends at <a href="https://ideasonboard.com/">Ideas on Board</a>.</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFj_DlsujnGnpomXC9SQnlYw6KnyTHYNj3dFGwqEBc-7GN4GxuULsQBkB71PZDjSOWA2e6JnE-ckkRNE4gfP1CjnJ6hTP99qOwrYCs6vCLcHkZv95cGd3wJa_6Ln6eHNkX_qYna58-GeYUUJVyfAj7z8P90aQjg_7oRE_ttWafVVQucGFqbv6SGnm3S3Y/s1500/porto_tram.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1000" data-original-width="1500" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFj_DlsujnGnpomXC9SQnlYw6KnyTHYNj3dFGwqEBc-7GN4GxuULsQBkB71PZDjSOWA2e6JnE-ckkRNE4gfP1CjnJ6hTP99qOwrYCs6vCLcHkZv95cGd3wJa_6Ln6eHNkX_qYna58-GeYUUJVyfAj7z8P90aQjg_7oRE_ttWafVVQucGFqbv6SGnm3S3Y/w320-h213/porto_tram.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span class="alt-titles"><span class="tool-identifier">CC BY-NC 4.0 </span></span><a href="https://en.sporvognsrejser.dk/photographer/henrik-boye/porto">Henrik Boye</a></td></tr></tbody></table>&nbsp;I'm very happy with what has been accomplished so far, with the first concrete result being the merge in Mesa of the support for NXP's SoC. Thanks to Philipp Zabel and Christian Gmeiner for helping with their ideas and code reviews.<br /></p><p>With this, as of yesterday, one can accelerate models such as <a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> on that SoC with only open source software, with the support being provided directly from projects that are already ubiquitous in today's products, such as the Linux kernel and Mesa3D. We can expect this functionality to reach distributions such as Debian in due time, for seamless installation and integration in products.</p><p>With this milestone reached, I will be working on expanding support for more models, with a first goal of enabling <a href="https://arxiv.org/abs/1506.02640">YOLO-like models</a>, starting with <a href="https://arxiv.org/abs/2107.08430">YOLOX</a>. I will be working as well on performance, as currently we are not fully using the capabilities of this hardware.</p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/3966247031396089060/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=3966247031396089060' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3966247031396089060'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3966247031396089060'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/11/etnaviv-npu-update-21-support-for-npu.html' title='Etnaviv NPU update 21: Support for the NPU in the NXP i.MX 8M Plus SoC is upstream!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFj_DlsujnGnpomXC9SQnlYw6KnyTHYNj3dFGwqEBc-7GN4GxuULsQBkB71PZDjSOWA2e6JnE-ckkRNE4gfP1CjnJ6hTP99qOwrYCs6vCLcHkZv95cGd3wJa_6Ln6eHNkX_qYna58-GeYUUJVyfAj7z8P90aQjg_7oRE_ttWafVVQucGFqbv6SGnm3S3Y/s72-w320-h213-c/porto_tram.jpg" height="72" width="72"/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1549208295495432304</id><published>2024-07-31T15:09:00.000+02:00</published><updated>2024-07-31T15:09:16.011+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="ideasonboard"/><category scheme="http://www.blogger.com/atom/ns#" term="imx8mp"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="nxp"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-si+"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 20: Fast object detection on the NXP i.MX 8M Plus SoC </title><content type='html'><p>I'm happy to announce that my first project regarding support for the NPU in NXP's i.MX 8M Plus SoC has reached the feature complete stage.</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEV1C2BKEXOHK8ajiPeYGiHVv25V8_2THnOavlFGSut0pIOFycx6J2NUYA1H4UE8uHLx0qIMeTQ8swyg4dOgAosVVBz_SX-eIhQBKgAfIb5z2DWTC4uYKjsQA-IqnQ9xbmMbJNKskMpUxw2iVYrBpmcLwhmO0FHFjKvS4C4pvjNPrJhSD5ZuURGK6GqOA/s1500/porto_tram.webp" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1000" data-original-width="1500" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEV1C2BKEXOHK8ajiPeYGiHVv25V8_2THnOavlFGSut0pIOFycx6J2NUYA1H4UE8uHLx0qIMeTQ8swyg4dOgAosVVBz_SX-eIhQBKgAfIb5z2DWTC4uYKjsQA-IqnQ9xbmMbJNKskMpUxw2iVYrBpmcLwhmO0FHFjKvS4C4pvjNPrJhSD5ZuURGK6GqOA/w400-h266/porto_tram.webp" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span class="alt-titles"><span class="tool-identifier">CC BY-NC 4.0 </span></span><a href="https://en.sporvognsrejser.dk/photographer/henrik-boye/porto">Henrik Boye</a></td></tr></tbody></table><p>For the last several weeks I have been working full-time on adding support for the NPU to the existing Etnaviv driver. Most of the existing code that supports the NPU in the Amlogic A311D was reused, but NXP used a much more recent version of the NPU IP so some advancements required new code, and this in turn required reverse engineering.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoN75vrxYwOK8RxqxsMPtznOpAmxzA1m69i4Q0humol4nufF3hvzmCg6wKZzihVQRm6P8onaLxDNdlTfTf3GM-KEPBbmlSdxXPq9KzZ8NDXA3N8TivuBeFO8gJUGYmAEDGvGbZn6CoMVujbtouDFwvd3z10f9kPGrze0fQkC9I6t32VaSaXCUAUjfK6Ss/s212/iob.png" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="161" data-original-width="212" height="161" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoN75vrxYwOK8RxqxsMPtznOpAmxzA1m69i4Q0humol4nufF3hvzmCg6wKZzihVQRm6P8onaLxDNdlTfTf3GM-KEPBbmlSdxXPq9KzZ8NDXA3N8TivuBeFO8gJUGYmAEDGvGbZn6CoMVujbtouDFwvd3z10f9kPGrze0fQkC9I6t32VaSaXCUAUjfK6Ss/s1600/iob.png" width="212" /></a></div>This work has been kindly sponsored by the Open Source consultancy <a href="https://ideasonboard.com/">Ideas On Board</a>, for which I am very grateful. I hope this will be useful to those companies that need full mainline support in their products, even if it is just the start.<p></p><p>This company is unique in working on both NPU and camera drivers in Linux mainline, so they have the best experience for products that require long term support and vision processing.</p><p></p><p>Since the last update I have fixed the last bugs in the compression of the weights tensor and implemented support for a new hardware-assisted way of executing depthwise convolutions. Some improvements on how the tensor addition operation is lowered to convolutions was needed as well.</p><p>Performance is pretty good already, allowing for detecting objects in video streams at 30 frames per second, so at a similar performance level as the NPU in the Amlogic A311D. Some performance features are left to be implemented, so I think there is still substantial room for improvement.</p><div>At current the code is at a very much proof-of-concept state. The next step is cleaning it all up and submitting for review to
Mesa3D. In the meantime, you can find the draft code at
<a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-imx8mp">https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-imx8mp</a>.</div><div><br /></div><div>A
big thanks to Philipp Zabel who reverse engineered the bitstream format
of the weight encoding and added some patches to the kernel that were required for the NPU to work reliably.<br /></div><p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1549208295495432304/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1549208295495432304' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1549208295495432304'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1549208295495432304'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/07/etnaviv-npu-update-20-fast-object.html' title='Etnaviv NPU update 20: Fast object detection on the NXP i.MX 8M Plus SoC '/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEV1C2BKEXOHK8ajiPeYGiHVv25V8_2THnOavlFGSut0pIOFycx6J2NUYA1H4UE8uHLx0qIMeTQ8swyg4dOgAosVVBz_SX-eIhQBKgAfIb5z2DWTC4uYKjsQA-IqnQ9xbmMbJNKskMpUxw2iVYrBpmcLwhmO0FHFjKvS4C4pvjNPrJhSD5ZuURGK6GqOA/s72-w400-h266-c/porto_tram.webp" height="72" width="72"/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-9116337611903920940</id><published>2024-06-28T09:08:00.002+02:00</published><updated>2024-06-28T09:08:22.203+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="ideasonboard"/><category scheme="http://www.blogger.com/atom/ns#" term="imx8mp"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="nxp"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-si+"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC</title><content type='html'><p>Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).</p><p>This work is sponsored by the open source consultancy <a href="https://ideasonboard.com/">Ideas On Boards</a>, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite <a href="https://arxiv.org/abs/2004.14525">MobileDet</a> object detection model.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://ideasonboard.com/" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="161" data-original-width="212" height="161" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi32Q5e3dsQg9Vsrz8e_U78qPxg_oM6_WXLagf_1CPJcjnJd4GkFutxBDkg6xjabu0QUkZX9i96MmKxG-kJ4yQ4IxBoP_VSsTJP5NRxNZofouDEM8Y5Ecqnak0hfbrgbVTFss_t1zi0J8uziRIhyphenhyphenPv2YVrKS8b2Ok9AngneW7V5flac5TYBLtCOddN_Oak/s1600/iob.png" width="212" /></a></div><br />Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.<br /><br />For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of <a href="https://www.pengutronix.de/en/index.html">Pengutronix</a>, as mentioned in <a href="https://blog.tomeuvizoso.net/2024/05/etnaviv-npu-update-18-getting-driver-to.html">my previous update</a> on the variant of this NPU shipped inside the Amlogic S905D3.<br /><br />How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.<br /><br />Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.<br /><br />So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.<br /><br />Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.<br /><br />Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.<br /><br />I'm very grateful to <a href="https://ideasonboard.com/">Ideas On Board</a> for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.<br /><br />I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at <a href="https://www.toradex.com/">Toradex</a>, thanks!<p></p><br /></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/9116337611903920940/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=9116337611903920940' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/9116337611903920940'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/9116337611903920940'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/06/etnaviv-npu-update-19-ideas-on-board.html' title='Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi32Q5e3dsQg9Vsrz8e_U78qPxg_oM6_WXLagf_1CPJcjnJd4GkFutxBDkg6xjabu0QUkZX9i96MmKxG-kJ4yQ4IxBoP_VSsTJP5NRxNZofouDEM8Y5Ecqnak0hfbrgbVTFss_t1zi0J8uziRIhyphenhyphenPv2YVrKS8b2Ok9AngneW7V5flac5TYBLtCOddN_Oak/s72-c/iob.png" height="72" width="72"/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-5180937629682036273</id><published>2024-06-13T09:49:00.006+02:00</published><updated>2024-06-13T09:49:29.396+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="kernel"/><category scheme="http://www.blogger.com/atom/ns#" term="linux"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mainline"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'> Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline</title><content type='html'><p>In the past few weeks I have been working on <a href="https://blog.tomeuvizoso.net/2024/05/etnaviv-npu-update-18-getting-driver-to.html">among other things</a> a kernel driver for the NPU in the Rockchip RK3588 SoC, new from the ground up.</p><p>It is now fully working and after a good amount of polishing I sent it yesterday to the kernel mailing lists, for review. Those interested can see the code and follow the review process at this <a href="https://lore.kernel.org/all/20240612-6-10-rocket-v1-0-060e48eea250@tomeuvizoso.net/#r">link</a>.<br /></p><p>The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.</p><p style="text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/DDccYn4wpnY" width="320" youtube-src-id="DDccYn4wpnY"></iframe>&nbsp;</p><p style="text-align: left;">The <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29698">userspace&nbsp; driver</a> is in a less polished state, but fully featured at this state. I will be working on this in the next few days so it can be properly submitted for review.</p><p style="text-align: left;">This is the first accelerator-only driver for an edge NPU submitted to the mainline kernel, and hopefully it can serve as a template for the next ones to come, as the differences among NPUs of different vendors are relatively superficial.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/5180937629682036273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=5180937629682036273' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5180937629682036273'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5180937629682036273'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/06/rockchip-npu-update-4-kernel-driver-for.html' title=' Rockchip NPU update 4: Kernel driver for the RK3588 NPU submitted to mainline'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/DDccYn4wpnY/default.jpg" height="72" width="72"/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1163579060436918075</id><published>2024-05-07T14:46:00.008+02:00</published><updated>2024-05-14T07:17:01.238+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="s905d3"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC</title><content type='html'><p>With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the <a href="https://docs.mesa3d.org/teflon.html">TensorFlow Lite driver</a> for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.</p><p>With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.</p><p>Philipp Zabel of <a href="https://www.pengutronix.de/en/index.html">Pengutronix</a> has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.</p><p></p>I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network#Weights">weights tensor</a>.<p></p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4vzDAW-b5B2wJqVNXILuvox_OG35uHBRyFhA60PZbtVKRtIbL5uBVtonZLGIEbPZopy8PCwYNqm0ZMmZumgWXiY9VeX__jdpS_0yC-5DyV7hR5cwkblHv9AstuBVZiPKFjdloanx8cRqdYiybEARVkwc3z2w9-W3Ddvpj8bVs7V9LfDRwIQacO8bZF3Y/s320/Gym_Dumbbells_For_Working_Out_(193383405).jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Weights, the other kind of." border="0" data-original-height="240" data-original-width="320" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4vzDAW-b5B2wJqVNXILuvox_OG35uHBRyFhA60PZbtVKRtIbL5uBVtonZLGIEbPZopy8PCwYNqm0ZMmZumgWXiY9VeX__jdpS_0yC-5DyV7hR5cwkblHv9AstuBVZiPKFjdloanx8cRqdYiybEARVkwc3z2w9-W3Ddvpj8bVs7V9LfDRwIQacO8bZF3Y/w320-h240/Gym_Dumbbells_For_Working_Out_(193383405).jpeg" title="Weights, the other kind of." width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Weights, the other kind of them.</td></tr></tbody></table>Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.</p><p>After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the <a href="https://en.wikipedia.org/wiki/Run-length_encoding">run-length</a> and <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman encodings</a> were the same, with only a few differences such as where and how the bias values were stored.</p><p>With a few changes to Philip's work-in-progress branch I got my first tests passing on the <a href="https://libre.computer/products/aml-s905d3-cc/">Libre Computer Solitude</a> SBC board.</p><p>Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.</p><p>The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.</p><p>Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).</p><p>Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1163579060436918075/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1163579060436918075' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1163579060436918075'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1163579060436918075'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/05/etnaviv-npu-update-18-getting-driver-to.html' title='Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4vzDAW-b5B2wJqVNXILuvox_OG35uHBRyFhA60PZbtVKRtIbL5uBVtonZLGIEbPZopy8PCwYNqm0ZMmZumgWXiY9VeX__jdpS_0yC-5DyV7hR5cwkblHv9AstuBVZiPKFjdloanx8cRqdYiybEARVkwc3z2w9-W3Ddvpj8bVs7V9LfDRwIQacO8bZF3Y/s72-w320-h240-c/Gym_Dumbbells_For_Working_Out_(193383405).jpeg" height="72" width="72"/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-7716416125935972669</id><published>2024-04-19T10:17:00.003+02:00</published><updated>2024-04-19T10:18:30.411+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'> Rockchip NPU update 3: Real-time object detection on RK3588</title><content type='html'><h3 style="text-align: left;">Progress</h3><p>Yesterday I managed to implement in my open-source driver all the remaining operations so the <a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> model can run on Rockchip's NPU in the RK3588 SoC.</p><p>Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/s1920/object_detection_rk3588.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1080" data-original-width="1920" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/w400-h225/object_detection_rk3588.png" width="400" /></a></div><br />&nbsp;I uploaded the generated video to YouTube at:<p></p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/DDccYn4wpnY" width="320" youtube-src-id="DDccYn4wpnY"></iframe></div><p></p><div style="text-align: left;">You can get the source code at my branch <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/rocket/?ref_type=heads">here</a>.<br /></div><h3 style="text-align: left;">&nbsp;</h3><h3 style="text-align: left;">Next steps</h3><p>Now that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.</p><p>There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/7716416125935972669/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=7716416125935972669' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/7716416125935972669'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/7716416125935972669'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/04/rockchip-npu-update-3-real-time-object.html' title=' Rockchip NPU update 3: Real-time object detection on RK3588'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHr3zmWGrs1FZyd3mb_sSaxdKN4i35Wao0D8dOJvSP0dDO7EhfWw88PFEIQF-FOqYzk0yy6c1joeKIqVEG9PtArtQWl2z-DedrBcMD7pZiXjlELeGaPYfU04o7dSBN7Tgg2-7d5maikXo2qQyViFeQoVxwqwzyLEKpzSCY1k3218QiQOEFInvLedMwAi4/s72-w400-h225-c/object_detection_rk3588.png" height="72" width="72"/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-2268059939300496747</id><published>2024-03-28T08:47:00.000+01:00</published><updated>2024-03-28T08:47:00.757+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'>Rockchip NPU update 2: MobileNetV1 is done</title><content type='html'><h3 style="text-align: left;">Progress</h3><p style="text-align: left;">For&nbsp; the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.</p><p style="text-align: left;">I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.</p><p style="text-align: left;">And it not only runs flawlessly, but at the same performance level as the blob.</p><p style="text-align: left;">It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be done.<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://commons.wikimedia.org/w/index.php?curid=285598" target="_blank"><span style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s320/hen.jpg" width="320" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;">by Julien Langlois CC BY-SA 3.0<br /></td></tr></tbody></table><p>&nbsp;<span style="font-family: courier;">tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 classification.py -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from libteflon.so with args: {}<br />Teflon delegate: loaded rknpu driver<br /><br />teflon: compiling graph: 89 tensors 27 operations<br />...<br />teflon: compiled graph, took 413 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 11 ms<br />teflon: invoked graph, took 10 ms<br />teflon: invoked graph, took 10 ms<br /><b>0.984314: hen</b><br />0.019608: cock<br />0.000000: toilet tissue<br />0.000000: sea cucumber<br />0.000000: wood rabbit<br />time: 10.776ms<br /></span></p><p style="text-align: left;"><span style="font-family: inherit;">Notice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.</span></p><p style="text-align: left;"><span style="font-family: inherit;">Thanks to TL Lim of <a href="https://pine64.org/">PINE64</a> for sending me a <a href="https://wiki.pine64.org/wiki/QuartzPro64_Development">QuartzPro64</a> board for me to hack on. <br /></span></p><h3 style="text-align: left;"><span style="font-family: inherit;">Next steps</span></h3><p style="text-align: left;"><span style="font-family: inherit;">I want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.</span></p><p style="text-align: left;"><span style="font-family: inherit;">After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.<br /></span></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/2268059939300496747/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=2268059939300496747' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2268059939300496747'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2268059939300496747'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/03/rockchip-npu-update-2-mobilenetv1-is.html' title='Rockchip NPU update 2: MobileNetV1 is done'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s72-c/hen.jpg" height="72" width="72"/><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1181718218983280915</id><published>2024-03-16T12:46:00.003+01:00</published><updated>2024-03-16T18:49:59.475+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="rk3588"/><category scheme="http://www.blogger.com/atom/ns#" term="rockchip"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><title type='text'>Rockchip NPU update 1: A walk in the park?</title><content type='html'><p>During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.<br /></p><p>The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s640/pexels-mart-production-8121657.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="427" data-original-width="640" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s320/pexels-mart-production-8121657.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>A nice walk in the park</i><br /></td></tr></tbody></table><p></p><p>Rockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.</p><p>To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.<br /></p><p>Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by <a href="https://github.com/phhusson/rknpu-reverse-engineering">Pierre-Hugues Husson</a> and <a href="https://github.com/mtx512/rk3588-npu/">Jasbir Matharu</a> to get started, a big thanks to them!<br /></p><p>After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.</p><p>I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.</p><p>With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).</p><p>Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.</p><p>I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.</p><p>Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...<br /></p><p>Tests run fast and reliably, even with high concurrency:</p><p><span style="font-family: courier;"><span style="font-size: x-small;">tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/libteflon.so TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt&nbsp; --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt <br />Running gtest on 8 threads in 1-test groups<br />Pass: 0, Duration: 0<br />Pass: 139, Skip: 14, Duration: 2, Remaining: 2<br />Pass: 277, Skip: 22, Duration: 4, Remaining: 0<br />Pass: 316, Skip: 24, Duration: 4, Remaining: 0</span></span></p>You can find the source code in <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/rocket?ref_type=heads">this branch</a>.<br /><p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1181718218983280915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1181718218983280915' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1181718218983280915'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1181718218983280915'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/03/rockchip-npu-update-1-walk-in-park.html' title='Rockchip NPU update 1: A walk in the park?'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGU-XeRJwraDc8PCTHTVdlrt4rM0QeZUuKNFA8WuB4Ogr51PgpWAhll2esCPZatq5SoYxIcyCAbQvahRiSiOCVSysu-dXyJu5gT0C-8hvt3mDe4Wuj_qg98pR_utgzeoyw3C042IDW3ZLgoZux7i877z-D684agsk1_QpYzE2pAO609Mnw1RIFVFE7UMM/s72-c/pexels-mart-production-8121657.jpg" height="72" width="72"/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6608771821592155584</id><published>2024-02-23T13:10:00.001+01:00</published><updated>2024-02-23T13:10:25.672+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 17: Faster!</title><content type='html'><p>In the last update I explained how compression of zero weights gave our driver such a big performance improvement.</p><p>Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.</p><h4 style="text-align: left;">TL;DR</h4><p style="text-align: left;">Our driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.</p><p style="text-align: left;">On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/s848/perf_evol_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="431" data-original-width="848" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/w640-h326/perf_evol_2.png" width="640" /></a></div><p></p><h4 style="text-align: left;">Enable more convolutions</h4><p>Our driver
was rejecting convolutions with a number of output channels that is not
divisible by the number of convolution cores in the NPU because at the
start of the development the code that lays the weights out in memory
didn't support that. That caused TensorFlow Lite to run the convolutions
in CPU, and some of them were big enough to take a few milliseconds,
several times more than on the NPU.<br /></p><p>When implementing support
for bigger kernels I had to add improvements to the tiling of the
convolutions and that included adding support for these other
convolutions. So by just removing the rejection of these, we got a nice
speed up on SSD MobileDet: from 32.7ms to 27ms!</p><p>That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.</p><h4 style="text-align: left;">Caching of the input tensor</h4><p>So far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.</p><p>That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.</p><h4 style="text-align: left;">Next steps</h4><p>At this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.</p><p>But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.</p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6608771821592155584/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6608771821592155584' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608771821592155584'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608771821592155584'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/02/etnaviv-npu-update-17-faster.html' title=' Etnaviv NPU update 17: Faster!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO-x0rNGjtToQ2tdZD06wLekaZfisubI0jCp4BSJunHgf9yspA3b86Sz_XvtZh8IT565W2NXBPCnWHCbiimwFhyphenhyphenArSVPwTT0Q1mqMl2pxxjBh6JVEjh9ikXFEEVLxgNbUxGvjaBMCB0uUeB9BszKvyvwxzWZ5Itiq24PKvNUsWr2m-xGbDlwqmvaP68_4/s72-w640-h326-c/perf_evol_2.png" height="72" width="72"/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-2527511145589800731</id><published>2024-02-08T10:36:00.000+01:00</published><updated>2024-02-08T10:36:04.340+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 16: A nice performance jump</title><content type='html'><p>After the open-source driver for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPU</a> was <a href="https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-15-we-are-upstream.html">merged into Mesa</a> two weeks ago, I have been taking some rest and thinking about what will come next.</p><h3 style="text-align: left;">Automated testing <br /></h3><p>I have a <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27214">merge request</a> to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. <a href="https://www.collabora.com/">Collabora</a> is graciously looking at it. Thanks!</p><h3 style="text-align: left;">Performance<br /></h3><p>I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.</p><p>It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.<br /></p><p>IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.<br /></p><h3 style="text-align: left;">Weight compression <br /></h3><p></p><p>Something I started working on last week is compression of zero values in the weight buffers. <a href="https://arxiv.org/abs/2102.00554">Sparsity</a> is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.</p><p>By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).</p><p>By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/s848/perf_evol.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="431" data-original-width="848" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/w640-h326/perf_evol.png" width="640" /></a></div><br /><br /></div><p></p><p>As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.</p><h3 style="text-align: left;">IRC channel</h3><p>I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click <a href="https://webchat.oftc.net/?channels=ml-mainline" target="_blank">here</a> to join via a web interface, though I recommend setting up an account at <a href="https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/">matrix.org</a>.</p><h3 style="text-align: left;">What next</h3><p>Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?</p><p>I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.</p><p>I'm currently available for contracting, so I should be able to get on your project full-time on short notice.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/2527511145589800731/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=2527511145589800731' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2527511145589800731'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/2527511145589800731'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/02/etnaviv-npu-update-16-nice-performance.html' title=' Etnaviv NPU update 16: A nice performance jump'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilf8m0CkxyFeQ7N-8XfsKx6dQjCdBxW1uJaOn2JrsAxAnNSZSLoiAlh-6Jw05edEoykz6U2PsuROOMOMi3-kGqpv-gqBiasERfcUnHOtGiWfQBQtDzhApd7lSU4gL83WkTW5Qzts32f8wPvg6DbZYeZNflL8HdDi9313PQJMR34D2r7Ku7fif2q9TpmLQ/s72-w640-h326-c/perf_evol.png" height="72" width="72"/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8436159336587732290</id><published>2024-01-24T11:52:00.000+01:00</published><updated>2024-01-24T11:52:46.494+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="verisilicon"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 15: We are upstream!</title><content type='html'><p>Today the <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">initial merge request for Teflon</a> was merged into Mesa, along with the first hardware driver, for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPU</a>.</p><p>For those who don't know, <a href="https://docs.mesa3d.org/teflon.html">Teflon</a> is a <a href="https://www.tensorflow.org/lite/performance/delegates">TensorFlow Lite delegate</a> that aims to support several <a href="https://en.wikipedia.org/wiki/AI_accelerator">AI accelerators</a> (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the <a href="https://en.wikipedia.org/wiki/MIT_License">MIT license</a>.<br /></p><p style="text-align: center;"><a href="https://gitlab.freedesktop.org/uploads/-/system/group/avatar/1155/gears.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="773" data-original-width="773" height="200" src="https://gitlab.freedesktop.org/uploads/-/system/group/avatar/1155/gears.png" width="200" /></a> <br /></p><p>This will have the following advantages for the project:</p><ol style="text-align: left;"><li>The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the <a href="https://docs.mesa3d.org/release-calendar.html">release calendar</a>.<br /></li><li>Contribution to the project will happen within the <a href="https://docs.mesa3d.org/submittingpatches.html">development process of Mesa</a>. This is a well-established process in which employees from companies such as Google, Valve, <a href="https://docs.mesa3d.org/drivers/powervr.html">Imagination</a>, Intel, <a href="https://docs.mesa3d.org/drivers/d3d12.html">Microsoft</a> and <a href="https://docs.mesa3d.org/drivers/radv.html">AMD</a> work together on their GPU drivers.<br /></li><li>The project has great technical infrastructure, maintained by awesome sysadmins:</li><ul><li>A well-maintained <a href="https://gitlab.freedesktop.org/">Gitlab instance</a>,</li><li><a href="https://docs.mesa3d.org/ci/index.html">extensive CI</a>, for both build and runtime testing, on real hardware,</li><li>mailing list, web server, etc.<br /></li></ul><li>More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:</li><ul><li>The <a href="https://docs.mesa3d.org/nir/index.html">NIR intermediate representation</a> with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.</li><li>The <a href="https://docs.mesa3d.org/gallium/index.html">Gallium internal API</a> that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as <a href="https://developer.android.com/ndk/guides/neuralnetworks">Android NNAPI</a>.</li></ul><li>And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: <a href="https://www.x.org/wiki/Events/">XDC</a>.<br /></li></ol><div><h3 style="text-align: left;">The story so far</h3><p style="text-align: left;">In 2022, while still at <a href="http://collabora.com/">Collabora</a>, I started adding OpenCL support to the <a href="https://github.com/etnaviv/etna_viv#introduction">Etnaviv</a> driver in Mesa. Etnaviv is a userspace and kernel driver for <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">VeriSilicon's Vivante NPUs</a>.</p><p style="text-align: left;">The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.</p><p style="text-align: left;">I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (<a href="https://en.wikipedia.org/wiki/Systolic_array">systolic arrays</a>, as I realized later).</p><p style="text-align: left;">Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since <a href="https://arxiv.org/abs/1704.04760">Google released their first TPU</a>.</p><p style="text-align: left;">With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.<br /></p><p style="text-align: left;">During the summer and with <a href="https://libre.computer/">Libre Computer</a>'s sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.</p><p style="text-align: left;">By <a href="https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html">autumn</a> I was able to run that same object classification model (<a href="https://arxiv.org/abs/1704.04861">MobileNet V1</a>) 3 times faster than the CPU was able to. A <a href="https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html">month later</a> I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.</p><p style="text-align: left;">Afterwards I got to work on object detection models, and by the <a href="https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-14-object-detection.html">start of 2024</a> I managed to run <a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!</p><p style="text-align: left;">The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.<br /></p><h3 style="text-align: left;">Next steps</h3><p style="text-align: left;">Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!</p><p style="text-align: left;">I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.</p><p style="text-align: left;">There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.</p><p style="text-align: left;">There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.</p><p style="text-align: left;">And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.<br /></p><h3 style="text-align: left;">Thanks</h3><p>One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:</p><p><a href="http://collabora.com/">Collabora</a> for allowing me to start playing with this while I still worked with them.</p><p><a href="https://libre.computer/">Libre Computer</a> and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.</p><p><a href="https://www.igalia.com/">Igalia</a> for letting <a href="https://christian-gmeiner.info/">Christian Gmeiner</a> spend time reviewing all my code and answering my questions about Etnaviv. <br /></p><p></p><p style="text-align: left;"><a href="https://embedded-recipes.org/">Embedded Recipes</a> for giving me the opportunity to present my work last autumn in Paris.</p></div><div><p style="text-align: left;">Lucas Stach from <a href="https://www.pengutronix.de/en/index.html">Pengutronix</a> for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.</p><p style="text-align: left;">Neil Armstrong from <a href="https://www.linaro.org/">Linaro</a> for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.</p><p style="text-align: left;">And a collective thanks to the DRI/Mesa community for being so awesome!<br /></p><p></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8436159336587732290/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8436159336587732290' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8436159336587732290'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8436159336587732290'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-15-we-are-upstream.html' title=' Etnaviv NPU update 15: We are upstream!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6524113164238186020</id><published>2024-01-10T12:14:00.004+01:00</published><updated>2024-01-10T12:14:56.646+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 14: Object detection with decent performance</title><content type='html'><p>When almost two months ago I <a href="https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html">got MobileNetV1 running with useful performance</a> on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.</p><p>Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?<br /></p><p>Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.</p><p><a href="https://arxiv.org/abs/2004.14525">SSDLite MobileDet</a> is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.</p><p>The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.</p><p>So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:<br /></p><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/s1536/test1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1024" data-original-width="1536" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/w548-h366/test1.jpg" width="548" /></a></div><p></p><p>Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.</p><p>I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.</p><p>After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.</p><p>If we look at a practical use case such that supported by <a href="https://frigate.video/">Frigate NVR</a>, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.</p><p>Given the price level of the <a href="https://libre.computer/products/aml-a311d-cc/">single board computers that contain the VIPNano</a>, this is quite a good bang for your bucks. And all open source and heading to mainline!</p><p><b>Next steps</b></p><p>I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.</p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6524113164238186020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6524113164238186020' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6524113164238186020'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6524113164238186020'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2024/01/etnaviv-npu-update-14-object-detection.html' title=' Etnaviv NPU update 14: Object detection with decent performance'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsKGZYGx2ISm4TZobIq5OCov58aMRXLldRjrjM2dn0uUxuhChV1-gxt4wzLvEq1WZHe8pbdz4MtXML9oN2UCGvq2K_ncYuKkVnK4AG-_xrRGfARWv3kxBBvG20y5eWzFTWeZGazHFMIqaswvk1hl5kN-xArwD2TqjPj-iZxOPVMKzfx8PPbOagoSldJh0/s72-w548-h366-c/test1.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-239856422670257258</id><published>2023-12-21T09:16:00.002+01:00</published><updated>2023-12-21T09:16:49.906+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 13: Don't cross the tensors</title><content type='html'><p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s389/1520238648692.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="209" data-original-width="389" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s16000/1520238648692.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span class="ILfuVd" lang="en"><span class="hgKElc"><i>"Don't cross the streams. It would be bad."</i></span></span></td></tr></tbody></table><h4 style="text-align: left;">IR refactorings <br /></h4><p>A big part of what I have been up to in the past two weeks has been a
serious refactoring of the data structures that hold the model data in
the different phases until the HW configurations is generated.</p><p>What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.</p><p>The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8uT4oTOPviR6_aqbR0KFWycEcCxHBFoptasiS8nfb_2aiJ0XKNBE7BIVjFNBA46LPV204yMIBjrzPkJT_WyWc5k3HUcLLzzAMD9-NWei85UbmKHTgxHTHje8vEIdxQTfAEP9nk7HCWJEtgxpXU3CsrY1xykjiSa9QI35In5amVjFu7OGl8BmUA_j_oQQ/s888/mobiledet_add.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="888" data-original-width="290" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8uT4oTOPviR6_aqbR0KFWycEcCxHBFoptasiS8nfb_2aiJ0XKNBE7BIVjFNBA46LPV204yMIBjrzPkJT_WyWc5k3HUcLLzzAMD9-NWei85UbmKHTgxHTHje8vEIdxQTfAEP9nk7HCWJEtgxpXU3CsrY1xykjiSa9QI35In5amVjFu7OGl8BmUA_j_oQQ/w210-h640/mobiledet_add.png" width="210" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A small subsection of SSDLite MobileDet</td></tr></tbody></table><p>The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.</p><p>As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.</p><p>For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.</p><p>I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.</p><p>For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.</p><h4 style="text-align: left;">Tensor addition</h4><p>Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.</p><h4 style="text-align: left;">Lowering the strides</h4><p>Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.</p><h4 style="text-align: left;">Status and next steps <br /></h4><p>Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.</p><p>The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.<br /></p><p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxIxl-0oWNOqrRirUSUkf7k5b_pYiudHW1aOxIdF5K2MULi1zPldgxEfr2lNi5aZQqfUJ7KpmHFLl6KpWpCC0wbfxDi47I4hswY-p-gfDLsoA68OZfD_9YjxyHqa1maSHXHL9WRKrVik_5haHpLUeRrPwJyeiBwkqAt7iyQxdd7nVrjQYhb-4Z0esauK0/s21360/ssdlite_mobiledet_coco_qat_postprocess.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="21360" data-original-width="3392" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxIxl-0oWNOqrRirUSUkf7k5b_pYiudHW1aOxIdF5K2MULi1zPldgxEfr2lNi5aZQqfUJ7KpmHFLl6KpWpCC0wbfxDi47I4hswY-p-gfDLsoA68OZfD_9YjxyHqa1maSHXHL9WRKrVik_5haHpLUeRrPwJyeiBwkqAt7iyQxdd7nVrjQYhb-4Z0esauK0/w102-h640/ssdlite_mobiledet_coco_qat_postprocess.png" width="102" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The whole of SSDLite MobileDet<br /></td></tr></tbody></table><br />&nbsp;<p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/239856422670257258/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=239856422670257258' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/239856422670257258'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/239856422670257258'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/12/etnaviv-npu-update-13-dont-cross-tensors.html' title=' Etnaviv NPU update 13: Don't cross the tensors'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq1wKENtMzx01kGsnLXjmoCFGpyA67hSvWs1nAWXBftImNiTWD2dnfWaRWqhROBRcygMum9WfqZFp01ijApbVuwPWbXte4ds5pv2M_GyIcya_Ma0ZJJjoZIwrBk07X60PB7mB2Dp2r0NVtURa81yOHaOMNfS9Sr9avrF92NUfegfcqg5DiU7XAfAHUixQ/s72-c/1520238648692.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8772896522615830396</id><published>2023-12-06T11:21:00.001+01:00</published><updated>2023-12-06T11:22:57.008+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 12: Towards SSDLite MobileDet</title><content type='html'><p>During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to <a href="https://arxiv.org/abs/2004.14525">MobileDet</a>, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example <a href="https://frigate.video/">Frigate NVR</a>.</p><p>I haven't mentioned it in a few updates, but all this work keeps being sponsored by <a href="https://libre.computer/">Libre Computer</a>, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out <a href="https://libre.computer/products/aml-a311d-cc/">Alta</a> and <a href="https://libre.computer/products/aml-s905d3-cc/">Solitude</a> for the first such boards in the market.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://libre.computer/api/products/aml-a311d-cc/gallery/1.webp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="704" data-original-width="800" height="282" src="https://libre.computer/api/products/aml-a311d-cc/gallery/1.webp" width="320" /></a></div><p></p><h3 style="text-align: left;">Upstreaming</h3><div style="text-align: left;"><p>Igalia's Christian Gmeiner has been giving me great feedback at the <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">merge request</a>, and as part of that I <a href="https://lore.kernel.org/lkml/20231116140910.1613508-1-tomeu@tomeuvizoso.net/T/#m3047ef1f33ee2ccdfeeaaa38bb8dfd0cfca95bab">submitted a patch</a> to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded.&nbsp;</p><p>This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.<br /></p></div><h3 style="text-align: left;">Convolutions with 5x5 weights</h3><p>Until now I had implemented support only for weights with dimensions 1x1 (aka <a href="https://arxiv.org/abs/1712.05245">pointwise convolutions</a>) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.</p><p>I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.<br /></p><h3 style="text-align: left;">Tensor addition</h3><p>I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.</p><p>This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.<br /></p><h3 style="text-align: left;">Softmax pooling</h3><p>One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.</p><p>The blob implements these operations on the programmable core, with CL-like kernels.</p><p>So I undusted the work that I did in the <a href="https://blog.tomeuvizoso.net/2023/04/a-long-overdue-update.html">first half of 2023</a> and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in&nbsp;Etnaviv to make use of the programmable core.</p><p>Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.</p><p>With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.<br /></p><h3 style="text-align: left;">A new test suite</h3><p>With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.</p><p>To get around that, I reimplemented the tests in C++ with <a href="https://en.wikipedia.org/wiki/Google_Test">GoogleTest</a>, which is supported by Emma Anholt's <a href="https://gitlab.freedesktop.org/anholt/deqp-runner">deqp-runner</a> and will allow me to run the tests in parallel, making full use of the CPU cores in the board.</p><p>That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.</p><p>With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.</p><p>Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.</p><p>There is a <a href="https://elixir.bootlin.com/linux/v6.6.4/source/drivers/gpu/drm/etnaviv/etnaviv_sched.c#L16">kernel module parameter</a> to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.<br /></p><h3 style="text-align: left;">Next steps</h3><p>I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.</p><p>So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.</p><p style="text-align: left;"><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8772896522615830396/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8772896522615830396' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8772896522615830396'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8772896522615830396'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/12/etnaviv-npu-update-12-towards-ssdlite.html' title=' Etnaviv NPU update 12: Towards SSDLite MobileDet'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1719986941793663440</id><published>2023-11-17T08:46:00.001+01:00</published><updated>2023-12-06T09:03:01.270+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 11: Now twice as fast!</title><content type='html'><h1 style="text-align: left;">Progress</h1><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.<br /></div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">This makes our image classification benchmark twice as fast, as expected:<br /></div><p><span style="font-family: courier;">tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}<br /><b>Running the NN job took 13 ms.</b><br />0.866667: military uniform<br />0.031373: Windsor tie<br />0.015686: mortarboard<br />0.007843: bow tie<br />0.007843: academic gown<br /><b>time: 15.650ms</b><br /></span></p><div style="text-align: left;">60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.</div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;">Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Upstreaming is going well, those interested can follow it here:</div><div style="text-align: left;">&nbsp;</div><div style="text-align: left;"><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714</a>.<br /></div><div style="text-align: left;">&nbsp;</div><h1 style="text-align: left;">Next steps</h1><div style="text-align: left;">&nbsp;</div><p>These will be my priorities during the next couple of weeks, in order:</p><ol style="text-align: left;"><li>Upstreaming</li><li>Get the Mobilenet SSD V1 model running on the HW, for object detection<br /></li><li>Performance<br /></li></ol></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1719986941793663440/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1719986941793663440' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1719986941793663440'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1719986941793663440'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-11-now-twice-as-fast.html' title=' Etnaviv NPU update 11: Now twice as fast!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-5685163152487206629</id><published>2023-11-06T10:30:00.003+01:00</published><updated>2023-12-06T09:02:52.226+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 10: Upstreaming and TP jobs update</title><content type='html'><p>&nbsp;If you remember the <a href="https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html">last update</a> two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and <a href="https://www.mesa3d.org/">Mesa</a>.</p><p>One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.</p><p>Regarding Mesa, I have made several cleanups and have started getting great <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">review comments</a> from <a href="https://github.com/austriancoder">Christian Gmeiner</a>.</p><p>While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster&nbsp; than the naive code I was running on the CPU for this.</p><p>Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.</p><p>I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.</p><p>During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.<br /></p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/5685163152487206629/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=5685163152487206629' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5685163152487206629'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5685163152487206629'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/11/etnaviv-npu-update-10-upstreaming-and.html' title=' Etnaviv NPU update 10: Upstreaming and TP jobs update'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-5705381674930396395</id><published>2023-10-23T09:16:00.005+02:00</published><updated>2024-01-24T10:16:06.403+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 9: We got there!</title><content type='html'><h1 style="text-align: left;">Progress</h1><div style="text-align: left;">Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa:&nbsp;</div><div class="separator" style="clear: both; text-align: center;"><a href="https://coral.ai/static/docs/images/grace_hopper.bmp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="606" data-original-width="517" height="200" src="https://coral.ai/static/docs/images/grace_hopper.bmp" width="171" /></a></div><div style="text-align: left;"><span style="font-family: courier;"><blockquote>tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so<br />Loading external delegate from libteflon.so with args: {}<br />Processing the input took <b>18 ms.</b><br />Running the NN job took <b>13 ms.</b><br />Processing the output took 1 ms.<br />0.866667: military uniform<br />0.031373: Windsor tie<br />0.015686: mortarboard<br />0.007843: bow tie<br />0.007843: academic gown<br />time: 33.094ms<br /></blockquote></span>That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.</div><div style="text-align: left;">&nbsp;</div><h1 style="text-align: left;">Next steps</h1><p>Now that we have something that people can use in their products, I will switch to upstreaming mode.</p><p>I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714">here</a>.</p><p>I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.</p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/5705381674930396395/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=5705381674930396395' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5705381674930396395'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/5705381674930396395'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-9-we-got-there.html' title=' Etnaviv NPU update 9: We got there!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-4260046790772956889</id><published>2023-10-06T17:16:00.001+02:00</published><updated>2023-12-06T09:02:37.001+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 8: Finally some inference</title><content type='html'><h1 style="text-align: left;">Progress</h1><p>Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.</p><h2 style="text-align: left;">Refactored the Gallium front-end</h2><p>As commented in the <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html">previous update</a>, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.</p><p>I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.</p><p>As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.</p><p>Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.</p><p>Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.</p><h2 style="text-align: left;">Got MobileNet V1 to run</h2><div style="text-align: left;">As part of the refactoring&nbsp; from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:</div><div style="text-align: left;"><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://commons.wikimedia.org/w/index.php?curid=285598" target="_blank"><span style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s320/hen.jpg" width="320" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;">by Julien Langlois CC BY-SA 3.0<br /></td></tr></tbody></table><br /><p></p><blockquote><span style="font-family: courier;">tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so</span> <br /></blockquote><blockquote><span style="font-family: courier;">Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}<br />tflite_plugin_create_delegate<br />Teflon delegate: loaded etnaviv driver<br />INFO: Initialized TensorFlow Lite runtime.<br />PrepareDelegate<br />VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.<br /><b>0.960784: hen</b><br />0.015686: cock<br />0.007843: goose<br />0.003922: Pembroke<br />0.003922: Ibizan hound<br />time: 22.802ms<br />tflite_plugin_destroy_delegate</span></blockquote><p>This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.</p><h2 style="text-align: left;">Presented this work at Embedded Recipes 2023</h2><p>Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5slClDUZ5VBIFkJpgfL4Ng92TjVjFAvulkPXBCw8kT_iEDdZN3ph8uTma65Cd7d6-5z4YxYmQZc2NqStG3RGhllCuL30lJVp1XukKiS2qZQUpcOYY-m5A3RXQ4KiUYeDfVZ122lWfUg1_yZMpyZf2bbaNERyfzC6W7U3oGhXcQgxwXV6DWUOy9t20ajo/s4000/F7LiLViWUAA4PaR.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3000" data-original-width="4000" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5slClDUZ5VBIFkJpgfL4Ng92TjVjFAvulkPXBCw8kT_iEDdZN3ph8uTma65Cd7d6-5z4YxYmQZc2NqStG3RGhllCuL30lJVp1XukKiS2qZQUpcOYY-m5A3RXQ4KiUYeDfVZ122lWfUg1_yZMpyZf2bbaNERyfzC6W7U3oGhXcQgxwXV6DWUOy9t20ajo/s320/F7LiLViWUAA4PaR.jpg" width="320" /></a></div><br /><p>You can find the <a href="https://embedded-recipes.org/2023/schedule/accelerated-ml-at-the-edge-with-mainline/">slides here</a>, and listen to the talk at:</p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="334" src="https://www.youtube.com/live/s5_BZdljpqc?feature=shared&t=2340" width="560" youtube-src-id="s5_BZdljpqc"></iframe></div><br /><p><br /></p><h1 style="text-align: left;">Next steps</h1><p>The <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html">previous update</a> got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:</p><ol style="text-align: left;"><li>Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network</li><li>Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)<br /></li><li>Upstream changes to the Linux kernel</li><li>Propose Teflon to the Mesa folks<br /></li></ol><p></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/4260046790772956889/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=4260046790772956889' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4260046790772956889'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4260046790772956889'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/10/etnaviv-npu-update-8-finally-some.html' title=' Etnaviv NPU update 8: Finally some inference'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQiQSHVRGw-EMpuIKA6jxXH-ss_HgutqwgUYXvCg4tPMRq9Js2q7l0NGILTcRlBqDfUOMhKNdzAALj1E8dPN2zxd6aOK59OeO9f5ac0vaWuaEvDEl_EQLu6rd-887qRrMH_7tgG4_oSubzgI2_GCvVD5ck6ukwErppZc1AQ5RawYqzrcB-mec905-jYpI/s72-c/hen.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-4728184777760523572</id><published>2023-09-26T13:37:00.007+02:00</published><updated>2023-12-06T09:03:10.483+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 7: Summer is over</title><content type='html'><h1 style="text-align: left;">Progress</h1><p style="text-align: left;">With the kids back in school I have been able to work on the <a href="https://www.verisilicon.com/en/IPPortfolio/VivanteNPUIP">Vivante VIP NPU</a> driver full-time during the two weeks after the <a href="https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-6-almost-there.html">last update</a>, with quite some work coming out of the pipeline:</p><h2 style="text-align: left;">Found the problem with enabling the 8th NN core</h2><div style="text-align: left;">Though I don't know exactly yet what the problem is, I found that by going back to a <a href="https://gitlab.freedesktop.org/tomeu/linux/-/commit/af365186ab305d2fa3e91145ac79d2569b9df2a5">previous brute-force approach</a> to powering up the NPU, the 8th core works just fine.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.<br /></div><div style="text-align: left;"></div><h2 style="text-align: left;">Added support for depthwise convolutions</h2><div style="text-align: left;"><a href="https://arxiv.org/abs/1704.04861">MobileNetV1</a> introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a <a href="https://paperswithcode.com/method/depthwise-convolution">depthwise convolution</a> to process each depth level separately, plus a <a href="https://paperswithcode.com/method/pointwise-convolution">pointwise convolution</a> to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.<br /></div><h2 style="text-align: left;">Added support for pointwise convolutions</h2><div style="text-align: left;">For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.<br /></div><h2 style="text-align: left;">Added support for unsigned weights</h2><div style="text-align: left;">TensorFlow Lite has moved towards implementing a new <a href="https://www.tensorflow.org/lite/performance/quantization_spec#signed_integer_vs_unsigned_integer">quantization specification</a> which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">That also implied moving to per-tensor quantization, instead of per-axis.<br /></div><h2 style="text-align: left;">Added support for higher IFMs and OFMs (up to 256 each)</h2><div style="text-align: left;">In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><h1 style="text-align: left;">Next steps</h1><h2 style="text-align: left;">Model partition compilation and resource management<br /></h2><div style="text-align: left;">I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">So my next task after I get back from <a href="https://embedded-recipes.org/2023/">Embedded Recipes</a> will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.</div><div style="text-align: left;"><h2 style="text-align: left;">Move development to Cottonwood A311D board</h2></div><div style="text-align: left;">Da Xue of <a href="https://libre.computer/">LibreComputer</a> has got Etnaviv and Teflon working on the <a href="https://hub.libre.computer/t/2023-09-25-libre-computer-aml-a311d-cc-alta-ai-sbc-announcement-pre/2905">new boards</a> that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.<br /></div><div style="text-align: left;"><h2 style="text-align: left;">Bigger coefficient tensors</h2></div><div style="text-align: left;">The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.</div><div style="text-align: left;"></div><div style="text-align: left;"><h2 style="text-align: left;">Medium term goals</h2></div><div style="text-align: left;">I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:</div><div style="text-align: left;"><ol style="text-align: left;"><li>Avoid copies in and out of the model partition, by mapping user buffers to the NPU</li><li>Use the TP units for tensor manipulation (transposing, mostly)</li><li>Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM</li><li>Use the external SRAM for intermediate tensor data</li><li>Chain all TP and NN jobs in a model partition in the same command stream</li><li>Enable zero-run-length compression in the coefficient buffer<br /></li><li>Tune the tiling parameters for reduced memory bandwidth usage</li></ol></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/4728184777760523572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=4728184777760523572' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4728184777760523572'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/4728184777760523572'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-7-summer-is-over.html' title=' Etnaviv NPU update 7: Summer is over'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-1682105245552353397</id><published>2023-09-07T18:19:00.007+02:00</published><updated>2023-12-06T09:03:19.411+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 6: Almost there!</title><content type='html'><h2 style="text-align: left;">Progress</h2><p>&nbsp;This week started quite fruitfully, these features were added:</p><ul style="text-align: left;"><li>Convolutions with multiple input and output channels (input and output feature maps)</li><li><a href="https://keras.io/api/layers/convolution_layers/convolution2d/">"Same"</a> padding in convolutions</li></ul><p>And with this we should have all the features we need to run a model such as MobileNet v1 and get some performance numbers to guide the next steps.</p><h2 style="text-align: left;">One more roadblock <br /></h2><p>Only that the NPU hangs when I try to use the 8th core... and this is required to run most detection models, as they start by convoluting the input to 32 feature maps. <br /></p><p>Have checked and we are sending to the kernel bit-identical command streams and input buffers, so I suspect the problem will be somewhere in the kernel.</p><p>So I plan to instrument the out-of-tree kernel driver and get some register and command stream dumps, in the hope that there is some bit in a magic register somewhere that I need to flip.</p><h2 style="text-align: left;">Want to try it out?</h2><p>I'm not really looking forward to such work, so I decided to first invest some time cleaning things up a bit to make it easier for other people to play with this if they wish.</p><p>I have removed from my branch everything from my previous attempt at using OpenCL and have written some documentation about how to run the TensorFlow Lite delegate:</p><p><a href="https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst">https://gitlab.freedesktop.org/tomeu/mesa/-/blob/teflon/docs/teflon.rst</a></p><p>You will need a VIM3 board, a recent mainline kernel and a Debian testing rootfs.</p><p><br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/1682105245552353397/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=1682105245552353397' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1682105245552353397'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/1682105245552353397'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/09/etnaviv-npu-update-6-almost-there.html' title=' Etnaviv NPU update 6: Almost there!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-3023393172299765340</id><published>2023-08-24T12:45:00.000+02:00</published><updated>2023-08-24T12:45:36.117+02:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="librecomputer"/><category scheme="http://www.blogger.com/atom/ns#" term="machine-learning"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 5: Harder convolutions!</title><content type='html'><h2 style="text-align: left;">Progress <br /></h2><p>Managed to squeeze some time between holidaying to hack on the NPU driver and got something out of it.</p><p>Since the <a href="https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-4-its-convoluting.html">last update</a> I have:</p><ul style="text-align: left;"><li> implemented support for strided convolutions with more than one input channel, and</li><li>Implemented support for more than one output channel, but for now only for a single input channel.</li></ul><p>Next steps are&nbsp; to support convolutions with multiple input and output channels, and padding. Then see what is still missing so we can run MobileNet v1 and check the performance when using the NN units and doing the rest on the CPU.</p><p>As a reminder, I'm pushing all the code to this branch: <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/">https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/</a>.<br /></p><h2 style="text-align: left;">IRC channel</h2><p>A bunch of us have started to gather in the #ml-mainline IRC channel in OFTC to disucss matters about doing accelerated ML with mainline, on embedded.</p><p>For those of you that may not have a IRC bouncer setup yet, you can easily join with the <a href="https://webchat.oftc.net/">web chat UI</a>, but in case others aren't in front of the keyboard when you type your question, I recommend using element.io with the Matrix IRC bridge:<br /><br /><a href="https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/">https://blog.christophersmart.com/2022/03/21/joining-a-bridged-irc-network-on-element-matrix/</a></p><h2 style="text-align: left;">Embedded recipes</h2><p>I have been invited to give a talk about all this ML with mainline effort at <a href="https://embedded-recipes.org/2023/">Embedded Recipes 2023</a>, Paris 28-29 September. Slides and a recording will be published after the conference ends.</p><h2 style="text-align: left;">Sponsor</h2><p>Last but not least, if I am able to invest so much effort on this is because the folks at <a href="https://libre.computer/">LibreComputer</a> have been supporting me financially this last couple of months.</p><p>Thanks to <a href="https://twitter.com/librecomputer">Da Xue</a> for his support, it is greatly appreciated! It is awesome to see SBC vendors investing in the Linux upstream ecosystem.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/3023393172299765340/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=3023393172299765340' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3023393172299765340'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/3023393172299765340'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-5-harder-convolutions.html' title='Etnaviv NPU update 5: Harder convolutions!'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-745522949949199487</id><published>2023-08-07T18:52:00.002+02:00</published><updated>2023-12-06T09:02:13.799+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'> Etnaviv NPU update 4: It's convoluting! </title><content type='html'><p><span style="font-family: inherit;">Summer has kept me busy with holidays, but I have managed to find a bit of time to keep hacking on the driver for the VeriSilicon NPU since the <a href="https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-3-deeper-into.html">last update</a>.</span></p><h2 style="text-align: left;"><span style="font-family: inherit;">TL;DR</span></h2><p><span style="font-family: inherit;">The issue with placing the output to the right scale is solved now, and simple convolution operations are working just fine.</span></p><p><span style="font-family: inherit;">3D tensors are now supported as inputs, and we support strided convolutions as well, but only on 2D inputs for now.</span></p><p><span style="font-family: inherit;">The test workloads are running fast and stably now, so I now feel I have pretty solid ground beneath my feet.</span></p><p><span style="font-family: inherit;">There are three features left before I can run a real, full-fledged commercially interesting model:</span></p><ol style="text-align: left;"><li><span style="font-family: inherit;">3D inputs for strided convolutions</span></li><li><span style="font-family: inherit;">Multiple output channels</span></li><li><span style="font-family: inherit;">Padded convolutions</span></li></ol><h2 style="text-align: left;"><span style="font-family: inherit;">Re-quantization</span></h2><p><span style="font-family: inherit;">The last update in this blog was left at my attempt at figuring out how the convolution raw outputs had to be processed with fields called post_shift and post_multiplier so I could get the right values in the final output.</span></p><p><span style="font-family: inherit;">After spending more time than I should probably have in a spreadsheet trying to find correlations, some desperate googling brought me to some research papers about optimizing quantization operations on integer-only hardware:</span></p><ul style="text-align: left;"><li><span style="font-family: inherit;"><a href="https://arxiv.org/pdf/2106.00127.pdf">Integer-Only Neural Network Quantization Scheme<br />Based on Shift-Batch-Normalization</a></span></li><li><span style="font-family: inherit;"><a href="https://arxiv.org/pdf/1712.05877.pdf"><span dir="ltr" role="presentation" style="font-size: calc(var(--scale-factor)*14.35px); left: 18.84%; top: 13.41%; transform: scaleX(0.902854);">Quantization and Training of Neural Networks for Efficient</span><br role="presentation" /><span dir="ltr" role="presentation" style="font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">Integer-Arithmetic-Only Inference</span></a></span></li></ul><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">That explains the meaning of the shift and multiplier, as these are the operations we can use to approximate the floating point division on integer hardware.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">But to actually understand what the hardware was trying to do with them, it was useful to look at the QNNPACK implementation of requantization.</span></p><h2 style="text-align: left;"><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">3D input tensor</span></h2><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">This was pretty much straightforward, as was basically a matter of updating the code to take into account the added dimension, and also reorder the tensor elements as the hardware expects depth first order.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">This was made much easier by some improvements to the scripts I use to observe the behavior of the closed source stack, by intercepting the communication with the kernel's GPL driver.</span></p><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);">For example, this is the output when Mesa has generated a cmd stream that is functionally equivalent to what the blob sends to the kernel:</span></p><blockquote><span style="font-family: inherit;">+ diff -u -U 100 /home/tomeu/mesa.txt /home/tomeu/galcore.txt<br />--- /home/tomeu/mesa.txt&nbsp;&nbsp;&nbsp; 2023-08-07 18:28:29.939750225 +0200<br />+++ /home/tomeu/galcore.txt&nbsp;&nbsp;&nbsp; 2023-08-07 18:28:42.116625362 +0200<br />@@ -1,176 +1,273 @@<br />&nbsp;{<br />-&nbsp;&nbsp;&nbsp; 0x0801028a, /* LOAD_STATE (1) Base: 0x00A28 Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0x00000011, /*&nbsp;&nbsp; PA.SYSTEM_MODE := PROVOKING_VERTEX_LAST=1,HALF_PIXEL_CENTER=1 */<br />-&nbsp;&nbsp;&nbsp; 0x08010e13, /* LOAD_STATE (1) Base: 0x0384C Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0x00000002, /*&nbsp;&nbsp; GL.API_MODE := OPENCL */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />+&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e4f, /* LOAD_STATE (1) Base: 0x0393C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; GL.OCB_REMAP_START := 0x0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e50, /* LOAD_STATE (1) Base: 0x03940 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; GL.OCB_REMAP_END := 0x0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e4c, /* LOAD_STATE (1) Base: 0x03930 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000010, /*&nbsp;&nbsp; GL.NN_CONFIG := UNK0=0x0,DISABLE_ZDPN=0,DISABLE_SWTILING=0,SMALL_BATCH=1,DDR_BURST_SIZE=0x0,UNK7=0,NN_CORE_COUNT=0x0,UNK12=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010428, /* LOAD_STATE (1) Base: 0x010A0 Size: 1 Fixp: 0 */<br />-&nbsp;&nbsp;&nbsp; 0xffff3000, /*&nbsp;&nbsp; PS.NN_INST_ADDR := *0xffff3000 */<br />+&nbsp;&nbsp;&nbsp; 0x3348e780, /*&nbsp;&nbsp; PS.NN_INST_ADDR := *0x3348e780 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010429, /* LOAD_STATE (1) Base: 0x010A4 Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp;&nbsp; 0x010A4 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000c23, /*&nbsp;&nbsp; GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x08010e03, /* LOAD_STATE (1) Base: 0x0380C Size: 1 Fixp: 0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000c23, /*&nbsp;&nbsp; GL.FLUSH_CACHE := DEPTH=1,COLOR=1,TEXTURE=0,PE2D=0,TEXTUREVS=0,SHADER_L1=1,SHADER_L2=0,UNK10=1,UNK11=1,DESCRIPTOR_UNK12=0,DESCRIPTOR_UNK13=0 */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /* UNKNOWN (0) */<br />&nbsp;&nbsp;&nbsp;&nbsp; 0x00000000, /*&nbsp; */<br />&nbsp;}<br />&nbsp;map-&gt;layer_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;no_z_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_xy_size = 0x2;&nbsp; /* (2) */<br />&nbsp;map-&gt;kernel_z_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;kernels_per_core = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;pooling = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;pooling_xy_size = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;prelu = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;nn_layer_flush = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;kernel_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_data_type = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_x_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_y_size = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_x_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_y_offset = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused0 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;brick_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;brick_distance = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;relu = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused1 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_shift = 0x17;&nbsp; /* (23) */<br />&nbsp;map-&gt;unused2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;no_flush = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_x_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_y_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_z_size = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;rounding_mode = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;in_image_x_offset_bit_3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_y_offset_bit_3 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_tile_x_size = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;out_image_tile_y_size = 0x3;&nbsp; /* (3) */<br />-map-&gt;kernel_address = 0x3fffd00;&nbsp; /* (67108096) */<br />+map-&gt;kernel_address = 0xcd237f;&nbsp; /* (13443967) */<br />&nbsp;map-&gt;kernel_z_size2 = 0x0;&nbsp; /* (0) */<br />-map-&gt;in_image_address = 0xffff6000;&nbsp; /* (4294926336) */<br />-map-&gt;out_image_address = 0xffff7000;&nbsp; /* (4294930432) */<br />+map-&gt;in_image_address = 0x3348e240;&nbsp; /* (860414528) */<br />+map-&gt;out_image_address = 0x89ffc500;&nbsp; /* (2315240704) */<br />&nbsp;map-&gt;image_caching_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_caching_mode = 0x1;&nbsp; /* (1) */<br />&nbsp;map-&gt;partial_cache_data_unit = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_pattern_msb = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_y_size = 0x2;&nbsp; /* (2) */<br />&nbsp;map-&gt;out_image_y_stride = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;kernel_pattern_low = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_pattern_high = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_cache_start_address = 0x800;&nbsp; /* (2048) */<br />&nbsp;map-&gt;kernel_cache_end_address = 0xa00;&nbsp; /* (2560) */<br />&nbsp;map-&gt;image_start_address = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;image_end_address = 0x800;&nbsp; /* (2048) */<br />&nbsp;map-&gt;in_image_border_mode = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_border_const = 0x7d;&nbsp; /* (125) */<br />&nbsp;map-&gt;unused4 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;kernel_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_data_type_bit_2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier_1_to_6 = 0x1f;&nbsp; /* (31) */<br />&nbsp;map-&gt;post_shift_bit_5_6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused5 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_x_stride = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;in_image_y_stride = 0x4;&nbsp; /* (4) */<br />&nbsp;map-&gt;out_image_x_stride = 0x3;&nbsp; /* (3) */<br />&nbsp;map-&gt;unused6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;post_multiplier_7_to_14 = 0x61;&nbsp; /* (97) */<br />&nbsp;map-&gt;out_image_circular_buf_size = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused7 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;per_channel_post_mul = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;out_image_circular_buf_end_addr_plus_1 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;unused8 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_circular_buf_size = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused9 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;in_image_circular_buf_end_addr_plus_1 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;unused10 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;coef_zero_point = 0x80;&nbsp; /* (128) */<br />&nbsp;map-&gt;out_zero_point = 0x77;&nbsp; /* (119) */<br />&nbsp;map-&gt;kernel_direct_stream_from_VIP_sram = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;depthwise = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused11 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused12 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused13 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused14 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused15 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;unused16 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further1 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further2 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further3 = 0x3ffffff;&nbsp; /* (67108863) */<br />&nbsp;map-&gt;further4 = 0x7f800000;&nbsp; /* (2139095040) */<br />&nbsp;map-&gt;further5 = 0xff800000;&nbsp; /* (4286578688) */<br />&nbsp;map-&gt;further6 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further7 = 0x0;&nbsp; /* (0) */<br />&nbsp;map-&gt;further8 = 0x0;&nbsp; /* (0) */<br />&nbsp;&nbsp; 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x2c, 0x99, 0x0e, 0x00, 0x00,<br />&nbsp;&nbsp; 0x40, 0xea, 0x2c, 0xeb, 0x80, 0xaf, 0x80, 0x9b, 0x99, 0x80, 0x80, 0x13,<br />&nbsp;&nbsp; 0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,<br />&nbsp;&nbsp; 0x00, 0x00, 0x00, 0x00<br />&nbsp;&nbsp; 0x69, 0xd3, 0x2d, 0x92, 0x07, 0x00, 0x64, 0x00, 0x0c, 0x22, 0x90, 0xd6,<br />&nbsp;&nbsp; 0x53, 0xc9, 0xe2, 0x48, 0xe6, 0x4c, 0xa8, 0xeb, 0xd2, 0xf3, 0xb0, 0xf4,<br />&nbsp;&nbsp; 0x2d, 0xa4, 0x3e, 0xf4, 0x0f, 0x7b, 0x98, 0x01, 0x41, 0x84, 0x92, 0x7e,<br />&nbsp;&nbsp; 0xfa, 0x19, 0xf5, 0xda, 0xb3, 0x5a, 0xb7, 0xf3, 0x97, 0x95, 0x12, 0xe7,<br />&nbsp;&nbsp; 0x51, 0x94, 0xcb, 0x5a, 0x1f, 0xa9, 0xc6, 0xc4, 0x1c, 0xa9, 0x92, 0x1f,<br />&nbsp;&nbsp; 0xf7, 0x64, 0xc3, 0xca<br />&nbsp;&nbsp; 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77</span></blockquote><p><span style="font-family: inherit;">This corresponds to a convolution with the following parameters:</span></p><ul style="text-align: left;"><li><span style="font-family: inherit;">8x8x1 input tensor</span></li><li><span style="font-family: inherit;">3x3x1 weight tensor</span></li><li><span style="font-family: inherit;">stride == 2</span></li></ul><p><span style="font-family: inherit;">The differences are due to different addresses being allocated between runs, and some differences due to how Mesa's code is structured but that shouldn't affect the end result.&nbsp;</span></p><p><span style="font-family: inherit;">At the top we have the payload of the submit IOCTL, followed by a struct with the configuration for the NN units themselves and then the buffers for the weights, input and output.<br /></span></p><p><span style="font-family: inherit;">When running a convolution configuration that isn't yet supported, we will spot more differences and hopefully will be able to figure out the logic behind them.</span></p><h2 style="text-align: left;"><span style="font-family: inherit;">Strided convolutions</span></h2><p><span style="font-family: inherit;">The hardware doesn't really support strided convolutions, so these are "lowered" to 1-stride convolutions with added channels, as per this research paper:</span></p><ul style="text-align: left;"><li><a href="https://www.arxiv-vanity.com/papers/1712.02502/" style="font-family: inherit;">Take it in your stride: Do we need striding in CNNs?</a><span style="font-family: inherit;"><br /></span></li></ul><p><span style="font-family: inherit;">By implementing the algorithm in the paper, we match the behavior of the blob, as with requantization. It refers only to 2D input tensors, so I will need to check how the blob behaves with 3D inputs and figure out the logic behind it.</span></p><p><span style="font-family: inherit;">For now I have chosen to do the tensor manipulation on the CPU, but later on we will be able to use the TP units in the HW for this, reducing latency. <br /></span></p><h2 style="text-align: left;"><span style="font-family: inherit;">Test suite</span></h2><p><span style="font-family: inherit;">With so many different convolution parameters supported, I felt the need for a comfortable way of keeping regressions in check.</span></p><p><span style="font-family: inherit;">I wrote a simple pytest module that will generate a TFLite model with a single convolution operation, and the parameters and payloads will be changed according to the different parameters that we support.<br /></span></p><p><span style="font-family: inherit;">At some point I will add a CI job, probably before sending the initial merge request.<br /></span></p><div><p><span dir="ltr" role="presentation" style="font-family: inherit; font-size: calc(var(--scale-factor)*14.35px); left: 31.27%; top: 15.69%; transform: scaleX(0.907723);"></span></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/745522949949199487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=745522949949199487' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/745522949949199487'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/745522949949199487'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/08/etnaviv-npu-update-4-its-convoluting.html' title=' Etnaviv NPU update 4: It's convoluting! '/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-8257480099972567574</id><published>2023-06-26T08:46:00.003+02:00</published><updated>2023-12-06T09:01:38.692+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 3: Deeper into the convolution units</title><content type='html'><p>What two weeks!</p><h2 style="text-align: left;">Programming of the convolution units</h2><div style="text-align: left;"><p style="text-align: left;">Taking from where I left at the <a href="https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-2-diving-into.html">last update</a>, I made progress in understanding the format of the buffer that contains the weights and biases. <br /></p></div><div style="text-align: left;"><p style="text-align: left;">The bit of knowledge that made a difference was realising that the format is optimized so that each NN core can efficiently access the portion of it that it needs, without having to do any parsing or decoding. Knowing that also helped in guessing what some fields in the parameter structure are for.<br /></p><p style="text-align: left;">With that, I&nbsp; was able to correctly run a convolution on a small matrix with arbitrary weights and biases.</p><p style="text-align: left;">The biggest roadblock in this area currently is understanding how I need to program the output unit in the NN so the output data is in the desired scale. There are a series of fields that influence how the output values are processed before being placed in the output buffer, and I don't really know how they work yet. They are called post_shift and post_mult and the first correlates moderately (r=0.78) to the quantization scale of the output. I know that the post_shift field does what it says, to the right, but to understand what value I need in each situation I feel I need to understand better how the hardware works and what could be the initial values at the end of the convolution and before the output unit. I will be reading a bunch of research papers about NN-accelerating silicon in the summer.<br /></p><p style="text-align: left;">That said, replacing the OpenCL kernels in TensorFlow Lite's GPU delegate that do convolutions with the fixed units turned out to be a worse idea than I initially thought. This is because that delegate is completely oriented towards float-first hardware such as GPUs and this accelerator is integer only.</p><p style="text-align: left;">A consequence of this is that TFLite inserts a dequantize operation at the start of the graph and a quantize at the end, to match the desired intput and output formats of a fully quantized model while feeding floats to the GPU. We need integers, so would be having to quantize after TFLite's dequantization and vice versa. Also, the other operations in the graph expect floats as well... This is certainly the wrong path to take for performance in a bandwidth-constrained device as all embedded boards are, so I had to go back to the drawing board.</p><h2 style="text-align: left;">A new Gallium frontend: Teflon</h2><p style="text-align: left;">If TF Lite's GPU delegate is such a bad match for this HW, what can we do to run inferences with reasonable speeds? The same that VeriSilicon did: write our own delegate:</p><p style="text-align: left;"><a href="https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/">https://gitlab.freedesktop.org/tomeu/mesa/-/commits/teflon/</a><br /></p><p style="text-align: left;">TF Lite's operation description matches relatively well what we currently know of the configuration of the NN units. So we will not need to write complex shaders to implement the operations, but "just" translate the description of the operation to the HW configuration.</p><p style="text-align: left;">Of course, there is no HW that has fixed function units that accelerate all operations that are built into TF Lite or even that the most commonly used models contain. VeriSilicon's delegate deals with that by having a library of optimized OpenCL kernels that run on their programmable shader core(s).</p><p style="text-align: left;">But we want to avoid getting in the business of writing dozens of kernels that will need to be tweaked and made more complex so they run efficiently on other NPUs out there.</p><p style="text-align: left;">Fortunately, the delegate infrastructure in TF Lite is designed for this very scenario of imperfect HW and we can have a simple delegate that will implement the operations supported by the HW and the rest will execute in other delegates based on their capabilities.</p><p style="text-align: left;">How fast that will be is a big unknown right now, as switching between delegates will have a cost in terms of synchronization and data sharing, but that is something that we probably can improve in the TF Lite code base as the kernel has already all mechanisms for efficient synchronization and data sharing.</p><p style="text-align: left;">Other possibilities that we have with the TF Lite delegate mechanism is offloading the operations we don't need to a different delegate that supports accelerating them. For example, in the case of a board with Amlogic A311D or S905D3, we could use the GPU delegate to run those operations on the Mali GPU on it, via the OpenCL driver that Alyssa is writing in Mesa.</p><p style="text-align: left;">And if that is still slower than with the proprietary stack, one could always write an optimized kernel in NIR to run on the programmable core in the Vivante NPU. That is the beauty of free software, we can address the needs we have ourselves, and importantly so, do it by pooling work with others!</p><p style="text-align: left;">Because this frontend is implemented in terms of Gallium, we leverage the infrastructure in there for memory management, synchronization and execution. I think this will work well for adding support to other NN engines such as those from Rockchip, Cadence, Mediatek, etc.<br /></p><h2 style="text-align: left;">Next steps</h2><p style="text-align: left;">I need to crack the nut of the post-processing of the raw output so it is in the expected scale, and afterwards I will be looking at handling multiple feature maps (kernel z &gt; 1).</p><p style="text-align: left;">After that I don't see much else in the way of running convolutions as expected by TF Lite, so hopefully I will be running some models and measuring the performance. I expect that we will want to do the same for accelerating tensor operations with the TP units. And we will probably want to give a look at using the SRAM to reduce bandwidth and memory access latency. That still some way off though, and the summer is just starting!<br /></p></div></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/8257480099972567574/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=8257480099972567574' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8257480099972567574'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/8257480099972567574'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-3-deeper-into.html' title='Etnaviv NPU update 3: Deeper into the convolution units'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-956184837449237805</id><published>2023-06-10T14:14:00.001+02:00</published><updated>2023-12-06T09:01:28.578+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 2: Diving into the convolution units</title><content type='html'><p>In the <a href="https://blog.tomeuvizoso.net/2023/05/etnaviv-npu-update-1-planning-for.html">previous update</a> I explained that the programmable core in this NPU (VIPNano-QI) is too slow to run inference workloads substantially faster than the CPUs. The vendor stack achieves acceptable inference rates by running most of the work on fixed-function units that can perform different kinds of convolutions and transformations of tensors.</p><p>Most of the work is done by the convolution units that VeriSilicon calls NN cores, so this is what I have been focusing on at this stage. I think that even if we still do all tensor transformation on the programmable core, by using the NN units we could already achieve usable performance.</p><p>By looking around in the ioctls that VeriSilicon's userspace stack sends to the kernel, it was clear that in the NN jobs there was little more than a pointer to a structure that configures the NN fixed-function units. Luckily I didn't need to reverse engineer it from zero, as VeriSilicon's out-of-tree kernel driver is GPL and contains two instances of <a href="https://github.com/TierMobility/linux/blob/242f3e8c8502ff8e818028f8b9fd9894e0feef2e/drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware_func_flop_reset.c#L4751">programming this HW</a> with a trivial job (a 2x2x1 kernel with a single bias value).<br /></p><p>Took some boring work to translate what the code does to a C struct, but this was the initial one:</p><p><span style="font-family: courier;">struct etna_nn_params {<br />&nbsp;&nbsp; uint32_t op_type : 1; /* conv: 0 fully_connected: 1 */<br />&nbsp;&nbsp; uint32_t no_z_offset : 1;<br />&nbsp;&nbsp; uint32_t kernel_x_size : 4;<br />&nbsp;&nbsp; uint32_t kernel_z_size : 14; /* &amp; 0x3FFF */<br />&nbsp;&nbsp; uint32_t kernels_per_core : 7;<br />&nbsp;&nbsp; uint32_t zero1 : 2;<br />&nbsp;&nbsp; uint32_t zero2 : 1;<br />&nbsp;&nbsp; uint32_t zero3 : 1;<br />&nbsp;&nbsp; uint32_t nn_layer_flush : 1;<br /><br />&nbsp;&nbsp; uint32_t kernel_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t in_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t out_image_data_type : 2; /* UINT8 0x2 INT8 0x0 */<br />&nbsp;&nbsp; uint32_t in_image_x_size : 13;<br />&nbsp;&nbsp; uint32_t in_image_y_size : 13;<br /><br />&nbsp;&nbsp; uint32_t zero4 : 3;<br />&nbsp;&nbsp; uint32_t zero5 : 3;<br />&nbsp;&nbsp; uint32_t unused0 : 1;<br />&nbsp;&nbsp; uint32_t zero6 : 16;<br />&nbsp;&nbsp; uint32_t zero7 : 1;<br />&nbsp;&nbsp; uint32_t enable_relu : 1;<br />&nbsp;&nbsp; uint32_t zero9 : 1;<br />&nbsp;&nbsp; uint32_t post_shift : 6;<br /><br />&nbsp;&nbsp; uint32_t unused1 : 2;<br />&nbsp;&nbsp; uint32_t zero10 : 1;<br />&nbsp;&nbsp; uint32_t zero11 : 1;<br />&nbsp;&nbsp; uint32_t unused2 : 2;<br />&nbsp;&nbsp; uint32_t out_image_x_size : 13;<br />&nbsp;&nbsp; uint32_t out_image_y_size : 13;<br /><br />&nbsp;&nbsp; uint32_t out_image_z_size : 14;<br />&nbsp;&nbsp; uint32_t zero12 : 2; /* 0x0 */<br />&nbsp;&nbsp; uint32_t zero13 : 1; /* (0 &gt;&gt; 3) &amp; 0x1 */<br />&nbsp;&nbsp; uint32_t zero14 : 1; /* (0 &gt;&gt; 3) &amp; 0x1 */<br />&nbsp;&nbsp; uint32_t unk0 : 7;&nbsp; /* 1 */<br />&nbsp;&nbsp; uint32_t unk1 : 7;&nbsp; /* 1 */<br /><br />&nbsp;&nbsp; uint32_t kernel_address : 26; /* &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t kernel_z_size2 : 6; /* &gt;&gt; 14 */<br /><br />&nbsp;&nbsp; uint32_t in_image_address;<br /><br />&nbsp;&nbsp; uint32_t out_image_address;<br /><br />&nbsp;&nbsp; uint32_t unused3 : 12;<br />&nbsp;&nbsp; uint32_t kernel_y_size : 4;<br />&nbsp;&nbsp; uint32_t out_image_y_size2 : 16;&nbsp; /* maybe stride? */<br /><br />&nbsp;&nbsp; uint32_t zero15;<br /><br />&nbsp;&nbsp; uint32_t zero16;<br /><br />&nbsp;&nbsp; uint32_t zero17;<br /><br />&nbsp;&nbsp; uint32_t kernel_cache_end_address;<br /><br />&nbsp;&nbsp; uint32_t zero19;<br /><br />&nbsp;&nbsp; uint32_t image_end_address;<br /><br />&nbsp;&nbsp; uint32_t zero20 : 2;<br />&nbsp;&nbsp; uint32_t zero21 : 16;<br />&nbsp;&nbsp; uint32_t kernel_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t in_image_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t out_image_data_type_bit_2 : 1;<br />&nbsp;&nbsp; uint32_t zero22 : 6;<br />&nbsp;&nbsp; uint32_t post_shift_bit_5_6 : 2;<br />&nbsp;&nbsp; uint32_t unused4 : 3;<br /><br />&nbsp;&nbsp; uint32_t in_image_stride : 16;<br />&nbsp;&nbsp; uint32_t in_image_y_size2 : 16; /* again? */<br /><br />&nbsp;&nbsp; uint32_t out_image_stride : 16;<br />&nbsp;&nbsp; uint32_t unused5 : 8;<br />&nbsp;&nbsp; uint32_t zero23 : 8;<br /><br />&nbsp;&nbsp; uint32_t zero24 : 26; /* 0 &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t zero25 : 1;<br />&nbsp;&nbsp; uint32_t zero26 : 1;<br />&nbsp;&nbsp; uint32_t zero27 : 1; /* 0 &gt;&gt; 4 */<br />&nbsp;&nbsp; uint32_t zero28 : 1; /* 0 &gt;&gt; 4 */<br />&nbsp;&nbsp; uint32_t zero29 : 1;<br />&nbsp;&nbsp; uint32_t kernel_data_type_bit_3 : 1;<br /><br />&nbsp;&nbsp; uint32_t unk2 : 26; /* 0xFFFFFFFF &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t unused6 : 4;<br />&nbsp;&nbsp; uint32_t zero30 : 1;<br />&nbsp;&nbsp; uint32_t in_image_data_type_bit_3 : 1;<br /><br />&nbsp;&nbsp; uint32_t zero31 : 26; /* 0 &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t out_image_data_type_bit_3 : 1;<br />&nbsp;&nbsp; uint32_t unused7 : 6;<br /><br />&nbsp;&nbsp; uint32_t unk3 : 26; /* 0xFFFFFFFF &gt;&gt; 6 */<br />&nbsp;&nbsp; uint32_t unused8 : 6;<br /><br />&nbsp;&nbsp; uint32_t coef_zero_point : 8;<br />&nbsp;&nbsp; uint32_t out_zero_point : 8;<br />&nbsp;&nbsp; uint32_t zero32 : 1;<br />&nbsp;&nbsp; uint32_t zero33 : 1;<br />&nbsp;&nbsp; uint32_t zero34 : 8;<br />&nbsp;&nbsp; uint32_t unused9 : 6;<br /><br />&nbsp;&nbsp; uint32_t zero35;<br /><br />&nbsp;&nbsp; uint32_t zero36 : 4;<br />&nbsp;&nbsp; uint32_t zero37 : 28;&nbsp; /* 0 &gt;&gt; 4 */<br /><br />&nbsp;&nbsp; uint32_t zero38 : 4;<br />&nbsp;&nbsp; uint32_t zero39 : 28;&nbsp; /* 0 &gt;&gt; 4 */<br /><br />&nbsp;&nbsp; uint32_t further1;<br />&nbsp;&nbsp; uint32_t further2;<br />&nbsp;&nbsp; uint32_t further3;<br />&nbsp;&nbsp; uint32_t further4;<br />&nbsp;&nbsp; uint32_t further5;<br />&nbsp;&nbsp; uint32_t further6;<br />&nbsp;&nbsp; uint32_t further7;<br />&nbsp;&nbsp; uint32_t further8;<br /></span>};<br /></p><p>As you can see there are a lot of "zero" and "unused" fields, most of them I think will be actually used for something as HW engineers don't tend to like wasting bits. By adding instrumentation for dumping these structs to the reverse engineering tooling, I will be making myself a better idea of what each field means and does.<br /></p><p>I got GPU hangs the first time that I submitted a job with the same configuration as the kernel's trivial reset job, and looking further showed that the buffer that contains the convolution filters must follow a specific format.</p><p>By looking again at the kernel driver sources, I used the same kernel/filter buffer and the GPU didn't hang anymore. That kernel was all zeroes as the weights, and indeed my output buffer was now full of zeroes.</p><p>Then I tried to put my weights into the format that I inferred from the kernel driver source code, but I wasn't able to get any job to run to completion without hangs, and the output buffer was unchanged.</p><p>To figure out what I was missing about how the weights (and the biases) need to be placed in the buffer, I added code to the reverse engineering tooling to dump the weights buffer. With that buffer and after playing some with the sizes of the output, input and kernel buffers, I finally got a job to run with non-zero weights.</p><p>What I am doing right now is slowly zeroing out the weights buffer to figure out what are data bits, what are control and what effect the changes have in the output.</p><p>Hope that by the next update I will have documented the format of the weights buffer and will be able to run at least one kind of convolution!<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/956184837449237805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=956184837449237805' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/956184837449237805'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/956184837449237805'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/06/etnaviv-npu-update-2-diving-into.html' title='Etnaviv NPU update 2: Diving into the convolution units'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-664175667937540078.post-6608550388257645646</id><published>2023-05-29T11:31:00.001+02:00</published><updated>2023-12-06T09:01:09.123+01:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="etnaviv"/><category scheme="http://www.blogger.com/atom/ns#" term="mesa"/><category scheme="http://www.blogger.com/atom/ns#" term="npu"/><category scheme="http://www.blogger.com/atom/ns#" term="tensorflow"/><category scheme="http://www.blogger.com/atom/ns#" term="vipnano-qi"/><category scheme="http://www.blogger.com/atom/ns#" term="vivante"/><title type='text'>Etnaviv NPU update 1: Planning for performance</title><content type='html'><p>As I wrote in the <a href="https://blog.tomeuvizoso.net/2023/04/a-long-overdue-update.html">last update</a>, my <a href="https://gitlab.freedesktop.org/tomeu/mesa/-/tree/etnaviv-opencl">OpenCL branch</a> is able to correctly run <a href="https://arxiv.org/abs/1704.04861">MobileNet v1</a> with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.</p><p>In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!). </p><p>When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.</p><p>After a lots of fumbling around in the internets I stumbled upon <a href="https://github.com/phytec/android-phytec-devices/commit/530d1d3102c93b00ae0a6a87a50db2648f874277">a commit</a> that included an interestingly-named environment variable: <span class="Text-sc-17v1xeu-0 cExLQ"><mark>VIV_VX_DISABLE_TP_NN</mark>_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).</span></p><p><span class="Text-sc-17v1xeu-0 cExLQ">With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.<br /></span></p><p>Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.</p><p>Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.<br /></p><p>The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:</p><p><span style="font-family: courier;">&nbsp; Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -&gt; [T#5]<br />&nbsp; Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -&gt; [T#31]<br />... <br /></span></p><p><span style="font-family: courier;">[12 more pairs of CONV_2D and </span><span style="font-family: courier;">DEPTHWISE_CONV_2D</span><span style="font-family: courier;">] </span></p><p><span style="font-family: courier;">...<br /></span></p><p><span style="font-family: courier;">&nbsp; Op#27 AVERAGE_POOL_2D(T#29) -&gt; [T#0]<br />&nbsp; Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -&gt; [T#1]<br />&nbsp; Op#29 RESHAPE(T#1, T#86[-1, 1001]) -&gt; [T#85]<br />&nbsp; Op#30 SOFTMAX(T#85) -&gt; [T#87]</span><br /></p><p>As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end.&nbsp;</p><p>By using some of the environment variables that are mentioned in <a href="https://github.com/VeriSilicon/tflite-vx-delegate/issues/20#issuecomment-952472901">this issue</a> in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:</p><p><span style="font-family: courier;">&nbsp; operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN<br />... <br /></span></p><p><span style="font-family: courier;">[34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN]&nbsp;</span></p><p><span style="font-family: courier;">...<br /></span></p><p><span style="font-family: courier;">&nbsp; operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH<br />&nbsp; operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP<br />&nbsp; operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SH<br /></span><br />From that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.<br /><br />So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...</p><p></p><h2 style="text-align: left;">Next steps&nbsp;</h2><p style="text-align: left;">What I will be working on next:<br /></p><ol style="text-align: left;"><li>Adapt the existing RE tooling to dump information regarding NN and TP workflows</li><li>Start to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.</li><li>Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.</li><li>Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.</li><li>Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".</li><li>Get things working for this specific network and measure performance.</li></ol><p>If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.</p><p>The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the <a href="https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.1/html/d6/d9a/group__group__cnn.html#ga870c106e8ceb4c118692c6f754f75f43">vxConvolutionLayer node</a>).<br /></p><p>The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.</p><p>That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.<br /></p></content><link rel='replies' type='application/atom+xml' href='https://blog.tomeuvizoso.net/feeds/6608550388257645646/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=664175667937540078&postID=6608550388257645646' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608550388257645646'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/664175667937540078/posts/default/6608550388257645646'/><link rel='alternate' type='text/html' href='https://blog.tomeuvizoso.net/2023/05/etnaviv-npu-update-1-planning-for.html' title='Etnaviv NPU update 1: Planning for performance'/><author><name>Tomeu Vizoso</name><uri>http://www.blogger.com/profile/16626407169435386757</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='//blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3VSYTeUdOd9zbuD4GAz7nzjkpyrnEsbggXSjF423RKaw7fTLijGdON9rvy_T41rpgYbGZcvlJ9V5_3KxhA0AOZytvTAyibl9kwJogOB51HJtqyjMp4UDuuIbDxObYu2W0cY0-sJEqbyemTh6TWkvpxFfyvlF7Pe6FyR7VXGuS8ehDuro/s220/tomeu-2019.jpg'/></author><thr:total>0</thr:total></entry></feed>
If you would like to create a banner that links to this page (i.e. this validation result), do the following:
Download the "valid Atom 1.0" banner.
Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)
Add this HTML to your page (change the image src
attribute if necessary):
If you would like to create a text link instead, here is the URL you can use:
http://www.feedvalidator.org/check.cgi?url=http%3A//blog.tomeuvizoso.net/feeds/posts/default