<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://cuda-chen.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://cuda-chen.github.io/" rel="alternate" type="text/html" /><updated>2026-02-27T07:41:02+00:00</updated><id>https://cuda-chen.github.io/feed.xml</id><title type="html">Cuda Chen’s Blog</title><subtitle>Image Processing, Machine Learning, Parallel Computing, video games, and living.</subtitle><entry><title type="html">libmseed Optimization Attempt – A Counter Example of Replacing switch-case</title><link href="https://cuda-chen.github.io/programming/seismology/seismology%20data%20format/optimziation/benchmarking/2026/02/21/libmseed-counter-example-of-jump-table.html" rel="alternate" type="text/html" title="libmseed Optimization Attempt – A Counter Example of Replacing switch-case" /><published>2026-02-21T00:00:00+00:00</published><updated>2026-02-21T00:00:00+00:00</updated><id>https://cuda-chen.github.io/programming/seismology/seismology%20data%20format/optimziation/benchmarking/2026/02/21/libmseed-counter-example-of-jump-table</id><content type="html" xml:base="https://cuda-chen.github.io/programming/seismology/seismology%20data%20format/optimziation/benchmarking/2026/02/21/libmseed-counter-example-of-jump-table.html"><![CDATA[<h2 id="outline">Outline</h2>

<p>For system software programmers, we always thrive
for any possibilities of performance optimization.
When it comes to branching, we believes
that a solely jump table, which does a certain operation
after a arithmetic operation for calculating the entry
of the jump table, beats many methods
including long if-else and complex switch-case statements.</p>

<p>In this post, I am going to give you a counter example
that a seems-dumb switch-case statement beats
jump table technique. I will not only show the benchmark
result, but also some personal findings that
I think why a switch-case statement become the better
performant one compared to the jump table implementation.</p>

<h2 id="the-target">The Target</h2>

<p>I am going to do an optimization of <code class="language-plaintext highlighter-rouge">msr_decode_steim2()</code>
in a widely-used siesmic data library called libmseed.</p>

<p>If you are interested in the format of miniSEED with Steim2 format,
check <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>.</p>

<p>You can view the whole part of <code class="language-plaintext highlighter-rouge">msr_decode_steim2()</code> in <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>.
For your clarity, here I provide a minified version of the code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">int64</span> <span class="nf">msr_decode_steim2</span><span class="p">(...)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">nibble</span> <span class="o">=</span> <span class="p">...;</span> <span class="cm">/* check the first nibble */</span>
    <span class="kt">int</span> <span class="n">dnib</span><span class="p">;</span> <span class="cm">/* second nibble placeholder */</span>    

    <span class="k">switch</span><span class="p">(</span><span class="n">nibble</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
            <span class="n">dnib</span> <span class="o">=</span> <span class="p">...;</span> <span class="cm">/* check the second nibble */</span>
            <span class="k">switch</span><span class="p">(</span><span class="n">dnib</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
            <span class="n">dnib</span> <span class="o">=</span> <span class="p">...;</span>
            <span class="k">switch</span><span class="p">(</span><span class="n">dnib</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
                    <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">break</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="what-i-am-going-to-do">What I am Going to Do</h3>

<p>Eliminate the non-trivial switch-case statment. Moreover,
check any possibilities for micro-optimization.</p>

<h2 id="attempts">Attempts</h2>

<p>For the goal, I come up with these methods: jump table
and for loop.</p>

<h3 id="jump-table">jump table</h3>

<p>For dozens of conditions, we can use a jump table so that
we let the program to execute certain actions based on the
value of each condition. Usually, this can yield with a
better performances compared to switch-case statement.</p>

<p>We can just move the statement of each switch-case
into a series of functions, then create a table for
indexing the actions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* a series of callbacks */</span>
<span class="kt">int</span> <span class="nf">fnoop</span><span class="p">(...)</span> <span class="p">{...}</span> <span class="cm">/* nibble=0x00 */</span>
<span class="kt">int</span> <span class="nf">f01</span><span class="p">(...)</span> <span class="p">{...}</span> <span class="cm">/* nibble=0x01 */</span>
<span class="kt">int</span> <span class="nf">f1010</span><span class="p">(...)</span> <span class="p">{...}</span> <span class="cm">/* nibble=0x02 and dnib=0x02
...

typedef int (*steim2_decode_func_cb) (uint32_t , /* input frame */</span>
                                      <span class="kt">int32_t</span> <span class="o">*</span><span class="p">,</span>  <span class="cm">/* output difference array */</span>
                                      <span class="kt">int</span> <span class="o">*</span>      <span class="cm">/* output difference array index */</span>
<span class="p">);</span> 
<span class="k">static</span> <span class="n">steim2_decode_func_cb</span> <span class="n">__steim2_decode_func_tbl</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">fnoop</span><span class="p">,</span> <span class="n">fnoop</span><span class="p">,</span> <span class="n">fnoop</span><span class="p">,</span> <span class="n">fnoop</span><span class="p">,</span> <span class="n">f01</span><span class="p">,</span>   <span class="n">f01</span><span class="p">,</span>   <span class="n">f01</span><span class="p">,</span>   <span class="n">f01</span><span class="p">,</span> 
    <span class="n">f1000</span><span class="p">,</span> <span class="n">f1001</span><span class="p">,</span> <span class="n">f1010</span><span class="p">,</span> <span class="n">f1011</span><span class="p">,</span> <span class="n">f1100</span><span class="p">,</span> <span class="n">f1101</span><span class="p">,</span> <span class="n">f1110</span><span class="p">,</span> <span class="n">f1111</span><span class="p">,</span>
<span class="p">};</span>

<span class="kt">int64_t</span> <span class="nf">msr_decode_steim2</span><span class="p">(...)</span> <span class="p">{</span>
    <span class="cm">/* Substitute the swtich-case into array indexing */</span>
    <span class="kt">int</span> <span class="n">nibble</span> <span class="o">=</span> <span class="p">...;</span>
    <span class="kt">int</span> <span class="n">dnib</span> <span class="o">=</span> <span class="p">...;</span> <span class="cm">/* Always check the second nibble */</span>
    <span class="kt">uint32_t</span> <span class="n">ii</span> <span class="o">=</span> <span class="p">((</span><span class="n">nibble</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">dnib</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">);</span>

    <span class="n">steim2_decode_func_cb</span> <span class="n">handler</span> <span class="o">=</span> <span class="n">__steim2_decode_func_tbl</span><span class="p">[</span><span class="n">ii</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">handler</span> <span class="p">(</span><span class="n">frame</span><span class="p">[</span><span class="n">widx</span><span class="p">],</span> <span class="n">diff</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">diffidx</span><span class="p">);</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="for-loop">for loop</h3>

<p>As specified in <sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>, we can use a for loop with precomputed
values to determined how many bits should be scanned each time
and how many values stored in a frame.</p>

<p>So it will become like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int64_t</span> <span class="nf">msr_decode_steim2</span><span class="p">(...)</span> <span class="p">{</span>
      <span class="cm">/* Substitute the swtich-case into array indexing */</span>
      <span class="kt">int</span> <span class="n">nibble</span> <span class="o">=</span> <span class="p">...;</span>
      <span class="kt">int</span> <span class="n">dnib</span> <span class="o">=</span> <span class="p">...;</span> <span class="cm">/* Always check the second nibble */</span>
      <span class="kt">uint32_t</span> <span class="n">ii</span> <span class="o">=</span> <span class="p">((</span><span class="n">nibble</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">2</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">dnib</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">);</span>

      <span class="kt">int</span> <span class="n">base</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,};</span>
      <span class="kt">uint32_t</span> <span class="n">increment_mask</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mh">0x0</span><span class="p">,</span> <span class="mh">0x0</span><span class="p">,</span> <span class="mh">0x07</span><span class="p">,</span> <span class="mh">0x07</span><span class="p">,};</span>
      <span class="kt">int</span> <span class="n">cnt</span> <span class="o">=</span> <span class="n">base</span><span class="p">[</span><span class="n">nibble</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="n">ii</span> <span class="o">&amp;</span> <span class="n">increment_mask</span><span class="p">[</span><span class="n">nibble</span> <span class="o">&amp;</span> <span class="mh">0x03</span><span class="p">]);</span>

      <span class="cm">/* start bit of each combination of nibble and dnib */</span>
      <span class="kt">int</span> <span class="n">start_bit_pos</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
          <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
          <span class="mi">24</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span>
          <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span>
          <span class="mi">24</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">24</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> 
      <span class="p">};</span>

      <span class="cm">/* bit length of each combination of nibble and dnib */</span>
      <span class="kt">int</span> <span class="n">bb</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> 
          <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
          <span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span>
          <span class="mi">0</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span>
          <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
      <span class="p">};</span>

      <span class="k">for</span> <span class="p">(</span><span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">idx</span> <span class="o">&lt;</span> <span class="n">cnt</span><span class="p">;</span> <span class="n">idx</span><span class="o">++</span><span class="p">)</span>
      <span class="p">{</span>        
        <span class="kt">int</span> <span class="n">bit_count</span> <span class="o">=</span> <span class="n">bb</span><span class="p">[</span><span class="n">ii</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">32</span> <span class="o">-</span> <span class="n">bit_count</span><span class="p">;</span>
        <span class="cm">/* The nibble=0x01 needs extra treatment as there are no
        * any defintion of little-endian Steim2 SEED.
        * See https://github.com/EarthScope/libmseed/issues/36#issuecomment-470370790
        * for more details.
        */</span>
        <span class="kt">int</span> <span class="n">start</span> <span class="o">=</span> <span class="n">start_bit_pos</span><span class="p">[</span><span class="n">ii</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">]</span> <span class="o">-</span> <span class="p">(</span>
                <span class="n">nibble</span> <span class="o">==</span> <span class="mh">0x01</span>
                <span class="o">?</span> <span class="n">cnt</span> <span class="o">-</span> <span class="n">idx</span> <span class="o">-</span> <span class="mi">1</span> 
                <span class="o">:</span> <span class="n">idx</span><span class="p">)</span> <span class="o">*</span> <span class="n">bit_count</span><span class="p">;</span>

        <span class="cm">/* adapted from "Sign extending from a variable bit-width" section in https://graphics.stanford.edu/~seander/bithacks.html */</span>
        <span class="kt">int32_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">1U</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">bit_count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> 
        <span class="kt">int32_t</span> <span class="n">t</span> <span class="o">=</span> <span class="p">(((</span><span class="n">frame</span><span class="p">[</span><span class="n">widx</span><span class="p">])</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="n">start</span><span class="p">))</span> <span class="o">&amp;</span> <span class="p">((</span><span class="mi">1U</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">bit_count</span><span class="p">))</span> <span class="o">-</span> <span class="mi">1</span><span class="p">));</span>
        <span class="kt">int32_t</span> <span class="n">tmp</span> <span class="o">=</span> <span class="p">(</span><span class="n">t</span> <span class="o">^</span> <span class="n">m</span><span class="p">)</span> <span class="o">-</span> <span class="n">m</span><span class="p">;</span>

        <span class="n">diff</span><span class="p">[</span><span class="n">diffidx</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span> 
      <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="benchmark">Benchmark</h2>

<p>For benchmarking, out goal is to reduce the execution time.</p>

<h3 id="environment">environment</h3>

<ul>
  <li>OS: Linux v6.8.0</li>
  <li>CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz</li>
  <li>compiler flags: <code class="language-plaintext highlighter-rouge">-O2</code></li>
  <li>compiler: GCC 13.3.0</li>
</ul>

<h3 id="benchmark-program">benchmark program</h3>

<ul>
  <li>decode_rodeo
    <ul>
      <li>download from <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></li>
      <li>run 1000000 iterations on Steim2 encoded input data</li>
    </ul>
  </li>
</ul>

<p>The sample output of benchmark program:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./decode_rodeo 
Testing Steim-2
reclen: 1595
dataoffset: 59
datasize: 1536
Steim2 decoded 1000000 iterations in 2.045135 seconds
</code></pre></div></div>

<h3 id="comparison">comparison</h3>

<table>
  <thead>
    <tr>
      <th>type</th>
      <th>execution time (measured in seconds)</th>
      <th>performance gain (compared to baseline)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>baseline</td>
      <td>2.045135</td>
      <td>0.0 %</td>
    </tr>
    <tr>
      <td>jump table</td>
      <td>2.451932</td>
      <td>-19.8909607 %</td>
    </tr>
    <tr>
      <td>for loop</td>
      <td>3.792999</td>
      <td>-85.4644803 %</td>
    </tr>
  </tbody>
</table>

<h2 id="findings">Findings</h2>

<p>We can realize that all of the attempts are signicantly slower than
baseline.</p>

<p>Though the lines of disassembly of each attempts are fewer,
we get the inferior result.</p>

<p>After some investigations, I conclude why any attempts result in
inferior result:</p>
<ul>
  <li>For jump table, it requires to make a function call, which
becomes a big impact on decoding procedure.</li>
  <li>For for loop method, we calculate the needed values after
we get the nibbles when decoding each frame. In fact, this
part become the bottleneck of the decoding process.</li>
</ul>

<h2 id="recap">Recap</h2>

<ul>
  <li>If you are certain the action of each condition, consider
using code templating (e.g., macros) rather than function calls.</li>
  <li>Benchmark, benchmark, and benchmark.</li>
</ul>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:2" role="doc-endnote">
      <p>https://www.fdsn.org/pdf/SEEDManual_V2.4.pdf <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>https://github.com/EarthScope/libmseed/blob/07a6e2d8b4611e9f59155d5632ab24dc2598cf9f/unpackdata.c#L356 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>https://github.com/EarthScope/libmseed/pull/102#issuecomment-1614927016 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="[&quot;programming&quot;, &quot;seismology&quot;, &quot;seismology data format&quot;, &quot;optimziation&quot;, &quot;benchmarking&quot;]" /><category term="programmers" /><category term="C" /><category term="miniSEED" /><category term="libmseed" /><category term="benchmarking" /><category term="optimization" /><summary type="html"><![CDATA[Outline]]></summary></entry><entry><title type="html">My Remark of Using LLM with Programming in 2025</title><link href="https://cuda-chen.github.io/living/2025/12/02/my-remark-of-using-llm-with-programming-in-2025.html" rel="alternate" type="text/html" title="My Remark of Using LLM with Programming in 2025" /><published>2025-12-02T00:00:00+00:00</published><updated>2025-12-02T00:00:00+00:00</updated><id>https://cuda-chen.github.io/living/2025/12/02/my-remark-of-using-llm-with-programming-in-2025</id><content type="html" xml:base="https://cuda-chen.github.io/living/2025/12/02/my-remark-of-using-llm-with-programming-in-2025.html"><![CDATA[<blockquote>
  <p>If you use LLM as a developer, you shall take yourself as a tester
to anti-prove that the solutions provided by LLM meets your needs.
If you use LLM as a tester, you shall take yourself as a developer
to prove that the test methods provided by LLM always prove
your solution meets your needs.</p>
</blockquote>]]></content><author><name></name></author><category term="[&quot;living&quot;]" /><category term="living" /><summary type="html"><![CDATA[If you use LLM as a developer, you shall take yourself as a tester to anti-prove that the solutions provided by LLM meets your needs. If you use LLM as a tester, you shall take yourself as a developer to prove that the test methods provided by LLM always prove your solution meets your needs.]]></summary></entry><entry><title type="html">semu Contribution: Create VirtIO Sound Device Playback</title><link href="https://cuda-chen.github.io/programming/open%20source%20contribution/virtualization/2025/11/22/semu-contribution-create-virtio-sound-device-playback.html" rel="alternate" type="text/html" title="semu Contribution: Create VirtIO Sound Device Playback" /><published>2025-11-22T00:00:00+00:00</published><updated>2025-11-22T00:00:00+00:00</updated><id>https://cuda-chen.github.io/programming/open%20source%20contribution/virtualization/2025/11/22/semu-contribution-create-virtio-sound-device-playback</id><content type="html" xml:base="https://cuda-chen.github.io/programming/open%20source%20contribution/virtualization/2025/11/22/semu-contribution-create-virtio-sound-device-playback.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>semu <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> is a minimalist RISC-V system emulator which
runs a guest Linux Kernel and corresponding userland.
It utilizes VirtIO to access the I/O resources reside
on host (called para-virtualization).</p>

<p>VirtIO specifies the guest how to interact the I/O resources
resides on host, and there is no
exception of sound resource. Created by OpenSynergy <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup>,
such specification lets guest OS can use sound resource
in automobile area as the application usually resides
in an isolated environment which full virtualization
becomes the bottleneck of I/O transmission.</p>

<p>In this post, I make a contribution
that creates the very first VirtIO sound device playback,
which is applied on RISC-V system emulator that use MMIO as
its interrupt basis, on the planet.</p>

<h2 id="goal">Goal</h2>

<p>This contribution aims for these goals:</p>

<ul>
  <li>Create a VirtIO sound device playback.</li>
  <li>Support Linux and macOS host.</li>
</ul>

<p>The whole content of the contribution
can be viewed in here: <a href="https://github.com/sysprog21/semu/pull/53">https://github.com/sysprog21/semu/pull/53</a></p>

<h2 id="implementation">Implementation</h2>

<h3 id="prepareing-environment">Prepareing Environment</h3>

<p>As the guest OS is Linux, you have to activate the ALSA <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">3</a></sup>
and sound VirtIO driver building options in the configuration of Linux building:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ALSA requires System V IPC
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y

CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_VIRTIO=y
</code></pre></div></div>

<p>Furthermore, you have to install some ALSA utilities for testing
the playback. Taking <code class="language-plaintext highlighter-rouge">buildroot</code> setting as example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BR2_PACKAGE_ALSA_UTILS=y
BR2_PACKAGE_ALSA_UTILS_APLAY=y
BR2_PACKAGE_ALSA_UTILS_SPEAKER_TEST=y
</code></pre></div></div>

<h3 id="initialization">Initialization</h3>

<p>The initialization setup is straightforward: follow what the
specification tells you to do.</p>

<p>If the initialization is set up correctly, you will receive
such messages when booting up:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[    4.011962] ALSA device list:
[    4.015962]   #0: VirtIO SoundCard at platform/f4400000.virtio/virtio2
</code></pre></div></div>

<h3 id="playing-sounds">Playing Sounds</h3>

<h4 id="how-the-driver-sends-pcm-frames-to-device">How the Driver Sends PCM Frames to Device</h4>

<p>Before we let the device plays sound, we need to realize
how the driver sends PCM frames. By observation on Linux Kernel v6.7,
its sound driver does these:</p>

<ol>
  <li>Send PREPARE command.
    <ul>
      <li>At the meantime, the sound driver sends PCM frames for pre-buffering.</li>
    </ul>
  </li>
  <li>Send START command to start playing.</li>
  <li>Send STOP command to stop playing.
    <ul>
      <li>Meanwhile, the sound driver stop sending PCM frames.</li>
    </ul>
  </li>
  <li>Send RELEASE command to release the stream.</li>
</ol>

<p>As such, we need to implement the threading model as follows:</p>

<ol>
  <li>A multi-thread model to serve the control and TX events at the same time.</li>
  <li>A queue to store PCM frames.</li>
</ol>

<h4 id="threading-model">Threading Model</h4>

<p>I propose my threading model as below ASCII art:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># originally generated by Google Nano Banana Pro with Gemini 3
# then edited by me

+-----------------------------------------------------------------------------+
|                             THREADING MODEL                                 |
+-----------------------------------------------------------------------------+
|                                                                             |
| [PRODUCER: TX THREAD]                        [CONSUMER: CALLBACK THREAD OF  |
|                                                         SOUND BACKEND]      |
|                                                                             |
|    +==========+                                                             |
|    | TX virtq |                                                             |
|    +==========+                                                             |
|        |                                                                    |
|        v (1) Fetch PCM Frames                                               |
|  +-------------+                                                            |
|  |   TX-THRD   |                                          +---------------+ |
|  |             |                                          | CALLBACK-THRD | |
|  | [Accumulate]|                                          |               | |
|  |      |      |                                          |   [WAITING]   | |
|  |   &lt;Check&gt;   |                                          |   (Blocked    | |
|  | Period Size |                                          |    on CV)     | |
|  |   Reached?  |                                          |               | |
|  +------+------+                                          +------+--------+ |
|         |                                                        ^          |
|         | (Yes: Batch Ready)                                     |          |
|         |                                                        |          |
|         | (2) Enqueue Batch                                      |          |
|         v                    +===============+                   |          |
|         +-------------------&gt;|     QUEUE     |                   |          |
|                              +===============+                   |          |
|                                      |                           |          |
|         | (3) SEND NOTIFICATION      | (4) Data Available        |          |
|         |     (CV Signal)            +--------------------------&gt;|          |
|         v                                                        |          |
|       ( ! ) - - - - - - - - - - - - - - - - - - - - - - - - - &gt; ( ! )       |
|                                                                  |          |
|                                                      (5) Wake Up &amp; Read     |
|                                                                  v          |
|                                                           +-------------+   |
|                                                           |SOUND BACKEND|   |
|                                                           +-------------+   |
|                                                                             |
+-----------------------------------------------------------------------------+
</code></pre></div></div>

<p>For such implementation of threading model, I would like to make some remarks:</p>

<ol>
  <li>Using CV (Conditional Variable) will be suffice for lightweight locking.</li>
  <li>As the driver always sends PCM frames (the only exception is the end of the stream)
with a whole period size (<em>period_bytes</em>, specifically), the produce
notifies the consumer once it receives a whole period size).</li>
  <li>As the PCM frames are sent at the same time in PREPARE and START state,
using multi-threading instead some kind of lock-free design in DPDK <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">4</a></sup>
reduce the complexity.</li>
  <li>For thread implementaion, I choose PThreads as we have the needs of
cross-platform compatibility.</li>
</ol>

<h3 id="other-necessary-works">Other Necessary Works</h3>

<p>As the interrupt foundation of semu is MMIO, I add some configurations
in the dts of semu:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>snd0: virtio@4700000 {
    compatible = "virtio,mmio";
    reg = &lt;0x4700000 0x200&gt;;
    interrupts = &lt;5&gt;;
};
</code></pre></div></div>

<h3 id="limitation">Limitation</h3>

<p>ALSA relies on system timer. However, semu currently has some issues
of timer, which lets ALSA in guest OS stops sending any PCM frames
after a period time.</p>

<p>Yet, there exists a chance to play the entire sound by adjust the buffer size.
For instance, I have tried by setting the buffer size to eight times
of period size and the sound plays to the end (with some repeating artifacts,
though).</p>

<h2 id="wrap-up">Wrap Up</h2>

<p>This post depicts the implementation of a VirtIO sound device
playback. It not only becomes the very first implementation
of on RISC-V, but also leaves a mark
with the scarce-to-none resource of VirtIO sound device.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>https://github.com/sysprog21/semu <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>https://www.opensynergy.com/ <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>https://wiki.archlinux.org/title/Advanced_Linux_Sound_Architecture <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>https://doc.dpdk.org/guides/prog_guide/ring_lib.html <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="[&quot;programming&quot;, &quot;open source contribution&quot;, &quot;virtualization&quot;]" /><category term="programming" /><category term="C" /><category term="VirtIO" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Optimize _mm_crc32_u8 conversion in sse2neon</title><link href="https://cuda-chen.github.io/programming/open%20source%20contribution/2024/02/27/optimize-mm-crc32-u8-conversion-in-sse2neon.html" rel="alternate" type="text/html" title="Optimize _mm_crc32_u8 conversion in sse2neon" /><published>2024-02-27T00:00:00+00:00</published><updated>2024-02-27T00:00:00+00:00</updated><id>https://cuda-chen.github.io/programming/open%20source%20contribution/2024/02/27/optimize-mm-crc32-u8-conversion-in-sse2neon</id><content type="html" xml:base="https://cuda-chen.github.io/programming/open%20source%20contribution/2024/02/27/optimize-mm-crc32-u8-conversion-in-sse2neon.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this post, I am going to illustrate the progress of <code class="language-plaintext highlighter-rouge">_mm_crc32_u8</code>
conversion improvement of the contribution to sse2neon.</p>

<p>In the beginning, I will make a brief introduction to CRC32C,
which is the CRC algorithm that <code class="language-plaintext highlighter-rouge">_mm_crc32_u8</code> applies.
Then, I will show how I optimize the conversion with various method.</p>

<h2 id="whats-crc32c">What’s CRC32C?</h2>

<p>Before explaining CRC32C, I would like to answer a question: what
is CRC (Cyclic Redundancy Check)? It is an algorithm used for error detection in network and storage device <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.
The sender uses a number as divisor, then applies division on the message
to get the remainder. Next, sender appends the remainder in the end
of the message. To verify whether the message has any errors,
the receiver applies division on the message. If the remainder
is not zero, it means the message is errorous. As it doesn’t modify
the content of message (redundancy) and the division is just shifting
the divident then subtract (cyclic code), so the name, CRC,
represents these behaviors.</p>

<p>A CRC algorithm is called an n-bit CRC when its divisor (formally
check value) is n-bit long. Thus, the CRC32C, a variant of CRC32, has 
a 32-bit binary number as the dividend.</p>

<p>As a reminder, the CRC32C uses the following polynominals (I will represent
as P for the rest of post):</p>

<ul>
  <li>normal: 0x1EDC6F41 (usually denoted as 0x11EDC6F41)</li>
  <li>bit-reflected: 0x82F63B78</li>
</ul>

<p>What’s more, we use the bit-reflected way for implementation.
For the reasons of using bit-reflected method,
you can refer to <em>Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches</em> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>

<h2 id="road-of-optimization">Road of Optimization</h2>

<p>Let’s start with the original implementation in sse2neon <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FORCE_INLINE</span> <span class="kt">uint32_t</span> <span class="nf">_mm_crc32_u8</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">crc</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">crc</span> <span class="o">^=</span> <span class="n">v</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">bit</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">bit</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">bit</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span>
            <span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x82f63b78</span><span class="p">);</span>
        <span class="k">else</span>
            <span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">crc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="apply-ternany-operator">apply ternany operator</h3>

<p>Modern compiler can optimize the ternany operator into
conditional move to prevent branching. As a consequence, we can
re-write the <code class="language-plaintext highlighter-rouge">if...else</code> statement into ternany operator:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">bit</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">bit</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">bit</span><span class="o">++</span><span class="p">)</span>
    <span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">?</span> <span class="p">((</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x82f63b78</span><span class="p">))</span> <span class="o">:</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>However, as mentioned by the reviewer <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>, we should come up
with another way to utilize the power of NEON.</p>

<h3 id="tabular-method">tabular method</h3>

<p>Observing the following implementation of calculating CRC32C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">bit</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">bit</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">bit</span><span class="o">++</span><span class="p">)</span>
    <span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">?</span> <span class="p">((</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x82f63b78</span><span class="p">))</span> <span class="o">:</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>You can realize that which bits of P will be shifted in of P then XOR’d are uniquely
deretmined by the rightmost 8 bits of <code class="language-plaintext highlighter-rouge">crc</code>. Thus, we can rewrite the calculation
procedure as follows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// I use A, B, C, D, ...</span>
<span class="c1">// as the substitution of either 0 or the polynominal.</span>

<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">A</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">B</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">C</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">D</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">E</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">F</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">G</span>
<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">H</span>
</code></pre></div></div>

<p>We then rewrite the above procedure to a single expression:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(((((((((((((((</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">A</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">B</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">C</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">D</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">E</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">F</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">G</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">H</span>
</code></pre></div></div>

<p>Re-distribute the shifts for simplifying the expression:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">A</span> <span class="o">&gt;&gt;</span> <span class="mi">7</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">B</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">C</span> <span class="o">&gt;&gt;</span> <span class="mi">5</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">D</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">E</span> <span class="o">&gt;&gt;</span> <span class="mi">3</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">F</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">G</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="n">H</span>
</code></pre></div></div>

<p>Then, combine all the terms from A to H into a single value T:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">^</span> <span class="n">T</span>
</code></pre></div></div>

<p>We can precompute the value of T because it is merely composed of
256 permutations (recall that we just do calculation on the rightmost
8 bits):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Adopted from qemu: https://github.com/qemu/qemu/blob/907209e3111dd62a553a19319b422ff8aba5b9c0/util/crc32c.c#L40</span>

<span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">_sse2neon_crc32_tbl</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mh">0x00000000</span><span class="p">,</span> <span class="mh">0xF26B8303</span><span class="p">,</span> <span class="mh">0xE13B70F7</span><span class="p">,</span> <span class="mh">0x1350F3F4</span><span class="p">,</span>
    <span class="mh">0xC79A971F</span><span class="p">,</span> <span class="mh">0x35F1141C</span><span class="p">,</span> <span class="mh">0x26A1E7E8</span><span class="p">,</span> <span class="mh">0xD4CA64EB</span><span class="p">,</span>
    <span class="mh">0x8AD958CF</span><span class="p">,</span> <span class="mh">0x78B2DBCC</span><span class="p">,</span> <span class="mh">0x6BE22838</span><span class="p">,</span> <span class="mh">0x9989AB3B</span><span class="p">,</span>
    <span class="mh">0x4D43CFD0</span><span class="p">,</span> <span class="mh">0xBF284CD3</span><span class="p">,</span> <span class="mh">0xAC78BF27</span><span class="p">,</span> <span class="mh">0x5E133C24</span><span class="p">,</span>
    <span class="mh">0x105EC76F</span><span class="p">,</span> <span class="mh">0xE235446C</span><span class="p">,</span> <span class="mh">0xF165B798</span><span class="p">,</span> <span class="mh">0x030E349B</span><span class="p">,</span>
    <span class="mh">0xD7C45070</span><span class="p">,</span> <span class="mh">0x25AFD373</span><span class="p">,</span> <span class="mh">0x36FF2087</span><span class="p">,</span> <span class="mh">0xC494A384</span><span class="p">,</span>
    <span class="mh">0x9A879FA0</span><span class="p">,</span> <span class="mh">0x68EC1CA3</span><span class="p">,</span> <span class="mh">0x7BBCEF57</span><span class="p">,</span> <span class="mh">0x89D76C54</span><span class="p">,</span>
    <span class="mh">0x5D1D08BF</span><span class="p">,</span> <span class="mh">0xAF768BBC</span><span class="p">,</span> <span class="mh">0xBC267848</span><span class="p">,</span> <span class="mh">0x4E4DFB4B</span><span class="p">,</span>
    <span class="mh">0x20BD8EDE</span><span class="p">,</span> <span class="mh">0xD2D60DDD</span><span class="p">,</span> <span class="mh">0xC186FE29</span><span class="p">,</span> <span class="mh">0x33ED7D2A</span><span class="p">,</span>
    <span class="mh">0xE72719C1</span><span class="p">,</span> <span class="mh">0x154C9AC2</span><span class="p">,</span> <span class="mh">0x061C6936</span><span class="p">,</span> <span class="mh">0xF477EA35</span><span class="p">,</span>
    <span class="mh">0xAA64D611</span><span class="p">,</span> <span class="mh">0x580F5512</span><span class="p">,</span> <span class="mh">0x4B5FA6E6</span><span class="p">,</span> <span class="mh">0xB93425E5</span><span class="p">,</span>
    <span class="mh">0x6DFE410E</span><span class="p">,</span> <span class="mh">0x9F95C20D</span><span class="p">,</span> <span class="mh">0x8CC531F9</span><span class="p">,</span> <span class="mh">0x7EAEB2FA</span><span class="p">,</span>
    <span class="mh">0x30E349B1</span><span class="p">,</span> <span class="mh">0xC288CAB2</span><span class="p">,</span> <span class="mh">0xD1D83946</span><span class="p">,</span> <span class="mh">0x23B3BA45</span><span class="p">,</span>
    <span class="mh">0xF779DEAE</span><span class="p">,</span> <span class="mh">0x05125DAD</span><span class="p">,</span> <span class="mh">0x1642AE59</span><span class="p">,</span> <span class="mh">0xE4292D5A</span><span class="p">,</span>
    <span class="mh">0xBA3A117E</span><span class="p">,</span> <span class="mh">0x4851927D</span><span class="p">,</span> <span class="mh">0x5B016189</span><span class="p">,</span> <span class="mh">0xA96AE28A</span><span class="p">,</span>
    <span class="mh">0x7DA08661</span><span class="p">,</span> <span class="mh">0x8FCB0562</span><span class="p">,</span> <span class="mh">0x9C9BF696</span><span class="p">,</span> <span class="mh">0x6EF07595</span><span class="p">,</span>
    <span class="mh">0x417B1DBC</span><span class="p">,</span> <span class="mh">0xB3109EBF</span><span class="p">,</span> <span class="mh">0xA0406D4B</span><span class="p">,</span> <span class="mh">0x522BEE48</span><span class="p">,</span>
    <span class="mh">0x86E18AA3</span><span class="p">,</span> <span class="mh">0x748A09A0</span><span class="p">,</span> <span class="mh">0x67DAFA54</span><span class="p">,</span> <span class="mh">0x95B17957</span><span class="p">,</span>
    <span class="mh">0xCBA24573</span><span class="p">,</span> <span class="mh">0x39C9C670</span><span class="p">,</span> <span class="mh">0x2A993584</span><span class="p">,</span> <span class="mh">0xD8F2B687</span><span class="p">,</span>
    <span class="mh">0x0C38D26C</span><span class="p">,</span> <span class="mh">0xFE53516F</span><span class="p">,</span> <span class="mh">0xED03A29B</span><span class="p">,</span> <span class="mh">0x1F682198</span><span class="p">,</span>
    <span class="mh">0x5125DAD3</span><span class="p">,</span> <span class="mh">0xA34E59D0</span><span class="p">,</span> <span class="mh">0xB01EAA24</span><span class="p">,</span> <span class="mh">0x42752927</span><span class="p">,</span>
    <span class="mh">0x96BF4DCC</span><span class="p">,</span> <span class="mh">0x64D4CECF</span><span class="p">,</span> <span class="mh">0x77843D3B</span><span class="p">,</span> <span class="mh">0x85EFBE38</span><span class="p">,</span>
    <span class="mh">0xDBFC821C</span><span class="p">,</span> <span class="mh">0x2997011F</span><span class="p">,</span> <span class="mh">0x3AC7F2EB</span><span class="p">,</span> <span class="mh">0xC8AC71E8</span><span class="p">,</span>
    <span class="mh">0x1C661503</span><span class="p">,</span> <span class="mh">0xEE0D9600</span><span class="p">,</span> <span class="mh">0xFD5D65F4</span><span class="p">,</span> <span class="mh">0x0F36E6F7</span><span class="p">,</span>
    <span class="mh">0x61C69362</span><span class="p">,</span> <span class="mh">0x93AD1061</span><span class="p">,</span> <span class="mh">0x80FDE395</span><span class="p">,</span> <span class="mh">0x72966096</span><span class="p">,</span>
    <span class="mh">0xA65C047D</span><span class="p">,</span> <span class="mh">0x5437877E</span><span class="p">,</span> <span class="mh">0x4767748A</span><span class="p">,</span> <span class="mh">0xB50CF789</span><span class="p">,</span>
    <span class="mh">0xEB1FCBAD</span><span class="p">,</span> <span class="mh">0x197448AE</span><span class="p">,</span> <span class="mh">0x0A24BB5A</span><span class="p">,</span> <span class="mh">0xF84F3859</span><span class="p">,</span>
    <span class="mh">0x2C855CB2</span><span class="p">,</span> <span class="mh">0xDEEEDFB1</span><span class="p">,</span> <span class="mh">0xCDBE2C45</span><span class="p">,</span> <span class="mh">0x3FD5AF46</span><span class="p">,</span>
    <span class="mh">0x7198540D</span><span class="p">,</span> <span class="mh">0x83F3D70E</span><span class="p">,</span> <span class="mh">0x90A324FA</span><span class="p">,</span> <span class="mh">0x62C8A7F9</span><span class="p">,</span>
    <span class="mh">0xB602C312</span><span class="p">,</span> <span class="mh">0x44694011</span><span class="p">,</span> <span class="mh">0x5739B3E5</span><span class="p">,</span> <span class="mh">0xA55230E6</span><span class="p">,</span>
    <span class="mh">0xFB410CC2</span><span class="p">,</span> <span class="mh">0x092A8FC1</span><span class="p">,</span> <span class="mh">0x1A7A7C35</span><span class="p">,</span> <span class="mh">0xE811FF36</span><span class="p">,</span>
    <span class="mh">0x3CDB9BDD</span><span class="p">,</span> <span class="mh">0xCEB018DE</span><span class="p">,</span> <span class="mh">0xDDE0EB2A</span><span class="p">,</span> <span class="mh">0x2F8B6829</span><span class="p">,</span>
    <span class="mh">0x82F63B78</span><span class="p">,</span> <span class="mh">0x709DB87B</span><span class="p">,</span> <span class="mh">0x63CD4B8F</span><span class="p">,</span> <span class="mh">0x91A6C88C</span><span class="p">,</span>
    <span class="mh">0x456CAC67</span><span class="p">,</span> <span class="mh">0xB7072F64</span><span class="p">,</span> <span class="mh">0xA457DC90</span><span class="p">,</span> <span class="mh">0x563C5F93</span><span class="p">,</span>
    <span class="mh">0x082F63B7</span><span class="p">,</span> <span class="mh">0xFA44E0B4</span><span class="p">,</span> <span class="mh">0xE9141340</span><span class="p">,</span> <span class="mh">0x1B7F9043</span><span class="p">,</span>
    <span class="mh">0xCFB5F4A8</span><span class="p">,</span> <span class="mh">0x3DDE77AB</span><span class="p">,</span> <span class="mh">0x2E8E845F</span><span class="p">,</span> <span class="mh">0xDCE5075C</span><span class="p">,</span>
    <span class="mh">0x92A8FC17</span><span class="p">,</span> <span class="mh">0x60C37F14</span><span class="p">,</span> <span class="mh">0x73938CE0</span><span class="p">,</span> <span class="mh">0x81F80FE3</span><span class="p">,</span>
    <span class="mh">0x55326B08</span><span class="p">,</span> <span class="mh">0xA759E80B</span><span class="p">,</span> <span class="mh">0xB4091BFF</span><span class="p">,</span> <span class="mh">0x466298FC</span><span class="p">,</span>
    <span class="mh">0x1871A4D8</span><span class="p">,</span> <span class="mh">0xEA1A27DB</span><span class="p">,</span> <span class="mh">0xF94AD42F</span><span class="p">,</span> <span class="mh">0x0B21572C</span><span class="p">,</span>
    <span class="mh">0xDFEB33C7</span><span class="p">,</span> <span class="mh">0x2D80B0C4</span><span class="p">,</span> <span class="mh">0x3ED04330</span><span class="p">,</span> <span class="mh">0xCCBBC033</span><span class="p">,</span>
    <span class="mh">0xA24BB5A6</span><span class="p">,</span> <span class="mh">0x502036A5</span><span class="p">,</span> <span class="mh">0x4370C551</span><span class="p">,</span> <span class="mh">0xB11B4652</span><span class="p">,</span>
    <span class="mh">0x65D122B9</span><span class="p">,</span> <span class="mh">0x97BAA1BA</span><span class="p">,</span> <span class="mh">0x84EA524E</span><span class="p">,</span> <span class="mh">0x7681D14D</span><span class="p">,</span>
    <span class="mh">0x2892ED69</span><span class="p">,</span> <span class="mh">0xDAF96E6A</span><span class="p">,</span> <span class="mh">0xC9A99D9E</span><span class="p">,</span> <span class="mh">0x3BC21E9D</span><span class="p">,</span>
    <span class="mh">0xEF087A76</span><span class="p">,</span> <span class="mh">0x1D63F975</span><span class="p">,</span> <span class="mh">0x0E330A81</span><span class="p">,</span> <span class="mh">0xFC588982</span><span class="p">,</span>
    <span class="mh">0xB21572C9</span><span class="p">,</span> <span class="mh">0x407EF1CA</span><span class="p">,</span> <span class="mh">0x532E023E</span><span class="p">,</span> <span class="mh">0xA145813D</span><span class="p">,</span>
    <span class="mh">0x758FE5D6</span><span class="p">,</span> <span class="mh">0x87E466D5</span><span class="p">,</span> <span class="mh">0x94B49521</span><span class="p">,</span> <span class="mh">0x66DF1622</span><span class="p">,</span>
    <span class="mh">0x38CC2A06</span><span class="p">,</span> <span class="mh">0xCAA7A905</span><span class="p">,</span> <span class="mh">0xD9F75AF1</span><span class="p">,</span> <span class="mh">0x2B9CD9F2</span><span class="p">,</span>
    <span class="mh">0xFF56BD19</span><span class="p">,</span> <span class="mh">0x0D3D3E1A</span><span class="p">,</span> <span class="mh">0x1E6DCDEE</span><span class="p">,</span> <span class="mh">0xEC064EED</span><span class="p">,</span>
    <span class="mh">0xC38D26C4</span><span class="p">,</span> <span class="mh">0x31E6A5C7</span><span class="p">,</span> <span class="mh">0x22B65633</span><span class="p">,</span> <span class="mh">0xD0DDD530</span><span class="p">,</span>
    <span class="mh">0x0417B1DB</span><span class="p">,</span> <span class="mh">0xF67C32D8</span><span class="p">,</span> <span class="mh">0xE52CC12C</span><span class="p">,</span> <span class="mh">0x1747422F</span><span class="p">,</span>
    <span class="mh">0x49547E0B</span><span class="p">,</span> <span class="mh">0xBB3FFD08</span><span class="p">,</span> <span class="mh">0xA86F0EFC</span><span class="p">,</span> <span class="mh">0x5A048DFF</span><span class="p">,</span>
    <span class="mh">0x8ECEE914</span><span class="p">,</span> <span class="mh">0x7CA56A17</span><span class="p">,</span> <span class="mh">0x6FF599E3</span><span class="p">,</span> <span class="mh">0x9D9E1AE0</span><span class="p">,</span>
    <span class="mh">0xD3D3E1AB</span><span class="p">,</span> <span class="mh">0x21B862A8</span><span class="p">,</span> <span class="mh">0x32E8915C</span><span class="p">,</span> <span class="mh">0xC083125F</span><span class="p">,</span>
    <span class="mh">0x144976B4</span><span class="p">,</span> <span class="mh">0xE622F5B7</span><span class="p">,</span> <span class="mh">0xF5720643</span><span class="p">,</span> <span class="mh">0x07198540</span><span class="p">,</span>
    <span class="mh">0x590AB964</span><span class="p">,</span> <span class="mh">0xAB613A67</span><span class="p">,</span> <span class="mh">0xB831C993</span><span class="p">,</span> <span class="mh">0x4A5A4A90</span><span class="p">,</span>
    <span class="mh">0x9E902E7B</span><span class="p">,</span> <span class="mh">0x6CFBAD78</span><span class="p">,</span> <span class="mh">0x7FAB5E8C</span><span class="p">,</span> <span class="mh">0x8DC0DD8F</span><span class="p">,</span>
    <span class="mh">0xE330A81A</span><span class="p">,</span> <span class="mh">0x115B2B19</span><span class="p">,</span> <span class="mh">0x020BD8ED</span><span class="p">,</span> <span class="mh">0xF0605BEE</span><span class="p">,</span>
    <span class="mh">0x24AA3F05</span><span class="p">,</span> <span class="mh">0xD6C1BC06</span><span class="p">,</span> <span class="mh">0xC5914FF2</span><span class="p">,</span> <span class="mh">0x37FACCF1</span><span class="p">,</span>
    <span class="mh">0x69E9F0D5</span><span class="p">,</span> <span class="mh">0x9B8273D6</span><span class="p">,</span> <span class="mh">0x88D28022</span><span class="p">,</span> <span class="mh">0x7AB90321</span><span class="p">,</span>
    <span class="mh">0xAE7367CA</span><span class="p">,</span> <span class="mh">0x5C18E4C9</span><span class="p">,</span> <span class="mh">0x4F48173D</span><span class="p">,</span> <span class="mh">0xBD23943E</span><span class="p">,</span>
    <span class="mh">0xF36E6F75</span><span class="p">,</span> <span class="mh">0x0105EC76</span><span class="p">,</span> <span class="mh">0x12551F82</span><span class="p">,</span> <span class="mh">0xE03E9C81</span><span class="p">,</span>
    <span class="mh">0x34F4F86A</span><span class="p">,</span> <span class="mh">0xC69F7B69</span><span class="p">,</span> <span class="mh">0xD5CF889D</span><span class="p">,</span> <span class="mh">0x27A40B9E</span><span class="p">,</span>
    <span class="mh">0x79B737BA</span><span class="p">,</span> <span class="mh">0x8BDCB4B9</span><span class="p">,</span> <span class="mh">0x988C474D</span><span class="p">,</span> <span class="mh">0x6AE7C44E</span><span class="p">,</span>
    <span class="mh">0xBE2DA0A5</span><span class="p">,</span> <span class="mh">0x4C4623A6</span><span class="p">,</span> <span class="mh">0x5F16D052</span><span class="p">,</span> <span class="mh">0xAD7D5351</span><span class="p">,</span> 
<span class="p">};</span>

<span class="n">FORCE_INLINE</span> <span class="kt">uint32_t</span> <span class="nf">_mm_crc32_u8</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">crc</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">crc</span> <span class="o">^=</span> <span class="n">v</span><span class="p">;</span>
	<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">^</span> <span class="n">_sse2neon_crc32_tbl</span><span class="p">[</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mh">0xFF</span><span class="p">];</span>
    <span class="k">return</span> <span class="n">crc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, reviewer requested not to use this as it costs 1KiB space <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>,
<del>which for my point-of-view, 1KiB space is costly on embedded system such
as Raspberry Pi.
Therefore, we have to emerge another tabular method solution
with the balance between performance and space.</del></p>

<blockquote>
  <p>Edit: thanks for the reply from <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>, the table used
in this implementation needs 16 times of cacheline for storing 
pre-computed values as the cacheline size of most CPU architectures is 64B.
Hence, we ought to find a solution that can fit all the pre-computed
values into the whole size of cacheline.</p>
</blockquote>

<h3 id="tabular-method-half-byte">tabular method (half-byte)</h3>

<p>As mentioned in <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>, we can break the whole 8-bit table look-up into two consecutive 4-bit table look-up:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FORCE_INLINE</span> <span class="kt">uint32_t</span> <span class="nf">_mm_crc32_u8</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">crc</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">crc</span> <span class="o">^=</span> <span class="n">v</span><span class="p">;</span>
	<span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">crc32_half_byte_tbl</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
	    <span class="mh">0x00000000</span><span class="p">,</span> <span class="mh">0x105ec76f</span><span class="p">,</span> <span class="mh">0x20bd8ede</span><span class="p">,</span> <span class="mh">0x30e349b1</span><span class="p">,</span> <span class="mh">0x417b1dbc</span><span class="p">,</span> <span class="mh">0x5125dad3</span><span class="p">,</span>
	    <span class="mh">0x61c69362</span><span class="p">,</span> <span class="mh">0x7198540d</span><span class="p">,</span> <span class="mh">0x82f63b78</span><span class="p">,</span> <span class="mh">0x92a8fc17</span><span class="p">,</span> <span class="mh">0xa24bb5a6</span><span class="p">,</span> <span class="mh">0xb21572c9</span><span class="p">,</span>
	    <span class="mh">0xc38d26c4</span><span class="p">,</span> <span class="mh">0xd3d3e1ab</span><span class="p">,</span> <span class="mh">0xe330a81a</span><span class="p">,</span> <span class="mh">0xf36e6f75</span><span class="p">,</span>
	<span class="p">};</span>
	
	<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">^</span> <span class="n">crc32_half_byte_tbl</span><span class="p">[</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">];</span>
	<span class="n">crc</span> <span class="o">=</span> <span class="p">(</span><span class="n">crc</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">^</span> <span class="n">crc32_half_byte_tbl</span><span class="p">[</span><span class="n">crc</span> <span class="o">&amp;</span> <span class="mh">0x0F</span><span class="p">];</span>
	<span class="k">return</span> <span class="n">crc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The look-up table just needs to hold every 16th entry of the one-byte tabular method,
thus 16 entries with only 64B space!
Though this introduces an additional comparision thus cannot utilize 
the benefit of out-of-order execution in modern CPU, I think it will be
a acceptable compromise as the entire pre-computed values can
be fit into one cacheline.</p>

<h3 id="using-arm-cryptography-extension">using Arm Cryptography Extension</h3>

<p>Though tabular method performs well, we always have to make a trade-off between performance
and space: for better performance such as avoiding loop dependency, we ought to
use more space to store the look-up table values; whilst reducing space for better
memory usage we cannot avoid loop dependency as shown in <em>tabular method (half-byte)</em> section.</p>

<p>The Arm Cryptography Extension provides certain operations which we can utilize so that
we don’t need to store a loop-up table. To begin with using Arm Cryptography Extension,
I would like to introduce Barrett Reduction as it is the bedrock of further
optimizing the CRC calculation.</p>

<h4 id="barrett-reduction-">Barrett reduction <sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup></h4>

<p>Recall that the fundamental of CRC is to do polynominal division on a message with
a certain polynominal in order to get the remainder <sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. As division is an expensive operation on computer,
we can replace the division into multiplying the multiplicative inverse of the polynominal,
and this is the core concept of Barrett reduction.</p>

<p>So to get CRC of message \(a\) with polynominal \(p\):</p>

\[a \mod p = a - \lfloor sa \rfloor p\]

<p>where \(s = 1/p\)</p>

<p>In practies, we can approximate \(1/p\) with a value \(m/2^k\) as division with \(2^k\) 
is merely right shift with \(k\) times.</p>

<p>I set \(k=64\) in my implementation as this is usually enough, and we can
pre-calculate the \(s\). Thanks for the post in <sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">9</a></sup>, we can use
the uint256_t <sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">10</a></sup> project to get \(s\) with the following code snippet:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;cstdio&gt;</span><span class="cp">
#include</span> <span class="cpf">"uint256_t.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">find_mu</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">uint256_t</span> <span class="n">dividend</span> <span class="o">=</span> <span class="n">uint256_t</span><span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="o">&lt;&lt;</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">const</span> <span class="n">uint256_t</span> <span class="n">divisor</span> <span class="o">=</span> <span class="mh">0x11EDC6F41</span><span class="p">;</span> <span class="c1">// polynominal used by CRC32C</span>
    <span class="k">const</span> <span class="kt">int</span> <span class="n">bits_in_divisor</span> <span class="o">=</span> <span class="mi">33</span><span class="p">;</span> 

    <span class="n">uint256_t</span> <span class="n">result</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">bit</span> <span class="o">=</span> <span class="mi">255</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">bit</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">((</span><span class="n">dividend</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">uint256_t</span><span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="o">&lt;&lt;</span> <span class="n">bit</span><span class="p">))</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="n">bit</span> <span class="o">-</span> <span class="n">bits_in_divisor</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">shift</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">dividend</span> <span class="o">^=</span> <span class="n">divisor</span> <span class="o">&lt;&lt;</span> <span class="n">shift</span><span class="p">;</span>
                <span class="n">result</span> <span class="o">^=</span> <span class="n">uint256_t</span><span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="o">&lt;&lt;</span> <span class="n">shift</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="n">dividend</span> <span class="o">^=</span> <span class="n">divisor</span> <span class="o">&gt;&gt;</span> <span class="o">-</span><span class="n">shift</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">bit</span><span class="o">--</span><span class="p">;</span>
    <span class="p">}</span>   

    <span class="n">printf</span><span class="p">(</span><span class="s">"%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">result</span><span class="p">.</span><span class="n">str</span><span class="p">(</span><span class="mi">16</span><span class="p">).</span><span class="n">c_str</span><span class="p">());</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">main</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">find_mu</span><span class="p">(</span><span class="mi">64</span><span class="p">);</span> <span class="c1">// 2^64 / p</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="carry-less-multiplication">carry-less multiplication</h4>

<p>Recall that the main concept of CRC is to do polynominal division.
As such, polynominal division has no need to do carries; yet,
to allow each digit to become an arbitrary value is impractial.
We can instead do the following: still don’t carry, but let the value
in a sensible range. We should limit the range as \([0, 1]\)
because we are using computer to perform the polynominal division.
That is, preverse with MODULO 2.</p>

<p>There is an interesting property of polynominal operation MODULO 2:
all of the polynominal operation is equivalent to binary arthmetic
with no carrys, or “carry-less” binary arthmetic. Consequently,
we can substitute the multiplication in Barrett reduction into
carry-less multiplication.</p>

<p>Though using Barrett reduction with carry-less multiplication
does not need to store the look-up table, it needs the target
to support hardware accelerated carry-less multiplication as the ordinary
carry-less multiplication requires \(O(b^2)\) time (\(b\) means
the bits of number), which usually
performs worse than look-up table method. Thankfully,
Arm Cryptography Extension provides a hardware accelerated carry-less multiplication.</p>

<p>Summing up, we can come up with the following implementation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FORCE_INLINE</span> <span class="kt">uint32_t</span> <span class="nf">_mm_crc32_u8</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">crc</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="c1">// Adapted from: https://mary.rs/lab/crc32/</span>
    <span class="c1">// If target supports Arm Cryptography Extension:</span>

    <span class="c1">// Barrent reduction</span>
    <span class="n">uint64x2_t</span> <span class="n">orig</span> <span class="o">=</span>
        <span class="n">vcombine_u64</span><span class="p">(</span><span class="n">vcreate_u64</span><span class="p">((</span><span class="kt">uint64_t</span><span class="p">)</span> <span class="p">(</span><span class="n">crc</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span><span class="p">),</span> <span class="n">vcreate_u64</span><span class="p">(</span><span class="mh">0x0</span><span class="p">));</span>
    <span class="n">uint64x2_t</span> <span class="n">tmp</span> <span class="o">=</span> <span class="n">orig</span><span class="p">;</span>

    <span class="c1">// Polynomial P(x) of CRC32C</span>
    <span class="kt">uint64_t</span> <span class="n">p</span> <span class="o">=</span> <span class="mh">0x105EC76F1</span><span class="p">;</span>
    <span class="c1">// Barrett Reduction (in bit-reflected form) constant mu_{64} = \lfloor</span>
    <span class="c1">// 2^{64} / P(x) \rfloor = 0x11f91caf6</span>
    <span class="kt">uint64_t</span> <span class="n">mu</span> <span class="o">=</span> <span class="mh">0x1dea713f1</span><span class="p">;</span>

    <span class="c1">// Note: the _sse2neon_vmull_p64 is a wrapper of carry-less multiplication</span>
    <span class="c1">// Multiply by mu_{64}</span>
    <span class="n">tmp</span> <span class="o">=</span> <span class="n">_sse2neon_vmull_p64</span><span class="p">(</span><span class="n">vget_low_u64</span><span class="p">(</span><span class="n">tmp</span><span class="p">),</span> <span class="n">vcreate_u64</span><span class="p">(</span><span class="n">mu</span><span class="p">));</span>
    <span class="c1">// Divide by 2^{64} (mask away the unnecessary bits)</span>
    <span class="n">tmp</span> <span class="o">=</span>
        <span class="n">vandq_u64</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="n">vcombine_u64</span><span class="p">(</span><span class="n">vcreate_u64</span><span class="p">(</span><span class="mh">0xFFFFFFFF</span><span class="p">),</span> <span class="n">vcreate_u64</span><span class="p">(</span><span class="mh">0x0</span><span class="p">)));</span>
    <span class="c1">// Multiply by P(x) (shifted left by 1 for alignment reasons)</span>
    <span class="n">tmp</span> <span class="o">=</span> <span class="n">_sse2neon_vmull_p64</span><span class="p">(</span><span class="n">vget_low_u64</span><span class="p">(</span><span class="n">tmp</span><span class="p">),</span> <span class="n">vcreate_u64</span><span class="p">(</span><span class="n">p</span><span class="p">));</span>
    <span class="c1">// Subtract original from result</span>
    <span class="n">tmp</span> <span class="o">=</span> <span class="n">veorq_u64</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="n">orig</span><span class="p">);</span>

    <span class="c1">// Extract the 'lower' (in bit-reflected sense) 32 bits</span>
    <span class="n">crc</span> <span class="o">=</span> <span class="n">vgetq_lane_u32</span><span class="p">(</span><span class="n">vreinterpretq_u32_u64</span><span class="p">(</span><span class="n">tmp</span><span class="p">),</span> <span class="mi">1</span><span class="p">);</span>

    <span class="k">return</span> <span class="n">crc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, I have shown two methods of optimizing CRC32C calculation,
and these implementations are merge to sse2neon.
I also make brief dipictions of CRC and carry-less multiplication, which
are commonly seen topics in cryptography.</p>

<h2 id="trivia">Trivia</h2>

<p>While I was measuring the running time of each implementation, I found that 
the precedence of test function will affect the running time of each
implementation in qemu.</p>

<h2 id="reference">Reference</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>https://en.wikipedia.org/wiki/Cyclic_redundancy_check <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>https://github.com/komrad36/CRC <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>https://github.com/DLTcollab/sse2neon/blob/cfaa59fc04fecb117c0a0f3fe9c82dece6f359ad/sse2neon.h#L8502 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>https://github.com/DLTcollab/sse2neon/pull/627#discussion_r1453360563 <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>https://github.com/DLTcollab/sse2neon/pull/627#issuecomment-1895992394 <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>https://www.facebook.com/groups/system.software2024/posts/1556960111748548/?comment_id=1556971665080726 (the comments are written is Mandarin) <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p>https://create.stephan-brumme.com/crc32/#half-byte <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:8" role="doc-endnote">
      <p>https://en.wikipedia.org/wiki/Barrett_reduction <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:9" role="doc-endnote">
      <p>https://mary.rs/lab/crc32/ <a href="#fnref:9" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:10" role="doc-endnote">
      <p>https://github.com/calccrypto/uint256_t <a href="#fnref:10" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="[&quot;programming&quot;, &quot;open source contribution&quot;]" /><category term="programming" /><category term="C" /><category term="C++" /><category term="sse2neon" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">My sse2neon Contribution of _rdtsc</title><link href="https://cuda-chen.github.io/programming/open%20source%20contributions/2023/04/02/my-sse2neon-contribution-of-rdtsc.html" rel="alternate" type="text/html" title="My sse2neon Contribution of _rdtsc" /><published>2023-04-02T00:00:00+00:00</published><updated>2023-04-02T00:00:00+00:00</updated><id>https://cuda-chen.github.io/programming/open%20source%20contributions/2023/04/02/my-sse2neon-contribution-of-rdtsc</id><content type="html" xml:base="https://cuda-chen.github.io/programming/open%20source%20contributions/2023/04/02/my-sse2neon-contribution-of-rdtsc.html"><![CDATA[<p>In this post, I am going to illustrate the path of <code class="language-plaintext highlighter-rouge">_rdtsc</code> [^1] conversion contribution
on sse2neon. At first, I will introduce the usage of <code class="language-plaintext highlighter-rouge">_rdtsc</code>, then talk about
the implementation and test case .
The full implementation can be seen in here: https://github.com/DLTcollab/sse2neon/pull/532</p>

<h2 id="whats-_rdtsc">What’s <code class="language-plaintext highlighter-rouge">_rdtsc</code></h2>

<p>The <code class="language-plaintext highlighter-rouge">_rdtsc</code> is an SSE intrinsic which gets the current timestamp from processor.
The way which makes it special is that it gets the timestamp directly from
hardware, which is suitable for measuring precise execution time.</p>

<h2 id="implementations">Implementations</h2>

<p>As this post is talking about the conversion, let’s talk about how I implement
the conversions on each target.</p>

<h3 id="armv8-a">ARMv8-A</h3>

<p>Pretty straightforward. You just read the value from CNTVCT_EL0 (counter-timer virtual count register).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">val</span><span class="p">;</span>
<span class="n">__asm__</span> <span class="nf">__volatile__</span><span class="p">(</span><span class="s">"mrs %0, cntvct_el0"</span> <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">val</span><span class="p">));</span>
<span class="k">return</span> <span class="n">val</span><span class="p">;</span>
</code></pre></div></div>

<h3 id="armv7-a">ARMv7-A</h3>

<p>The ARMv7-A counterpart is trickier as it has no CNTVCT_EL0. Instead, you can get
the value from PMCCNTR (performance monitors cycle count register).</p>

<p>Nevertheless, PMCCNTR can be accessed only in one of the following conditions:</p>
<ol>
  <li>All modes executing at PL1 or higher.</li>
  <li>User mode when PMCUSERENR.EN == 1 (PMCUSERENR stands for performance monitors user enable register).</li>
</ol>

<p>What’s more, PMCCNTR starts to count only if PMCNTENSET (performance monitors count enable set register)
is set.</p>

<p>If none of the three above conditions is met, you will not be able to access PMCCNTR
or get its value. In fact, you can get the current timestamp using Linux kernel API (<code class="language-plaintext highlighter-rouge">gettimeofday</code>)
as usually the API is running in kernel mode.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="n">pmccntr</span><span class="p">,</span> <span class="n">pmuseren</span><span class="p">,</span> <span class="n">pmcntenset</span><span class="p">;</span>
<span class="n">__asm__</span> <span class="nf">__volatile__</span><span class="p">(</span><span class="s">"mrc p15, 0, %0, c9, c14, 0"</span> <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">pmuseren</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pmuseren</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>  
    <span class="n">__asm__</span> <span class="n">__volatile__</span><span class="p">(</span><span class="s">"mrc p15, 0, %0, c9, c12, 1"</span> <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">pmcntenset</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">pmcntenset</span> <span class="o">&amp;</span> <span class="mh">0x80000000UL</span><span class="p">)</span> <span class="p">{</span> 
        <span class="n">__asm__</span> <span class="n">__volatile__</span><span class="p">(</span><span class="s">"mrc p15, 0, %0, c9, c13, 0"</span> <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">pmccntr</span><span class="p">));</span>
        <span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span> <span class="p">(</span><span class="n">pmccntr</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">;</span>
<span class="n">gettimeofday</span><span class="p">(</span><span class="o">&amp;</span><span class="n">tv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span><span class="p">)</span> <span class="o">*</span> <span class="mi">1000000</span> <span class="o">+</span> <span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="test-cases">Test Cases</h2>

<p>In order to prove the implementation works, I add a dedicated test case for unit testing.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// test case implementation</span>
<span class="n">result_t</span> <span class="nf">test_rdtsc</span><span class="p">(</span><span class="k">const</span> <span class="n">SSE2NEONTestImpl</span> <span class="o">&amp;</span><span class="n">impl</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">iter</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">_rdtsc</span><span class="p">();</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">__asm__</span> <span class="n">__volatile__</span><span class="p">(</span><span class="s">""</span> <span class="o">:::</span> <span class="s">"memory"</span><span class="p">);</span>
    <span class="kt">uint64_t</span> <span class="n">end</span> <span class="o">=</span> <span class="n">_rdtsc</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">end</span> <span class="o">&gt;</span> <span class="n">start</span> <span class="o">?</span> <span class="n">TEST_SUCCESS</span> <span class="o">:</span> <span class="n">TEST_FAIL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The test procedure as follows:</p>
<ol>
  <li>get current timestamp</li>
  <li>create a long-running time for loop</li>
  <li>get current timestamp again</li>
  <li>check whether the value of timestamp in 3. is larger than 1.</li>
</ol>

<h3 id="why-the-for-loop-looks-so-strange">Why the for loop looks so strange?</h3>
<p>You may ask why not create the long-running for loop as follows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">100000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
    <span class="p">;</span> <span class="c1">// no-op</span>
</code></pre></div></div>

<p>The reason is that modern compile sometimes eliminates the loops
with no any operations. Therefore, we need a trick which creates
a long-running for loop without being removed by compiler.
Fortunately, we can use <code class="language-plaintext highlighter-rouge">__asm__ __volatile__("" ::: "memory");</code> to do
the trick.</p>

<p>So you may ask another question: why <code class="language-plaintext highlighter-rouge">__asm__ __volatile__("" ::: "memory");</code>
can fulfill the task?</p>

<p>According in this post [^2], the <code class="language-plaintext highlighter-rouge">__asm__ __volatile__("" ::: "memory");</code>
creates a compiler barrier. What’s more, with the help of <code class="language-plaintext highlighter-rouge">volatile</code>
keyword, compiler won’t take any optimization of this assembly.
Therefore, we create a statement which doing nothing. Thus, the long-running
for loop serves its purpose.</p>

<h2 id="references">References</h2>

<p>[^1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=rdtsc</p>

<p>[^2] https://preshing.com/20120625/memory-ordering-at-compile-time/</p>]]></content><author><name></name></author><category term="[&quot;programming&quot;, &quot;open source contributions&quot;]" /><category term="programming" /><category term="C" /><category term="C++" /><category term="sse2neon" /><summary type="html"><![CDATA[In this post, I am going to illustrate the path of _rdtsc [^1] conversion contribution on sse2neon. At first, I will introduce the usage of _rdtsc, then talk about the implementation and test case . The full implementation can be seen in here: https://github.com/DLTcollab/sse2neon/pull/532]]></summary></entry><entry><title type="html">My Moderna COVID-19 First Booster Vaccination Report</title><link href="https://cuda-chen.github.io/life/2022/05/29/moderna-booster-1st.html" rel="alternate" type="text/html" title="My Moderna COVID-19 First Booster Vaccination Report" /><published>2022-05-29T00:00:00+00:00</published><updated>2022-05-29T00:00:00+00:00</updated><id>https://cuda-chen.github.io/life/2022/05/29/moderna-booster-1st</id><content type="html" xml:base="https://cuda-chen.github.io/life/2022/05/29/moderna-booster-1st.html"><![CDATA[<p>In this post, I will record the situation after I
got vaccination with <a href="https://modernacovid19global.com/">Moderna</a> COVID-19 vaccine.</p>

<h2 id="day-1-the-day-after-got-vaccinated">Day 1 (the day after got vaccinated)</h2>
<ul>
  <li>Sore muscle on vaccinated arm.</li>
  <li>Moderate fatigue.</li>
</ul>

<h2 id="day-2">Day 2</h2>
<ul>
  <li>Strong fatigue.</li>
  <li>Sore muscle on vaccinated arm.</li>
  <li>Headache.</li>
  <li>Fever.</li>
  <li>Sweating a lot, I bet I haven’t sweated a lot before even
for exercising.</li>
  <li>Palpitation.</li>
</ul>

<h2 id="day-3">Day 3</h2>
<ul>
  <li>A little fatigue.</li>
  <li>A little sore muscle on vaccinated arm.</li>
  <li>Coughing.</li>
  <li>Stuffy nose.</li>
  <li>Palpitation.</li>
</ul>

<h2 id="day-4">Day 4</h2>
<ul>
  <li>A little fatigue.</li>
  <li>Coughing sometimes.</li>
  <li>Stuffy nose.</li>
  <li>Palpitation.</li>
</ul>

<h2 id="day-5">Day 5</h2>
<ul>
  <li>A little fatigue.</li>
  <li>Coughing sometimes.</li>
  <li>Stuffy nose.</li>
</ul>

<h2 id="day-6">Day 6</h2>
<ul>
  <li>A little fatigue.</li>
  <li>Coughing seldomly.</li>
</ul>

<h2 id="day-7">Day 7</h2>
<ul>
  <li>Feel a little fatigue sometimes.</li>
  <li>Coughing seldomly.</li>
  <li><strong>Get PCR positive report today.</strong></li>
</ul>

<h2 id="day-8">Day 8</h2>
<ul>
  <li>Awaken as hell.</li>
  <li>Coughing seldomly.</li>
</ul>

<h2 id="day-9">Day 9</h2>
<ul>
  <li>Awaken as hell.</li>
  <li>Coughing seldomly.</li>
</ul>

<h2 id="day-10">Day 10</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-11">Day 11</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-12">Day 12</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-13">Day 13</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-14">Day 14</h2>
<ul>
  <li>Awaken as hell.</li>
  <li><strong>Get dis-quarantine notification today.</strong></li>
</ul>]]></content><author><name></name></author><category term="[&quot;life&quot;]" /><category term="life" /><category term="COVID-19" /><category term="vaccine" /><summary type="html"><![CDATA[In this post, I will record the situation after I got vaccination with Moderna COVID-19 vaccine.]]></summary></entry><entry><title type="html">Switch Your Jekyll Blog to Google Analytics 4 Simplified</title><link href="https://cuda-chen.github.io/blogging/2022/04/30/switch-your-jekyll-blog-to-google-analytics-4-simplified.html" rel="alternate" type="text/html" title="Switch Your Jekyll Blog to Google Analytics 4 Simplified" /><published>2022-04-30T00:00:00+00:00</published><updated>2022-04-30T00:00:00+00:00</updated><id>https://cuda-chen.github.io/blogging/2022/04/30/switch-your-jekyll-blog-to-google-analytics-4-simplified</id><content type="html" xml:base="https://cuda-chen.github.io/blogging/2022/04/30/switch-your-jekyll-blog-to-google-analytics-4-simplified.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>As a reminder from Google [^1]. Google Analytics will be replaced
by Google Analytics 4 after July 1st, 2023. As a user of Google
Analytics, I write down the switching procedures so that other
user can have a post to know how to switch your Jekyll blog
from Google Analytics to Google Analytics 4.</p>

<h2 id="switching-procedures">Switching Procedures</h2>
<ol>
  <li>
    <p>Create a Google Analytics 4 Property.
For more information you can visit <a href="https://support.google.com/analytics/answer/9304153?hl=en">the help center of Google</a>.</p>
  </li>
  <li>
    <p>Record your measurement ID.
This <a href="https://support.google.com/analytics/answer/9539598">answer</a>
will let you find your measurement ID.</p>
  </li>
  <li>Put your measurement ID in <code class="language-plaintext highlighter-rouge">_config.yml</code>.
Usually you put your measurement ID like this:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>google_analytics: &lt;your-measurement-id&gt;
</code></pre></div>    </div>
  </li>
  <li>(Minima and GitHub Pages only) use remote theme.
For <a href="https://github.com/jekyll/minima/issues/561">some reasons</a>,
you have to use remote theme if your Jekyll blog uses minima
theme and is hosted on GitHub Pages.
Usually you set your blog to use remote theme like this:</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># _config.yml

- theme: minima
+ remote_theme: jekyll/minima

  plugins:
+ - jekyll-remote-theme
</code></pre></div></div>

<h2 id="references">References</h2>
<p>[^1] https://support.google.com/analytics/answer/10089681?hl=en</p>]]></content><author><name></name></author><category term="[&quot;blogging&quot;]" /><category term="Jekyll" /><category term="Minima(Jekyll)" /><summary type="html"><![CDATA[Introduction As a reminder from Google [^1]. Google Analytics will be replaced by Google Analytics 4 after July 1st, 2023. As a user of Google Analytics, I write down the switching procedures so that other user can have a post to know how to switch your Jekyll blog from Google Analytics to Google Analytics 4.]]></summary></entry><entry><title type="html">My Medigenvac COVID-19 Second Vaccination Report</title><link href="https://cuda-chen.github.io/life/2021/10/30/medigenvec2nd.html" rel="alternate" type="text/html" title="My Medigenvac COVID-19 Second Vaccination Report" /><published>2021-10-30T00:00:00+00:00</published><updated>2021-10-30T00:00:00+00:00</updated><id>https://cuda-chen.github.io/life/2021/10/30/medigenvec2nd</id><content type="html" xml:base="https://cuda-chen.github.io/life/2021/10/30/medigenvec2nd.html"><![CDATA[<p>In this posrt, I will record the situation after I
got vaccinated with <a href="https://www.medigenvac.com/public/en">Medigenvac</a> COVID-19 vaccine
second shot.</p>

<h2 id="day-1-the-day-get-vaccinated">Day 1 (The day get vaccinated)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-2">Day 2</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-3">Day 3</h2>
<ul>
  <li>Awaken as hell.</li>
  <li>Sore sholder on vaccined one.</li>
</ul>

<h2 id="day-4">Day 4</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-5">Day 5</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-6">Day 6</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-7">Day 7</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-8">Day 8</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-9">Day 9</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-10">Day 10</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-11">Day 11</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-12">Day 12</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-13">Day 13</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-14">Day 14</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>]]></content><author><name></name></author><category term="[&quot;life&quot;]" /><category term="life" /><category term="COVID-19" /><category term="vaccine" /><summary type="html"><![CDATA[In this posrt, I will record the situation after I got vaccinated with Medigenvac COVID-19 vaccine second shot.]]></summary></entry><entry><title type="html">How My LeNet Achieves 99% Accuracy</title><link href="https://cuda-chen.github.io/deep%20learning/2021/09/23/lenet-99.html" rel="alternate" type="text/html" title="How My LeNet Achieves 99% Accuracy" /><published>2021-09-23T00:00:00+00:00</published><updated>2021-09-23T00:00:00+00:00</updated><id>https://cuda-chen.github.io/deep%20learning/2021/09/23/lenet-99</id><content type="html" xml:base="https://cuda-chen.github.io/deep%20learning/2021/09/23/lenet-99.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Fine-tuning plays a great role in model training, and realizing
the meaning of each hyperparameter lets you succeed.</p>

<p>In this post, I am going to show you how I achieve 99% top-1
accuracy on MNIST hand-written number recognition by just 
fine-tuning three hyperparameters. I also try
to implement a classic CNN model, LeNet-5, from scratch for
making me familiarize the structure and the basic
components of a CNN model. What’s more, I will build my LeNet-5
model in <a href="https://fluxml.ai/">Flux.jl</a> to show an example
of Julia neural network framework.</p>

<h2 id="base-model">Base Model</h2>
<blockquote>
  <p>You can get the code from <a href="https://github.com/Cuda-Chen/flux-lenet">my repo</a>.</p>

</blockquote>

<p>The base model is the well-known 5-layer LeNet [^1], and the
implementation is adopted from Flux.jl model zoo [^2]. As such, 
there are some differences between the original LeNet and 
the implementation in Flux.jl model zoo [^3]:</p>
<ol>
  <li>The activation function of convolution layer in LeNet uses
<strong>sigmoid</strong>, whilst in Flux.jl model zoo uses <strong>ReLU</strong>.</li>
  <li>The pooling layer in LeNet uses <strong>average</strong> pooling, whereas
in Flux.jl mode zoo uses <strong>max</strong> pooling.</li>
  <li>The activation function of pooling layer in LeNet uses
<strong>scaled hyperbolic tangent</strong>, while the one in Flux.jl model zoo uses
<strong>identity</strong> (linear).</li>
  <li>The multi-class classification used in original LeNet paper
uses <strong>Euclidean radial basis (RBF) function</strong>. [3] However, 
<strong>softmax</strong> is used in Flux.jl’s implementation.</li>
</ol>

<p>For your ease, I list the structure of my implementation:
<img src="/assets/images/2021/09/23/nn.svg" alt="model structure" /></p>
<center>The structure of base model. You can right-click to show the image
in new tab. Sorry for your inconvenience because NN-SVG (http://alexlenail.me/NN-SVG/LeNet.html)
does not have any options to resize the image.</center>

<h2 id="lets-fine-tuning">Let’s fine tuning!</h2>
<p>As such, hypermeter tuning plays a crucial role in machine learning
model development. Though the LeNet implementation of Flux.jl can achieve
98% top-1 accuracy, I still want to try whether I can break the limits.
What’s more, by experimenting fine tuning, I can attain the knowledge
which parameters plays the major role in certain task.</p>

<h3 id="baseline-and-its-performance">baseline and its performance</h3>
<blockquote>
  <p>Baseline model can be found here:
https://github.com/FluxML/model-zoo/blob/master/vision/conv_mnist/conv_mnist.jl</p>
</blockquote>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:41
Epoch: 1   Train: (loss = 0.1586f0, acc = 95.3417)   Test: (loss = 0.145f0, acc = 95.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 2   Train: (loss = 0.1079f0, acc = 96.7733)   Test: (loss = 0.0958f0, acc = 97.03)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 3   Train: (loss = 0.0829f0, acc = 97.515)   Test: (loss = 0.0717f0, acc = 97.75)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 4   Train: (loss = 0.0639f0, acc = 98.0883)   Test: (loss = 0.0573f0, acc = 98.21)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 5   Train: (loss = 0.0614f0, acc = 98.12)   Test: (loss = 0.0539f0, acc = 98.25)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 6   Train: (loss = 0.0593f0, acc = 98.2017)   Test: (loss = 0.058f0, acc = 98.13)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 7   Train: (loss = 0.0464f0, acc = 98.6083)   Test: (loss = 0.0464f0, acc = 98.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 8   Train: (loss = 0.04f0, acc = 98.7867)   Test: (loss = 0.039f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 9   Train: (loss = 0.0393f0, acc = 98.7833)   Test: (loss = 0.0416f0, acc = 98.63)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 10   Train: (loss = 0.0348f0, acc = 98.9667)   Test: (loss = 0.0388f0, acc = 98.67)

</code></pre></div></div>

<h3 id="batch-size">batch size</h3>
<p>Batch size means how many training samples are used in one iteration.
Furthermore, it represents you update, or formally, calculate the loss then
back-propagate, the parameters of the model after ingest certain number
of training samples. Therefore, assuming the following scenes:</p>
<ol>
  <li>If you update the parameters after ingest <strong>the whole data</strong>. You may
get a fast parameter updating time, but the model will perform poorly
on actual case because the model falls into the trap of local minima.
Besides, it needs a huge number of memory to load the data.</li>
  <li>If you update the parameters after ingest <strong>each number of data</strong> (only
one data in each iteration). You may get a model with fantastic outcome, 
but it takes an extraordinary time to train as it updates the parameters 
in each iteration.</li>
</ol>

<p>As such, choosing the right number of batch size can:</p>
<ul>
  <li>reduce the training time and memory</li>
  <li>coverage in better performance</li>
</ul>

<p>In this post, I tried different number of batch size, and the best batch
size of my training platform is <strong>32</strong>.</p>

<table>
  <thead>
    <tr>
      <th>Batch Size</th>
      <th>Testing Accuracy (after training with 10 epoches)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>32</td>
      <td>98.94%</td>
    </tr>
    <tr>
      <td>64</td>
      <td>98.9%</td>
    </tr>
    <tr>
      <td>256</td>
      <td>98.54%</td>
    </tr>
    <tr>
      <td>512</td>
      <td>98.21%</td>
    </tr>
  </tbody>
</table>

<p>And the following paragraphs are the training log of different batch size:</p>

<h4 id="32">32</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1069f0, acc = 96.725)   Test: (loss = 0.092f0, acc = 97.28)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0645f0, acc = 98.0217)   Test: (loss = 0.0578f0, acc = 98.16)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 3   Train: (loss = 0.0467f0, acc = 98.6183)   Test: (loss = 0.0439f0, acc = 98.64)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.0407f0, acc = 98.7817)   Test: (loss = 0.0415f0, acc = 98.67)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0392f0, acc = 98.8017)   Test: (loss = 0.0428f0, acc = 98.68)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0329f0, acc = 98.915)   Test: (loss = 0.0408f0, acc = 98.71)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0207f0, acc = 99.395)   Test: (loss = 0.0322f0, acc = 99.01)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 8   Train: (loss = 0.0196f0, acc = 99.3833)   Test: (loss = 0.0294f0, acc = 99.02)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0179f0, acc = 99.45)   Test: (loss = 0.0345f0, acc = 98.92)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0166f0, acc = 99.4283)   Test: (loss = 0.0328f0, acc = 98.94)
</code></pre></div></div>

<h4 id="64">64</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1307f0, acc = 96.045)   Test: (loss = 0.1139f0, acc = 96.49)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0852f0, acc = 97.33)   Test: (loss = 0.0752f0, acc = 97.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 3   Train: (loss = 0.0617f0, acc = 98.1583)   Test: (loss = 0.0555f0, acc = 98.39)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 4   Train: (loss = 0.0485f0, acc = 98.5767)   Test: (loss = 0.0454f0, acc = 98.5)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0515f0, acc = 98.3933)   Test: (loss = 0.0481f0, acc = 98.51)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0464f0, acc = 98.545)   Test: (loss = 0.0469f0, acc = 98.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0323f0, acc = 99.0033)   Test: (loss = 0.0365f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 8   Train: (loss = 0.0298f0, acc = 99.0417)   Test: (loss = 0.0337f0, acc = 98.96)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 9   Train: (loss = 0.0327f0, acc = 98.945)   Test: (loss = 0.0393f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0273f0, acc = 99.1333)   Test: (loss = 0.0351f0, acc = 98.9)
</code></pre></div></div>

<h4 id="256">256</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:38
Epoch: 1   Train: (loss = 0.2218f0, acc = 93.6817)   Test: (loss = 0.2066f0, acc = 94.14)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 2   Train: (loss = 0.137f0, acc = 95.965)   Test: (loss = 0.1233f0, acc = 96.37)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 3   Train: (loss = 0.1088f0, acc = 96.7117)   Test: (loss = 0.0953f0, acc = 97.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 4   Train: (loss = 0.0858f0, acc = 97.4033)   Test: (loss = 0.0755f0, acc = 97.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 5   Train: (loss = 0.0746f0, acc = 97.775)   Test: (loss = 0.0657f0, acc = 98.03)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 6   Train: (loss = 0.0665f0, acc = 98.0417)   Test: (loss = 0.0597f0, acc = 98.1)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 7   Train: (loss = 0.0603f0, acc = 98.2617)   Test: (loss = 0.0554f0, acc = 98.32)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 8   Train: (loss = 0.0535f0, acc = 98.4033)   Test: (loss = 0.0481f0, acc = 98.45)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 9   Train: (loss = 0.052f0, acc = 98.4883)   Test: (loss = 0.0496f0, acc = 98.47)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 10   Train: (loss = 0.047f0, acc = 98.5767)   Test: (loss = 0.0445f0, acc = 98.54)
</code></pre></div></div>

<h4 id="512">512</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:38
Epoch: 1   Train: (loss = 0.3686f0, acc = 89.5733)   Test: (loss = 0.3486f0, acc = 90.57)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 2   Train: (loss = 0.2046f0, acc = 94.0917)   Test: (loss = 0.1919f0, acc = 94.34)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 3   Train: (loss = 0.1542f0, acc = 95.425)   Test: (loss = 0.1387f0, acc = 95.9)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 4   Train: (loss = 0.1233f0, acc = 96.3467)   Test: (loss = 0.1119f0, acc = 96.6)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 5   Train: (loss = 0.1032f0, acc = 96.9167)   Test: (loss = 0.0912f0, acc = 97.32)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 6   Train: (loss = 0.0923f0, acc = 97.2533)   Test: (loss = 0.0831f0, acc = 97.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 7   Train: (loss = 0.0831f0, acc = 97.5483)   Test: (loss = 0.074f0, acc = 97.82)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 8   Train: (loss = 0.0778f0, acc = 97.6967)   Test: (loss = 0.0709f0, acc = 97.84)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 9   Train: (loss = 0.0732f0, acc = 97.8883)   Test: (loss = 0.0674f0, acc = 97.94)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 10   Train: (loss = 0.0661f0, acc = 98.0383)   Test: (loss = 0.0594f0, acc = 98.21)
</code></pre></div></div>

<h3 id="regularizer-parameter">regularizer parameter</h3>
<p>The regularizer is to add penalty so that the model reduce the probability to become overfitting.
Usually, we can use L1 and L2 regularizer, and I choose L2 regularizer for my LeNet-5 model.</p>

<p>In this experiment, the best L2 regularizer parameter is <strong>1e-6</strong>.</p>

<table>
  <thead>
    <tr>
      <th>L2 Regularizer Parameter</th>
      <th>Testing Accuracy (after training with 10 epoches)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1e-2</td>
      <td>97.68%</td>
    </tr>
    <tr>
      <td>1e-4</td>
      <td>98.87%</td>
    </tr>
    <tr>
      <td>1e-6</td>
      <td>99.05%</td>
    </tr>
  </tbody>
</table>

<p>As usual, I put the training logs with different regularizer parameters:</p>

<h4 id="1e-2">1e-2</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:41
Epoch: 1   Train: (loss = 0.1379f0, acc = 96.1117)   Test: (loss = 0.123f0, acc = 96.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.1076f0, acc = 96.9583)   Test: (loss = 0.0933f0, acc = 97.28)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.1239f0, acc = 96.2667)   Test: (loss = 0.1089f0, acc = 96.64)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.1041f0, acc = 97.16)   Test: (loss = 0.0915f0, acc = 97.57)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 5   Train: (loss = 0.1092f0, acc = 96.965)   Test: (loss = 0.1014f0, acc = 97.17)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0911f0, acc = 97.4883)   Test: (loss = 0.0808f0, acc = 97.74)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0894f0, acc = 97.5717)   Test: (loss = 0.0816f0, acc = 97.79)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0891f0, acc = 97.5483)   Test: (loss = 0.0796f0, acc = 97.79)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0941f0, acc = 97.36)   Test: (loss = 0.0849f0, acc = 97.44)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 10   Train: (loss = 0.0955f0, acc = 97.3467)   Test: (loss = 0.0844f0, acc = 97.68)
</code></pre></div></div>

<h4 id="1e-4">1e-4</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1079f0, acc = 96.7133)   Test: (loss = 0.0922f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0633f0, acc = 98.055)   Test: (loss = 0.0565f0, acc = 98.19)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 3   Train: (loss = 0.0478f0, acc = 98.5733)   Test: (loss = 0.0448f0, acc = 98.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 4   Train: (loss = 0.041f0, acc = 98.7333)   Test: (loss = 0.0418f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0394f0, acc = 98.7783)   Test: (loss = 0.0425f0, acc = 98.7)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0351f0, acc = 98.88)   Test: (loss = 0.0424f0, acc = 98.55)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0218f0, acc = 99.335)   Test: (loss = 0.0317f0, acc = 99.05)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 8   Train: (loss = 0.0214f0, acc = 99.35)   Test: (loss = 0.0304f0, acc = 98.9)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0207f0, acc = 99.36)   Test: (loss = 0.0335f0, acc = 98.91)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0206f0, acc = 99.3133)   Test: (loss = 0.0345f0, acc = 98.87)
</code></pre></div></div>

<h4 id="1e-6">1e-6</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:49
Epoch: 1   Train: (loss = 0.1077f0, acc = 96.72)   Test: (loss = 0.092f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 2   Train: (loss = 0.0647f0, acc = 98.005)   Test: (loss = 0.058f0, acc = 98.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.0449f0, acc = 98.67)   Test: (loss = 0.0419f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 4   Train: (loss = 0.0443f0, acc = 98.6667)   Test: (loss = 0.0451f0, acc = 98.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 5   Train: (loss = 0.0419f0, acc = 98.645)   Test: (loss = 0.043f0, acc = 98.76)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0337f0, acc = 98.925)   Test: (loss = 0.0406f0, acc = 98.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0214f0, acc = 99.3417)   Test: (loss = 0.0325f0, acc = 98.93)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0211f0, acc = 99.345)   Test: (loss = 0.0303f0, acc = 99.06)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0217f0, acc = 99.31)   Test: (loss = 0.0363f0, acc = 98.83)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 10   Train: (loss = 0.0154f0, acc = 99.51)   Test: (loss = 0.0317f0, acc = 99.05)
</code></pre></div></div>

<h3 id="optimizer">optimizer</h3>
<p>Optimizer in machine learning is to change the learning rate according to pre-assigned
parameter so that the learning rate of model can be changed and the model is more
likely to generalize well.
In this post, I choose three optimizers: ADAMW, NADAM, and AdaBelief among commonly-seen
ADAM. For the description of these optimizers, you can visit <a href="https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimisers">the documentation of
optimizer of Flux.jl</a>.</p>

<p>In this post, the best optimizer is <strong>ADAMW</strong>.</p>

<table>
  <thead>
    <tr>
      <th>Optimizer Type</th>
      <th>Testing Accuracy (after training with 10 epoches)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ADAMW</td>
      <td>99.05%</td>
    </tr>
    <tr>
      <td>NADAM</td>
      <td>98.92%</td>
    </tr>
    <tr>
      <td>AdaBelief</td>
      <td>99.01%</td>
    </tr>
  </tbody>
</table>

<p>And here are the training logs with different optimizer:</p>

<h4 id="adamw">ADAMW</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:44
Epoch: 1   Train: (loss = 0.1077f0, acc = 96.72)   Test: (loss = 0.092f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.0647f0, acc = 98.005)   Test: (loss = 0.058f0, acc = 98.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.0449f0, acc = 98.67)   Test: (loss = 0.0419f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.0443f0, acc = 98.6667)   Test: (loss = 0.0451f0, acc = 98.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 5   Train: (loss = 0.0419f0, acc = 98.645)   Test: (loss = 0.043f0, acc = 98.76)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0337f0, acc = 98.925)   Test: (loss = 0.0406f0, acc = 98.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 7   Train: (loss = 0.0214f0, acc = 99.3417)   Test: (loss = 0.0325f0, acc = 98.93)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0211f0, acc = 99.345)   Test: (loss = 0.0303f0, acc = 99.06)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0217f0, acc = 99.31)   Test: (loss = 0.0363f0, acc = 98.83)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 10   Train: (loss = 0.0154f0, acc = 99.51)   Test: (loss = 0.0317f0, acc = 99.05)
</code></pre></div></div>

<h4 id="nadam">NADAM</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:43
Epoch: 1   Train: (loss = 0.108f0, acc = 96.6633)   Test: (loss = 0.0922f0, acc = 97.22)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.0616f0, acc = 98.145)   Test: (loss = 0.0547f0, acc = 98.3)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 3   Train: (loss = 0.0479f0, acc = 98.5433)   Test: (loss = 0.0454f0, acc = 98.5)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 4   Train: (loss = 0.0399f0, acc = 98.8417)   Test: (loss = 0.0407f0, acc = 98.61)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 5   Train: (loss = 0.0411f0, acc = 98.6967)   Test: (loss = 0.0435f0, acc = 98.67)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 6   Train: (loss = 0.0334f0, acc = 98.915)   Test: (loss = 0.0411f0, acc = 98.66)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 7   Train: (loss = 0.0211f0, acc = 99.3683)   Test: (loss = 0.0328f0, acc = 98.91)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0205f0, acc = 99.355)   Test: (loss = 0.0307f0, acc = 98.98)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0173f0, acc = 99.4767)   Test: (loss = 0.0316f0, acc = 99.01)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 10   Train: (loss = 0.0194f0, acc = 99.3567)   Test: (loss = 0.035f0, acc = 98.92)
</code></pre></div></div>

<h4 id="adabelief">AdaBelief</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:47
Epoch: 1   Train: (loss = 0.0743f0, acc = 97.7433)   Test: (loss = 0.0636f0, acc = 98.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 2   Train: (loss = 0.0485f0, acc = 98.5567)   Test: (loss = 0.0448f0, acc = 98.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 3   Train: (loss = 0.0377f0, acc = 98.8583)   Test: (loss = 0.0399f0, acc = 98.78)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 4   Train: (loss = 0.0306f0, acc = 99.0483)   Test: (loss = 0.0333f0, acc = 98.97)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:17
Epoch: 5   Train: (loss = 0.0322f0, acc = 99.0167)   Test: (loss = 0.0403f0, acc = 98.84)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0254f0, acc = 99.2183)   Test: (loss = 0.0373f0, acc = 98.73)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 7   Train: (loss = 0.0159f0, acc = 99.53)   Test: (loss = 0.0299f0, acc = 99.08)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 8   Train: (loss = 0.0174f0, acc = 99.4417)   Test: (loss = 0.0314f0, acc = 99.03)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 9   Train: (loss = 0.0133f0, acc = 99.6033)   Test: (loss = 0.029f0, acc = 99.13)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.015f0, acc = 99.49)   Test: (loss = 0.0333f0, acc = 99.01)

</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>
<p>In this post, I build the classic LeNet-5 model not only practice my machine learning skills
but also make myself familiar with emerging Flux.jl framework.
I also show three possible criteria – batch size, regularizer, and optimizer – for the 
procedures of hyper-parameter tuning, or fine-tuning. At last, I bring you my LeNet-5 model can achieve
99% top-1 accuracy on MNIST dataset.</p>

<h2 id="list-to-show-the-training-environment">List to Show the Training Environment</h2>
<ul>
  <li>CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz</li>
  <li>RAM: 16 GiB</li>
  <li>OS: Fedora 33 (Linux Kernel 5.13.12)</li>
  <li>Julia version: 1.6.2</li>
  <li>Flux.jl version: v0.12.4</li>
</ul>

<h2 id="references">References</h2>
<p>[^1] http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf</p>

<p>[^2] https://github.com/FluxML/model-zoo/blob/33f5c472c321a50fc2105358a00eb7b3ec0ffa5e/vision/conv_mnist/conv_mnist.jl#L21</p>

<p>[^3] https://pabloinsente.github.io/the-convolutional-network</p>]]></content><author><name></name></author><category term="[&quot;Deep learning&quot;]" /><category term="Deep learning" /><category term="CNN" /><summary type="html"><![CDATA[Introduction Fine-tuning plays a great role in model training, and realizing the meaning of each hyperparameter lets you succeed.]]></summary></entry><entry><title type="html">My Medigenvac COVID-19 First Vaccination Report</title><link href="https://cuda-chen.github.io/life/2021/09/08/medigenvac-report.html" rel="alternate" type="text/html" title="My Medigenvac COVID-19 First Vaccination Report" /><published>2021-09-08T00:00:00+00:00</published><updated>2021-09-08T00:00:00+00:00</updated><id>https://cuda-chen.github.io/life/2021/09/08/medigenvac-report</id><content type="html" xml:base="https://cuda-chen.github.io/life/2021/09/08/medigenvac-report.html"><![CDATA[<p>In this posrt, I will record the situation after I
got vaccinated with <a href="https://www.medigenvac.com/public/en">Medigenvac</a> COVID-19 vaccine.</p>

<h2 id="day-1-20210824">Day 1 (2021/08/24)</h2>
<ul>
  <li>Fatigue.</li>
</ul>

<h2 id="day-2-20210825">Day 2 (2021/08/25)</h2>
<ul>
  <li>Mild fatigue.</li>
  <li>Mild sore on vaccined sholder.</li>
</ul>

<h2 id="day-3-20210826">Day 3 (2021/08/26)</h2>
<ul>
  <li>Moderate fatigue.</li>
</ul>

<h2 id="day-4-20210827">Day 4 (2021/08/27)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-5-20210828">Day 5 (2021/08/28)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-6-20210829">Day 6 (2021/08/29)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-7-20210830">Day 7 (2021/08/30)</h2>
<ul>
  <li>Awaken as hell in the morning, but after lunch with Subway, I felt fatigueand took a nap in the evening.</li>
</ul>

<h2 id="day-8-20210831">Day 8 (2021/08/31)</h2>
<ul>
  <li>Awaken as hell in the morning.</li>
  <li>Got headache after finishing lunch, maybe I am so tired these days.</li>
</ul>

<h2 id="day-9-20210901">Day 9 (2021/09/01)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-10-20210902">Day 10 (2021/09/02)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-11-20210903">Day 11 (2021/09/03)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-12-20210904">Day 12 (2021/09/04)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-13-20210905">Day 13 (2021/09/05)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>

<h2 id="day-14-20210906">Day 14 (2021/09/06)</h2>
<ul>
  <li>Awaken as hell.</li>
</ul>]]></content><author><name></name></author><category term="[&quot;life&quot;]" /><category term="life" /><category term="COVID-19" /><category term="vaccine" /><summary type="html"><![CDATA[In this posrt, I will record the situation after I got vaccinated with Medigenvac COVID-19 vaccine.]]></summary></entry></feed>