Cuda Chen’s Blog

Optimize _mm_crc32_u8 conversion in sse2neon

2024-02-27T00:00:00+00:00

Introduction

In this post, I am going to illustrate the progress of _mm_crc32_u8 conversion improvement of the contribution to sse2neon.

In the beginning, I will make a brief introduction to CRC32C, which is the CRC algorithm that _mm_crc32_u8 applies. Then, I will show how I optimize the conversion with various method.

What’s CRC32C?

Before explaining CRC32C, I would like to answer a question: what is CRC (Cyclic Redundancy Check)? It is an algorithm used for error detection in network and storage device ¹. The sender uses a number as divisor, then applies division on the message to get the remainder. Next, sender appends the remainder in the end of the message. To verify whether the message has any errors, the receiver applies division on the message. If the remainder is not zero, it means the message is errorous. As it doesn’t modify the content of message (redundancy) and the division is just shifting the divident then subtract (cyclic code), so the name, CRC, represents these behaviors.

A CRC algorithm is called an n-bit CRC when its divisor (formally check value) is n-bit long. Thus, the CRC32C, a variant of CRC32, has a 32-bit binary number as the dividend.

As a reminder, the CRC32C uses the following polynominals (I will represent as P for the rest of post):

normal: 0x1EDC6F41 (usually denoted as 0x11EDC6F41)
bit-reflected: 0x82F63B78

What’s more, we use the bit-reflected way for implementation. For the reasons of using bit-reflected method, you can refer to Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches ².

Road of Optimization

Let’s start with the original implementation in sse2neon ³:

FORCE_INLINE uint32_t _mm_crc32_u8(uint32_t crc, uint8_t v)
{
    crc ^= v;
    for (int bit = 0; bit < 8; bit++) {
        if (crc & 1)
            crc = (crc >> 1) ^ UINT32_C(0x82f63b78);
        else
            crc = (crc >> 1);
    }
    return crc;
}

apply ternany operator

Modern compiler can optimize the ternany operator into conditional move to prevent branching. As a consequence, we can re-write the if...else statement into ternany operator:

for (int bit = 0; bit < 8; bit++)
    crc = (crc & 1) ? ((crc >> 1) ^ UINT32_C(0x82f63b78)) : (crc >> 1);

However, as mentioned by the reviewer ⁴, we should come up with another way to utilize the power of NEON.

tabular method

Observing the following implementation of calculating CRC32C:

for (int bit = 0; bit < 8; bit++)
    crc = (crc & 1) ? ((crc >> 1) ^ UINT32_C(0x82f63b78)) : (crc >> 1);

You can realize that which bits of P will be shifted in of P then XOR’d are uniquely deretmined by the rightmost 8 bits of crc. Thus, we can rewrite the calculation procedure as follows:

// I use A, B, C, D, ...
// as the substitution of either 0 or the polynominal.

crc = (crc >> 1) ^ A
crc = (crc >> 1) ^ B
crc = (crc >> 1) ^ C
crc = (crc >> 1) ^ D
crc = (crc >> 1) ^ E
crc = (crc >> 1) ^ F
crc = (crc >> 1) ^ G
crc = (crc >> 1) ^ H

We then rewrite the above procedure to a single expression:

(((((((((((((((crc >> 1) ^ A) >> 1) ^ B) >> 1) ^ C) >> 1) ^ D) >> 1) ^ E) >> 1) ^ F) >> 1) ^ G) >> 1) ^ H

Re-distribute the shifts for simplifying the expression:

(crc >> 8) ^ (A >> 7) ^ (B >> 6) ^ (C >> 5) ^ (D >> 4) ^ (E >> 3) ^ (F >> 2) ^ (G >> 1) ^ H

Then, combine all the terms from A to H into a single value T:

(crc >> 8) ^ T

We can precompute the value of T because it is merely composed of 256 permutations (recall that we just do calculation on the rightmost 8 bits):

// Adopted from qemu: https://github.com/qemu/qemu/blob/907209e3111dd62a553a19319b422ff8aba5b9c0/util/crc32c.c#L40

static const uint32_t _sse2neon_crc32_tbl[] = {
    0x00000000, 0xF26B8303, 0xE13B70F7, 0x1350F3F4,
    0xC79A971F, 0x35F1141C, 0x26A1E7E8, 0xD4CA64EB,
    0x8AD958CF, 0x78B2DBCC, 0x6BE22838, 0x9989AB3B,
    0x4D43CFD0, 0xBF284CD3, 0xAC78BF27, 0x5E133C24,
    0x105EC76F, 0xE235446C, 0xF165B798, 0x030E349B,
    0xD7C45070, 0x25AFD373, 0x36FF2087, 0xC494A384,
    0x9A879FA0, 0x68EC1CA3, 0x7BBCEF57, 0x89D76C54,
    0x5D1D08BF, 0xAF768BBC, 0xBC267848, 0x4E4DFB4B,
    0x20BD8EDE, 0xD2D60DDD, 0xC186FE29, 0x33ED7D2A,
    0xE72719C1, 0x154C9AC2, 0x061C6936, 0xF477EA35,
    0xAA64D611, 0x580F5512, 0x4B5FA6E6, 0xB93425E5,
    0x6DFE410E, 0x9F95C20D, 0x8CC531F9, 0x7EAEB2FA,
    0x30E349B1, 0xC288CAB2, 0xD1D83946, 0x23B3BA45,
    0xF779DEAE, 0x05125DAD, 0x1642AE59, 0xE4292D5A,
    0xBA3A117E, 0x4851927D, 0x5B016189, 0xA96AE28A,
    0x7DA08661, 0x8FCB0562, 0x9C9BF696, 0x6EF07595,
    0x417B1DBC, 0xB3109EBF, 0xA0406D4B, 0x522BEE48,
    0x86E18AA3, 0x748A09A0, 0x67DAFA54, 0x95B17957,
    0xCBA24573, 0x39C9C670, 0x2A993584, 0xD8F2B687,
    0x0C38D26C, 0xFE53516F, 0xED03A29B, 0x1F682198,
    0x5125DAD3, 0xA34E59D0, 0xB01EAA24, 0x42752927,
    0x96BF4DCC, 0x64D4CECF, 0x77843D3B, 0x85EFBE38,
    0xDBFC821C, 0x2997011F, 0x3AC7F2EB, 0xC8AC71E8,
    0x1C661503, 0xEE0D9600, 0xFD5D65F4, 0x0F36E6F7,
    0x61C69362, 0x93AD1061, 0x80FDE395, 0x72966096,
    0xA65C047D, 0x5437877E, 0x4767748A, 0xB50CF789,
    0xEB1FCBAD, 0x197448AE, 0x0A24BB5A, 0xF84F3859,
    0x2C855CB2, 0xDEEEDFB1, 0xCDBE2C45, 0x3FD5AF46,
    0x7198540D, 0x83F3D70E, 0x90A324FA, 0x62C8A7F9,
    0xB602C312, 0x44694011, 0x5739B3E5, 0xA55230E6,
    0xFB410CC2, 0x092A8FC1, 0x1A7A7C35, 0xE811FF36,
    0x3CDB9BDD, 0xCEB018DE, 0xDDE0EB2A, 0x2F8B6829,
    0x82F63B78, 0x709DB87B, 0x63CD4B8F, 0x91A6C88C,
    0x456CAC67, 0xB7072F64, 0xA457DC90, 0x563C5F93,
    0x082F63B7, 0xFA44E0B4, 0xE9141340, 0x1B7F9043,
    0xCFB5F4A8, 0x3DDE77AB, 0x2E8E845F, 0xDCE5075C,
    0x92A8FC17, 0x60C37F14, 0x73938CE0, 0x81F80FE3,
    0x55326B08, 0xA759E80B, 0xB4091BFF, 0x466298FC,
    0x1871A4D8, 0xEA1A27DB, 0xF94AD42F, 0x0B21572C,
    0xDFEB33C7, 0x2D80B0C4, 0x3ED04330, 0xCCBBC033,
    0xA24BB5A6, 0x502036A5, 0x4370C551, 0xB11B4652,
    0x65D122B9, 0x97BAA1BA, 0x84EA524E, 0x7681D14D,
    0x2892ED69, 0xDAF96E6A, 0xC9A99D9E, 0x3BC21E9D,
    0xEF087A76, 0x1D63F975, 0x0E330A81, 0xFC588982,
    0xB21572C9, 0x407EF1CA, 0x532E023E, 0xA145813D,
    0x758FE5D6, 0x87E466D5, 0x94B49521, 0x66DF1622,
    0x38CC2A06, 0xCAA7A905, 0xD9F75AF1, 0x2B9CD9F2,
    0xFF56BD19, 0x0D3D3E1A, 0x1E6DCDEE, 0xEC064EED,
    0xC38D26C4, 0x31E6A5C7, 0x22B65633, 0xD0DDD530,
    0x0417B1DB, 0xF67C32D8, 0xE52CC12C, 0x1747422F,
    0x49547E0B, 0xBB3FFD08, 0xA86F0EFC, 0x5A048DFF,
    0x8ECEE914, 0x7CA56A17, 0x6FF599E3, 0x9D9E1AE0,
    0xD3D3E1AB, 0x21B862A8, 0x32E8915C, 0xC083125F,
    0x144976B4, 0xE622F5B7, 0xF5720643, 0x07198540,
    0x590AB964, 0xAB613A67, 0xB831C993, 0x4A5A4A90,
    0x9E902E7B, 0x6CFBAD78, 0x7FAB5E8C, 0x8DC0DD8F,
    0xE330A81A, 0x115B2B19, 0x020BD8ED, 0xF0605BEE,
    0x24AA3F05, 0xD6C1BC06, 0xC5914FF2, 0x37FACCF1,
    0x69E9F0D5, 0x9B8273D6, 0x88D28022, 0x7AB90321,
    0xAE7367CA, 0x5C18E4C9, 0x4F48173D, 0xBD23943E,
    0xF36E6F75, 0x0105EC76, 0x12551F82, 0xE03E9C81,
    0x34F4F86A, 0xC69F7B69, 0xD5CF889D, 0x27A40B9E,
    0x79B737BA, 0x8BDCB4B9, 0x988C474D, 0x6AE7C44E,
    0xBE2DA0A5, 0x4C4623A6, 0x5F16D052, 0xAD7D5351, 
};

FORCE_INLINE uint32_t _mm_crc32_u8(uint32_t crc, uint8_t v)
{
    crc ^= v;
	crc = (crc >> 8) ^ _sse2neon_crc32_tbl[crc & 0xFF];
    return crc;
}

However, reviewer requested not to use this as it costs 1KiB space ⁵, ~~which for my point-of-view, 1KiB space is costly on embedded system such as Raspberry Pi. Therefore, we have to emerge another tabular method solution with the balance between performance and space.~~

Edit: thanks for the reply from ⁶, the table used in this implementation needs 16 times of cacheline for storing pre-computed values as the cacheline size of most CPU architectures is 64B. Hence, we ought to find a solution that can fit all the pre-computed values into the whole size of cacheline.

tabular method (half-byte)

As mentioned in ⁷, we can break the whole 8-bit table look-up into two consecutive 4-bit table look-up:

FORCE_INLINE uint32_t _mm_crc32_u8(uint32_t crc, uint8_t v)
{
	crc ^= v;
	static const uint32_t crc32_half_byte_tbl[] = {
	    0x00000000, 0x105ec76f, 0x20bd8ede, 0x30e349b1, 0x417b1dbc, 0x5125dad3,
	    0x61c69362, 0x7198540d, 0x82f63b78, 0x92a8fc17, 0xa24bb5a6, 0xb21572c9,
	    0xc38d26c4, 0xd3d3e1ab, 0xe330a81a, 0xf36e6f75,
	};
	
	crc = (crc >> 4) ^ crc32_half_byte_tbl[crc & 0x0F];
	crc = (crc >> 4) ^ crc32_half_byte_tbl[crc & 0x0F];
	return crc;
}

The look-up table just needs to hold every 16th entry of the one-byte tabular method, thus 16 entries with only 64B space! Though this introduces an additional comparision thus cannot utilize the benefit of out-of-order execution in modern CPU, I think it will be a acceptable compromise as the entire pre-computed values can be fit into one cacheline.

using Arm Cryptography Extension

Though tabular method performs well, we always have to make a trade-off between performance and space: for better performance such as avoiding loop dependency, we ought to use more space to store the look-up table values; whilst reducing space for better memory usage we cannot avoid loop dependency as shown in tabular method (half-byte) section.

The Arm Cryptography Extension provides certain operations which we can utilize so that we don’t need to store a loop-up table. To begin with using Arm Cryptography Extension, I would like to introduce Barrett Reduction as it is the bedrock of further optimizing the CRC calculation.

Barrett reduction ⁸

Recall that the fundamental of CRC is to do polynominal division on a message with a certain polynominal in order to get the remainder ². As division is an expensive operation on computer, we can replace the division into multiplying the multiplicative inverse of the polynominal, and this is the core concept of Barrett reduction.

So to get CRC of message \(a\) with polynominal \(p\):

\[a \mod p = a - \lfloor sa \rfloor p\]

where \(s = 1/p\)

In practies, we can approximate \(1/p\) with a value \(m/2^k\) as division with \(2^k\) is merely right shift with \(k\) times.

I set \(k=64\) in my implementation as this is usually enough, and we can pre-calculate the \(s\). Thanks for the post in ⁹, we can use the uint256_t ¹⁰ project to get \(s\) with the following code snippet:

#include 
#include "uint256_t.h"

int find_mu(int i)
{
    uint256_t dividend = uint256_t{1} << i;
    const uint256_t divisor = 0x11EDC6F41; // polynominal used by CRC32C
    const int bits_in_divisor = 33; 

    uint256_t result = 0;
    int bit = 255;
    while (bit >= 0) {
        if ((dividend & (uint256_t{1} << bit)) != 0) {
            int shift = bit - bits_in_divisor + 1;
            if (shift >= 0) {
                dividend ^= divisor << shift;
                result ^= uint256_t{1} << shift;
            } else {
                dividend ^= divisor >> -shift;
            }
        }
        bit--;
    }   

    printf("%s\n", result.str(16).c_str());
    return 0;
}

int main()
{
    find_mu(64); // 2^64 / p
    return 0;
}

carry-less multiplication

Recall that the main concept of CRC is to do polynominal division. As such, polynominal division has no need to do carries; yet, to allow each digit to become an arbitrary value is impractial. We can instead do the following: still don’t carry, but let the value in a sensible range. We should limit the range as \([0, 1]\) because we are using computer to perform the polynominal division. That is, preverse with MODULO 2.

There is an interesting property of polynominal operation MODULO 2: all of the polynominal operation is equivalent to binary arthmetic with no carrys, or “carry-less” binary arthmetic. Consequently, we can substitute the multiplication in Barrett reduction into carry-less multiplication.

Though using Barrett reduction with carry-less multiplication does not need to store the look-up table, it needs the target to support hardware accelerated carry-less multiplication as the ordinary carry-less multiplication requires \(O(b^2)\) time (\(b\) means the bits of number), which usually performs worse than look-up table method. Thankfully, Arm Cryptography Extension provides a hardware accelerated carry-less multiplication.

Summing up, we can come up with the following implementation:

FORCE_INLINE uint32_t _mm_crc32_u8(uint32_t crc, uint8_t v)
{
    ...
    // Adapted from: https://mary.rs/lab/crc32/
    // If target supports Arm Cryptography Extension:

    // Barrent reduction
    uint64x2_t orig =
        vcombine_u64(vcreate_u64((uint64_t) (crc) << 24), vcreate_u64(0x0));
    uint64x2_t tmp = orig;

    // Polynomial P(x) of CRC32C
    uint64_t p = 0x105EC76F1;
    // Barrett Reduction (in bit-reflected form) constant mu_{64} = \lfloor
    // 2^{64} / P(x) \rfloor = 0x11f91caf6
    uint64_t mu = 0x1dea713f1;

    // Note: the _sse2neon_vmull_p64 is a wrapper of carry-less multiplication
    // Multiply by mu_{64}
    tmp = _sse2neon_vmull_p64(vget_low_u64(tmp), vcreate_u64(mu));
    // Divide by 2^{64} (mask away the unnecessary bits)
    tmp =
        vandq_u64(tmp, vcombine_u64(vcreate_u64(0xFFFFFFFF), vcreate_u64(0x0)));
    // Multiply by P(x) (shifted left by 1 for alignment reasons)
    tmp = _sse2neon_vmull_p64(vget_low_u64(tmp), vcreate_u64(p));
    // Subtract original from result
    tmp = veorq_u64(tmp, orig);

    // Extract the 'lower' (in bit-reflected sense) 32 bits
    crc = vgetq_lane_u32(vreinterpretq_u32_u64(tmp), 1);

    return crc;
}

Conclusion

In this post, I have shown two methods of optimizing CRC32C calculation, and these implementations are merge to sse2neon. I also make brief dipictions of CRC and carry-less multiplication, which are commonly seen topics in cryptography.

Trivia

While I was measuring the running time of each implementation, I found that the precedence of test function will affect the running time of each implementation in qemu.

Reference

https://en.wikipedia.org/wiki/Cyclic_redundancy_check ↩
https://github.com/komrad36/CRC ↩ ↩²
https://github.com/DLTcollab/sse2neon/blob/cfaa59fc04fecb117c0a0f3fe9c82dece6f359ad/sse2neon.h#L8502 ↩
https://github.com/DLTcollab/sse2neon/pull/627#discussion_r1453360563 ↩
https://github.com/DLTcollab/sse2neon/pull/627#issuecomment-1895992394 ↩
https://www.facebook.com/groups/system.software2024/posts/1556960111748548/?comment_id=1556971665080726 (the comments are written is Mandarin) ↩
https://create.stephan-brumme.com/crc32/#half-byte ↩
https://en.wikipedia.org/wiki/Barrett_reduction ↩
https://mary.rs/lab/crc32/ ↩
https://github.com/calccrypto/uint256_t ↩

My sse2neon Contribution of _rdtsc

2023-04-02T00:00:00+00:00

In this post, I am going to illustrate the path of _rdtsc [^1] conversion contribution on sse2neon. At first, I will introduce the usage of _rdtsc, then talk about the implementation and test case . The full implementation can be seen in here: https://github.com/DLTcollab/sse2neon/pull/532

What’s `_rdtsc`

The _rdtsc is an SSE intrinsic which gets the current timestamp from processor. The way which makes it special is that it gets the timestamp directly from hardware, which is suitable for measuring precise execution time.

Implementations

As this post is talking about the conversion, let’s talk about how I implement the conversions on each target.

ARMv8-A

Pretty straightforward. You just read the value from CNTVCT_EL0 (counter-timer virtual count register).

uint64_t val;
__asm__ __volatile__("mrs %0, cntvct_el0" : "=r"(val));
return val;

ARMv7-A

The ARMv7-A counterpart is trickier as it has no CNTVCT_EL0. Instead, you can get the value from PMCCNTR (performance monitors cycle count register).

Nevertheless, PMCCNTR can be accessed only in one of the following conditions:

All modes executing at PL1 or higher.
User mode when PMCUSERENR.EN == 1 (PMCUSERENR stands for performance monitors user enable register).

What’s more, PMCCNTR starts to count only if PMCNTENSET (performance monitors count enable set register) is set.

If none of the three above conditions is met, you will not be able to access PMCCNTR or get its value. In fact, you can get the current timestamp using Linux kernel API (gettimeofday) as usually the API is running in kernel mode.

uint32_t pmccntr, pmuseren, pmcntenset;
__asm__ __volatile__("mrc p15, 0, %0, c9, c14, 0" : "=r"(pmuseren));
if (pmuseren & 1) {  
    __asm__ __volatile__("mrc p15, 0, %0, c9, c12, 1" : "=r"(pmcntenset));
    if (pmcntenset & 0x80000000UL) { 
        __asm__ __volatile__("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));
        return (uint64_t) (pmccntr) << 6;
    }
}

struct timeval tv;
gettimeofday(&tv, NULL);
return (uint64_t) (tv.tv_sec) * 1000000 + tv.tv_usec;

Test Cases

In order to prove the implementation works, I add a dedicated test case for unit testing.

// test case implementation
result_t test_rdtsc(const SSE2NEONTestImpl &impl, uint32_t iter)
{
    uint64_t start = _rdtsc();
    for (int i = 0; i < 100000; i++)
        __asm__ __volatile__("" ::: "memory");
    uint64_t end = _rdtsc();
    return end > start ? TEST_SUCCESS : TEST_FAIL;
}

The test procedure as follows:

get current timestamp
create a long-running time for loop
get current timestamp again
check whether the value of timestamp in 3. is larger than 1.

Why the for loop looks so strange?

You may ask why not create the long-running for loop as follows:

for(int i = 0; i < 100000; i++)
    ; // no-op

The reason is that modern compile sometimes eliminates the loops with no any operations. Therefore, we need a trick which creates a long-running for loop without being removed by compiler. Fortunately, we can use __asm__ __volatile__("" ::: "memory"); to do the trick.

So you may ask another question: why __asm__ __volatile__("" ::: "memory"); can fulfill the task?

According in this post [^2], the __asm__ __volatile__("" ::: "memory"); creates a compiler barrier. What’s more, with the help of volatile keyword, compiler won’t take any optimization of this assembly. Therefore, we create a statement which doing nothing. Thus, the long-running for loop serves its purpose.

References

[^1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=rdtsc

[^2] https://preshing.com/20120625/memory-ordering-at-compile-time/

My Moderna COVID-19 First Booster Vaccination Report

2022-05-29T00:00:00+00:00

In this post, I will record the situation after I got vaccination with Moderna COVID-19 vaccine.

Day 1 (the day after got vaccinated)

Sore muscle on vaccinated arm.
Moderate fatigue.

Day 2

Strong fatigue.
Sore muscle on vaccinated arm.
Headache.
Fever.
Sweating a lot, I bet I haven’t sweated a lot before even for exercising.
Palpitation.

Day 3

A little fatigue.
A little sore muscle on vaccinated arm.
Coughing.
Stuffy nose.
Palpitation.

Day 4

A little fatigue.
Coughing sometimes.
Stuffy nose.
Palpitation.

Day 5

A little fatigue.
Coughing sometimes.
Stuffy nose.

Day 6

A little fatigue.
Coughing seldomly.

Day 7

Feel a little fatigue sometimes.
Coughing seldomly.
Get PCR positive report today.

Day 8

Awaken as hell.
Coughing seldomly.

Day 9

Awaken as hell.
Coughing seldomly.

Day 10

Awaken as hell.

Day 11

Awaken as hell.

Day 12

Awaken as hell.

Day 13

Awaken as hell.

Day 14

Awaken as hell.
Get dis-quarantine notification today.

Switch Your Jekyll Blog to Google Analytics 4 Simplified

2022-04-30T00:00:00+00:00

Introduction

As a reminder from Google [^1]. Google Analytics will be replaced by Google Analytics 4 after July 1st, 2023. As a user of Google Analytics, I write down the switching procedures so that other user can have a post to know how to switch your Jekyll blog from Google Analytics to Google Analytics 4.

Switching Procedures

Create a Google Analytics 4 Property. For more information you can visit the help center of Google.
Record your measurement ID. This answer will let you find your measurement ID.
Put your measurement ID in _config.yml. Usually you put your measurement ID like this:
```
google_analytics: 
```
(Minima and GitHub Pages only) use remote theme. For some reasons, you have to use remote theme if your Jekyll blog uses minima theme and is hosted on GitHub Pages. Usually you set your blog to use remote theme like this:

# _config.yml

- theme: minima
+ remote_theme: jekyll/minima

  plugins:
+ - jekyll-remote-theme

References

[^1] https://support.google.com/analytics/answer/10089681?hl=en

My Medigenvac COVID-19 Second Vaccination Report

2021-10-30T00:00:00+00:00

In this posrt, I will record the situation after I got vaccinated with Medigenvac COVID-19 vaccine second shot.

Day 1 (The day get vaccinated)

Awaken as hell.

Day 2

Awaken as hell.

Day 3

Awaken as hell.
Sore sholder on vaccined one.

Day 4

Awaken as hell.

Day 5

Awaken as hell.

Day 6

Awaken as hell.

Day 7

Awaken as hell.

Day 8

Awaken as hell.

Day 9

Awaken as hell.

Day 10

Awaken as hell.

Day 11

Awaken as hell.

Day 12

Awaken as hell.

Day 13

Awaken as hell.

Day 14

Awaken as hell.

How My LeNet Achieves 99% Accuracy

2021-09-23T00:00:00+00:00

Introduction

Fine-tuning plays a great role in model training, and realizing the meaning of each hyperparameter lets you succeed.

In this post, I am going to show you how I achieve 99% top-1 accuracy on MNIST hand-written number recognition by just fine-tuning three hyperparameters. I also try to implement a classic CNN model, LeNet-5, from scratch for making me familiarize the structure and the basic components of a CNN model. What’s more, I will build my LeNet-5 model in Flux.jl to show an example of Julia neural network framework.

Base Model

You can get the code from my repo.

The base model is the well-known 5-layer LeNet [^1], and the implementation is adopted from Flux.jl model zoo [^2]. As such, there are some differences between the original LeNet and the implementation in Flux.jl model zoo [^3]:

The activation function of convolution layer in LeNet uses sigmoid, whilst in Flux.jl model zoo uses ReLU.
The pooling layer in LeNet uses average pooling, whereas in Flux.jl mode zoo uses max pooling.
The activation function of pooling layer in LeNet uses scaled hyperbolic tangent, while the one in Flux.jl model zoo uses identity (linear).
The multi-class classification used in original LeNet paper uses Euclidean radial basis (RBF) function. [3] However, softmax is used in Flux.jl’s implementation.

For your ease, I list the structure of my implementation:

The structure of base model. You can right-click to show the image in new tab. Sorry for your inconvenience because NN-SVG (http://alexlenail.me/NN-SVG/LeNet.html) does not have any options to resize the image.

Let’s fine tuning!

As such, hypermeter tuning plays a crucial role in machine learning model development. Though the LeNet implementation of Flux.jl can achieve 98% top-1 accuracy, I still want to try whether I can break the limits. What’s more, by experimenting fine tuning, I can attain the knowledge which parameters plays the major role in certain task.

baseline and its performance

Baseline model can be found here: https://github.com/FluxML/model-zoo/blob/master/vision/conv_mnist/conv_mnist.jl

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:41
Epoch: 1   Train: (loss = 0.1586f0, acc = 95.3417)   Test: (loss = 0.145f0, acc = 95.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 2   Train: (loss = 0.1079f0, acc = 96.7733)   Test: (loss = 0.0958f0, acc = 97.03)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 3   Train: (loss = 0.0829f0, acc = 97.515)   Test: (loss = 0.0717f0, acc = 97.75)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 4   Train: (loss = 0.0639f0, acc = 98.0883)   Test: (loss = 0.0573f0, acc = 98.21)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 5   Train: (loss = 0.0614f0, acc = 98.12)   Test: (loss = 0.0539f0, acc = 98.25)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 6   Train: (loss = 0.0593f0, acc = 98.2017)   Test: (loss = 0.058f0, acc = 98.13)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 7   Train: (loss = 0.0464f0, acc = 98.6083)   Test: (loss = 0.0464f0, acc = 98.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 8   Train: (loss = 0.04f0, acc = 98.7867)   Test: (loss = 0.039f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 9   Train: (loss = 0.0393f0, acc = 98.7833)   Test: (loss = 0.0416f0, acc = 98.63)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 10   Train: (loss = 0.0348f0, acc = 98.9667)   Test: (loss = 0.0388f0, acc = 98.67)

batch size

Batch size means how many training samples are used in one iteration. Furthermore, it represents you update, or formally, calculate the loss then back-propagate, the parameters of the model after ingest certain number of training samples. Therefore, assuming the following scenes:

If you update the parameters after ingest the whole data. You may get a fast parameter updating time, but the model will perform poorly on actual case because the model falls into the trap of local minima. Besides, it needs a huge number of memory to load the data.
If you update the parameters after ingest each number of data (only one data in each iteration). You may get a model with fantastic outcome, but it takes an extraordinary time to train as it updates the parameters in each iteration.

As such, choosing the right number of batch size can:

reduce the training time and memory
coverage in better performance

In this post, I tried different number of batch size, and the best batch size of my training platform is 32.

Batch Size	Testing Accuracy (after training with 10 epoches)
32	98.94%
64	98.9%
256	98.54%
512	98.21%

And the following paragraphs are the training log of different batch size:

32

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1069f0, acc = 96.725)   Test: (loss = 0.092f0, acc = 97.28)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0645f0, acc = 98.0217)   Test: (loss = 0.0578f0, acc = 98.16)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 3   Train: (loss = 0.0467f0, acc = 98.6183)   Test: (loss = 0.0439f0, acc = 98.64)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.0407f0, acc = 98.7817)   Test: (loss = 0.0415f0, acc = 98.67)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0392f0, acc = 98.8017)   Test: (loss = 0.0428f0, acc = 98.68)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0329f0, acc = 98.915)   Test: (loss = 0.0408f0, acc = 98.71)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0207f0, acc = 99.395)   Test: (loss = 0.0322f0, acc = 99.01)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 8   Train: (loss = 0.0196f0, acc = 99.3833)   Test: (loss = 0.0294f0, acc = 99.02)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0179f0, acc = 99.45)   Test: (loss = 0.0345f0, acc = 98.92)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0166f0, acc = 99.4283)   Test: (loss = 0.0328f0, acc = 98.94)

64

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1307f0, acc = 96.045)   Test: (loss = 0.1139f0, acc = 96.49)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0852f0, acc = 97.33)   Test: (loss = 0.0752f0, acc = 97.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 3   Train: (loss = 0.0617f0, acc = 98.1583)   Test: (loss = 0.0555f0, acc = 98.39)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 4   Train: (loss = 0.0485f0, acc = 98.5767)   Test: (loss = 0.0454f0, acc = 98.5)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0515f0, acc = 98.3933)   Test: (loss = 0.0481f0, acc = 98.51)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0464f0, acc = 98.545)   Test: (loss = 0.0469f0, acc = 98.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0323f0, acc = 99.0033)   Test: (loss = 0.0365f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 8   Train: (loss = 0.0298f0, acc = 99.0417)   Test: (loss = 0.0337f0, acc = 98.96)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 9   Train: (loss = 0.0327f0, acc = 98.945)   Test: (loss = 0.0393f0, acc = 98.77)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0273f0, acc = 99.1333)   Test: (loss = 0.0351f0, acc = 98.9)

256

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:38
Epoch: 1   Train: (loss = 0.2218f0, acc = 93.6817)   Test: (loss = 0.2066f0, acc = 94.14)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 2   Train: (loss = 0.137f0, acc = 95.965)   Test: (loss = 0.1233f0, acc = 96.37)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 3   Train: (loss = 0.1088f0, acc = 96.7117)   Test: (loss = 0.0953f0, acc = 97.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 4   Train: (loss = 0.0858f0, acc = 97.4033)   Test: (loss = 0.0755f0, acc = 97.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 5   Train: (loss = 0.0746f0, acc = 97.775)   Test: (loss = 0.0657f0, acc = 98.03)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 6   Train: (loss = 0.0665f0, acc = 98.0417)   Test: (loss = 0.0597f0, acc = 98.1)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 7   Train: (loss = 0.0603f0, acc = 98.2617)   Test: (loss = 0.0554f0, acc = 98.32)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 8   Train: (loss = 0.0535f0, acc = 98.4033)   Test: (loss = 0.0481f0, acc = 98.45)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 9   Train: (loss = 0.052f0, acc = 98.4883)   Test: (loss = 0.0496f0, acc = 98.47)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 10   Train: (loss = 0.047f0, acc = 98.5767)   Test: (loss = 0.0445f0, acc = 98.54)

512

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:38
Epoch: 1   Train: (loss = 0.3686f0, acc = 89.5733)   Test: (loss = 0.3486f0, acc = 90.57)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 2   Train: (loss = 0.2046f0, acc = 94.0917)   Test: (loss = 0.1919f0, acc = 94.34)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:11
Epoch: 3   Train: (loss = 0.1542f0, acc = 95.425)   Test: (loss = 0.1387f0, acc = 95.9)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 4   Train: (loss = 0.1233f0, acc = 96.3467)   Test: (loss = 0.1119f0, acc = 96.6)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 5   Train: (loss = 0.1032f0, acc = 96.9167)   Test: (loss = 0.0912f0, acc = 97.32)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 6   Train: (loss = 0.0923f0, acc = 97.2533)   Test: (loss = 0.0831f0, acc = 97.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 7   Train: (loss = 0.0831f0, acc = 97.5483)   Test: (loss = 0.074f0, acc = 97.82)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 8   Train: (loss = 0.0778f0, acc = 97.6967)   Test: (loss = 0.0709f0, acc = 97.84)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 9   Train: (loss = 0.0732f0, acc = 97.8883)   Test: (loss = 0.0674f0, acc = 97.94)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:10
Epoch: 10   Train: (loss = 0.0661f0, acc = 98.0383)   Test: (loss = 0.0594f0, acc = 98.21)

regularizer parameter

The regularizer is to add penalty so that the model reduce the probability to become overfitting. Usually, we can use L1 and L2 regularizer, and I choose L2 regularizer for my LeNet-5 model.

In this experiment, the best L2 regularizer parameter is 1e-6.

L2 Regularizer Parameter	Testing Accuracy (after training with 10 epoches)
1e-2	97.68%
1e-4	98.87%
1e-6	99.05%

As usual, I put the training logs with different regularizer parameters:

1e-2

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:41
Epoch: 1   Train: (loss = 0.1379f0, acc = 96.1117)   Test: (loss = 0.123f0, acc = 96.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.1076f0, acc = 96.9583)   Test: (loss = 0.0933f0, acc = 97.28)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.1239f0, acc = 96.2667)   Test: (loss = 0.1089f0, acc = 96.64)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.1041f0, acc = 97.16)   Test: (loss = 0.0915f0, acc = 97.57)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 5   Train: (loss = 0.1092f0, acc = 96.965)   Test: (loss = 0.1014f0, acc = 97.17)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0911f0, acc = 97.4883)   Test: (loss = 0.0808f0, acc = 97.74)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0894f0, acc = 97.5717)   Test: (loss = 0.0816f0, acc = 97.79)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0891f0, acc = 97.5483)   Test: (loss = 0.0796f0, acc = 97.79)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0941f0, acc = 97.36)   Test: (loss = 0.0849f0, acc = 97.44)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 10   Train: (loss = 0.0955f0, acc = 97.3467)   Test: (loss = 0.0844f0, acc = 97.68)

1e-4

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:40
Epoch: 1   Train: (loss = 0.1079f0, acc = 96.7133)   Test: (loss = 0.0922f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 2   Train: (loss = 0.0633f0, acc = 98.055)   Test: (loss = 0.0565f0, acc = 98.19)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:12
Epoch: 3   Train: (loss = 0.0478f0, acc = 98.5733)   Test: (loss = 0.0448f0, acc = 98.52)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 4   Train: (loss = 0.041f0, acc = 98.7333)   Test: (loss = 0.0418f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 5   Train: (loss = 0.0394f0, acc = 98.7783)   Test: (loss = 0.0425f0, acc = 98.7)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 6   Train: (loss = 0.0351f0, acc = 98.88)   Test: (loss = 0.0424f0, acc = 98.55)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0218f0, acc = 99.335)   Test: (loss = 0.0317f0, acc = 99.05)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 8   Train: (loss = 0.0214f0, acc = 99.35)   Test: (loss = 0.0304f0, acc = 98.9)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 9   Train: (loss = 0.0207f0, acc = 99.36)   Test: (loss = 0.0335f0, acc = 98.91)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.0206f0, acc = 99.3133)   Test: (loss = 0.0345f0, acc = 98.87)

1e-6

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:49
Epoch: 1   Train: (loss = 0.1077f0, acc = 96.72)   Test: (loss = 0.092f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 2   Train: (loss = 0.0647f0, acc = 98.005)   Test: (loss = 0.058f0, acc = 98.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.0449f0, acc = 98.67)   Test: (loss = 0.0419f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 4   Train: (loss = 0.0443f0, acc = 98.6667)   Test: (loss = 0.0451f0, acc = 98.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 5   Train: (loss = 0.0419f0, acc = 98.645)   Test: (loss = 0.043f0, acc = 98.76)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0337f0, acc = 98.925)   Test: (loss = 0.0406f0, acc = 98.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 7   Train: (loss = 0.0214f0, acc = 99.3417)   Test: (loss = 0.0325f0, acc = 98.93)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0211f0, acc = 99.345)   Test: (loss = 0.0303f0, acc = 99.06)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0217f0, acc = 99.31)   Test: (loss = 0.0363f0, acc = 98.83)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 10   Train: (loss = 0.0154f0, acc = 99.51)   Test: (loss = 0.0317f0, acc = 99.05)

optimizer

Optimizer in machine learning is to change the learning rate according to pre-assigned parameter so that the learning rate of model can be changed and the model is more likely to generalize well. In this post, I choose three optimizers: ADAMW, NADAM, and AdaBelief among commonly-seen ADAM. For the description of these optimizers, you can visit the documentation of optimizer of Flux.jl.

In this post, the best optimizer is ADAMW.

Optimizer Type	Testing Accuracy (after training with 10 epoches)
ADAMW	99.05%
NADAM	98.92%
AdaBelief	99.01%

And here are the training logs with different optimizer:

ADAMW

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:44
Epoch: 1   Train: (loss = 0.1077f0, acc = 96.72)   Test: (loss = 0.092f0, acc = 97.23)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.0647f0, acc = 98.005)   Test: (loss = 0.058f0, acc = 98.17)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 3   Train: (loss = 0.0449f0, acc = 98.67)   Test: (loss = 0.0419f0, acc = 98.62)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 4   Train: (loss = 0.0443f0, acc = 98.6667)   Test: (loss = 0.0451f0, acc = 98.53)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 5   Train: (loss = 0.0419f0, acc = 98.645)   Test: (loss = 0.043f0, acc = 98.76)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0337f0, acc = 98.925)   Test: (loss = 0.0406f0, acc = 98.7)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 7   Train: (loss = 0.0214f0, acc = 99.3417)   Test: (loss = 0.0325f0, acc = 98.93)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0211f0, acc = 99.345)   Test: (loss = 0.0303f0, acc = 99.06)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0217f0, acc = 99.31)   Test: (loss = 0.0363f0, acc = 98.83)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 10   Train: (loss = 0.0154f0, acc = 99.51)   Test: (loss = 0.0317f0, acc = 99.05)

NADAM

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:43
Epoch: 1   Train: (loss = 0.108f0, acc = 96.6633)   Test: (loss = 0.0922f0, acc = 97.22)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 2   Train: (loss = 0.0616f0, acc = 98.145)   Test: (loss = 0.0547f0, acc = 98.3)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 3   Train: (loss = 0.0479f0, acc = 98.5433)   Test: (loss = 0.0454f0, acc = 98.5)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 4   Train: (loss = 0.0399f0, acc = 98.8417)   Test: (loss = 0.0407f0, acc = 98.61)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 5   Train: (loss = 0.0411f0, acc = 98.6967)   Test: (loss = 0.0435f0, acc = 98.67)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 6   Train: (loss = 0.0334f0, acc = 98.915)   Test: (loss = 0.0411f0, acc = 98.66)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 7   Train: (loss = 0.0211f0, acc = 99.3683)   Test: (loss = 0.0328f0, acc = 98.91)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 8   Train: (loss = 0.0205f0, acc = 99.355)   Test: (loss = 0.0307f0, acc = 98.98)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 9   Train: (loss = 0.0173f0, acc = 99.4767)   Test: (loss = 0.0316f0, acc = 99.01)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:16
Epoch: 10   Train: (loss = 0.0194f0, acc = 99.3567)   Test: (loss = 0.035f0, acc = 98.92)

AdaBelief

Epoch: 0   Train: (loss = 2.3162f0, acc = 12.1333)   Test: (loss = 2.316f0, acc = 12.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:47
Epoch: 1   Train: (loss = 0.0743f0, acc = 97.7433)   Test: (loss = 0.0636f0, acc = 98.11)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 2   Train: (loss = 0.0485f0, acc = 98.5567)   Test: (loss = 0.0448f0, acc = 98.56)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 3   Train: (loss = 0.0377f0, acc = 98.8583)   Test: (loss = 0.0399f0, acc = 98.78)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 4   Train: (loss = 0.0306f0, acc = 99.0483)   Test: (loss = 0.0333f0, acc = 98.97)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:17
Epoch: 5   Train: (loss = 0.0322f0, acc = 99.0167)   Test: (loss = 0.0403f0, acc = 98.84)
[ Info: Model saved in "runs/model.bson"
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 6   Train: (loss = 0.0254f0, acc = 99.2183)   Test: (loss = 0.0373f0, acc = 98.73)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:14
Epoch: 7   Train: (loss = 0.0159f0, acc = 99.53)   Test: (loss = 0.0299f0, acc = 99.08)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 8   Train: (loss = 0.0174f0, acc = 99.4417)   Test: (loss = 0.0314f0, acc = 99.03)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:15
Epoch: 9   Train: (loss = 0.0133f0, acc = 99.6033)   Test: (loss = 0.029f0, acc = 99.13)
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:13
Epoch: 10   Train: (loss = 0.015f0, acc = 99.49)   Test: (loss = 0.0333f0, acc = 99.01)

Conclusion

In this post, I build the classic LeNet-5 model not only practice my machine learning skills but also make myself familiar with emerging Flux.jl framework. I also show three possible criteria – batch size, regularizer, and optimizer – for the procedures of hyper-parameter tuning, or fine-tuning. At last, I bring you my LeNet-5 model can achieve 99% top-1 accuracy on MNIST dataset.

List to Show the Training Environment

CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
RAM: 16 GiB
OS: Fedora 33 (Linux Kernel 5.13.12)
Julia version: 1.6.2
Flux.jl version: v0.12.4

References

[^1] http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

[^2] https://github.com/FluxML/model-zoo/blob/33f5c472c321a50fc2105358a00eb7b3ec0ffa5e/vision/conv_mnist/conv_mnist.jl#L21

[^3] https://pabloinsente.github.io/the-convolutional-network

My Medigenvac COVID-19 First Vaccination Report

2021-09-08T00:00:00+00:00

In this posrt, I will record the situation after I got vaccinated with Medigenvac COVID-19 vaccine.

Day 1 (2021/08/24)

Fatigue.

Day 2 (2021/08/25)

Mild fatigue.
Mild sore on vaccined sholder.

Day 3 (2021/08/26)

Moderate fatigue.

Day 4 (2021/08/27)

Awaken as hell.

Day 5 (2021/08/28)

Awaken as hell.

Day 6 (2021/08/29)

Awaken as hell.

Day 7 (2021/08/30)

Awaken as hell in the morning, but after lunch with Subway, I felt fatigueand took a nap in the evening.

Day 8 (2021/08/31)

Awaken as hell in the morning.
Got headache after finishing lunch, maybe I am so tired these days.

Day 9 (2021/09/01)

Awaken as hell.

Day 10 (2021/09/02)

Awaken as hell.

Day 11 (2021/09/03)

Awaken as hell.

Day 12 (2021/09/04)

Awaken as hell.

Day 13 (2021/09/05)

Awaken as hell.

Day 14 (2021/09/06)

Awaken as hell.

How to Show Git Branch Graph in Terminal

2021-08-14T00:00:00+00:00

Sometimes I would like to watch the history of my Git commits as well as the branches. What’s more, I would like to see these changes in terminal so that I don’t need to install then execute other programs.

As a Git user, I can watch the Git commits by typing git log. However, sometimes I want to watch the branch graph so that I can know which branch merges to another branch.

As such, you can type this command:

$ git log --all --decorate --oneline --graph

And this stackoverflow answer [^1] provides an interesting rhythm – A DOG – and a meme to memorize it:

This delighted dog with rainbow background will help you to memorize A DOG command :)

Thanks you stackoverflow, you save me the day!

Reference

[^1] https://stackoverflow.com/a/35075021

Django+MySQL CI/CD with GitHub Actions

2021-05-31T00:00:00+00:00

Introduction

In nowadays, when we develop a web application, we usually apply CI (Continuous Integration) and CD (Continous Deployment) to automate the processes of testing and deployment.

To run such automation, we use Jenkins [^1] and Travis [^2] in the old days. However, in this decade, there are lots of tool popping up for our needs, and one of them is GitHub Actions. As a web developer and GitHub lovers, I would like to use GitHub Actions for not only can be easily integrated with GitHub projects but also show off my ability to customize GitHub Actions [^3] for my needs. Therefore, I would like to share my note apply GitHub Actions (especially CI) on an example Django project using MySQL as its database.

Introduction of GitHub Actions

Just like other CI/CD platform, GitHub Actions is a yet-another platform for CI/CD. However, it has the following features:

Highly integrated with GitHub Services. GitHub Actions can be invoked with arbitrary GitHub events. Therefore, if you host your repos on GitHub, you can consider to use GitHub Actions for better development experiences.
Community-powered workflows. GitHub Actions lets you to create your own worflow and publish to its marketplace to share your work with others. For example, once I switch a Crystal project from Travis to GitHub Actions. The project needs CD of built website with custom domain and different repo name settings. And the work of GitHub Pages action [^4] helped me a lot with painless worflow switch.
Self-hosted machines are permitted. You can not only run GitHub Actions on GitHub, but also your self-hosted machines with much more flexible configurations and better bargain! And the official document [^5] has great explainations to teach you how to achieve this.

As such, there are a lot of communities such as Julia Language [^6] switch their CI/CD workflows onto GitHub Actions. What’s more, these communities invent plenty of custom workflows for the ease of themselves and the developers using their project in the future.

Apply GitHub Actions on Django with MySQL

In this paragraph, I list the procedures how to apply GitHub Actions to your Django project.

You can see full example project in here.

1. Create Django project

Frist, create your Django project, namely example, by typing:

$ django-admin startproject example

Then add MySQL settings into DATABASES variable in example/settings.py:

DATABASES = {
    'default': {
        'ENGINE': os.environ.get('DBENGINE', ''),
        'NAME': os.environ.get('DBNAME', ''),
        'USER': os.environ.get('DBUSER', ''),
        'PASSWORD': os.environ.get('DBPASSWORD', ''),
        'HOST': os.environ.get('DBHOST', ''),
        'PORT': os.environ.get('DBPORT', ''),
    }
}

For demo purpose, I use environment variables to store the settings of databases.

2. Create Django App: `users`

Create a Django app called users by typing:

$ django-admin startapp users

3. Add Unit Tests in `users`

Because we want to run CI, we should add some unit test codes.

Substitude existing code with following codes in users/tests.py:

from django.test import TestCase
from django.contrib.auth.models import User

# Create your tests here.

class UserTestCase(TestCase):
    def test_user(self):
        username = 'cudachen'
        password = 'carbotzergling'
        u = User(username=username)
        u.set_password(password)
        u.save()
        self.assertEqual(u.username, username)
        self.assertTrue(u.check_password(password))

4. Add GitHub Actions CI

And here comes the main dish! But before adding GitHub Actions’ configuration, here are some common mistake I encounter as kindly reminders for you:

branch : always make sure which branches you want to trigger GitHub Actions, or you will be confused why GitHub Actions doesn’t working. In this case, I would like to trigger GitHub Actions when pull request or push on main branch.
env: as I use environment variables for database settings, make sure to set environment variables in each step you are going to use database, e.g. database migration and running unit tests.
DB port: as indicated in [^7], GitHub Actions assign random port for each GitHub Actions services defined by you. In order to access these services (e.g. database) port with no failure, you have to use jobs..services..ports syntax.

Then, here are the steps adding GitHub Actions configuration:

Create directory called .github/workflows.
In .github/workflows directory, add django-ci.yml (our GitHub Actions configuration) as below:

name: Django CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:

    runs-on: ubuntu-latest
    strategy:
      max-parallel: 4
      matrix:
        python-version: [3.7]

    services:
      mysql:
        image: mysql:5.7
        env:
          MYSQL_ROOT_PASSWORD: zergling
          MYSQL_DATABASE: mysql
        ports: ['3306:3306']

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python $
      uses: actions/setup-python@v2
      with:
        python-version: $
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Run Migrations
      run: python manage.py migrate
      env: 
        DBENGINE: django.db.backends.mysql
        DBNAME: mysql
        DBUSER: root
        DBPASSWORD: zergling
        DBHOST: 127.0.0.1
        DBPORT: $
    - name: Run Tests
      run: |
        python manage.py test
      env: 
        DBENGINE: django.db.backends.mysql
        DBNAME: mysql
        DBUSER: root
        DBPASSWORD: zergling
        DBHOST: 127.0.0.1
        DBPORT: $

5. Push to GitHub and Enjoy!

After the above steps, push our project to GitHub, and GitHub Actions will start to work!

Trivia: Add GitHub Actions Badge for Showing CI Status

GitHub Actions provides status badge for showing CI status in ease. Usually, you can add the badge in README.md like this:

[![](https://github.com///actions/workflows/django-ci.yml/badge.svg)](https://github.com///actions/workflows/django-ci.yml)

Conclusion

In this post, I make an introduction how to apply CI/CD pipeline. I also introduce what is GitHub Actions and its characteristics. I then make a Django demo project using MySQL in order to show the processes running GitHub Actions, and leave some marks for avoiding common gotchas.

References

[^1] https://www.jenkins.io/

[^2] https://travis-ci.org/

[^3] https://github.com/features/actions

[^4] https://github.com/marketplace/actions/github-pages-action

[^5] https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners

[^6] https://julialang.org/

[^7] https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#example-using-localhost

A Virtual YouTuber System without Deep Learning

2021-05-16T00:00:00+00:00

In this article, I am going to present a Virtual YouTuber (VTuber) system just using ordinary CPU and webcam written in C++. I will also talk about the phenomenon of VTuber and some basics of gaze tracking. At last, I will leave some personal thoughts after working on this system.

Introduction of VTuber

VTuber means a person streaming or uploading to YouTube with an anime avatar follwing his movement in real-time. [^1] Though VTuber originated from mid-2010 in Japan, it started to boom from the beginning of 2020 with the help of COVID-19 and the push of some commercial companies such as hololive [^2] and nijisanji. [^3]

Motivation

One day, my friend sent me a project which he created a VTuber system written in Python. After glancing his code, I came up with an idea: what not implement this VTuber system in C++ for performance boosting as well as showing off my ability of C++? Furthermore, I want to strengthen my image processing knowledge by implementing this VTuber project.

Also, many of the VTuber system uses deep learning techniques or specialized hardware for pretty precise movement catch [^4], but not every one can afford the devices such as specialized hardware or GPU. What’s more, it requires more set up procedures or environment which may not be feasible in some scenes. In comparison, this VTuber system only needs normal CPU as well as webcam for the ease of setting up and cost efficiency.

An example of open-source VTuber system using deep learning. Adopted from [^5].

Architecture

The architecture of this system is described as follows:

Server

The server is responsible for detecting user’s face and its landmarks. After detecting the landmarks, it will calculates the movement of eyes and mouth. Next, the movement data will be transmitted to client via WebSocket.

Client

The clent is responsible for display the VTuber as well as its movements. For real-time displaying purpose, client uses WebSocket for receiving the movement data. After client receives the data sent from server, it will display the VTyber and its movements in browser.

Processing Procedures of Gaze Tracking and Mouth Movement

I guess you would like to realize how server detects face and calculates the movement under-the-hood. So in this paragraph, I am going to tell the details from webcam to face movement stream data.

Capture the video stream from webcam to picture by picture.
Resize the input picture.
It is a common knowledge that resizing the image can boost the image processing speed as there are fewer of pixels. After testing, resizing the input image to half of width and height can have 2x speed up while not hugely affect the face detection and gaze tracking processes.
Grayscale the input picture.
Again, you should grayscale your image if you do not need the color channels for further processing. What’s more, Dlib face detector runs faster in grayscale compared to RGB.
Run face landmark detection via Dlib.
Detect the regions of eyes.
In this procedure, we are going to detect the eyes. To speed up the processing and for more accurate result. I crop the images only containing the eyes.
Retrive pupil of each eye region.
As indicated in [^6], using a threshold betweend 5 to 100 then choose the last second one after sorting by pupil contour area can suit for most cases. Therfore, I adapt this Python package then re-written in C++ to suit my project.
Calculate the face movement (gaze and mouth).
After get the face landmark and the pupil location, it’s time to calculate the face movement. The calculation is adapted from my friend’s VTuber project in [^7].
Stream the data through WebSocket to client.

System Set up

In this part, I will summarize the set up of this vface-server-cpp projct.

Server

Download the vface-server-cpp from here: https://github.com/Cuda-Chen/vface-server-cpp
Install the dependencies, namely:
- OpenCV
- websocketd

Compile the project by typing:

$ mkdir build && cd build && cmake .. && make

Execute the program by typing:

$ websocketd --port=5566 ./vface_server_cpp

Run the client (namely, vface-web [^8]).

Client

For client set up, as the repo is maintained by my friend, you can visit [^8] for setting up.

Result

Thanks to the common ways such as resizing, threshold, and region of interest (ROI) used in image processing, my VTuber system can detect and calculate the face keypoints in about 10 ms. After the calculation, the data will be transmitted to client then draw the animate character you choose.

The following image uses my face to create this adorable character:

As such, you can notice that the character cannot close her eyes entirely. The reason are that I sit too far away the webcam and the eye closing movement needs further adjustment for each individual.

Summary

In this post, I show you guys my VTuber system and introduce the background of VTuber. I also list the processing procedures from framing your face, detect and calculating the face keypoints. At last, I demonstrate the result and leave some marks for further improvements.

References

[^1] https://www.urbandictionary.com/define.php?term=VTuber

[^2] https://en.hololive.tv/

[^3] https://www.nijisanji.jp/

[^4] https://gist.github.com/emilianavt/cbf4d6de6f7fb01a42d4cce922795794

[^5] https://github.com/DeepVTuber/DeepVTB#head-pose-estimation

[^6] https://github.com/antoinelame/GazeTracking

[^7] https://github.com/c910335/vface-server/blob/master/calculator.py

[^8] https://github.com/c910335/vface-web

Cuda Chen’s Blog

Optimize _mm_crc32_u8 conversion in sse2neon

Introduction

What’s CRC32C?

Road of Optimization

apply ternany operator

tabular method

tabular method (half-byte)

using Arm Cryptography Extension

Barrett reduction 8

carry-less multiplication

Conclusion

Trivia

Reference

My sse2neon Contribution of _rdtsc

What’s _rdtsc

Implementations

ARMv8-A

ARMv7-A

Test Cases

Why the for loop looks so strange?

References

My Moderna COVID-19 First Booster Vaccination Report

Day 1 (the day after got vaccinated)

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7

Day 8

Day 9

Day 10

Day 11

Day 12

Day 13

Day 14

Switch Your Jekyll Blog to Google Analytics 4 Simplified

Introduction

Switching Procedures

References

My Medigenvac COVID-19 Second Vaccination Report

Day 1 (The day get vaccinated)

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7

Day 8

Day 9

Day 10

Day 11

Day 12

Day 13

Day 14

How My LeNet Achieves 99% Accuracy

Introduction

Base Model

Let’s fine tuning!

baseline and its performance

batch size

32

64

256

512

regularizer parameter

1e-2

1e-4

1e-6

optimizer

ADAMW

NADAM

AdaBelief

Conclusion

List to Show the Training Environment

References

My Medigenvac COVID-19 First Vaccination Report

Day 1 (2021/08/24)

Day 2 (2021/08/25)

Barrett reduction ⁸

What’s `_rdtsc`

2. Create Django App: `users`

3. Add Unit Tests in `users`