Speed is the product
Exploring SIMD and auto-vectorization in WebAssembly to accelerate the C++ Eigen demucs.cpp code in freemusicdemixer.com. Post 2 in the freemusicdemixer.com seriesTaking freemusicdemixer.com seriously
Post 2 in a series describing the journey of my website and main project, https://freemusicdemixer.com
Free: an ambiguous term in software
Free has a well-known (but potentially confusing) dual meaning in the software world. This excellent StackExchange answer states it well:
"Free" is ambiguous in English: A free beer is a beer that you don’t have to pay for (you get it for free). A free speech is a speech where you have the right to express your opinion (you are free). To disambiguate the meaning of "free", you can refer to these concepts: free as in beer, if it’s about the price of zero (aka. gratis) free as in speech, if it’s about the freedom (aka. libre)
Free can mean that it cost you $0, or that the idea (and source code) is availably freely. Freemusicdemixer.com is both.
Speed is the product
The most important work I do on the site is speed optimization.
Speed matters
I've been fascinated by the task of optimizing C/C++/Rust audio and music code for years. Freemusicdemixer.com started off by implementing and offering the Open-Unmix model through my umx.cpp project with a leaner runtime memory usage to consume under the 4 GB limit of WebAssembly. This was a fast algorithm to begin with, but the quality suffers compared to better models like Demucs. The first time I implemented Demucs in freemusicdemixer.com (on a local test), it took around 2+ hours to demix a 4-minute song, which is totally unacceptable for end users.
In a previous time, I would have left it at that, leaving freemusicdemixer as a "good enough" proof-of-concept for mild curiosity, but not for any serious application.
Fortunately, this time around I was committed to making freemusicdemixer.com awesome, and I had a lot of naive code that I was able to fix or use more idiomatic Eigen APIs to bring the running time of Demucs down to 17 minutes for a 4-minute song. 17 minutes was the first release I went with when unveiling Demucs on the site. After that, I implemented Javascript workers to parallelize the operation of Demucs by adding a memory usage selector. With 32 GB of memory, I can demix a 4-minute track in 5 minutes these days.
I suspect that users (with access to at least 16 GB of RAM) are happy to have 5-7 minute demixing times on a free website - much, much happier than having to wait for hours. In other words, the speed of freemusicdemixer matters so much to the website. The faster it gets, the more it outstrips the heavyweight solutions and competitors based on backend GPU task runners.
Beating 5 minutes
5 minutes (@ 32 GB memory/8 workers) is the starting point for the rest of the optimizations and measurements in this blog post.
CPU architectures, SIMD, vectorization, and auto-vectorization
The Central Processing Unit (CPU) is defined by ARM:
The Central Processing Unit (CPU) is the primary component of a computer that acts as its “control center.” The CPU, also referred to as the “central” or “main” processor, is a complex set of electronic circuitry that runs the machine’s operating system and apps. The CPU interprets, processes and executes instructions, most often from the hardware and software programs running on the device.
CPUs have a ton of special instructions called "intrinsics":
If one takes advantage of these instructions themselves (either by writing intrinsics in C, or even lower-level assembly like OpenBLAS), it's referred to as vectorizing one's code.
In the case of auto-vectorization, the programmer writes code as they normally do (i.e. no special CPU instructions from the above), and rely on the compiler to auto-vectorize the code. By supplying CPU-specific architecture flags to gcc/g++
like -msse4.2
or -mavx
, in conjunction with -O3
to enable the auto-vectorizer, the generated binary should use efficient vectorized instructions.
A scalar instruction is an instruction that works on one value: a scalar. Add 2 numbers to generate 1 sum, multiply 2 numbers to generate 1 product. A vector instruction is an instruction that works on multiple scalars, or a vector of scalars. Add 4 pairs of numbers to generate 4 sums, multiply 4 pairs of numbers to generate 4 products. The CPU executes the same instruction whether you populate it with 1 pair or 4 pairs. This is why vectorization is crucial! You want to saturate your CPU to help it process your workload efficiently.
Another term for vectorized instructions is SIMD, or Single Instruction Multiple Data. Same idea: if you're summing pairs of numbers in a single CPU cycle, why not opportunistically shove 4 pairs of numbers in the same operation? Same power consumption, same CPU clock cycle, 4x throughput in (the specific vectorized part of) your application.
Typically the register width is used to define how many scalars can be processed at once by the vector. 128-bit-wide instructions support 4 x 32-bit floats in each vector argument (Intel's SSE 4.2 and older); 256-bit-wide instructions support 8x 32-bit floats (Intel's AVX), and 512-bit-wide instructions support 16x 32-bit floats (AVX-512, the newest generation of SIMD).
SIMD in WebAssembly
WebAssembly is a sort of virtual instruction set that runs on browsers (while your browser runs directly on your operating system). The WASM spec, like the other CPU manufacturers, provides intrinsics. These presumably (and verified in the results I will show later) speed up the way your compiled WASM code is run by the browser by leveraging WASM intrinsics/vectorized operations. The browser in turn is leveraging CPU-specific intrinsics/vectorized operations.
SIMD is enabled in WASM via the emscripten/emmc C/C++ compiler and SDK with the -msimd128
flag. WASM's SIMD spec contains 128-bit instructions.
Emulating SSE4.2 and other instruction sets
The WASM SIMD spec also supports emulating intrinsics for other architectures. This way, code that has been written or designed for the AMD64 SSE4.2 set of instructions, supported by both Intel and AMD CPUs, can be translated or emulated to WASM SIMD instructions. That means that a project like Eigen, whose code was optimized with specific CPU vectorized instructions, but not WebAssembly's vectorized instructions, can have it's SSE4.2 optimizations create benefits in WASM targets as well.
Given that WASM SIMD instructions are limited to 128 bits in their width, it means that 256- or 512-bit-wide instruction sets like AVX cannot be emulated. That's why for this article (and the freemusicdemixer project), I stick to SSE4.2 (which is the AMD64 spec for 128-bit-wide SIMD/vectorized instructions).
Testing the impact of SIMD on freemusicdemixer
For the site, I tested 3 variants of the compiled demucs.wasm module (all with -O3
, the max optimization flag that enables auto-vectorization):
- No SIMD or CPU-related flags: no
-msimd128
, etc. - Using WASM SIMD with
-msimd128
- Using WASM SIMD and the SSE 4.2 emulation/translation feature with
-msimd128 -msse4.2
Only auto-vectorization, no other code changes
I did not use any manual vectorized instructions (i.e. handwritten intrinsics), nor did I modify any code to make it easier for the compiler to auto-vectorize. That's a future project. For now I used the same code as I first wrote it.
Verifying that the binary contains SIMD instructions
My method of checking whether the 128-bit WASM SIMD auto-vectorization was effective was to use wasm-objump
on the compiled module and counting the appearance of the v128
keyword, which is the prefix used to represent WASM's 128-bit vectors.
Example:
$ wasm-objdump -d \
./build-wasm/demucs.wasm \
> ./demucs_wasm_objdump_msimd128_msse4_2.txt
$ rg -ciuu 'v128' \
demucs_wasm_objump_msimd128_msse4_2.txt
3389
Results
For all 3 cases, I ran a demix job for a 4-minute song with 32 GB memory (8 workers). This table summarizes the results:
Variant | Demix time | # of v128 |
---|---|---|
No SIMD | 6 min | 0 |
msimd128 | 5 min | 2145 |
msse4.2 | 4 min | 3389 |
I bolded the starting point of the current release of the site (5 min demix time).
Conclusion: it got 25% faster with SSE4.2
The great thing is confirming that using -msse4.2
allows the emcc compiler to generate more WASM SIMD instructions, and reduce the running time of the module significantly for a 25% speed boost!
More to come
This isn't the last time I improve the speed of freemusicdemixer.com, but it's a nice stopping point for now to make a new release and prepare for the future where I get it running even faster and slimmer.