Assembly language vs Bytecode vs WebAssembly vs Asm.js

Assembly

In the beginning there was a CPU. And the CPU could be loaded with data which were numbers and these magical numbers would cause the CPU to process those numbers and produce new numbers. This data which the CPU was running, the numbers fed into the computer, was machine code. And all programming of computers that's still done today ultimately ends up turning into machine code for the CPU, binary machine code being the only thing a computer can understand.

But long series of numbers, where each 256 or more unique numbers has special meaning which changes depending on context such as what number came before it, this is too difficult for humans to remember and reason about efficiently. It's too far separated from the way humans think. So we name these numbers, and these mnemonics are far easier to remember. We just need to assemble these mnemonic words for number along with other names and values into machine code. Thus assembly language is born.

Each processor has its own machine code format and so each type of CPU has its own unique assembly language to go with it. There is no global standard for machine code because that would prevent CPU manufacturers from adding new instructions which allowed CPUs to do new things. But what if we could have a virtual CPU with its own unique instruction set (machine code language) which could be turned into the machine code for any CPU family to finally run on a real CPU? And so the Java Virtual Machine (JVM) was born.

Bytecode

The JVM is possible in part due to being a stack-based architecture without registers. Generally, CPUs have a number of memory slots called registers where they can hold a few numbers for immediate use. But each CPU architecture can have any number of registers. The JVM relies on no registers and instead stores values on a stack. Thus the JVM can run on any architecture with a stack, which is all of them. Using the stack instead of registers also means the bytecode can be shorter due to not having duplicate instructions for combinations or registers as arguments or adding bytes to refer to the registers for each instruction. But you need to add extra code to push and pop values on the stack. Android and Dalvik VM which is a JVM implementation using registers rather than the stack does so to reduce the size of the code, not require JIT re-compiling and optmization for registers, and faster execution. But this means that the universal bytecode of the standard JVM is different from Android VM bytecode.

Nobody actually programs Java Virtual Machines by writing bytecode - which is what we call JVM machine code. Nor do they write JVM assembly language code and assemble into JVM bytecode (although you can do this with a tool called Jasmin). The JVM was created for its own programming language, Java. Nowadays, a number of languages can compile to JVM bytecode, and Java itself can compile directly to machine code (via gcj, the GNU Java compiler).

The creators of Java and the JVM also made a JVM that could run in browsers. As the web was starting to take off there was a demand for the kind of interactivity and features and dynamic processing that couldn't be achieved by the HTML of those days. In general, Java browser apps or "applets" were a bad experience for users due to loading time, awkward and nonstandard UI elements, and also being complete separate from the web page and not being able to interact with the DOM. Plugins like Flash offered a far better experience. But the browser needed its own native virtual machine and scripting language. So JavaScript was born.

Asm.js

Again, the language and the interpreter (browser implementation of JavaScript in Netscape, which we now know as Mozilla Firefox) came in one package. JavaScript was meant to control interaction in the browser, within a web page, directly in the DOM. It was meant to be a sandboxed environment because any website could run any JavaScript they wanted (although at the time it was common for browsers not to support executing JavaScript or for users to turn JavaScript off). This meant that the security of the sandbox was very important and the language wasn't allowed to do things, like access the user's computer hardware. This is starting to change.

Unlike Java, JavaScript isn't compiled to bytecode or machine code. As a dynamic language, it would be difficult to do so compared to a language like C. JavaScript didn't have performance in mind when it was released. Websites then used only small amounts of Javascript code anyways. But over time, the web became more dynamic, AJAX was invented, and soon websites became web apps.

JavaScript has slower parts and faster parts. Some things which you can say in a programming language easily map directly to CPU instructions, and these are fast. Some simple words, otoh, in a programming language lead to a lot of complex instructions under the hood. And some functions in a programming language happen automatically yet cause a slowdown in performance, like garbage collection. What if we could just use the fast parts and not trigger the slow parts? Thus Asm.js was born.

Asm.js is not a library or a framework that you include. It is just a special format for writing otherwise normal and legal JavaScript code. Firefox is pushing asm.js and can run asm.js code now. 'Asm' is shorthand for assembly or assembler. But asm.js doesn't look much like any assembly language, but is just a subset of the JavaScript language with a few tricks of the JavaScript syntax to explicitly "coerce" types (integers, floats, booleans, etc.) and tell the asm.js compiler, which can compile to bytecode before the JavaScript is parsed, how to produce optimized bytecode. This bytecode is currently up to half the speed of native machine code (some reports up to 70% of native speed), which means that even CPU intensive software like games can run at acceptable speeds. Asm.js uses another trick to emulate existing C code, by having something like virtual pointers or memory addresses which point to bytes in a virtual heap, a big array of bytes.

These large code bases like games or DOS emulators weren't written from scratch in asm.js, although it's possible to write asm.js by hand. But asm.js takes advantage of tons of existing software written in C/C++ by using the Emscripten compiler to take C and compile it to an asm.js target. This already works, although for now only in Firefox.

WebAssembly

But as we try to run larger and larger asm.js software, we have more performance problems. One problem is the loading time, the thing that suppressed Java applets back in the day. And so for quicker downloads and loading as well as overcoming the parse time overhead, browser vendors like Google, Microsoft, and even Mozilla, want to go beyond asm.js. This next idea is called WebAssembly.

WebAssembly, or wasm for short, is no longer JavaScript code, unlike asm.js. Wasm is actually a binary format, the data being an abstract syntax tree (AST), a way to represent bytecode (semantics: wasm AST is not bytecode). There will also be a human-readable, text and not binary format for wasm but what computers see is the wasm AST code. This text language will be the equivalent of assembly language, with the AST being the machine code of the WebAssembly virtual machine, which will be part of or limited by the JavaScript virtual machine, in order to have access to the same DOM and any hardware which the browser is allowed to manage. It won't be separate like the Java applets. The goal of WebAssembly is to improve on the performance of asm.js so that most software can run in the browser without being limited by slow performance. This would make more software possible and accessible and save time for everyone as well as making web an even better platform for smartphone app development, which is the direction things seem to be headed.

Machine code has been real since the beginning. Bytecode has been tested, rolled out, and matured in the hands of much of the world in the form of Java. Asm.js is new, but you can download and run it today using Firefox. WebAssembly hopes to be the future binary code format, but it's still early days and playing with it is still confined to developers only.