Go slowly   About  Contact  Archives

Handle Unicode in Golang and Javascript

Currently I’m working on some basic security stuffs between server side written in Golang and client side written in, well, Javascript.

Code point

A character is not the same as a code point.

The Unicode standard uses the term “code point” to refer to the item represented by a single value. A character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

In Go, a code point is called rune, when using range on a string, it will result in a rune at a time. Read more about strings in Go here.

In Javascript, when using String.prototype.split(), it will result in a UTF-16 code unit. To separate the text into code points, use for..of or Array.from instead. Read more about string split in Javascript here. To get UTF-16 code unit, use String.prototype.charCodeAt(), to get code point, use String.prototype.codePointAt().

Escape Unicode characters

When working with Unicode strings in cryptography, we wouldn’t want to deal with vary size code point in Unicode characters. To make our life easier, escape them first, and then working on ASCII.

In Golang, we have url.QueryEscape() and in Javascript, we got encodeURIComponent().

url.QueryEscape() escapes the string so it can be safely placed inside a URL query.

The encodeURIComponent() function encodes a URI by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

So in theory, encodeURIComponent() and QueryEscape() should be the same, but in practice, they are not. Both of them try to follow RFC 3986 until they don’t. Depend on what you are working on, they might behave the same on the same kind of string. So you can use QueryEscape() on server side and decodeURIComponent() on client side with a high level of confidence.

Written on May 17, 2022.