How does the UTF-8 encoding algorithm work on 8-bit chunks (in JavaScript)?

Question

How does the UTF-8 encoding algorithm work on 8-bit chunks (in JavaScript)?

I am looking at this:

function encodeCodePoint(codePoint) {
  if ((codePoint & 0xFFFFFF80) == 0) { // 1-byte sequence
    return stringFromCharCode(codePoint);
  }
  var symbol = '';
  if ((codePoint & 0xFFFFF800) == 0) { // 2-byte sequence
    symbol = stringFromCharCode(((codePoint >> 6) & 0x1F) | 0xC0);
  }
  else if ((codePoint & 0xFFFF0000) == 0) { // 3-byte sequence
    checkScalarValue(codePoint);
    symbol = stringFromCharCode(((codePoint >> 12) & 0x0F) | 0xE0);
    symbol += createByte(codePoint, 6);
  }
  else if ((codePoint & 0xFFE00000) == 0) { // 4-byte sequence
    symbol = stringFromCharCode(((codePoint >> 18) & 0x07) | 0xF0);
    symbol += createByte(codePoint, 12);
    symbol += createByte(codePoint, 6);
  }
  symbol += stringFromCharCode((codePoint & 0x3F) | 0x80);
  return symbol;
}

which, in JavaScript, seems to be taking advantage of the fact that the numbers in JavaScript are (I think) something around 32-bits long. So it does some bit manipulation which I am unfamiliar with and gets the encoded value. Same with the decode function:

function decodeSymbol() {
  var byte1;
  var byte2;
  var byte3;
  var byte4;
  var codePoint;

  if (byteIndex > byteCount) {
    throw Error('Invalid byte index');
  }

  if (byteIndex == byteCount) {
    return false;
  }

  // Read first byte
  byte1 = byteArray[byteIndex] & 0xFF;
  byteIndex++;

  // 1-byte sequence (no continuation bytes)
  if ((byte1 & 0x80) == 0) {
    return byte1;
  }

  // 2-byte sequence
  if ((byte1 & 0xE0) == 0xC0) {
    byte2 = readContinuationByte();
    codePoint = ((byte1 & 0x1F) << 6) | byte2;
    if (codePoint >= 0x80) {
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 3-byte sequence (may include unpaired surrogates)
  if ((byte1 & 0xF0) == 0xE0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    codePoint = ((byte1 & 0x0F) << 12) | (byte2 << 6) | byte3;
    if (codePoint >= 0x0800) {
      checkScalarValue(codePoint);
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 4-byte sequence
  if ((byte1 & 0xF8) == 0xF0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    byte4 = readContinuationByte();
    codePoint = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0C) |
      (byte3 << 0x06) | byte4;
    if (codePoint >= 0x010000 && codePoint <= 0x10FFFF) {
      return codePoint;
    }
  }

  throw Error('Invalid UTF-8 detected');
}

Basically, I can't quite read this code and can't really tell what's going on. Wondering if one with better bit-manipulation chops or UTF-8 encoding knowledge could describe at a high level what the input and output are from encoding and decoding, and very roughly how it goes from input to output for each. I am trying to build a utf-8 encoder/decoder and don't see exactly how an 8-bit stream is chunked into 1 to 4 byte chunks, partly because the JavaScript 32-bit integer thingy is getting in the way I think. But to me it seems like this is what happens:

Decoding:

We have an 8-bit (1-byte) stream of data.
We get a byte
We check if that byte is within a certain range of some sort (which I don't know)
If it's in some range, then we know an extra byte follows, or something like that.
We then collect all the bytes for the character...
And in the case of JavaScript, convert it to an integer and then String.fromCharCode(integer) sort of thing.

What I'm missing is how exactly it goes from the 1-byte sequence to up to 4 bytes, how does it do that part?

Encoding:

This is language/architecture dependent, since some architectures will have integers be 16, 32, or 64 bits (...I'm guessing...).
In the case of JavaScript, take the 32-ish-bit integer and do some bit-manipulation magic to extract out the 1 to 4 bytes for this character. How does it know how many bytes to receive???
Repeat until you have an array of bytes.

Wondering if one could fill in the gaps in my understanding. I'm not looking for exactly each bit-manipulation step, as there are a lot. I am just looking for the questions which I highlighted in my analysis just above.

javascript

encoding

utf-8

bit-manipulation

asked on Stack Overflow Dec 22, 2019 by

Lance Pollard

1 Answer

JS integers have 32bit binary operators, thus you can safely work with 4 x 8bit (4bytes) in one single number. That's what your decoder receives as a parameter.

UTF-8 encoding is variable in size. If the codepoint would only take 7bits (= ASCII), then it would fit into one byte, that has a leading zero to indicate that it only has one byte:

  0XXXXXXXX

Now to check whether the codepoint is only one byte, one could check if there is a bit set somewhere in the upper bytes. That can be done by comparing the codepoint to 0xFFFFF80, which has all bits set excluding the last 8. Thus, if a bitwise and results in something unequal 0, there is a bit set somewhere in the upper bytes.

  1111111111111111111110000000 &
                      0XXXXXXX
   = 0

Now if there are more than 7 bits, the first byte contains the number of bytes, all the following bytes contain a 01 sequence at the beginning, for 4 bytes that would be:

  11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

Now to get the upper 8 encoded bits here for example, one could rightshift by 18:

  1110XXX 10XXXXX

answered on Stack Overflow Dec 22, 2019 by

Jonas Wilms • edited Dec 22, 2019 by

Jonas Wilms

User contributions licensed under CC BY-SA 3.0