How does the UTF-8 encoding algorithm work on 8-bit chunks (in JavaScript)?

0

I am looking at this:

function encodeCodePoint(codePoint) {
  if ((codePoint & 0xFFFFFF80) == 0) { // 1-byte sequence
    return stringFromCharCode(codePoint);
  }
  var symbol = '';
  if ((codePoint & 0xFFFFF800) == 0) { // 2-byte sequence
    symbol = stringFromCharCode(((codePoint >> 6) & 0x1F) | 0xC0);
  }
  else if ((codePoint & 0xFFFF0000) == 0) { // 3-byte sequence
    checkScalarValue(codePoint);
    symbol = stringFromCharCode(((codePoint >> 12) & 0x0F) | 0xE0);
    symbol += createByte(codePoint, 6);
  }
  else if ((codePoint & 0xFFE00000) == 0) { // 4-byte sequence
    symbol = stringFromCharCode(((codePoint >> 18) & 0x07) | 0xF0);
    symbol += createByte(codePoint, 12);
    symbol += createByte(codePoint, 6);
  }
  symbol += stringFromCharCode((codePoint & 0x3F) | 0x80);
  return symbol;
}

which, in JavaScript, seems to be taking advantage of the fact that the numbers in JavaScript are (I think) something around 32-bits long. So it does some bit manipulation which I am unfamiliar with and gets the encoded value. Same with the decode function:

function decodeSymbol() {
  var byte1;
  var byte2;
  var byte3;
  var byte4;
  var codePoint;

  if (byteIndex > byteCount) {
    throw Error('Invalid byte index');
  }

  if (byteIndex == byteCount) {
    return false;
  }

  // Read first byte
  byte1 = byteArray[byteIndex] & 0xFF;
  byteIndex++;

  // 1-byte sequence (no continuation bytes)
  if ((byte1 & 0x80) == 0) {
    return byte1;
  }

  // 2-byte sequence
  if ((byte1 & 0xE0) == 0xC0) {
    byte2 = readContinuationByte();
    codePoint = ((byte1 & 0x1F) << 6) | byte2;
    if (codePoint >= 0x80) {
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 3-byte sequence (may include unpaired surrogates)
  if ((byte1 & 0xF0) == 0xE0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    codePoint = ((byte1 & 0x0F) << 12) | (byte2 << 6) | byte3;
    if (codePoint >= 0x0800) {
      checkScalarValue(codePoint);
      return codePoint;
    } else {
      throw Error('Invalid continuation byte');
    }
  }

  // 4-byte sequence
  if ((byte1 & 0xF8) == 0xF0) {
    byte2 = readContinuationByte();
    byte3 = readContinuationByte();
    byte4 = readContinuationByte();
    codePoint = ((byte1 & 0x07) << 0x12) | (byte2 << 0x0C) |
      (byte3 << 0x06) | byte4;
    if (codePoint >= 0x010000 && codePoint <= 0x10FFFF) {
      return codePoint;
    }
  }

  throw Error('Invalid UTF-8 detected');
}

Basically, I can't quite read this code and can't really tell what's going on. Wondering if one with better bit-manipulation chops or UTF-8 encoding knowledge could describe at a high level what the input and output are from encoding and decoding, and very roughly how it goes from input to output for each. I am trying to build a utf-8 encoder/decoder and don't see exactly how an 8-bit stream is chunked into 1 to 4 byte chunks, partly because the JavaScript 32-bit integer thingy is getting in the way I think. But to me it seems like this is what happens:

Decoding:

  • We have an 8-bit (1-byte) stream of data.
  • We get a byte
  • We check if that byte is within a certain range of some sort (which I don't know)
  • If it's in some range, then we know an extra byte follows, or something like that.
  • We then collect all the bytes for the character...
  • And in the case of JavaScript, convert it to an integer and then String.fromCharCode(integer) sort of thing.

What I'm missing is how exactly it goes from the 1-byte sequence to up to 4 bytes, how does it do that part?

Encoding:

  • This is language/architecture dependent, since some architectures will have integers be 16, 32, or 64 bits (...I'm guessing...).
  • In the case of JavaScript, take the 32-ish-bit integer and do some bit-manipulation magic to extract out the 1 to 4 bytes for this character. How does it know how many bytes to receive???
  • Repeat until you have an array of bytes.

Wondering if one could fill in the gaps in my understanding. I'm not looking for exactly each bit-manipulation step, as there are a lot. I am just looking for the questions which I highlighted in my analysis just above.

javascript
encoding
utf-8
bit-manipulation
asked on Stack Overflow Dec 22, 2019 by Lance Pollard

1 Answer

1

JS integers have 32bit binary operators, thus you can safely work with 4 x 8bit (4bytes) in one single number. That's what your decoder receives as a parameter.


UTF-8 encoding is variable in size. If the codepoint would only take 7bits (= ASCII), then it would fit into one byte, that has a leading zero to indicate that it only has one byte:

  0XXXXXXXX

Now to check whether the codepoint is only one byte, one could check if there is a bit set somewhere in the upper bytes. That can be done by comparing the codepoint to 0xFFFFF80, which has all bits set excluding the last 8. Thus, if a bitwise and results in something unequal 0, there is a bit set somewhere in the upper bytes.

  1111111111111111111110000000 &
                      0XXXXXXX
   = 0

Now if there are more than 7 bits, the first byte contains the number of bytes, all the following bytes contain a 01 sequence at the beginning, for 4 bytes that would be:

  11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

Now to get the upper 8 encoded bits here for example, one could rightshift by 18:

  1110XXX 10XXXXX
answered on Stack Overflow Dec 22, 2019 by Jonas Wilms • edited Dec 22, 2019 by Jonas Wilms

User contributions licensed under CC BY-SA 3.0