How to tell where letters begin and end in hex
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:46
Member (2006)
English to Afrikaans
+ ...
Oct 19, 2021

Hello everyone

The hex for " ÿ " is "20 C3 BF 20". The spaces are "20" and the "ÿ" is "C3 BF". How can I tell that "20" is the first letter, and not "20 C3"? And how can I tell that "C3 BF" is the second letter, and not just "C3"?

Thanks
Samuel


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 02:46
English to Russian
+ ...
It is all about UTF-8 and BOM (Byte order mark) Oct 19, 2021

As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM,
... See more
As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM, U+FEFF) character is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[1]

UTF-8 is capable of encoding all 1,112,064[nb 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

Hope this helps
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 01:46
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Mikhail Oct 19, 2021

Mikhail Zavidin wrote:
As I can understand the string in question is UTF-8 encoded.


Correct. If it was UTF-16LE, it would be "2000 FF00 2000" instead of "20 C3BF 20".

So the order depends on BOM of the file if any.


UTF-8 doesn't have a byte order (and the "byte order mark" added to UTF-8 files is called a "byte order mark" for historical reasons and not because it indicates a byte order (it doesn't indicate a byte order because UTF-8 doesn't have a byte order (or: has only one byte order, depending on how you explain it))). Anyway, the byte order (even if there was one) isn't really relevant to the question.

I'm trying to figure out how I can tell just by looking at "20 C3 BF 20" that "20 C3" and "BF 20" are not characters, but that "20" is a character, "C3 BF" is a character, and "20" is a character? The problem is that sometimes a character is encoded as two digits and sometimes it is encoded as four digits, and I want to know how can I tell which is when.


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 02:46
English to Russian
+ ...
Why not review the conversion table Oct 20, 2021

Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the
... See more
Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the higher bits set to 110 (110xxxxx), this symbol consist of 2 bytes.
Then if the byte contains the higher bits set to 1110 (1110xxxx) this symbol consist of 3 bytes.
And so on, according with the above table.

In your example the first 20 (00100000), contains 0 in higher bit of the byte, so this is single byte symbol from the first 127 symbols of the ASCII table. The second byte is C3, meaning 1100 0011 and represents a two byte symbol.
And so on.

Hope this helps

[Edited at 2021-10-20 10:14 GMT]

[Edited at 2021-10-20 10:17 GMT]

[Edited at 2021-10-20 10:18 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

How to tell where letters begin and end in hex






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »