Site Navigation
Categories:
Category:Character sets
MacOS codepages
DOS codepages
Windows codepages
EBCDIC codepages
Unicode
Character sets
Encodings
Character encoding
Unicode Transformation Formats
All pages needing cleanup
Wikipedia articles needing clarification from May 2009
Articles needing additional references from October 2009
All articles needing additional references
Wikipedia articles needing clarification from December 2009
All articles with unsourced statements
Articles with unsourced statements from December 2009
Articles with unsourced statements from January 2010
All articles with specifically-marked weasel-worded phrases
Articles with specifically-marked weasel-worded phrases from March 2010

Summary Of: UTF-8

UTF-8 encodes each character... UTF-8 was first officially presented at the... The UTF-8 encoding is variable... UTF-8 was restricted by... bytes in a UTF-8 sequence have the following meanings... A UTF-8 decoder should be prepared for... Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL... Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft... Many UTF-8 decoders throw an exception if a string has an error in it... more than one UTF-8 string converts to the same Unicode result... Therefore the original UTF-8 should be stored... UTF-8 may only legally be used to encode valid Unicode scalar values... and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above... data and did not alter their UTF-8 conversion when UCS... Modified UTF-8 strings will never contain any null bytes... All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU... supports standard UTF-8 when reading and writing strings through... This is the UTF-8 encoding of the Unicode... and is commonly referred to as a UTF-8 BOM even though it is not relevant to byte order... another encoding with a BOM is translated to UTF-8 without stripping it... The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF... even if the UTF-8 file contains only ASCII and would otherwise display correctly... Programming language parsers can often handle UTF-8 in string constants and comments... be used to identify if a file is UTF-8 versus a legacy encoding... Checking if the text is valid UTF-8 is more reliable than using BOM... which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a... it much easier to convert existing systems to UTF-8 than any other Unicode encoding... UTF-8 is the only encoding for XML entities that does not require a BOM or an... with UTF-8 as the preferred and most used encoding... UTF-8 strings can be fairly reliably recognized as such by a simple... of a random string of bytes being valid UTF-8 and not pure ASCII is 3... of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based... One UTF-8 advantage is that other byte... reliable way to implement this is to assume UTF-8 and switch to a legacy encoding only if several invalid UTF... UTF-8 representations and convert them to the same Unicode output... The introduction of UTF-8 gave one new active encoding on top of the locally established encoding... and UTF-8 was blamed for that in countries where there had not been any encoding troubles for... UTF-8 can encode any... so a valid UTF-8 stream never matches the UTF... UTF-8 encoded text is larger than the appropriate single... letters in UTF-8 will be double the size... directed at UTF-8 specifically and not Unicode in general... UTF-8 uses the codes 0... UTF-8 can encode any... UTF-8 does not require slower mathematical operations such as multiplication or division... UTF-8 often takes more space than an encoding made for one or a few languages... Byte streams containing invalid UTF-8 cannot be losslessly converted to UTF... ASCII characters take 1 byte in UTF-8 and 2 in UTF... will be smaller in UTF-8 due to the presence of ASCII spaces... For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF... the overlong UTF-8 sequence C0 80... Java virtual machine UTF-8 strings never have embedded nulls... The JNI uses modified UTF-8 strings to represent various string types... UTF-8 bit by bit... There are several current definitions of UTF-8 in various standards documents... which establishes UTF-8 as a standard Internet protocol element... UTF-8 test pages by... displays UTF-8 in a variety of formats... UTF-8 in modern browsers... details specific problems with UTF-8 in older browsers...

Encyclodia Page On: UTF-8

These Are Links To Other Documents
bit | UCS | Unicode Transformation Format | variable-length | character encoding | Unicode | backwards compatible | ASCII | e-mail | web pages | stored | streamed | code point | octets | bytes | Internet Engineering Task Force | Internet | protocols | encoding | Internet Mail Consortium | Unicode | Comparison
of encodings
| UTF-7 | UTF-1 | CESU-8 | UTF-16/UCS-2 | UTF-32/UCS-4 | UTF-EBCDIC | SCSU | BOCU-1 | Punycode | IDN | GB 18030 | BOM | Bi-directional text | Character Set | planes | characters | Han unification | HTML | E-mail | Unicode typefaces | ISO 10646 | X/Open | Unix System Laboratories | IBM | Ken Thompson | Plan 9 | operating system | Bell Labs | Rob Pike | Plan 9 | USENIX | San Diego | code point | Latin | diacritics | Greek | Cyrillic | Coptic | Armenian | Hebrew | Arabic | Syriac | Tāna | Basic Multilingual Plane | other planes of Unicode | CJK characters | Universal Character Set | binary | hex | decimal | IIS | replacement character | ISO-8859-1 | CP1252 | CP1252 | Internet Assigned Numbers Authority | CSS | HTML | XML | HTTP headers | CESU-8 | UCS-2 | UTF-16 | Basic Multilingual Plane | Oracle | null character | ASCIIZ | Java programming language | serialization | Java Native Interface | class files | Tcl | Windows | Notepad | byte-order mark | shebang | citations | verification | reliable references | challenged | removed | ASCII | heuristic algorithm | ISO/IEC 8859-1 | mojibake | clarification needed | Sorting | parser | Unicode | code page | byte order mark | Telnet | Cyrillic | Greek alphabet | Hindi | Devanagari | Thai | quantify | citation needed | Unicode | corruption | byte oriented | string searching algorithm | Shift JIS | bit operations | UTF-1 | citation needed | UCS-2 | BMP | byte order mark | citation needed | weasel words | Alt code | ASCII | Byte order mark | Character encodings in HTML | Comparison of e-mail clients#Features | Comparison of Unicode encodings | GB 18030 | Iconv | API | character encodings | ISO/IEC 8859 | Unicode and e-mail | Unicode and HTML | UTF-8 in URIs | UTF-9 and UTF-18 | UTF-16/UCS-2 | Universal Character Set | Internet Engineering Task Force | Internet Engineering Task Force | Internet Assigned Numbers Authority | IANA | Sun Microsystems | Sun Microsystems | Sun Microsystems | Sun Microsystems | Sun Microsystems | clarification needed | Plan 9 from Bell Labs | NFC | JavaScript | GPL | v | d | Character encodings | Category:Character sets | ASCII | ISO/IEC 646 | ISO/IEC 6937 | T.61 | sixbit code pages | Baudot code | Morse code | ISO/IEC 8859 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | -11 | -12 | -13 | -14 | -15 | -16 | ANSEL | 6438 | MARC-8 | ArmSCII | CNS 11643 | GOST 10859 | GB 2312 | HKSCS | ISCII | JIS X 0201 | JIS X 0208 | JIS X 0212 | JIS X 0213 | KPS 9566 | KS X 1001 | PASCII | TIS-620 | TSCII | VISCII | YUSCII | EUC | CN | JP | KR | TW | ISO/IEC 2022 | JP | KR | CCCII | MacOS codepages | Arabic | CentralEurRoman | EUC-CN | Big5 | Cyrillic | Icelandic | ShiftJIS | EUC-KR | Roman | TIS-620 | DOS codepages | 437 | 720 | 737 | 775 | 850 | 852 | 855 | 857 | 858 | 860 | 861 | 862 | 863 | 865 | 866 | 869 | Kamenický | Mazovia | MIK | Iran System | Windows codepages | 874 | TIS-620 | 932 | ShiftJIS | 936 | GBK | 949 | EUC-KR | 950 | Big5 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 54936 | GB18030 | EBCDIC codepages | 37/1140 | 285/1146 | 500/1148 | 930/1390 | 1047/924 | JEF | KEIS | ATASCII | CDC display code | DEC Radix-50 | Fieldata | GSM 03.38 | HP roman8 | PETSCII | TI calculator character sets | ZX Spectrum character set | Unicode | ISO/IEC 10646 | UTF-16/UCS-2 | UTF-32/UCS-4 | UTF-7 | UTF-EBCDIC | GB 18030 | SCSU | BOCU-1 | APL | Cork | HZ | IBM code page 1133 | KOI8 | TRON | control character | C0 C1 | CCSID | charset detection | Han unification | ISO 6429/IEC 6429/ANSI X3.64 | mojibake | Categories | Unicode | Character sets | Encodings | Character encoding | Unicode Transformation Formats | All pages needing cleanup | Wikipedia articles needing clarification from May 2009 | Articles needing additional references from October 2009 | All articles needing additional references | Wikipedia articles needing clarification from December 2009 | All articles with unsourced statements | Articles with unsourced statements from December 2009 | Articles with unsourced statements from January 2010 | All articles with specifically-marked weasel-worded phrases | Articles with specifically-marked weasel-worded phrases from March 2010 |
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "UTF-8".