Content deleted Content added
Line ६२:
 
Further additions of characters to the already-encoded scripts, as well as symbols, in particular for [[mathematics]] and [[music]] (in the form of notes and rhythmic symbols), also occur. The [http://www.unicode.org/roadmaps/ Unicode Roadmap] lists scripts not yet in Unicode with tentative assignments to code blocks. Invented scripts, most of which do not qualify for inclusion in Unicode due to lack of real-world usage, are listed in the [[ConScript Unicode Registry]], along with unofficial but widely-used [[Private Use Area]] code assignments. Similarly, many medieval letter variants and ligatures not in Unicode are encoded in the [[Medieval Unicode Font Initiative]].
 
== Mapping and encodings ==
 
{{see also|Mapping of Unicode characters}}
 
=== Standard ===
 
The [[Unicode Consortium]], based in [[California]], develops the Unicode standard. Any company or individual willing to pay the membership dues may join this organization. Members include virtually all of the main computer software and hardware companies with any interest in text-processing standards, such as [[Apple Computer]], [[Microsoft]], [[International Business Machines|IBM]], [[Xerox]], [[Hewlett-Packard|HP]], [[Adobe Systems]] and many others.
 
The Consortium first published ''The Unicode Standard'' (ISBN 0-321-18578-1) in [[1991]], and continues to develop standards based on that original work. Unicode developed in conjunction with the [[International Organization for Standardization]], and it shares its character repertoire with [[ISO/IEC 10646]]: the Universal Character Set. Unicode and ISO/IEC 10646 function equivalently as character encodings, but ''The Unicode Standard'' contains much more information for implementers, covering — in depth — topics such as bitwise encoding, [[Unicode collation algorithm|collation]], and rendering. The Unicode Standard enumerates a multitude of character properties, including those needed for supporting [[Bi-directional text|bidirectional text]]. The two standards do use slightly different terminology.
 
<!-- Template:U links to this paragraph --><p id="Upluslink">When writing about a Unicode character, it is normal to write "U+" followed by a [[hexadecimal]] number indicating the character's code point. For code points in the [[Basic Multilingual Plane|BMP]], four digits are used; for code points outside the BMP, five or six digits are used, as required. Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits in order to indicate a code unit, not a code point.</p>
 
==== Unicode revision history ====
 
{| style="vertical-align: top;"
| {{CT-16}} width="0*" | [[1991#October|October,&nbsp;1991]] || {{CT-15}} width="0*" | Unicode&nbsp;1.0 || width="80%" | ISBN 0-201-56788-1.
|-
| {{CT-16}} | [[1992#June|June,&nbsp;1992]] || {{CT-15}} | Unicode&nbsp;1.0.1 || ISBN 0-201-60845-6.
|-
| {{CT-16}} | [[1993#June|June,&nbsp;1993]] || {{CT-15}} | Unicode&nbsp;1.1 || Previous 2 Publications, and, Unicode Technical Report #4:The Unicode Standard, Version 1.1 by Mark Davis.
|-
| {{CT-16}} | [[1996#July|July,&nbsp;1996]] || {{CT-15}} | Unicode&nbsp;2.0 || ISBN 0-201-48345-9.
|-
| {{CT-16}} | [[1998#May|May,&nbsp;1998]] || {{CT-15}} | Unicode&nbsp;2.1 || &nbsp;
|-
| {{CT-16}} | [[1998#May|May,&nbsp;1998]] || {{CT-15}} | Unicode&nbsp;2.1.2&nbsp; || Previous 3 Publications, and, Unicode Technical Report #8, The Unicode Standard, Version 2.1 by Lisa Moore.
|-
| {{CT-16}} | [[1999#September|September,&nbsp;1999]] || {{CT-15}}|Unicode&nbsp;3.0 || Covered 16-bit [[Universal character set|UCS]] Basic Multilingual Plane ([[Basic Multilingual Plane|BMP]]) from ISO 10646-1:2000. ISBN 0-201-61633-5.
|-
| {{CT-16}} | [[2001#March|March,&nbsp;2001]] || {{CT-15}} | Unicode&nbsp;3.1 || Added [[Mapping of Unicode characters|Supplementary Planes]] from ISO 10646-2, providing supplementary characters
|-
| {{CT-16}} | [[2002#March|March,&nbsp;2002]] || {{CT-15}} | Unicode&nbsp;3.2 || &nbsp;
|-
| {{CT-16}} | [[2003#April|April,&nbsp;2003]] || {{CT-15}} | Unicode&nbsp;4.0 || ISBN 0-321-18578-1.
|-
| {{CT-16}} | [[2004#March|March,&nbsp;2004]] || {{CT-15}} | Unicode&nbsp;4.0.1 || &nbsp;
|-
| {{CT-16}} | [[2005#March|March,&nbsp;2005]] || {{CT-15}} | Unicode&nbsp;4.1 || &nbsp;
|-
| {{CT-16}} | [[2006#July|July,&nbsp;2006]] || {{CT-15}} | Unicode&nbsp;5.0 || {{CT-12}}|(The character database, aka. ''UCD'', published on [[July 18]], but the book, ''The Unicode Standard, Version 5.0'', expected to be released in fourth quarter of 2006. ISBN 0-321-4801-0.)
|}
 
=== Storage, transfer, and processing ===
 
So far, Unicode has appeared simply as a means to assign a unique number to each character used in the written languages of the world. The storage of these numbers in text processing comprises another topic; problems result from the fact that much [[software]] written in the [[Western world]] deals with 8-bit or lower character encodings only, with Unicode support added only slowly in recent years. Similarly, in representing the scripts of [[Asia]], the [[ASCII]] based double-byte character encodings cannot even in principle encode more than 32,768 characters, and in practice the architectures chosen impose lower limits. Such limits do not suffice for the needs of scholars of the [[Chinese language]] alone.
 
The internal logic of much 8-bit legacy software typically permits only 8 bits for each character, making it impossible to use more than 256 code points without special processing. Sixteen-bit software can support only some tens of thousands of characters. Unicode, on the other hand, has already defined more than 100,000 encoded characters. Systems designers have therefore suggested several mechanisms for implementing Unicode; which one implementers choose depends on available storage space, [[source code]] compatibility, and interoperability with other systems.
 
Unicode defines two mapping methods:
 
* the '''UTF''' ('''[[UTF-8|Unicode Transformation Format]]''') encodings
* the '''UCS''' ('''[[Universal Character Set]]''') encodings
 
The encodings include:
 
* [[UTF-7]] — a relatively unpopular 7-bit encoding, suited for transmission and storage only; it is often considered obsolete
* [[UTF-8]] — an 8-bit, variable-width encoding, compatible with [[ASCII]].
* [[UCS-2]] — a 16-bit, fixed-width encoding that only supports the [[Mapping of Unicode characters#Basic Multilingual Plane|BMP]]
* [[UTF-16]] — a 16-bit, variable-width encoding that supports the full [[Mapping of Unicode characters|Unicode character mapping]]
* [[UCS-4]] and [[UTF-32]] — functionally identical 32-bit fixed-width encodings
* [[UTF-EBCDIC]] — an encoding intended for [[EBCDIC]] based mainframe systems
 
The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings.
 
UTF-8 uses one to four bytes per code point and, being relatively compact (for Latin script) and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling.
 
UTF-16 is similar to UCS-2 but can include one or two 16-bit words in order to cover the supplementary characters (introduced from Unicode 3.1 onwards). UTF-16 is used by many APIs, often for upward compatibility with APIs that were developed when Unicode was UCS-2 based, or for compatibility with other APIs that use UTF-16. UTF-16 is the standard format for the [[Microsoft Windows|Windows]] API (though surrogate support is not enabled by default) and for the [[Java (programming language)|Java]] (J2SE 1.5 or higher) and .NET bytecode environments.
 
UCS-2 is an obsolete, 16-bit fixed-width encoding covering the [[Basic Multilingual Plane]] only. For characters in the Basic Multilingual Plane (16 bit range), UCS-2 and UTF-16 are identical. Therefore they can be considered as different implementation levels of the same encoding. The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files. Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity, regardless of the Unicode encoding used. The units <code>FE</code> and <code>FF</code> never appear in [[UTF-8]]; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)|ligatures]]). The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>.
 
In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the [[endianness]], which varies across different platforms, affects how the code value actually manifests as an octet (byte) sequence). In the other cases, each code point may be represented by a variable number of code values. UCS-4 and UTF-32 are not commonly used, since no more than 21 of the 32 bits allocated per code point would ever be used, but it is becoming increasingly common for programming language implementations to use UCS-4 for their internal storage of encoded text<!--- which ones? -->.
 
[[Punycode]], another encoding form, enables the encoding of Unicode strings into the limited character set supported by the [[ASCII]]-based [[Domain Name System]]. The encoding is used as part of [[IDNA]], which is a system enabling the use of [[Internationalized Domain Names]] in all languages that are supported by Unicode.
 
[[GB18030]] is another encoding form for Unicode, from the [[Standardization Administration of China]]. It is the official [[character set]] of the [[People's Republic of China]] (PRC).
 
=== Ready-made versus composite characters ===
 
Unicode includes a mechanism for modifying character shape and so greatly extending the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s. They get inserted after the main character (one can stack several combining diacritics over the same character). However, for reasons of compatibility, Unicode also includes a large quantity of [[precomposed character|pre-composed characters]]. So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]].
 
An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides the mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides all 11,172 combinations of precomposed Hangul syllables.
 
The [[CJK]] ideographs currently have codes only for their precomposed form. Still, most of those ideographs evidently comprise simpler elements (radicals), so in principle Unicode could decompose them just as happens with [[Hangul]]. This would greatly reduce the number of required code points, while allowing the display of virtually every conceivable ideograph (which might do away with some of the problems caused by the [[Han unification]]). A similar idea covers some [[input method]]s, such as [[Cangjie method|Cangjie]] and [[Wubi method|Wubi]]. However, attempts to do this for character encoding have stumbled over the fact that ideographs do not actually decompose as simply or as regularly as it seems they should.
 
A set of [[Radical (Chinese character)|radicals]] was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 11.1 of Unicode 4.1) warns against using ideographic description sequences as an alternate representation for previously encoded characters:
 
:This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideograph descriptions are more akin to the English phrase, “an ‘e’ with an acute accent on it,” than to the character sequence &lt;U+006E, U+0301&gt;.
 
==== Ligatures ====
 
Many languages, including [[Arabic language|Arabic]] and [[Devanāgarī|Hindi]], have special orthographic rules which require that certain combinations of letterforms be combined into special [[ligature (typography)|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography|AAT]] (by Apple). [[Font language|Instructions]] are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. In simpler cases, such as the placement of combining marks or diacritics, fixed-width fonts sometimes employ a method known as "[[sidebearing]]" in which the special marks preceed the main letterform in the datastream and the font rendering software knows to combine the marks into a final form.{{citationneeded}} This method works only for some diacritics, and may fail to properly handle stacked marks.
 
[[As of 2004]], most software still cannot reliably handle many features not supported by older font formats, so combining characters generally will not work correctly. For example, {{unicode|ḗ}} (precomposed e with macron and acute above) and {{unicode|ḗ}} (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[macron]] and [[acute accent]], but in practice, their appearance can vary greatly across software applications. Similarly, [[dot (diacritic)|underdots]], as needed in the [[romanization]] of [[Indo-Aryan languages|Indic]], will often be placed incorrectly. As a workaround, Unicode characters that map to precomposed glyphs can be used for many such characters. The need for such alternatives inherits from the limitations of fonts and rendering technology, not weaknesses of Unicode itself.
 
== Unicode in use ==