युनिकोड: Difference between revisions

Content deleted Content added
m robot Adding: mr:युनिकोड
m robot Modifying: ckb:یوونیکۆد; अंगराग परिवर्तन
Line ७:
Unicode has the explicit aim of transcending the limitations of traditional [[character encoding]]s, such as those defined by the [[ISO 8859]] standard which find wide usage in various countries of the world, but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using [[Roman character]]s and the local language), but not multilingual computer processing (computer processing of arbitrary languages mixed with each other).
 
Unicode, in intent, encodes the underlying [[character (computing)|charactercharacters]]s — [[grapheme]]s and grapheme-like units — rather than the variant [[glyph]]s (renderings) for such characters. In the case of [[Chinese character]]s, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see [[Han unification]]).
 
In text processing, Unicode takes the role of providing a unique ''code point'' — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way, and leaves the visual rendering (size, shape, [[font]] or style) to other software, such as a [[web browser]] or [[word processor]]. This simple aim becomes complicated, however, by concessions made by Unicode's designers, in the hope of encouraging a more rapid adoption of Unicode.
Line १०६:
| {{CT-16}} | [[1999#September|September, 1999]] || {{CT-15}}|Unicode 3.0 || Covered 16-bit [[Universal character set|UCS]] Basic Multilingual Plane ([[Basic Multilingual Plane|BMP]]) from ISO 10646-1:2000. ISBN 0-201-61633-5.
|-
| {{CT-16}} | [[2001#March|March, 2001]] || {{CT-15}} | Unicode 3.1 || Added [[Mapping_of_Unicode_charactersMapping of Unicode characters|Supplementary Planes]] from ISO 10646-2, providing supplementary characters
|-
| {{CT-16}} | [[2002#March|March, 2002]] || {{CT-15}} | Unicode 3.2 ||  
Line १४३:
UTF-8 uses one to four bytes per code point and, being relatively compact (for Latin script) and ASCII-compatible, provides the ''de facto'' standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling.
 
UTF-16 is similar to UCS-2 but can include one or two 16-bit words in order to cover the supplementary characters (introduced from Unicode 3.1 onwards). UTF-16 is used by many APIs, often for upward compatibility with APIs that were developed when Unicode was UCS-2 based, or for compatibility with other APIs that use UTF-16. UTF-16 is the standard format for the [[Microsoft Windows|Windows]] API (though surrogate support is not enabled by default) and for the [[Java_Java (programming_languageprogramming language)|Java]] (J2SE 1.5 or higher) and .NET bytecode environments.
 
UCS-2 is an obsolete, 16-bit fixed-width encoding covering the [[Basic Multilingual Plane]] only. For characters in the Basic Multilingual Plane (16 bit range), UCS-2 and UTF-16 are identical. Therefore they can be considered as different implementation levels of the same encoding. The UCS-2 and UTF-16 encodings specify the Unicode [[Byte Order Mark]] (BOM) for use at the beginnings of text files. Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity, regardless of the Unicode encoding used. The units <code>FE</code> and <code>FF</code> never appear in [[UTF-8]]; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of [[ligature (typography)|ligatureligatures]]s). The same character converted to UTF-8 becomes the byte sequence <code>EF BB BF</code>.
 
In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the [[endianness]], which varies across different platforms, affects how the code value actually manifests as an octet (byte) sequence). In the other cases, each code point may be represented by a variable number of code values. UCS-4 and UTF-32 are not commonly used, since no more than 21 of the 32 bits allocated per code point would ever be used, but it is becoming increasingly common for programming language implementations to use UCS-4 for their internal storage of encoded text<!--- which ones? -->.
Line १५५:
=== Ready-made versus composite characters ===
 
Unicode includes a mechanism for modifying character shape and so greatly extending the supported glyph repertoire. This covers the use of [[combining diacritical mark]]s. They get inserted after the main character (one can stack several combining diacritics over the same character). However, for reasons of compatibility, Unicode also includes a large quantity of [[precomposed character|pre-composed charactercharacters]]s. So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of [[canonical equivalence]].
 
An example of this arises with [[Hangul]], the Korean alphabet. Unicode provides the mechanism for composing Hangul syllables with their individual subcomponents, known as [[Hangul Jamo]]. However, it also provides all 11,172 combinations of precomposed Hangul syllables.
Line १६९:
Many languages, including [[Arabic language|Arabic]] and [[Devanāgarī|Hindi]], have special orthographic rules which require that certain combinations of letterforms be combined into special [[ligature (typography)|ligature forms]]. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as [[OpenType]] (by Adobe and Microsoft), [[Graphite (SIL)|Graphite]] (by [[SIL International]]), or [[Apple Advanced Typography|AAT]] (by Apple). [[Font language|Instructions]] are also embedded in fonts to tell the [[operating system]] how to properly output different character sequences. In simpler cases, such as the placement of combining marks or diacritics, fixed-width fonts sometimes employ a method known as "[[sidebearing]]" in which the special marks preceed the main letterform in the datastream and the font rendering software knows to combine the marks into a final form.{{citationneeded}} This method works only for some diacritics, and may fail to properly handle stacked marks.
 
[[As of 2004]], most software still cannot reliably handle many features not supported by older font formats, so combining characters generally will not work correctly. For example, {{unicode|ḗ}} (precomposed e with macron and acute above) and {{unicode|ḗ}} (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an [[e]] with a [[macron]] and [[acute accent]], but in practice, their appearance can vary greatly across software applications. Similarly, [[dot (diacritic)|underdotunderdots]]s, as needed in the [[romanization]] of [[Indo-Aryan languages|Indic]], will often be placed incorrectly. As a workaround, Unicode characters that map to precomposed glyphs can be used for many such characters. The need for such alternatives inherits from the limitations of fonts and rendering technology, not weaknesses of Unicode itself.
 
== Unicode in use ==
Line ३१४:
 
=== Multilingual text-rendering engines ===
* [[Uniscribe]] — [[Microsoft Windows|Windows]]
* [[Apple Type Services for Unicode Imaging]] — new engine for [[Apple Macintosh|Macintosh]]
* [[WorldScript]] — old engine for [[Apple Macintosh|Macintosh]]
* [[Pango]] — [[Open Source]], used by [[GTK+]] (and hence [[GNOME]])
* [[International Components for Unicode|ICU Layout Engine]] — Open Source
* [[Graphite (SIL)|Graphite]] — (Open Source renderer from [[SIL International|SIL]])
* [[Qt (toolkit)|Scribe]] — Open Source renderer from [[Trolltech]]
 
=== Input methods ===
Line ३४२:
To input a Unicode character in a text box in [[Mozilla Firefox]] on Linux, type the hexadecimal character code while holding down the control and shift keys.
 
In the [[Vim (text editor)|Vim]] text editor, Unicode characters can be entered by pressing CTRL-V and then entering a key combination. For more information, type "<code>:help i_CTRL-V_digit</code>" in Vim. (Note that the entered text will be Unicode only if the current encoding is set to UTF-8 or another Unicode encoding; type "<code>:help encoding</code>" in Vim for details.) Many Unicode characters can also be entered using [[Digraph (computing)|digraphdigraphs]]s; a table of such characters and their corresponding digraphs can be obtained using the "<code>:digraphs</code>" command (again provided the current encoding is set to Unicode).
 
WordPad and Word 2002/2003 for Windows additionally allow for entering Unicode characters by typing the hexadecimal code point, for example 014B for ''ŋ'', and then pressing <code>Alt + x</code> to substitute the string to the left by its Unicode character. Usefully, the reverse also applies: if a user positions a cursor to the right of a non-ASCII character and presses <code>Alt + x</code>, then the Microsoft software will substitute the character with the hexadecimal Unicode code point.
 
Several visual keyboards are available that make entering Unicode characters and symbols very easy.
* [http://quickkeydotnet.sourceforge.net/ Quick Key] (Open Source)
* [http://cjb.ie/u.php Lightweight Unicode Map/Picker] (In-browser character map; operating-system independent. Open Source)
* [http://www.ergonis.com/products/popchar/ PopChar Demo Version]
 
== Issues ==
Line ४०१:
* ''The Unicode Standard, Version 5.0, Fifth Edition'', The [[Unicode Consortium]], Addison-Wesley Professional, Oct. 27, 2006. ISBN 0-321-48091-0
* ''The Unicode Standard, Version 4.0'', The Unicode Consortium, Addison-Wesley Professional, Aug. 27, 2003. ISBN 0-321-18578-1
[[Imageकिपा:Unicodeconsortium_book4.jpg|thumb|right|The Unicode Standard, Version 4.0]]
 
== External links ==
Line ४१५:
* tools
** [http://people.w3.org/rishida/scripts/uniview/conversion Unicode Code Converter v3]
** Insert characters instantly with [http://quickkeydotnet.sourceforge.net/ Quick Key Character Grid].
** [http://billposer.org/Software/unidesc.html A suite of programs for finding out what is in a Unicode file]
** [http://billposer.org/Software/uni2ascii.html Programs for converting between Unicode and various ASCII representations]
Line ४५५:
[[ca:Unicode]]
[[chr:Unicode/Cherokee]]
[[ckb:یونیکۆدیوونیکۆد]]
[[cs:Unicode]]
[[da:Unicode]]