युनिकोड: Difference between revisions

Content deleted Content added
Line ६२:
 
Further additions of characters to the already-encoded scripts, as well as symbols, in particular for [[mathematics]] and [[music]] (in the form of notes and rhythmic symbols), also occur. The [http://www.unicode.org/roadmaps/ Unicode Roadmap] lists scripts not yet in Unicode with tentative assignments to code blocks. Invented scripts, most of which do not qualify for inclusion in Unicode due to lack of real-world usage, are listed in the [[ConScript Unicode Registry]], along with unofficial but widely-used [[Private Use Area]] code assignments. Similarly, many medieval letter variants and ligatures not in Unicode are encoded in the [[Medieval Unicode Font Initiative]].
 
== Unicode in use ==
=== Operating systems ===
 
Unicode has become the dominant scheme for internal processing and sometimes storage (though a lot of text is still stored in legacy encodings) of text. Early adopters tended to use UCS-2 and later moved to UTF-16 (as this was the least disruptive way to add support for non-bmp characters). The best known such system is [[Windows NT]] (and its descendants, [[Windows 2000]] and [[Windows XP]]). The [[Java virtual machine|Java]] and [[.NET Framework|.NET]] bytecode environments, Mac OS X, and also Unix desktops such as KDE and GNOME, also use it.
 
[[UTF-8]] (originally developed for [[Plan 9 from Bell Labs|Plan 9]]) has become the main encoding on most [[Unix-like]] operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional [[extended ASCII]] character sets.
 
=== E-mail ===
 
{{main|Unicode and e-mail}}
 
[[MIME]] defines two different mechanisms for encoding non-ASCII characters in [[e-mail]], depending on whether the characters are in e-mail headers such as the "Subject:" or in the text body of the message. In both cases, the original character set is identified as well as a transfer encoding. For e-mail transmission of Unicode the UTF-8 character set and the [[Base64]] transfer encoding are recommended. The details of the two different mechanisms are specified in the MIME standards and are generally hidden from users of e-mail software.
 
The adoption of Unicode in [[e-mail]] has been very slow. Most East-Asian text is still encoded in a local encoding such as [[Shift-JIS]], and many commonly used e-mail programs still cannot handle Unicode data correctly, if they have any support at all. This situation is not expected to change in the foreseeable future.
 
=== Web ===
 
{{main|Unicode and HTML}}
Web browsers have been supporting severals UTFs, especially UTF-8, for many years now. Display problems result primarily from [[typeface|font]] related issues. In particular [[Internet Explorer]] does not render many code points unless it is explicitly told to use a font that contains them.
 
All [[W3C]] recommendations are using Unicode as their ''document character set'', the encoding being variable, ever since HTML 4.0. It replaces the 8-bit ASCII superset [[ISO-8859-1]], which had been the standard character set and encoding before.
 
Although syntax rules may affect the order in which characters are allowed to appear, both [[HTML|HTML 4]] and [[XML]] (including [[XHTML]]) documents, by definition, comprise characters from most of the Unicode code points, with the exception of:
* most of the [[C0 and C1 control codes]]
* the permanently-unassigned code points D800–DFFF
* any code point ending in FFFE or FFFF
* any code point above 10FFFF.
These characters manifest either directly as [[byte]]s according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point.
 
For example, the references <code>Δ</code>, <code>Й</code>, <code>ק</code>, <code>م</code>, <code>๗</code>, <code>あ</code>, <code>叶</code>, <code>葉</code>, and <code>냻</code> (or the same numeric values expressed in hexadecimal, with <code>&amp;#x</code> as the prefix) display on browsers as Δ, Й, ק,‎ م, ๗, あ, 叶, 葉 and 냻. If the proper fonts exist, these symbols look like the [[Greek alphabet|Greek]] capital letter "[[Delta (letter)|Delta]]", [[Cyrillic alphabet|Cyrillic]] capital letter "[[Short I]]", [[Hebrew alphabet|Hebrew]] letter "Qof", [[Arabic alphabet|Arabic]] letter "Meem", [[Thai language|Thai]] [[numeral]] [[7 (number)|7]], [[Japanese language|Japanese]] [[Hiragana]] "A", [[simplified Chinese]] "[[Leaf]]", [[traditional Chinese]] "Leaf", and [[Korean language|Korean]] [[Hangul]] syllable "Nyaelh", respectively.
 
In [[HTTP]] requests, [[Uniform Resource Locator|URLs]] must be [[percent-encoding|percent-encoded]], usually using the [[UTF-8]] encoding to represent Unicode.
 
=== Fonts ===
 
Free and retail [[Unicode fonts|fonts]] based on Unicode occur commonly, since first [[TrueType]] and now [[OpenType]] support Unicode. These font formats map Unicode code points to glyphs.
 
Thousands of [[List of typefaces|fonts]] exist on the market, but fewer than a dozen fonts — sometimes described as "pan-Unicode" fonts — attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based [[List of typefaces#Unicode fonts|fonts]] typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of [[diminishing returns]] for most typefaces.
 
Several subsets of Unicode are standardized: Microsoft Windows since Windows NT 4.0 supports [[WGL-4]] with 652 characters, which is considered to support all contemporary European languages using the Latin, Greek or Cyrillic script. Other standardized subsets of Unicode include MES-1 (335 characters) and MES-2 (1062 characters) (CWA 13873:2000, Multilingual European Subsets in ISO/IEC 10646-1).
{| class="wikitable" style="font-family: monospace;"
|+style="font-family:sans-serif; font-style: normal;"| '''WGL-4''', ''MES-1''<ref>[http://www.kostis.net/charsets/iso10646.mes-1.htm MES-1]</ref> and MES-2<ref>[http://www.cl.cam.ac.uk/~mgk25/ucs/mes-2-rationale.html MES-2]</ref>
|-style="font-family:serif;"|
! Row !! Cells !! Range(s)
|-
!rowspan="2"| 00
| '''''20–7E'''''
| Basic Latin (00–7F)
|-
| '''''A0–FF'''''
| Latin-1 Supplement (80–FF)
|-
!rowspan="2"| 01
| '''''00–13,'' 14–15, ''16–2B,'' 2C–2D, ''2E–4D,'' 4E–4F, ''50–7E,'' 7F'''
| Latin Extended-A (00–7F)
|-
| 8F, '''92,''' B7, DE-EF, '''FA–FF'''
| Latin Extended-B (80–FF <span title="U+024F">…</span>)
|-
!rowspan="3"| 02
| 18–1B, 1E–1F
| Latin Extended-B (<span title="U+00180">…</span> 00–4F)
|-
| 59, 7C, 92
| IPA Extensions (50–AF)
|-
| BB–BD, '''C6, ''C7,'' C9,''' D6, '''''D8–DB,'' DC, ''DD,''''' DF, EE
| Spacing Modifier Letters (B0–FF)
|-
! 03
| 74–75, 7A, 7E, '''84–8A, 8C, 8E–A1, A3–CE,''' D7, DA–E1
| Greek (70–FF)
|-
! 04
| 00, '''01–0C,''' 0D, '''0E–4F,''' 50, '''51–5C,''' 5D, '''5E–5F, 90–91,''' 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9
| Cyrillic (00–FF)
|-
! 1E
| 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, '''80–85,''' 9B, '''F2–F3'''
| Latin Extended Additional (00–FF)
|-
! 1F
| 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE
| Greek Extended (00–FF)
|-
!rowspan="3"| 20
| '''13–14, ''15,'' 17, ''18–19,'' 1A–1B, ''1C–1D,'' 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E'''
| General Punctuation (00–6F)
|-
| '''44,''' 4A, '''7F''', 82
| Superscripts and Subscripts (70–9F)
|-
| '''A3–A4, A7, ''AC,''''' AF
| Currency Symbols (A0–CF)
|-
!rowspan="3"| 21
| '''05, 13, 16, ''22, 26,'' 2E'''
| Letterlike Symbols (00–4F)
|-
| '''''5B–5E'''''
| Number Forms (50–8F)
|-
| '''''90–93,'' 94–95, A8'''
| Arrows (90–FF)
|-
! 22
| 00, '''02,''' 03, '''06,''' 08-09, '''0F, 11–12, 15, 19–1A, 1E–1F,''' 27-28, '''29,''' 2A, '''2B, 48,''' 59, '''60–61, 64–65,''' 82–83, 95, 97
| Mathematical Operators (00–FF)
|-
! 23
| '''02, 0A, 20–21,''' 29–2A
| Miscellaneous Technical (00–FF)
|-
!rowspan="3"| 25
| '''00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C'''
| Box Drawing (00–7F)
|-
| '''80, 84, 88, 8C, 90–93'''
| Block Elements (80–9F)
|-
| '''A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6'''
| Geometric Shapes (A0–FF)
|-
! 26
| '''3A–3C, 40, 42, 60, 63, 65–66, ''6A,'' 6B'''
| Miscellaneous Symbols (00–FF)
|-
! F0
| (01–02)<!--in WGL-4, but not in MES-2-->
| Private Use Area (00–FF …)
|-
! FB
| '''01–02'''
| Alphabetic Presentation Forms (00–4F)
|-
! FF
| FD
| Specials
|}
 
Rendering software which cannot process a Unicode character appropriately most often display it as only an open rectangle, or the Unicode "Replacement Character" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. The Apple ''[[LastResort]]'' font will display a substitute glyph indicating the Unicode range of the character and the [[SIL International|SIL]] [[Unicode fallback font]] will display a box showing the hexadecimal scalar value of the character.
 
=== Multilingual text-rendering engines ===
* [[Uniscribe]] — [[Microsoft Windows|Windows]]
* [[Apple Type Services for Unicode Imaging]] — new engine for [[Apple Macintosh|Macintosh]]
* [[WorldScript]] — old engine for [[Apple Macintosh|Macintosh]]
* [[Pango]] — [[Open Source]], used by [[GTK+]] (and hence [[GNOME]])
* [[International Components for Unicode|ICU Layout Engine]] — Open Source
* [[Graphite (SIL)|Graphite]] — (Open Source renderer from [[SIL International|SIL]])
* [[Qt (toolkit)|Scribe]] — Open Source renderer from [[Trolltech]]
 
=== Input methods ===
 
Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire.
 
In [[Microsoft Windows]] (since Windows 2000), the "Character Map" program (Start/Programs/Accessories/System Tools/Character Map) provides rich-text editing controls for all Table I characters up to U+FFFF, by selection from a drop-down table, assuming that a Unicode [[font]] is selected. Programs such as [[Microsoft Office Word|Microsoft Word]] have a similar control embedded (Insert/Symbol). Rather more painfully and where the code point of the desired character is known, it is possible to create Unicode characters by pressing <code>Alt + #</code>, where # represents 0 followed by the decimal code point; for example, <code>Alt + 0241</code> will produce the Unicode character ''ñ''. (The # must start with 0 to be considered a Unicode code point and the keys on the numeric pad of the keyboard must be used.) This also works in many other Windows applications, but not in applications that use the standard Windows edit control, and do not make any special provisions to allow this type of input. See [[Alt codes]]. To add Unicode characters to chart titles in [[Microsoft Excel]] first type the title text into a worksheet cell, where the (Insert/Symbol) control can be used. The resulting text can be cut and pasted into chart titles.
 
[[Apple Macintosh]] users have a similar feature with an input method called 'Unicode Hex Input', in [[Mac OS X]] and in [[Mac OS]] 8.5 and later: hold down the Option key, and type the four-hex-digit Unicode code point. Inputting code points above U+FFFF is done by entering [[UTF-16|surrogate pairs]]; the software will convert each pair into a single character automatically. Mac OS X (version 10.2 and newer) also has a 'Character Palette', which allows users to visually select any Unicode character from a table organized numerically, by Unicode block, or by a selected font's available characters. The 'Unicode Hex Input' method must be activated in the International System Preferences in [[Mac OS X]] or the 'Keyboard' Control Panel in [[Mac OS]] 8.5 and later. Once activated, 'Unicode Hex Input' must also be selected in the Keyboard menu (designated by the flag icon) before a Unicode code point can be entered.
 
[[GNOME]] provides a 'Character Map' utility (Applications/Accessories/Character Map) which displays characters ordered by Unicode block or by writing system, and allows searching by character name or extended description. Where the character's code point is known, it can be entered in accordance with [[ISO 14755]]: hold down Ctrl and Shift and enter the hexadecimal Unicode value, preceded by the letter U if using GNOME 2.15 or later. Because Gnome uses [[UTF-8]] internally, this method works in all applications and surrogate pairs are not needed.
 
At the [[X11|X Input Method]] or GTK+ Input Module level, the input method editor [[SCIM]] provides a “raw code” input method to allow the user to enter the 4-digit hexadecimal Unicode value.
 
All [[X11|X Window]] applications (including [[GNOME]] and [[KDE]], but not only them) support using the [[Compose Key]]. For keyboards which do not have a designated Compose key, another key (e.g., CapsLock) could be redefined as a Compose key.
 
The [[Linux]] [[console]] allows Unicode characters to be entered by holding down Alt and typing the decimal code on the [[numeric keypad]]. (In order for this to work, the console should be placed in Unicode mode with <code>unicode_start(1)</code> and a suitable font selected with <code>setfont(8)</code>.) The AltGr key allows the hexadecimal code to be entered instead, using NumLock-Enter as A-F (clockwise). ISO 14755 compliant input (Ctrl+Shift+hexadecimal code on normal keys) is also available in the <code>unicode</code> keymap.
 
The [[Opera (web browser)|Opera web browser]] in version 7.5 and over allows users to enter any Unicode character directly into a text field by typing its hexadecimal code, selecting it, and pressing <code>Alt + x</code>.
 
To input a Unicode character in a text box in [[Mozilla Firefox]] on Linux, type the hexadecimal character code while holding down the control and shift keys.
 
In the [[Vim (text editor)|Vim]] text editor, Unicode characters can be entered by pressing CTRL-V and then entering a key combination. For more information, type "<code>:help i_CTRL-V_digit</code>" in Vim. (Note that the entered text will be Unicode only if the current encoding is set to UTF-8 or another Unicode encoding; type "<code>:help encoding</code>" in Vim for details.) Many Unicode characters can also be entered using [[Digraph (computing)|digraphs]]; a table of such characters and their corresponding digraphs can be obtained using the "<code>:digraphs</code>" command (again provided the current encoding is set to Unicode).
 
WordPad and Word 2002/2003 for Windows additionally allow for entering Unicode characters by typing the hexadecimal code point, for example 014B for ''ŋ'', and then pressing <code>Alt + x</code> to substitute the string to the left by its Unicode character. Usefully, the reverse also applies: if a user positions a cursor to the right of a non-ASCII character and presses <code>Alt + x</code>, then the Microsoft software will substitute the character with the hexadecimal Unicode code point.
 
Several visual keyboards are available that make entering Unicode characters and symbols very easy.
* [http://quickkeydotnet.sourceforge.net/ Quick Key] (Open Source)
* [http://cjb.ie/u.php Lightweight Unicode Map/Picker] (In-browser character map; operating-system independent. Open Source)
* [http://www.ergonis.com/products/popchar/ PopChar Demo Version]
 
== Issues ==