Unicode is an international character encoding standard. With Unicode, each character receives a unique number which remains the same, regardless of the computer platform, software, or the language. It is widely supported by the computer industry and is synchronized with ISO 10646, the international standards body which is made up of representatives from various countries.
Unicode will permit your documents to be accessible to others. It is widely supported by the computer industry and by national bodies throughout the world; it is now the default standard for XML. Unicode will also help to preserve your data and offer stability.
One goal of Unicode is universal coverage; it has enough space to cover the scripts of the world, both historic and modern. Most modern scripts used by a large population are covered, but many historic scripts are currently left out.
Check the Code Charts on the Unicode website, www.unicode.org. (Note: The Code Charts on the Web are more up-to-date than those published in the book version; the website has Unicode 3.2, whereas the book represents Unicode 3.0.)
The Code Charts comprise the charts themselves (with representative glyphs or images and their Unicode values in hex), and a Names List. The Names List includes the Unicode value, a glyph, the name (in all caps), and occasionally additional information for the user.
Notes:
a. Scripts are arranged in blocks, but some characters may be in
different blocks, so you may need to look around for the appropriate
character. Punctuation, which is used across a variety of scripts, can
be in a separate block.
b. If a needed character can be covered by more than one Unicode
character (such as the middle dot, for which Unicode has several), pick
one and document your choice--checking first with others in the
scholarly community who have electronic texts. Be sure to select a
character which has similar character properties, otherwise there may be
problem with line-breaking, etc. General guidelines for letters/signs
used in text: avoid selecting characters from the following blocks:
Mathematical Operators (for non-math uses), Superscripts and Subscripts
(unless for phonetic/phonemic use), Letterlike Symbols, or Number Forms
blocks. (Character properties information is available at:
http://unicode.org/Public/UNIDATA/UnicodeData.html.)
c. Some letters you need may be covered by more than one character.
For example, the Hittite 'h rocker' is covered by two characters: 'LATIN
SMALL LETTER H' (0068) and the combining diacritic 'COMBINING BREVE
BELOW' (032E). Precomposed characters (i.e., h with rocker) will not
likely be accepted by Unicode, since it can already be handled by
character h + combining breve below. Abbreviations, ligatures, and
idiosyncratic scribal marks will also not likely be accepted; these can
be included in the font.
For further details on finding a character, see
http://www.unicode.org/unicode/standard/where/.
Note: Scripts outside of Plane 0, the 'Basic Multilingual Plane', may need some adjustment in order to work. Most historic scripts will be located in Plane 1. The most recent software is making this possible. For information on Windows NT, 2000, and XP, see www.i18nguy.com/surrogates .html. To determine which plane a script is, check the Roadmaps charts (http://www.unicode.org/roadmaps/ )
Resources for information on this and directions how to set up the browser (etc.):
Difficulties/Unhappy with the Unicode font or software?
Let the software/font vendor know. If your favorite product does not support Unicode (i.e., Macromedia Dreamweaver), write to the person in charge of the product and ask that it be Unicode-enabled. In some companies, the 'Internationalization' group handles Unicode.
Check the following pages on the Unicode website for the latest information on script proposals being developed or in line for approval:
Also, ask on the Unicode email list if a Unicode proposal is being prepared for the missing script.
Options:
a. Use transliteration/transcription
(i.e., use the Latin blocks, with appropriate combining diacritics; avoid Mathematical Operators [for non-math uses], Superscripts and Subscripts [unless for phonetic/phonemic use], Letterlike Symbols, or Number Forms blocks.)
b. Work to get the script into Unicode. The steps toward this include:
BRIEF EXCURSUS on Characters vs. Glyphs Unicode covers characters, not glyphs. Characters are abstract and reflect 'the smallest components of written language that have semantic value' (TheUnicode Standard 3.0, p. 13), whereas glyphs are the surface representations of characters. It is glyphs that appear on the printed page or on your monitor. For the abstract character 'a' (LATIN SMALL LETTER A), one can find the following glyphs: a, a, a, a (and many more). Unicode just concerns the abstract characters, the font provides the glyphs. Determining a character vs. a glyph can be difficult when working with historic texts. Key questions to ask are:
|
c. But how can one work with the script in the meantime?
One approach:
Other approaches:
Note: The Text Encoding Initiative is preparing guidelines, but nothing definitive has been agreed upon yet.
Note 1: Plane 0 designates location of the script in the Basic Multilingual Plane, which is currently well supported in modern operating systems. Plane 1 or the Supplementary Multilingual Plane will contain many of the historic scripts. Software support for accessing Plane 1 is currently being developed as are fonts. For directions on using characters for this area in Windows 2000, NT, or XP, see www.i18nguy.com/surrogates .html.
Note 2: For a discussion of the Early Semitic scripts and their proposed grouping into scripts, see the following document from Michael Everson from January 2001, http://std.dkuug. dk/JTC1/SC2/WG2/docs/N2311.pdf
Scripts used for Indo-European languages are italicized in the list below.
Arabic (http://www.unicode. org/charts/PDF/U0600.pdf) Plane 0
Armenian (http://www.unicode. org/charts/PDF/U0530.pdf) Plane 0
Coptic (Some characters are listed in the 'Greek and Coptic' block at http://www.unicode. org/charts/PDF/U0370.pdf. Note: http://std.dkuug. dk/jtc1/sc2/wg2/docs/n2444.pdf is a document that requests additional characters for Coptic be added. Scholarly input on the repertoire in the chart verifying that the list of characters is complete and the glyphs are representative is requested, along with a statement on why Coptic should be handled as a difference of script, and not style.) Plane 0
Ethiopic (http://www.unicode. org/charts/PDF/U1200.pdf) Plane 0
Greek (N.B. check the Greek and Greek Extended blocks, additional characters are being proposed, including editorial signs, see:
http://www.tlg.uci.edu/ ~tlg/Uni.prop.html) Plane 0
Hebrew (http://www.unicode.org/charts/PDF/U0590.pdf) (N.B.: Unicode only includes the Tiberian vocalization system, non-Tiberian systems were not included) Plane 0
Syriac (http://www.unicode. org/charts/PDF/U0700.pdf) (includes Estrangelo, Serto, Nestorian, Jacobite, and will include Manichaean and Christian Sogdian) Plane 0
Cyrillic (with some historic letters; http://www.unicode.org/charts/PDF/U0400.pdf) and Cyrillic Supplement (http://www.unicode. org/charts/PDF/U0500.pdf) (Note: Additional missing Cyrillic letters are being prepared for submission.) Plane 0
Devanagari (http://www.unicode. org/charts/PDF/U0900.pdf) Plane 0
Gothic (http://www.unicode. org/charts/PDF/U10330.pdf) Plane 1
Latin (5 blocks: Basic Latin [http://www.unicode.org/charts/PDF/U0000.pdf], Latin-1 Supplement [http://www.unicode.org/charts/PDF/U0080.pdf], Latin Extended-A [http://www.unicode.org/charts/PDF/U0100.pdf], Latin Extended-B [http://www.unicode.org/charts/PDF/U1E00.pdf], Latin Extended Additional [http://www.unicode.org/charts/PDF/U1E00.pdf]) Plane 0
Ogham (http://www.unicode. org/charts/PDF/U1680.pdf) Plane 0
Old Italic (http://www.unicode. org/charts/PDF/U10300.pdf) Plane 1
Runic (http://www.unicode. org/charts/PDF/U16A0.pdf) Plane 0
Aegean Scripts (Linear B and Cypriot; http://www.evertype.com/standards/iso10646/pdf/n2378-aegean.pdf ) Plane 1
Ugaritic cuneiform (proposals: http: //www.evertype.com/standards/iso10646/pdf/ugaritic.pdf and http://www.unicode.org/pending/ugaritic/01141-n2338-ugaritic.pdf ) Plane 1
Aramaic (proposal: http://std.dkuug. dk/JTC1/SC2/WG2/docs/n2042.pdf) (Aramaic is expected to include Aramaic proper, Middle Persian, Parthian, and Sogdian) Plane 0
Armazi (see Hatran)
Avestan (proposal: http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1684/n1684.htm; further work currently undertaken by Jost Gippert, TITUS) Plane 0
Byblos (no proposal) Plane 1
Carian (no proposal) Plane 1
Coptic (Some characters are included in the 'Greek and Coptic' block on http://www.unicode. org/charts/PDF/U0370.pdf. Note: http://std.dkuug. dk/jtc1/sc2/wg2/docs/n2444.pdf is a document that requests additional characters for Coptic be added. Scholarly input on the repertoire in the chart verifying that the list of characters is complete and the glyphs are representative is requested, along with a statement on why Coptic should be handled as a difference of script, and not style.) Plane 0
Cypro-Minoan (no proposal) Plane 1
Egyptian hieroglyphics (currently plans are to put forward a proposal that encompasses the Gardiner list and its Supplements [1957 list from the Grammar along with the 1928 and 1953 Supplements]; plans for more extensive Egyptological characters planned, but funding needed by Wolfgang Schenkel) Plane 1
Note: Proposal to add 6 Egyptological characters, used in transliteration http://www.dkuug. dk/jtc1/sc2/wg2/docs/n2241.pdf has been put forward. Comments from scholars on the need for these characters are encouraged.
Elymaic (no proposal) Plane 1
Hatran (no proposal) (Hatran/Armazi) Plane 1
Hieroglyphic Luwian (proposal: http:// www.evertype.com/standards/iso10646/pdf/luvian.pdf) Plane 1
Linear A (no proposal) Plane 1
Lycian (no proposal) Plane 1
Lydian (no proposal) Plane 1
Mandaic (proposal: http:/ /www.evertype.com/standards/iso10646/pdf/mandaic.pdf) Plane 1
Meroitic (proposal: http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1638/n1638.htm) Plane 1
Nabataean (no proposal) Plane 1
North Arabic (no proposal) (Includes Dedanite, Lihyanite, Thamudic, and Safaitic) Plane 1
Old Persian cuneiform (http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1639/n1639.htm; further work currently undertaken by Jost Gippert, TITUS) Plane 1
Palmyrene (no proposal) Plane 1
Phoenician (proposal: http://www. evertype.com/standards/plane-1/ph.html) (To include Punic, Neo-Punic, Phoenician proper, Late Phoenician cursive, Phoenician papyrus, Siloam Hebrew, Hebrew seals, Ammonite, Moabite, and Palaeo-Hebrew) Plane 0
Proto-Elamite (no proposal) Plane 1
Samaritan (proposal: http ://www.evertype.com/standards/iso10646/pdf/samaritan.pdf) Plane 0
South Arabian (http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1689/n1689.htm) (Includes Epigraphic South Arabian, Later South Arabian, Thamudic Ethiopic, Consonantal Ethiopic) Plane 1
Sumero-Akkadian cuneiform (work being undertaken separately by Karl-Juergen Feuerherm, Lloyd Anderson, and Dean Snyder of Johns Hopkins; Hittite is to be included in this block) Plane 1
Proto-Sinaitic; Phaistos Disk; Sidetic
To Help on Unicode Proposals
If you are interested in commenting on a proposal or creating one, please contact:
Deborah Anderson, University of California, Berkeley, Department of
Linguistics,
1203 Dwinelle Hall #2650, Berkeley, CA 94720-2650; dwanders@socrates.berkeley.
edu.
For those scripts with proposals, specific comments are needed on the following:
a. Is the list of characters complete? (Note: Precomposed characters, ligatures, and variants, while of use for the creation of a font, should not be included in the basic proposal. This information would be helpful to include in ancillary documentation.)
b. Are the glyphs in the chart representative? (Note: The glyphs not intended to be prescriptive.)
c. Can you provide specific details about characters' properties (i.e., upper/lower case information, etc.)? Any special information needed for font designers or implementers (including a list of ligatures for the font, placement of on a line, etc.) can also be included.