Ancient Near Eastern Scripts, Indo-European, and Unicode

Draft BIFoCAL document

Deborah Anderson
Unicode/IE Working Group
Dept. of Linguistics
UC Berkeley

What is Unicode?

Unicode is an international character encoding standard. With Unicode, each character receives a unique number which remains the same, regardless of the computer platform, software, or the language. It is widely supported by the computer industry and is synchronized with ISO 10646, the international standards body which is made up of representatives from various countries.

Why use Unicode?

Unicode will permit your documents to be accessible to others. It is widely supported by the computer industry and by national bodies throughout the world; it is now the default standard for XML. Unicode will also help to preserve your data and offer stability.

One goal of Unicode is universal coverage; it has enough space to cover the scripts of the world, both historic and modern. Most modern scripts used by a large population are covered, but many historic scripts are currently left out.

Steps to using Unicode:

1. Is the script in Unicode?

Check the Code Charts on the Unicode website, www.unicode.org. (Note: The Code Charts on the Web are more up-to-date than those published in the book version; the website has Unicode 3.2, whereas the book represents Unicode 3.0.)

The Code Charts comprise the charts themselves (with representative glyphs or images and their Unicode values in hex), and a Names List. The Names List includes the Unicode value, a glyph, the name (in all caps), and occasionally additional information for the user.

Notes:
a. Scripts are arranged in blocks, but some characters may be in different blocks, so you may need to look around for the appropriate character. Punctuation, which is used across a variety of scripts, can be in a separate block.
b. If a needed character can be covered by more than one Unicode character (such as the middle dot, for which Unicode has several), pick one and document your choice--checking first with others in the scholarly community who have electronic texts. Be sure to select a character which has similar character properties, otherwise there may be problem with line-breaking, etc. General guidelines for letters/signs used in text: avoid selecting characters from the following blocks: Mathematical Operators (for non-math uses), Superscripts and Subscripts (unless for phonetic/phonemic use), Letterlike Symbols, or Number Forms blocks. (Character properties information is available at: http://unicode.org/Public/UNIDATA/UnicodeData.html.)
c. Some letters you need may be covered by more than one character. For example, the Hittite 'h rocker' is covered by two characters: 'LATIN SMALL LETTER H' (0068) and the combining diacritic 'COMBINING BREVE BELOW' (032E). Precomposed characters (i.e., h with rocker) will not likely be accepted by Unicode, since it can already be handled by character h + combining breve below. Abbreviations, ligatures, and idiosyncratic scribal marks will also not likely be accepted; these can be included in the font.
For further details on finding a character, see http://www.unicode.org/unicode/standard/where/.

2. If the script IS in Unicode, you need the following in order to create, read, and share documents:

Note: Scripts outside of Plane 0, the 'Basic Multilingual Plane', may need some adjustment in order to work. Most historic scripts will be located in Plane 1. The most recent software is making this possible. For information on Windows NT, 2000, and XP, see www.i18nguy.com/surrogates .html. To determine which plane a script is, check the Roadmaps charts (http://www.unicode.org/roadmaps/ )

Resources for information on this and directions how to set up the browser (etc.):

Difficulties/Unhappy with the Unicode font or software?

Let the software/font vendor know. If your favorite product does not support Unicode (i.e., Macromedia Dreamweaver), write to the person in charge of the product and ask that it be Unicode-enabled. In some companies, the 'Internationalization' group handles Unicode.

3. If the script is NOT in Unicode:

Check the following pages on the Unicode website for the latest information on script proposals being developed or in line for approval:

Also, ask on the Unicode email list if a Unicode proposal is being prepared for the missing script.

Options:

a. Use transliteration/transcription

(i.e., use the Latin blocks, with appropriate combining diacritics; avoid Mathematical Operators [for non-math uses], Superscripts and Subscripts [unless for phonetic/phonemic use], Letterlike Symbols, or Number Forms blocks.)

b. Work to get the script into Unicode. The steps toward this include:

           

BRIEF EXCURSUS on Characters vs. Glyphs

Unicode covers characters, not glyphs. Characters are abstract and reflect 'the smallest components of written language that have semantic value' (TheUnicode Standard 3.0, p. 13), whereas glyphs are the surface representations of characters. It is glyphs that appear on the printed page or on your monitor.

For the abstract character 'a' (LATIN SMALL LETTER A), one can find the following glyphs: a, a, a, a (and many more). Unicode just concerns the abstract characters, the font provides the glyphs.

Determining a character vs. a glyph can be difficult when working with historic texts. Key questions to ask are:

  • Does the particular letter/sign contrast with another in the same document, with a different meaning? If so, it is a character.
  • Is its appearance predictable? If so, it may be a contextual variant, and not eligible for encoding.
  • Can the letter/sign be interchanged with another and still have the same meaning? If so, it is a glyph.

c. But how can one work with the script in the meantime?

One approach:

Other approaches:

Note: The Text Encoding Initiative is preparing guidelines, but nothing definitive has been agreed upon yet.

4. What to do if a script is missing a few characters?

5. What to do in order to show variants or characters that will never be accepted into Unicode?

6. What about my data that is in a non-Unicode font?

Future Directions and Needs

Other questions about Unicode?


Appendix:

Ancient Scripts Currently Included in Unicode 3.2

Note 1: Plane 0 designates location of the script in the Basic Multilingual Plane, which is currently well supported in modern operating systems. Plane 1 or the Supplementary Multilingual Plane will contain many of the historic scripts. Software support for accessing Plane 1 is currently being developed as are fonts. For directions on using characters for this area in Windows 2000, NT, or XP, see www.i18nguy.com/surrogates .html.

Note 2: For a discussion of the Early Semitic scripts and their proposed grouping into scripts, see the following document from Michael Everson from January 2001, http://std.dkuug. dk/JTC1/SC2/WG2/docs/N2311.pdf

Scripts used for Indo-European languages are italicized in the list below.

Ancient Near Eastern Scripts

Arabic (http://www.unicode. org/charts/PDF/U0600.pdf) Plane 0

Armenian (http://www.unicode. org/charts/PDF/U0530.pdf) Plane 0

Coptic (Some characters are listed in the 'Greek and Coptic' block at http://www.unicode. org/charts/PDF/U0370.pdf. Note: http://std.dkuug. dk/jtc1/sc2/wg2/docs/n2444.pdf is a document that requests additional characters for Coptic be added. Scholarly input on the repertoire in the chart verifying that the list of characters is complete and the glyphs are representative is requested, along with a statement on why Coptic should be handled as a difference of script, and not style.) Plane 0

Ethiopic (http://www.unicode. org/charts/PDF/U1200.pdf) Plane 0

Greek (N.B. check the Greek and Greek Extended blocks, additional characters are being proposed, including editorial signs, see:

http://www.tlg.uci.edu/ ~tlg/Uni.prop.html) Plane 0

Hebrew (http://www.unicode.org/charts/PDF/U0590.pdf) (N.B.: Unicode only includes the Tiberian vocalization system, non-Tiberian systems were not included) Plane 0

Syriac (http://www.unicode. org/charts/PDF/U0700.pdf) (includes Estrangelo, Serto, Nestorian, Jacobite, and will include Manichaean and Christian Sogdian) Plane 0

Other Scripts used by Indo-European Languages

Cyrillic (with some historic letters; http://www.unicode.org/charts/PDF/U0400.pdf) and Cyrillic Supplement (http://www.unicode. org/charts/PDF/U0500.pdf) (Note: Additional missing Cyrillic letters are being prepared for submission.) Plane 0

Devanagari (http://www.unicode. org/charts/PDF/U0900.pdf) Plane 0

Gothic (http://www.unicode. org/charts/PDF/U10330.pdf) Plane 1

Latin (5 blocks: Basic Latin [http://www.unicode.org/charts/PDF/U0000.pdf], Latin-1 Supplement [http://www.unicode.org/charts/PDF/U0080.pdf], Latin Extended-A [http://www.unicode.org/charts/PDF/U0100.pdf], Latin Extended-B [http://www.unicode.org/charts/PDF/U1E00.pdf], Latin Extended Additional [http://www.unicode.org/charts/PDF/U1E00.pdf]) Plane 0

Ogham (http://www.unicode. org/charts/PDF/U1680.pdf) Plane 0

Old Italic (http://www.unicode. org/charts/PDF/U10300.pdf) Plane 1

Runic (http://www.unicode. org/charts/PDF/U16A0.pdf) Plane 0

Scripts Awaiting Approval by ISO/IEC 10646

Aegean Scripts (Linear B and Cypriot; http://www.evertype.com/standards/iso10646/pdf/n2378-aegean.pdf ) Plane 1

Ugaritic cuneiform (proposals: http: //www.evertype.com/standards/iso10646/pdf/ugaritic.pdf and http://www.unicode.org/pending/ugaritic/01141-n2338-ugaritic.pdf ) Plane 1

Missing from Unicode (but included on Roadmaps, www.unicode.org/roadmaps)

Aramaic (proposal: http://std.dkuug. dk/JTC1/SC2/WG2/docs/n2042.pdf) (Aramaic is expected to include Aramaic proper, Middle Persian, Parthian, and Sogdian) Plane 0

Armazi (see Hatran)

Avestan (proposal: http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1684/n1684.htm; further work currently undertaken by Jost Gippert, TITUS) Plane 0

Byblos (no proposal) Plane 1

Carian (no proposal) Plane 1

Coptic (Some characters are included in the 'Greek and Coptic' block on http://www.unicode. org/charts/PDF/U0370.pdf. Note: http://std.dkuug. dk/jtc1/sc2/wg2/docs/n2444.pdf is a document that requests additional characters for Coptic be added. Scholarly input on the repertoire in the chart verifying that the list of characters is complete and the glyphs are representative is requested, along with a statement on why Coptic should be handled as a difference of script, and not style.) Plane 0

Cypro-Minoan (no proposal) Plane 1

Egyptian hieroglyphics (currently plans are to put forward a proposal that encompasses the Gardiner list and its Supplements [1957 list from the Grammar along with the 1928 and 1953 Supplements]; plans for more extensive Egyptological characters planned, but funding needed by Wolfgang Schenkel) Plane 1

Note: Proposal to add 6 Egyptological characters, used in transliteration http://www.dkuug. dk/jtc1/sc2/wg2/docs/n2241.pdf has been put forward. Comments from scholars on the need for these characters are encouraged.

Elymaic (no proposal) Plane 1

Hatran (no proposal) (Hatran/Armazi) Plane 1

Hieroglyphic Luwian (proposal: http:// www.evertype.com/standards/iso10646/pdf/luvian.pdf) Plane 1

Linear A (no proposal) Plane 1

Lycian (no proposal) Plane 1

Lydian (no proposal) Plane 1

Mandaic (proposal: http:/ /www.evertype.com/standards/iso10646/pdf/mandaic.pdf) Plane 1

Meroitic (proposal: http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1638/n1638.htm) Plane 1

Nabataean (no proposal) Plane 1

North Arabic (no proposal) (Includes Dedanite, Lihyanite, Thamudic, and Safaitic) Plane 1

Old Persian cuneiform (http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1639/n1639.htm; further work currently undertaken by Jost Gippert, TITUS) Plane 1

Palmyrene (no proposal) Plane 1

Phoenician (proposal: http://www. evertype.com/standards/plane-1/ph.html) (To include Punic, Neo-Punic, Phoenician proper, Late Phoenician cursive, Phoenician papyrus, Siloam Hebrew, Hebrew seals, Ammonite, Moabite, and Palaeo-Hebrew) Plane 0

Proto-Elamite (no proposal) Plane 1

Samaritan (proposal: http ://www.evertype.com/standards/iso10646/pdf/samaritan.pdf) Plane 0

South Arabian (http://std. dkuug.dk/JTC1/SC2/WG2/docs/n1689/n1689.htm) (Includes Epigraphic South Arabian, Later South Arabian, Thamudic Ethiopic, Consonantal Ethiopic) Plane 1

Sumero-Akkadian cuneiform (work being undertaken separately by Karl-Juergen Feuerherm, Lloyd Anderson, and Dean Snyder of Johns Hopkins; Hittite is to be included in this block) Plane 1

Not on the Roadmap

Proto-Sinaitic; Phaistos Disk; Sidetic

To Help on Unicode Proposals

If you are interested in commenting on a proposal or creating one, please contact:

Deborah Anderson, University of California, Berkeley, Department of Linguistics,
1203 Dwinelle Hall #2650, Berkeley, CA 94720-2650; dwanders@socrates.berkeley. edu.

For those scripts with proposals, specific comments are needed on the following:

a. Is the list of characters complete? (Note: Precomposed characters, ligatures, and variants, while of use for the creation of a font, should not be included in the basic proposal. This information would be helpful to include in ancillary documentation.)

b. Are the glyphs in the chart representative? (Note: The glyphs not intended to be prescriptive.)

c. Can you provide specific details about characters' properties (i.e., upper/lower case information, etc.)? Any special information needed for font designers or implementers (including a list of ligatures for the font, placement of on a line, etc.) can also be included.