Mass Deface
. The Unicode standard
uses the notation C, to give the
hexadecimal code point and the normative name of the character.
Unicode also defines various I for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.
A Unicode I "character" can actually consist of more than one internal
I "character" or code point. For Western languages, this is adequately
modelled by a I (like C) followed
by one or more I (like C). This sequence of
base character and modifiers is called a I. Some non-western languages require more complicated
models, so Unicode created the I concept, which was
later further refined into the I. For
example, a Korean Hangul syllable is considered a single logical
character, but most often consists of three actual
Unicode characters: a leading consonant followed by an interior vowel followed
by a trailing consonant.
Whether to call these extended grapheme clusters "characters" depends on your
point of view. If you are a programmer, you probably would tend towards seeing
each element in the sequences as one unit, or "character". However from
the user's point of view, the whole sequence could be seen as one
"character" since that's probably what it looks like in the context of the
user's language. In this document, we take the programmer's point of
view: one "character" is one Unicode code point.
For some combinations of base character and modifiers, there are
I characters. There is a single character equivalent, for
example, to the sequence C followed by
C. It is called C. These precomposed characters are, however, only available for
some combinations, and are mainly meant to support round-trip
conversions between Unicode and legacy standards (like ISO 8859). Using
sequences, as Unicode does, allows for needing fewer basic building blocks
(code points) to express many more potential grapheme clusters. To
support conversion between equivalent forms, various I are also defined. Thus, C is
in I, (abbreviated NFC), and the sequence
C followed by C
represents the same character in I (NFD).
Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character". The same character could
be represented differently in several legacy encodings. The
converse is not also true: some code points do not have an assigned
character. Firstly, there are unallocated code points within
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
When Unicode was first conceived, it was thought that all the world's
characters could be represented using a 16-bit word; that is a maximum of
C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be
needed. This soon proved to be false, and since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
The first C<0x10000> characters are called the I, or the
I (BMP). With Unicode 3.1, 17 (yes,
seventeen) planes in all were defined--but they are nowhere near full of
defined characters, yet.
When a new language is being encoded, Unicode generally will choose a
C of consecutive unallocated code points for its characters. So
far, the number of code points in these blocks has always been evenly
divisible by 16. Extras in a block, not currently needed, are left
unallocated, for future growth. But there have been occasions when
a later relase needed more code points than the available extras, and a
new block had to allocated somewhere else, not contiguous to the initial
one, to handle the overflow. Thus, it became apparent early on that
"block" wasn't an adequate organizing principal, and so the C