//proc/self/root/usr/share/perl5/pod/=encoding utf8
=head1 NAME
perllocale - Perl locale handling (internationalization and localization)
=head1 DESCRIPTION
In the beginning there was ASCII, the "American Standard Code for
Information Interchange", which works quite well for Americans with
their English alphabet and dollar-denominated currency. But it doesn't
work so well even for other English speakers, who may use different
currencies, such as the pound sterling (as the symbol for that currency
is not in ASCII); and it's hopelessly inadequate for many of the
thousands of the world's other languages.
To address these deficiencies, the concept of locales was invented
(formally the ISO C, XPG4, POSIX 1.c "locale system"). And applications
were and are being written that use the locale mechanism. The process of
making such an application take account of its users' preferences in
these kinds of matters is called B (often
abbreviated as B); telling such an application about a particular
set of preferences is known as B (B).
Perl was extended, starting in 5.004, to support the locale system. This
is controlled per application by using one pragma, one function call,
and several environment variables.
Unfortunately, there are quite a few deficiencies with the design (and
often, the implementations) of locales, and their use for character sets
has mostly been supplanted by Unicode (see L for an
introduction to that, and keep on reading here for how Unicode interacts
with locales in Perl).
Perl continues to support the old locale system, and starting in v5.16,
provides a hybrid way to use the Unicode character set, along with the
other portions of locales that may not be so problematic.
(Unicode is also creating C, the "Common Locale Data Repository",
L which includes more types of information than
are available in the POSIX locale system. At the time of this writing,
there was no CPAN module that provides access to this XML-encoded data.
However, many of its locales have the POSIX-only data extracted, and are
available at L.)
=head1 WHAT IS A LOCALE
A locale is a set of data that describes various aspects of how various
communities in the world categorize their world. These categories are
broken down into the following types (some of which include a brief
note here):
=over
=item Category LC_NUMERIC: Numeric formatting
This indicates how numbers should be formatted for human readability,
for example the character used as the decimal point.
=item Category LC_MONETARY: Formatting of monetary amounts
=for comment
The nbsp below makes this look better
E<160>
=item Category LC_TIME: Date/Time formatting
=for comment
The nbsp below makes this look better
E<160>
=item Category LC_MESSAGES: Error and other messages
This for the most part is beyond the scope of Perl
=item Category LC_COLLATE: Collation
This indicates the ordering of letters for comparision and sorting.
In Latin alphabets, for example, "b", generally follows "a".
=item Category LC_CTYPE: Character Types
This indicates, for example if a character is an uppercase letter.
=back
More details on the categories are given below in L.
Together, these categories go a long way towards being able to customize
a single program to run in many different locations. But there are
deficiencies, so keep reading.
=head1 PREPARING TO USE LOCALES
Perl will not use locales unless specifically requested to (see L below
for the partial exception of C). But even if there is such a
request, B of the following must be true for it to work properly:
=over 4
=item *
B. If it does,
you should find that the setlocale() function is a documented part of
its C library.
=item *
B. You, or
your system administrator, must make sure that this is the case. The
available locales, the location in which they are kept, and the manner
in which they are installed all vary from system to system. Some systems
provide only a few, hard-wired locales and do not allow more to be
added. Others allow you to add "canned" locales provided by the system
supplier. Still others allow you or the system administrator to define
and add arbitrary locales. (You may have to ask your supplier to
provide canned locales that are not delivered with your operating
system.) Read your system documentation for further illumination.
=item *
B. If it does,
C will say that the value for C is
C.
=back
If you want a Perl application to process and present your data
according to a particular locale, the application code should include
the S> pragma (see L) where
appropriate, and B of the following must be true:
=over 4
=item 1
B)
must be correctly set up> at the time the application is started, either
by yourself or by whomever set up your system account; or
=item 2
B using the method described in
L.
=back
=head1 USING LOCALES
=head2 The use locale pragma
By default, Perl ignores the current locale. The S>
pragma tells Perl to use the current locale for some operations.
Starting in v5.16, there is an optional parameter to this pragma:
use locale ':not_characters';
This parameter allows better mixing of locales and Unicode, and is
described fully in L, but briefly, it tells Perl to
not use the character portions of the locale definition, that is
the C and C categories. Instead it will use the
native (extended by Unicode) character set. When using this parameter,
you are responsible for getting the external character set translated
into the native/Unicode one (which it already will be if it is one of
the increasingly popular UTF-8 locales). There are convenient ways of
doing this, as described in L.
The current locale is set at execution time by
L described below. If that function
hasn't yet been called in the course of the program's execution, the
current locale is that which was determined by the L"ENVIRONMENT"> in
effect at the start of the program, except that
C> is always
initialized to the C locale (mentioned under L).
If there is no valid environment, the current locale is undefined. It
is likely, but not necessarily, the "C" locale.
The operations that are affected by locale are:
=over 4
=item B>
=over 4
=item *
B (format()) use C
=item *
B (strftime()) uses C.
=back
=for comment
The nbsp below makes this look better
E<160>
=item B>
The above operations are affected, as well as the following:
=over 4
=item *
B (C, C, C, C, and C) and
the POSIX string collation functions strcoll() and strxfrm() use
C. sort() is also affected if used without an
explicit comparison function, because it uses C by default.
B C and C are unaffected by locale: they always
perform a char-by-char comparison of their scalar operands. What's
more, if C finds that its operands are equal according to the
collation sequence specified by the current locale, it goes on to
perform a char-by-char comparison, and only returns I<0> (equal) if the
operands are char-for-char identical. If you really want to know whether
two strings--which C and C may consider different--are equal
as far as collation in the locale is concerned, see the discussion in
L.
=item *
B (uc(), lc(),
ucfirst(), and lcfirst()) use C
=back
=back
The default behavior is restored with the S> pragma, or
upon reaching the end of the block enclosing C regular expression modifier (see L) to instruct it to do so.
Versions of Perl from 5.002 to 5.003 did use the C
information if available; that is, C<\w> did understand what
were the letters according to the locale environment variables.
The problem was that the user had no control over the feature:
if the C library supported locales, Perl used them.
=head2 I18N:Collate obsolete
In versions of Perl prior to 5.004, per-locale collation was possible
using the C library module. This module is now mildly
obsolete and should be avoided in new applications. The C
functionality is now integrated into the Perl core language: One can
use locale-specific scalar data completely normally with C,
so there is no longer any need to juggle with the scalar references of
C.
=head2 Sort speed and memory use impacts
Comparing and sorting by locale is usually slower than the default
sorting; slow-downs of two to four times have been observed. It will
also consume more memory: once a Perl scalar variable has participated
in any string comparison or sorting operation obeying the locale
collation rules, it will take 3-15 times more memory than before. (The
exact multiplier depends on the string's contents, the operating system
and the locale.) These downsides are dictated more by the operating
system's implementation of the locale system than by Perl.
=head2 write() and LC_NUMERIC
If a program's environment specifies an LC_NUMERIC locale and C is in effect when the format is declared, the locale is used
to specify the decimal point character in formatted output. Formatted
output cannot be controlled by C at the time when write()
is called.
=head2 Freely available locale definitions
The Unicode CLDR project extracts the POSIX portion of many of its
locales, available at
http://unicode.org/Public/cldr/latest/
There is a large collection of locale definitions at:
http://std.dkuug.dk/i18n/WG15-collection/locales/
You should be aware that it is
unsupported, and is not claimed to be fit for any purpose. If your
system allows installation of arbitrary locales, you may find the
definitions useful as they are, or as a basis for the development of
your own locales.
=head2 I18n and l10n
"Internationalization" is often abbreviated as B because its first
and last letters are separated by eighteen others. (You may guess why
the internalin ... internaliti ... i18n tends to get abbreviated.) In
the same way, "localization" is often abbreviated to B.
=head2 An imperfect standard
Internationalization, as defined in the C and POSIX standards, can be
criticized as incomplete, ungainly, and having too large a granularity.
(Locales apply to a whole process, when it would arguably be more useful
to have them apply to a single thread, window group, or whatever.) They
also have a tendency, like standards groups, to divide the world into
nations, when we all know that the world can equally well be divided
into bankers, bikers, gamers, and so on.
=head1 Unicode and UTF-8
The support of Unicode is new starting from Perl version v5.6, and more fully
implemented in version v5.8 and later. See L. It is
strongly recommended that when combining Unicode and locale (starting in
v5.16), you use
use locale ':not_characters';
When this form of the pragma is used, only the non-character portions of
locales are used by Perl, for example C. Perl assumes that
you have translated all the characters it is to operate on into Unicode
(actually the platform's native character set (ASCII or EBCDIC) plus
Unicode). For data in files, this can conveniently be done by also
specifying
use open ':locale';
This pragma arranges for all inputs from files to be translated into
Unicode from the current locale as specified in the environment (see
L), and all outputs to files to be translated back
into the locale. (See L). On a per-filehandle basis, you can
instead use the L module, or the L
module, both available from CPAN. The latter module also has methods to
ease the handling of C and environment variables, and can be used
on individual strings. Also, if you know that all your locales will be
UTF-8, as many are these days, you can use the L|perlrun/-C>
command line switch.
This form of the pragma allows essentially seamless handling of locales
with Unicode. The collation order will be Unicode's. It is strongly
recommended that when you need to order and sort strings that you use
the standard module L which gives much better results
in many instances than you can get with the old-style locale handling.
For pre-v5.16 Perls, or if you use the locale pragma without the
C<:not_characters> parameter, Perl tries to work with both Unicode and
locales--but there are problems.
Perl does not handle multi-byte locales in this case, such as have been
used for various
Asian languages, such as Big5 or Shift JIS. However, the increasingly
common multi-byte UTF-8 locales, if properly implemented, may work
reasonably well (depending on your C library implementation) in this
form of the locale pragma, simply because both
they and Perl store characters that take up multiple bytes the same way.
However, some, if not most, C library implementations may not process
the characters in the upper half of the Latin-1 range (128 - 255)
properly under LC_CTYPE. To see if a character is a particular type
under a locale, Perl uses the functions like C. Your C
library may not work for UTF-8 locales with those functions, instead
only working under the newer wide library functions like C.
Perl generally takes the tack to use locale rules on code points that can fit
in a single byte, and Unicode rules for those that can't (though this
isn't uniformly applied, see the note at the end of this section). This
prevents many problems in locales that aren't UTF-8. Suppose the locale
is ISO8859-7, Greek. The character at 0xD7 there is a capital Chi. But
in the ISO8859-1 locale, Latin1, it is a multiplication sign. The POSIX
regular expression character class C<[[:alpha:]]> will magically match
0xD7 in the Greek locale but not in the Latin one.
However, there are places where this breaks down. Certain constructs are
for Unicode only, such as C<\p{Alpha}>. They assume that 0xD7 always has its
Unicode meaning (or the equivalent on EBCDIC platforms). Since Latin1 is a
subset of Unicode and 0xD7 is the multiplication sign in both Latin1 and
Unicode, C<\p{Alpha}> will never match it, regardless of locale. A similar
issue occurs with C<\N{...}>. It is therefore a bad idea to use C<\p{}> or
C<\N{}> under plain C--I you can guarantee that the
locale will be a ISO8859-1. Use POSIX character classes instead.
Another problem with this approach is that operations that cross the
single byte/multiple byte boundary are not well-defined, and so are
disallowed. (This boundary is between the codepoints at 255/256.).
For example, lower casing LATIN CAPITAL LETTER Y WITH DIAERESIS (U+0178)
should return LATIN SMALL LETTER Y WITH DIAERESIS (U+00FF). But in the
Greek locale, for example, there is no character at 0xFF, and Perl
has no way of knowing what the character at 0xFF is really supposed to
represent. Thus it disallows the operation. In this mode, the
lowercase of U+0178 is itself.
The same problems ensue if you enable automatic UTF-8-ification of your
standard file handles, default C layer, and C<@ARGV> on non-ISO8859-1,
non-UTF-8 locales (by using either the B<-C> command line switch or the
C environment variable; see L).
Things are read in as UTF-8, which would normally imply a Unicode
interpretation, but the presence of a locale causes them to be interpreted
in that locale instead. For example, a 0xD7 code point in the Unicode
input, which should mean the multiplication sign, won't be interpreted by
Perl that way under the Greek locale. This is not a problem
I you make certain that all locales will always and only be either
an ISO8859-1, or, if you don't have a deficient C library, a UTF-8 locale.
Vendor locales are notoriously buggy, and it is difficult for Perl to test
its locale-handling code because this interacts with code that Perl has no
control over; therefore the locale-handling code in Perl may be buggy as
well. (However, the Unicode-supplied locales should be better, and
there is a feed back mechanism to correct any problems. See
L.)
If you have Perl v5.16, the problems mentioned above go away if you use
the C<:not_characters> parameter to the locale pragma (except for vendor
bugs in the non-character portions). If you don't have v5.16, and you
I have locales that work, using them may be worthwhile for certain
specific purposes, as long as you keep in mind the gotchas already
mentioned. For example, if the collation for your locales works, it
runs faster under locales than under L; and you gain
access to such things as the local currency symbol and the names of the
months and days of the week. (But to hammer home the point, in v5.16,
you get this access without the downsides of locales by using the
C<:not_characters> form of the pragma.)
Note: The policy of using locale rules for code points that can fit in a
byte, and Unicode rules for those that can't is not uniformly applied.
Pre-v5.12, it was somewhat haphazard; in v5.12 it was applied fairly
consistently to regular expression matching except for bracketed
character classes; in v5.14 it was extended to all regex matches; and in
v5.16 to the casing operations such as C<"\L"> and C. For
collation, in all releases, the system's C function is called,
and whatever it does is what you get.
=head1 BUGS
=head2 Broken systems
In certain systems, the operating system's locale support
is broken and cannot be fixed or used by Perl. Such deficiencies can
and will result in mysterious hangs and/or Perl core dumps when
C is in effect. When confronted with such a system,
please report in excruciating detail to >, and
also contact your vendor: bug fixes may exist for these problems
in your operating system. Sometimes such bug fixes are called an
operating system upgrade.
=head1 SEE ALSO
L, L, L, L,
L, L,
L, L, L,
L, L, L,
L, L, L,
L, L, L,
L, L.
=head1 HISTORY
Jarkko Hietaniemi's original F heavily hacked by Dominic
Dunlop, assisted by the perl5-porters. Prose worked over a bit by
Tom Christiansen, and updated by Perl 5 porters.