//proc/self/root/usr/share/perl5/pod/=head1 NAME
perlunifaq - Perl Unicode FAQ
=head1 Q and A
This is a list of questions and answers about Unicode in Perl, intended to be
read after L.
=head2 perlunitut isn't really a Unicode tutorial, is it?
No, and this isn't really a Unicode FAQ.
Perl has an abstracted interface for all supported character encodings, so this
is actually a generic C tutorial and C FAQ. But many people
think that Unicode is special and magical, and I didn't want to disappoint
them, so I decided to call the document a Unicode tutorial.
=head2 What character encodings does Perl support?
To find out which character encodings your Perl supports, run:
perl -MEncode -le "print for Encode->encodings(':all')"
=head2 Which version of perl should I use?
Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
The tutorial and FAQ assume the latest release.
You should also check your modules, and upgrade them if necessary. For example,
HTML::Entities requires version >= 1.32 to function correctly, even though the
changelog is silent about this.
=head2 What about binary data, like images?
Well, apart from a bare C, you shouldn't treat them specially.
(The binmode is needed because otherwise Perl may convert line endings on Win32
systems.)
Be careful, though, to never combine text strings with binary strings. If you
need text in a binary stream, encode your text strings first using the
appropriate encoding, then join them with binary strings. See also: "What if I
don't encode?".
=head2 When should I decode or encode?
Whenever you're communicating text with anything that is external to your perl
process, like a database, a text file, a socket, or another program. Even if
the thing you're communicating with is also written in Perl.
=head2 What if I don't decode?
Whenever your encoded, binary string is used together with a text string, Perl
will assume that your binary string was encoded with ISO-8859-1, also known as
latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For
example, if it was UTF-8, the individual bytes of multibyte characters are seen
as separate characters, and then again converted to UTF-8. Such double encoding
can be compared to double HTML encoding (C<>>), or double URI encoding
(C<%253E>).
This silent implicit decoding is known as "upgrading". That may sound
positive, but it's best to avoid it.
=head2 What if I don't encode?
Your text string will be sent using the bytes in Perl's internal format. In
some cases, Perl will warn you that you're doing something wrong, with a
friendly warning:
Wide character in print at example.pl line 2.
Because the internal format is often UTF-8, these bugs are hard to spot,
because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't
use the fact that Perl's internal format is UTF-8 to your advantage. Encode
explicitly to avoid weird bugs, and to show to maintenance programmers that you
thought this through.
=head2 Is there a way to automatically decode or encode?
If all data that comes from a certain handle is encoded in exactly the same
way, you can tell the PerlIO system to automatically decode everything, with
the C layer. If you do this, you can't accidentally forget to decode
or encode anymore, on things that use the layered handle.
You can provide this layer when Cing the file:
open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write
open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read
Or if you already have an open filehandle:
binmode $fh, ':encoding(UTF-8)';
Some database drivers for DBI can also automatically encode and decode, but
that is sometimes limited to the UTF-8 encoding.
=head2 What if I don't know which encoding was used?
Do whatever you can to find out, and if you have to: guess. (Don't forget to
document your guess with a comment.)
You could open the document in a web browser, and change the character set or
character encoding until you can visually confirm that all characters look the
way they should.
There is no way to reliably detect the encoding automatically, so if people
keep sending you data without charset indication, you may have to educate them.
=head2 Can I use Unicode in my Perl sources?
Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the
C