Character Sets
A character set determines which characters are allowed within a document.
A restrictive character set only allows certain types of characters. For example,
a restrictive character set may only allow uppercase characters, and as its
name suggests, a broad character set allows many characters. For example, a
broad character set may include Arabic characters.
ASCII
ASCII is a widely used character set. Each character in the ASCII character
set is represented by a character encoding value. The ASCII character code
for an uppercase "A" is the value 65, and the ASCII character code
for a lowercase "a" is the value 97. Pure ASCII is a 7-bit encoding
scheme, allowing 128 different values. ANSI extend the ASCII character set
to 8-bit to use the full range of 256 characters available in a Byte.
Unicode
The designated character set for XML documents is unicode, which includes
characters from around the world. The Universal Character Set (UCS), is an
ISO standard that encompasses most of the world's writing systems. UCS uses
multi-octet characters with are not compatible many current applications and
protocols. The UCS Transformation Formats (UTF) standards were developed to
overcome the compatibility issue. The two most widely used encoding schemes
for unicode are UTF-8, and UTF-16. UTF-8 uses 8 bits, and is compatible with
7-bit ASCII. UTF-8 is able to represent other characters using two or more
byte combinations. UTF-16 uses 16 bit character encoding, and is able to represent
65,356 possible values.
Specifying a Character Set
The markup and the character data for the actual text of the document are
both written in unicode by default. This enables XML documents to be created
from plain text editors.
The XML declaration may optionally include the character encoding to be used.
This allows you to specify an encoding type, other than 8-bit UTF. Notepad
for Windows in the UK uses windows-1252 encoding by default. As not all XML
parsers understand windows-1252 encoding, it is better to use a standard encoding
of ISO-8859-1, which is similar to the encoding used by Notepad. Notepad for
Windows 2000 and XP has the ability to save documents in unicode, allowing
the encoding attribute to be omitted from the declaration. The following example
specifies an encoding of ISO-8859-1.
<?xml version="1.0" encoding="ISO-8859-1"?>