Members

Technology Zones

IBM Learning Center

Articles

Hosted By

MaximumASP

Info

Rated
Read 46,676 times

Contents

Related Categories

Extensible Markup Language (XML) Tutorial - Character Sets

gez

Character Sets

A character set determines which characters are allowed within a document. A restrictive character set only allows certain types of characters. For example, a restrictive character set may only allow uppercase characters, and as its name suggests, a broad character set allows many characters. For example, a broad character set may include Arabic characters.

ASCII

ASCII is a widely used character set. Each character in the ASCII character set is represented by a character encoding value. The ASCII character code for an uppercase "A" is the value 65, and the ASCII character code for a lowercase "a" is the value 97. Pure ASCII is a 7-bit encoding scheme, allowing 128 different values. ANSI extend the ASCII character set to 8-bit to use the full range of 256 characters available in a Byte.

Unicode

The designated character set for XML documents is unicode, which includes characters from around the world. The Universal Character Set (UCS), is an ISO standard that encompasses most of the world's writing systems. UCS uses multi-octet characters with are not compatible many current applications and protocols. The UCS Transformation Formats (UTF) standards were developed to overcome the compatibility issue. The two most widely used encoding schemes for unicode are UTF-8, and UTF-16. UTF-8 uses 8 bits, and is compatible with 7-bit ASCII. UTF-8 is able to represent other characters using two or more byte combinations. UTF-16 uses 16 bit character encoding, and is able to represent 65,356 possible values.

Specifying a Character Set

The markup and the character data for the actual text of the document are both written in unicode by default. This enables XML documents to be created from plain text editors.

The XML declaration may optionally include the character encoding to be used. This allows you to specify an encoding type, other than 8-bit UTF. Notepad for Windows in the UK uses windows-1252 encoding by default. As not all XML parsers understand windows-1252 encoding, it is better to use a standard encoding of ISO-8859-1, which is similar to the encoding used by Notepad. Notepad for Windows 2000 and XP has the ability to save documents in unicode, allowing the encoding attribute to be omitted from the declaration. The following example specifies an encoding of ISO-8859-1.

<?xml version="1.0" encoding="ISO-8859-1"?>

I'm available for contract work. Please visit Juicify for details.

Comments

  • Excellent!

    Posted by Jet Blazer on 19 Jan 2006

    Beutiful article. Takes a user from a beginner level to an intermediate level very nicely. Certainly a treat for beginers and an expert alike.

  • Test

    Posted by prathiba on 04 Apr 2004

    Good article for starters