Home > Topics > Character Encoding and Whitespace

Character Encoding and Whitespace

Character data in XML is all text that is contained within elements and in attribute values. The text of an article and all metadata is made up of character data. A wide variety of characters appear in Taylor & Francis journal content, owing to the international nature of our journals. The Unicode character set is to be used for all TF JATS XML. Unicode contains the characters of most of the world’s scripts, and is also the standard character set for XML. While the Unicode standard and XML have several different encoding options available to represent character data, only certain encodings should be used in Taylor & Francis JATS XML.

Many different systems and software packages are involved in the creation and processing of Taylor & Francis content. Simply using UTF-8 encoding in each XML file would not guarantee accurate preservation of character data across all systems and software. A very specific set of rules for encoding character data, described here, should be adhered to in the creation of all TF JATS XML files. These rules aim for the highest degree of compatibility, and least opportunity for data corruption, when files are transferred and processed on different systems.

Encoding in XML Declaration

Each TF JATS XML file must declare the character encoding in an attribute of the XML declaration that appears at the very beginning of the file. For example:

<?xml version="1.0" encoding="ISO646-US"?>

An XML file must be saved using the same encoding that is indicated in the encoding attribute. An XML file must not use a different encoding than is specified in the encoding attribute, and must not contain more than one encoding. For example, if characters in another encoding make their way into an XML file through a copy/paste operation without being converted into the correct encoding those characters will be corrupted due to their encoding being different from the encoding that is specified in the encoding attribute.

The character encoding must be one of the following:

Encoding	Description
ISO646-US	This encoding permits a very limited subset of Unicode characters to be encoded directly as sequences of bytes. This subset contains characters that are widely compatible, primarily the Latin alphabet, numerals, and punctuation. All Unicode characters that are not included in this subset are automatically encoded as NCRs when this encoding is used to create an XML file.
US-ASCII	This subset of Unicode is nearly equivalent to ISO646-US.
UTF-8	UTF-8 encoding can be used but it is preferable to use ISO646-US instead. UTF-8 encoding allows any Unicode character to be encoded directly as sequences of bytes. However, this is a disadvantage because some Unicode characters can be invisible or might appear to be broken characters. If this encoding is used, there are certain character classes that should be encoded as NCRs or should not appear in files.

Encoding Characters

XML provides several different methods for encoding character data in files. Each method is capable of representing the same Unicode characters but each method has its own pros and cons. These different methods only affect how character data is presented in an XML file; software that parses (reads) XML understands each encoding method to be equivalent. The three methods of encoding character data are:

Directly as a sequence of bytes. Characters are visible in the file as they would normally be displayed in an average document. While in general this makes Unicode characters easily visible in the XML file, there characters within Unicode that are not distinctly visible when this method is used (e.g. whitespace characters, combining characters) and these characters should be encoded as NCRs.
Numerical character reference (NCR): Characters are represented by their Unicode number. For example, the character × (multiplication sign) is assigned to the hexadecimal number D7 in Unicode and can be encoded as ×. All characters coded as NCRs are clearly visible in an XML file, and this is a safe way to encode all Unicode characters. A possible downside is that most Unicode numbers are not easily recognizable for humans as the characters that they represent.
Character entity: Character entities are short names that typically represent a specific Unicode character. Certain characters that are part of the syntax of XML must be coded as character entities. These character entities that are predefined as part of the syntax of XML are shown in the list below.

Character	Character	Entity Use
&	`&`	required
<	`<`	required
>	`>`	optional
"	`"`	contextual
'	`'`	contextual

If an XML file is encoding using ISO646-US or US-ASCII then only characters from this subset of Unicode should be present in the XML file as direct sequences of bytes. This subset includes the characters that are listed below. All other Unicode characters must be encoded as numerical character references (NCRs).

If an XML file is encoded using UTF-8 most valid Unicode character may be directly encoded as UTF-8 byte sequences. Certain characters must be encoded as numerical character references (NCRs) according to Taylor & Francis specifications.

Only characters from the subset of Unicode that is ISO646-US, also known as US-ASCII or “keyboard characters”, should be present in the XML file. These characters are considered by Unicode to be “safe” characters that should display in any software application. The characters that comprise ISO646-US are (in character code order):

!"#%&'()*+,-./
0123456789
:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ
[]^_` |
abcdefghijklmnopqrstuvwxyz
{|}~

In addition to this set, legal XML whitespace characters are allowed. Legal XML whitespace characters are:

\x20 (space)
\x9 (tab)
\xD (carriage return)
\xA (line feed)

Unicode characters outside of the characters listed above must be coded as XML character entities. Either numeric character references in hex form (&) or decimal form (&), or named character entities (&) may be used. Character codes used in numeric character references (character references) must refer to a specific character in the Unicode character set.

For example, in the state name Hawaiʻi the okina character can be tagged as ʻ or &okina;. Both are equivalent.

Hawaiʻi

Hawai&okina;i

The TF JATS DTD includes all named character entities defined by the W3C’s ISO and MathML entity sets, and a large number of additional character entities for common characters. A table showing all of the named character entities defined in the TF JATS DTD is available here.

To help locate the numeric character code for any particular character, the code charts provided at unicode.org and unicodelookup.com are useful references. In addition, many editing programs provide special character choosers, and some XML software will automatically convert characters to XML character entities.

Note that two characters from the list of ISO646-US characters, the ampersand (&) and the less-than sign (<), when they appear in text must be coded as XML character entities to meet the well-formedness requirements of XML. The entity forms of the ampersand have been given above, and the entity forms of the less-than sign are <, <, and <.

Whitespace concerns

White space in XML is defined as spaces tabs line breaks. Software that processes XML can remove whitespace that is not identified as significant, and this happens automatically without warning. The DTD defines which elements may contain whitespace that is significant and should not be removed. In order to avoid the problem of significant whitespace being removed incorrectly, all JATS XML for Taylor & Francis must contain a DOCTYPE declaration referencing a JATS 1.2 DTD using a PUBLIC identifier with a full URL to the DTD.

Characters not in Unicode

Although Unicode contains the characters of most of the world’s scripts and is an expanding standard, there are some characters or glyphs that are not available in Unicode.

If a single character cannot be represented using any available Unicode character, the character should be encoded using the <private-char> element with a <inline-graphic> element and an image file. The attributes @name and @description on <private-char> may be used to describe the character, which can be important to include for accessibility.

Example

<p>This text includes an unusual character <private-char name="smiley glasses" description="smiling face with glasses"><inline-graphic xlink:href="ABCD_A_123456_ILG0001.png"/></private-char>.</p>

See the image guidelines for more information on this.

Bidirectional text

Text that flows right-to-left can be coded using Unicode’s mechanism for bidirectional text. The control characters for bidirectional text must be encoded as NCRs.

See the following pages for further information

Common Fractions

This section describes XML coding for common fractions that are typeset using a slanted bar between the numerator and denominator and are roughly the same height as other characters. Example: ½. Fractions such as these should be captured using Unicode characters, not inline images.

Common fractions can usually be represented using Unicode characters instead of inline images. Common fractions captured using Unicode characters will typically display better online. The Unicode characters also work better than images for copying and pasting, searching, and screen readers.

Almost any character or symbol, including the letters, numbers and punctuation of many languages, can be coded in XML using Unicode. The basic format is &code; (ampersand code semicolon). Many commonly used characters have mnemonic names supported by TF JATS, and the name can be used as the code. For example, the multiplication symbol can be coded as ×. Not all characters have mnemonic names, but every Unicode character has a hex number that we can use. The format to use the hex number is &#xcode; (ampersand hash x code semicolon). For example, the multiplication symbol can be coded using its hex number as ×.

Unicode has several characters to represent common (or vulgar) fractions that are typeset at roughly the same height as other characters and use a slanted bar between the numerator and denominator. Beyond these defined fraction characters, any fraction can be coded using combinations of characters as described below.

Here is an example. The width by height dimensions for a picture 11 7/16 x 9 7/16 should look like this: 

You can use superscript number characters to make the top part of the fraction, a fraction slash, and subscript number characters to make the bottom part of the fraction. The fraction slash character is ⁄ or &frasl;. This character has special dimensions that allow characters on the left and right to overlap the same space, which allows the fraction to display correctly.

The way to code this example in XML using Unicode hex numbers is:

11⁷⁄₁₆ × 9⁷⁄₁₆

Or, you can use mnemonic names where possible:

11⁷&frasl;₁₆ × 9⁷&frasl;₁₆

This XML when rendered will display:

11⁷⁄₁₆ × 9⁷⁄₁₆

Note that whole numbers should use normal digits, and spaces should be placed where you expect there to be space.

The table below lists the codes for characters frequently used in fractions.

Description	Display	Hex Numeric Character Reference	Mnemonic Character Entity Reference
multiplication sign	×	`×`	`×`
fraction slash	⁄	`⁄`	`&frasl;`
superscript zero	⁰	`⁰`
superscript one	¹	`¹`
superscript two	²	`²`
superscript three	³	`³`
superscript four	⁴	`⁴`
superscript five	⁵	`⁵`
superscript six	⁶	`⁶`
superscript seven	⁷	`⁷`
superscript eight	⁸	`⁸`
superscript nine	⁹	`⁹`
subscript zero	₀	`₀`
subscript one	₁	`₁`
subscript two	₂	`₂`
subscript three	₃	`₃`
subscript four	₄	`₄`
subscript five	₅	`₅`
subscript six	₆	`₆`
subscript seven	₇	`₇`
subscript eight	₈	`₈`
subscript nine	₉	`₉`
vulgar fraction one quarter	¼	`¼`
vulgar fraction one half	½	`½`
vulgar fraction three quarters	¾	`¾`