DTDs cannot define and enforce a list of allowed characters, but publishers need to specify which characters are allowed in order to avoid publishing systems encountering characters that they are unprepared for and cannot render. This user guide is therefore the definitive source of the rules concerning this matter.
An XML document conforming to the TFJA DTD must specify the "UTF-8" encoding system, which encodes characters from the Unicode character set. But this is a huge character set, and not all of the characters within it can be supported by the suppliers Taylor & Francis use to publish articles. A subset of the Unicode character set is specified using the rules below:
| |
- |
All non-control characters from the "ISO 646" character set (essentially, US ASCII except for the currency symbol) can be entered directly (see list below).
|
| |
- |
Characters outside of the ISO 646 set must not be entered directly (it is therefore not possible to arbitrarily insert any Unicode character).
|
| |
- |
Characters outside of the ISO 646 set must be entered as references to entities that are defined in the TFJA DTD if the TFJA DTD has an entity defined for the character.
|
| |
- |
Characters outside of the ISO 646 set and not defined in the TFJA DTD must be represented as hexadecimal or decimal character entities using the correct Unicode value for the character, for example "ت" (arabic letter teh).
|
| |
- |
Entities must not be defined within the internal DTD subset to represent characters not available in the TFJA DTD or Unicode.
|
Keyboard Characters
The allowed characters from ISO 646 are (in character code order):
!"#%&'()*+,-./
0123456789
:;<=>?@
ABCDEFGHIJKLMNOPQRSTUVWXYZ
[\]^_`
abcdefghijklmnopqrstuvwxyz
{|}~
In addition to this set, legal XML whitespace characters are allowed. Legal XML whitespace characters are:
#x20 (space)
#x9 (tab)
#xD (carriage return)
#xA (line feed)
All XML documents that adhere to the TFJA DTD must follow the rules above. It is legal for processes that output a new XML Document from the original tagged document to convert entities to other legal XML forms, such as Unicode characters, or character entity references, but any manual edits made to such a derived document should obey the rules above.
The direct use of Unicode characters outside those listed above is prohibited in TFJA XML documents to ensure that content is correctly preserved. The characters specified above are considered by Unicode to be "safe" characters that should display in any software application. Other Unicode characters when captured as entities can easily be identified using any text editor and the Unicode code charts, however if entered directly are difficult to identify on systems that lack the appropriate fonts and can be inadvertently corrupted during manual editing. Although only a subset of Unicode may be used directly, TFJA XML documents should specify the "UTF-8" character encoding to indicate that the Unicode character set is used.
Entity Sets
The TFJA DTD uses the ISO and MathML entity sets described in "XML Entity Definitions for Characters, W3C Recommendation 01 April 2010" (
http://www.w3.org/TR/xml-entity-names/), and may periodically be updated with later additions to these sets. Entity sets maintained by the W3C for HTML are specifically not included, although any characters appearing in these sets not already duplicated in the ISO and MathML sets have been added to TandFobj.ent. The
unarticle Preview Article contains a list of all character entities in the DTD.
Please contact
[email protected] and
[email protected] for advice should you discover the need for the creation of an entity that does not exist in any of the character sets that are linked into the DTD (the ISO sets and TandFchar.ent, TandFobj.ent, or TandFmath.ent).
TandFchar.ent
Some alphabetic characters (lowercase and uppercase letters) have ISO entities for their accented versions (e.g. "é'"). However, there are a great many combinations that did not exist, and Taylor & Francis has created this entity file to fill in the gaps in the event that they may be needed.
There are no entries in this file that duplicate any ISO entity. The letters used are "A..Z" and "a..z", and the accents catered for are acute, breve, caron, cedilla, circumflex, double acute, dot, grave, macron, ogon, ring, tilde and umlaut. Generally the entity names are quite self-explanatory.
The intention is to add further entities to this file as needed.
TandFobj.ent
The purpose of this file is to address non-text entities that appear in the content of Taylor & Francis articles. The initial focus of this file is to provide definition for non textual math and chemistry objects that have not been defined by one of the ISO entity files. The content of this entity file has been "seeded" with the contents of the ISOCHEM.ENT and ISOCH.ENT entity sets, and entities from xhtml1-special.ent and xhtml1-symbol.ent not found in the ISO sets.
As new non-text entities are identified, it is our intention to build on this file and make it universally available to our partners.
TandFmath.ent
All non-ISO math operators used in Taylor & Francis content must be identified, defined and declared in this file.
Greek Characters
The ISOGRK3 and ISOGRK4 entity sets exist in multiple forms. The files referenced in the TFJA DTD originate from the W3C website, which also supports the Math Markup Language (Version 3). The following comments apply to the version provided in this DTD.
ISOGRK1 is a set of Greek letters to be used for Greek language text (such as "&agr;").
ISOGRK2 contains the accented forms of some of the letters that may be needed in Greek text.
ISOGRK3 is a set of Greek letters for use as symbols in technical contexts (such as "&alpha"). Unlike ISOGRK1, this does not contain the full alphabet, but only those letters where the symbols are different from letters in the Latin alphabet (thus capital alpha is missing from this set; as a symbol it would not be distinguishable from the letter "A").
ISOGRK4 is a set of emboldened versions of the symbols in ISOGRK3.
To fit with normal conventions in mathematical typesetting, the ISOGRK3 symbols should be rendered in a sloping font to fit in with the italic characters from the Latin alphabet used for mathematical variables. The bold Greek symbols of ISOGRK4 would be more normally rendered upright.
Thus, in Greek text, use "&agr;" to represent alpha, use italic "&agr;" to represent italic alpha, use bold "&agr;" to represent bold alpha, and use bolditalic "&agr;" to represent bold italic alpha.
In Mathematics, use "α" to represent alpha (italicization implied) (do not use italic "α"), use "&b.alpha;" to represent bold alpha, and use italic "&b.alpha" to represent bold italic alpha. In the unlikely event that an upright non-bold alpha is needed in mathematical work, use "&agr;".
The one exception to this rule is that the "mu" used to represent "micro" in such contexts as unit abbreviations (which ought to be printed as an upright "mu", but rarely is), has its own entity in ISONUM ("µ"), which should be used in preference to "μ" or "&mgr;". For example "mmol" should be coded as µ mol (not "μ" mol).