Syntax rules in XML
An XML document is text, usually a particular encoding of Unicode such as UTF-8 or UTF-16, although other encodings may be used.
Unlike, for example, HTML, XML is highly dependent upon structure, content and integrity for its efficacy. In order for a document to be considered "well-formed" [1], it must conform (at the very least) to the following:
- It must have one (and only one) root element.
- Non-empty elements must be delimited by a start-tag and an end-tag. Empty elements may be marked with an empty-element tag.
- All attribute values must be quoted (either single (') or double (") quotes, but a single quote closes a single quote and a double quote a double quote. The other pair can then be used inside values.)
- Tags may be nested but may not overlap, that is each non-root element must be completely contained in another element.
Element names in XML are case-sensitive: for example and are a well-formed matching pair whereas and are not.
Also, again unlike HTML, clever choice of XML element names allows the meaning of the data to be retained as part of the markup. This makes it more easily interpreted by software programs.
As a concrete example, a simple recipe expressed in an XML representation might be:
Basic bread
Flour
Yeast
Warm Water
Salt
Mix all ingredients together, and knead thoroughly.
Cover with a cloth, and leave for one hour in warm room.
Knead again, place in a tin, and then bake in the oven.
Identifying information accurately enables programs to manipulate it easily: in this example, it is now easy to convert the quantities to other measuring systems, or to print the ingredients as icons for those with low reading skills (or different native language), or to refer to the individual ingredients or steps from elsewhere (another recipe, for example).
An XML document that meets certain other criteria in addition to being
well-formed (such as complying with an associated
DTD) is said to be "valid".
XML schema languages
Before the advent of generalised data description languages such as SGML and XML, software designers had to define special file formats or small languages to share data between programs. This required writing detailed specifications and special-purpose parsers and writers.
XML schema languages allow software designers to describe the structure of particular XML-based markup languages in a formal way. Such a description is called a schema. Well-tested tools exist to validate XML files against a schema to automatically verify whether the document conforms to the described structure. Other usages of the schema exist; XML editors for instance can use schemas to support the editing process.
The oldest XML schema format is the DTD, which is inherited from SGML. While DTD support is ubiquitous due to its inclusion in the XML 1.0 standard, it is seen as limited for the following reasons:
- No support for newer features of XML, most importantly namespaces.
- Lack of expressivity. Certain formal aspects of an XML document cannot be captured in a DTD.
- Custom non-XML syntax to describe the schema, inherited from SGML.
A newer XML schema language, described by the W3C as the successor of DTDs, is simply called XML Schema, also referred to as XML Schema Definition (XSD). XSD are far more powerful than DTDs in describing XML languages. Additionally XSD uses an XML based format, which makes it possible to use the XML toolset to help process XML schema. It also becomes possible to write a schema for the schema language itself. Criticisms of XSD are:
- Standard is very large, which makes it difficult to understand and implement.
- XML-based syntax leads to verbosity in schema description, which makes XSDs harder to read and write.
An alternative XML schema language recently gaining in popularity is