logo

XML Cheatsheet

XML is a markup language designed to store and transport data. It emphasizes structure and is both human-readable and machine-readable.

Basic Syntax Rules

  1. Root Element: Every XML document MUST have exactly one root element that encloses all other elements.
  2. Closing Tags: Every opening tag (<tag>) MUST have a corresponding closing tag (</tag>).
    • Empty elements can use self-closing tags: <tag/> (equivalent to <tag></tag>).
  3. Proper Nesting: Tags must be nested correctly. If <tagB> opens inside <tagA>, it must close before <tagA> closes.
    • Correct: <tagA><tagB>Content</tagB></tagA>
    • Incorrect: <tagA><tagB>Content</tagA></tagB>
  4. Case Sensitivity: XML tags and attribute names are case-sensitive. <tag> is different from <Tag>.
  5. Attribute Values Quoted: All attribute values MUST be enclosed in single (') or double (") quotes.
    • Correct: <element attr="value"> or <element attr='value'>
    • Incorrect: <element attr=value>

Core Components

  • XML Declaration (Optional, but Recommended):
    • Specifies XML version and character encoding. MUST be the very first line if present.
    <?xml version="1.0" encoding="UTF-8"?>
    
  • Elements: Building blocks defined by tags. Can contain text, other elements, or be empty.
    <book>                                <!-- Element 'book' -->
      <title>The Hitchhiker's Guide</title> <!-- Element 'title' with text content -->
      <author>Douglas Adams</author>      <!-- Element 'author' with text content -->
      <page-break/>                       <!-- Empty element 'page-break' (self-closing) -->
    </book>
    
  • Attributes: Provide additional information (metadata) about elements. Defined within the start tag.
    <book isbn="978-0345391803"> <!-- 'isbn' is an attribute of 'book' -->
      <title language="en">The Hitchhiker's Guide</title> <!-- 'language' is an attribute of 'title' -->
    </book>
    
  • Content (Text / PCDATA): Parsed Character Data. The text between the start and end tags of an element. Special characters (<, >, &, ', ") must be escaped using entities or placed within a CDATA section.
    <description>This book is <GREAT> & fun!</description>
    
  • Comments: Ignored by parsers. Useful for documentation. Cannot appear inside tags.
    <!-- This is a comment -->
    <data>Content</data> <!-- Comment after element -->
    
  • CDATA Sections: Character Data. Blocks of text that are NOT processed by the parser. Special characters do not need escaping within CDATA. Useful for including code snippets or XML fragments literally.
    <script>
    <![CDATA[
      function check() {
        if (a < b && c > d) { // No need to escape < or & here
          alert("Condition met!");
        }
      }
    ]]>
    </script>
    
  • Processing Instructions (PIs): Pass information to applications processing the XML document.
    <?xml-stylesheet type="text/css" href="style.css"?> <!-- Example: Link a stylesheet -->
    
  • Entities: Predefined shortcuts for special characters. Can also be custom defined in DTDs.
    • < -> < (Less Than)
    • > -> > (Greater Than)
    • & -> & (Ampersand)
    • ' -> ' (Apostrophe / Single Quote)
    • " -> " (Quotation Mark / Double Quote)

Well-Formed vs. Valid XML

  • Well-Formed: An XML document that adheres to all the basic syntax rules listed above. ALL XML documents must be well-formed to be processed correctly.
  • Valid: An XML document that is both well-formed AND conforms to the rules defined in a Document Type Definition (DTD) or an XML Schema (XSD). Validation ensures the document has the correct structure, element types, attribute types, etc., as defined by its grammar.

Character Encoding

  • Specifies the character set used (e.g., UTF-8, ISO-8859-1).
  • Declared in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>
  • UTF-8 is widely used and recommended as it supports a vast range of characters.
  • If omitted, parsers often default to UTF-8 or UTF-16. Explicit declaration is best practice.

Namespaces

  • Avoid naming conflicts when combining elements from different XML vocabularies.
  • Declared using the xmlns attribute.
  • Default Namespace: Applies to the element where it's declared and all unprefixed child elements.
    <book xmlns="http://example.com/books">
      <title>...</title> <!-- Belongs to http://example.com/books -->
    </book>
    
  • Prefixed Namespace: Associates a prefix with a namespace URI. Used for elements and attributes.
    <root xmlns:bk="http://example.com/books" xmlns:auth="http://example.com/authors">
      <bk:book>
        <bk:title>...</bk:title>
        <auth:author>...</auth:author> <!-- Belongs to http://example.com/authors -->
      </bk:book>
    </root>
    

Simple XML Example

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="library.css"?>

<!-- Represents a collection of books -->
<library xmlns="http://example.com/library/v1">
  <book isbn="978-0345391803" available="true">
    <title language="en">The Hitchhiker's Guide to the Galaxy</title>
    <author>Douglas Adams</author>
    <genre>Science Fiction Comedy</genre>
    <summary>
    <![CDATA[
      Follows Arthur Dent after Earth's destruction. <Amazing & Funny!>
    ]]>
    </summary>
  </book>

  <book isbn="978-0743273565">
    <title language="en">The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <genre>Fiction</genre>
    <!-- Needs availability check -->
  </book>

  <empty-element/> <!-- Example of a self-closing empty element -->

</library>

By Language

Python

Convert XML to JSON

Install xmltodict

$ sudo pip install xmltodict

Python Code:

import json
import xmltodict

s = open("foo.xml").read()
d = xmltodict.parse(s)
json.dump(d, open("bar.json",'w'))