XML Cheatsheet

XML is a markup language designed to store and transport data. It emphasizes structure and is both human-readable and machine-readable.

Basic Syntax Rules

Root Element: Every XML document MUST have exactly one root element that encloses all other elements.
Closing Tags: Every opening tag (<tag>) MUST have a corresponding closing tag (</tag>).
- Empty elements can use self-closing tags: <tag/> (equivalent to <tag></tag>).
Proper Nesting: Tags must be nested correctly. If <tagB> opens inside <tagA>, it must close before <tagA> closes.
- Correct: <tagA><tagB>Content</tagB></tagA>
- Incorrect: <tagA><tagB>Content</tagA></tagB>
Case Sensitivity: XML tags and attribute names are case-sensitive. <tag> is different from <Tag>.
Attribute Values Quoted: All attribute values MUST be enclosed in single (') or double (") quotes.
- Correct: <element attr="value"> or <element attr='value'>
- Incorrect: <element attr=value>

Core Components

XML Declaration (Optional, but Recommended):
- Specifies XML version and character encoding. MUST be the very first line if present.
```
<?xml version="1.0" encoding="UTF-8"?>
```

Elements: Building blocks defined by tags. Can contain text, other elements, or be empty.

<book>                                <!-- Element 'book' -->
  <title>The Hitchhiker's Guide</title> <!-- Element 'title' with text content -->
  <author>Douglas Adams</author>      <!-- Element 'author' with text content -->
  <page-break/>                       <!-- Empty element 'page-break' (self-closing) -->
</book>

Attributes: Provide additional information (metadata) about elements. Defined within the start tag.

<book isbn="978-0345391803"> <!-- 'isbn' is an attribute of 'book' -->
  <title language="en">The Hitchhiker's Guide</title> <!-- 'language' is an attribute of 'title' -->
</book>

Content (Text / PCDATA): Parsed Character Data. The text between the start and end tags of an element. Special characters (<, >, &, ', ") must be escaped using entities or placed within a CDATA section.
```
<description>This book is <GREAT> & fun!</description>
```

Comments: Ignored by parsers. Useful for documentation. Cannot appear inside tags.

<!-- This is a comment -->
<data>Content</data> <!-- Comment after element -->

CDATA Sections: Character Data. Blocks of text that are NOT processed by the parser. Special characters do not need escaping within CDATA. Useful for including code snippets or XML fragments literally.
```
<script>
<![CDATA[
  function check() {
    if (a < b && c > d) { // No need to escape < or & here
      alert("Condition met!");
    }
  }
]]>
</script>
```
Processing Instructions (PIs): Pass information to applications processing the XML document.
```
<?xml-stylesheet type="text/css" href="style.css"?> 
```
Entities: Predefined shortcuts for special characters. Can also be custom defined in DTDs.
- < -> < (Less Than)
- > -> > (Greater Than)
- & -> & (Ampersand)
- ' -> ' (Apostrophe / Single Quote)
- " -> " (Quotation Mark / Double Quote)

Well-Formed vs. Valid XML

Well-Formed: An XML document that adheres to all the basic syntax rules listed above. ALL XML documents must be well-formed to be processed correctly.
Valid: An XML document that is both well-formed AND conforms to the rules defined in a Document Type Definition (DTD) or an XML Schema (XSD). Validation ensures the document has the correct structure, element types, attribute types, etc., as defined by its grammar.

Character Encoding

Specifies the character set used (e.g., UTF-8, ISO-8859-1).
Declared in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>
UTF-8 is widely used and recommended as it supports a vast range of characters.
If omitted, parsers often default to UTF-8 or UTF-16. Explicit declaration is best practice.

Namespaces

Avoid naming conflicts when combining elements from different XML vocabularies.
Declared using the xmlns attribute.

Default Namespace: Applies to the element where it's declared and all unprefixed child elements.

<book xmlns="http://example.com/books">
  <title>...</title> <!-- Belongs to http://example.com/books -->
</book>

Prefixed Namespace: Associates a prefix with a namespace URI. Used for elements and attributes.

<root xmlns:bk="http://example.com/books" xmlns:auth="http://example.com/authors">
  <bk:book>
    <bk:title>...</bk:title>
    <auth:author>...</auth:author> <!-- Belongs to http://example.com/authors -->
  </bk:book>
</root>

Simple XML Example

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="library.css"?>

<!-- Represents a collection of books -->
<library xmlns="http://example.com/library/v1">
  <book isbn="978-0345391803" available="true">
    <title language="en">The Hitchhiker's Guide to the Galaxy</title>
    <author>Douglas Adams</author>
    <genre>Science Fiction Comedy</genre>
    <summary>
    <![CDATA[
      Follows Arthur Dent after Earth's destruction. <Amazing & Funny!>
    ]]>
    </summary>
  </book>

  <book isbn="978-0743273565">
    <title language="en">The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <genre>Fiction</genre>
    <!-- Needs availability check -->
  </book>

  <empty-element/> <!-- Example of a self-closing empty element -->

</library>

By Language

Python

Convert XML to JSON

Install xmltodict

$ sudo pip install xmltodict

Python Code:

import json
import xmltodict

s = open("foo.xml").read()
d = xmltodict.parse(s)
json.dump(d, open("bar.json",'w'))