XML Cheatsheet
XML is a markup language designed to store and transport data. It emphasizes structure and is both human-readable and machine-readable.
Basic Syntax Rules
- Root Element: Every XML document MUST have exactly one root element that encloses all other elements.
- Closing Tags: Every opening tag (
<tag>
) MUST have a corresponding closing tag (</tag>
).- Empty elements can use self-closing tags:
<tag/>
(equivalent to<tag></tag>
).
- Empty elements can use self-closing tags:
- Proper Nesting: Tags must be nested correctly. If
<tagB>
opens inside<tagA>
, it must close before<tagA>
closes.- Correct:
<tagA><tagB>Content</tagB></tagA>
- Incorrect:
<tagA><tagB>Content</tagA></tagB>
- Correct:
- Case Sensitivity: XML tags and attribute names are case-sensitive.
<tag>
is different from<Tag>
. - Attribute Values Quoted: All attribute values MUST be enclosed in single (
'
) or double ("
) quotes.- Correct:
<element attr="value">
or<element attr='value'>
- Incorrect:
<element attr=value>
- Correct:
Core Components
- XML Declaration (Optional, but Recommended):
- Specifies XML version and character encoding. MUST be the very first line if present.
<?xml version="1.0" encoding="UTF-8"?>
- Elements: Building blocks defined by tags. Can contain text, other elements, or be empty.
<book> <!-- Element 'book' --> <title>The Hitchhiker's Guide</title> <!-- Element 'title' with text content --> <author>Douglas Adams</author> <!-- Element 'author' with text content --> <page-break/> <!-- Empty element 'page-break' (self-closing) --> </book>
- Attributes: Provide additional information (metadata) about elements. Defined within the start tag.
<book isbn="978-0345391803"> <!-- 'isbn' is an attribute of 'book' --> <title language="en">The Hitchhiker's Guide</title> <!-- 'language' is an attribute of 'title' --> </book>
- Content (Text / PCDATA): Parsed Character Data. The text between the start and end tags of an element. Special characters (
<
,>
,&
,'
,"
) must be escaped using entities or placed within a CDATA section.<description>This book is <GREAT> & fun!</description>
- Comments: Ignored by parsers. Useful for documentation. Cannot appear inside tags.
<!-- This is a comment --> <data>Content</data> <!-- Comment after element -->
- CDATA Sections: Character Data. Blocks of text that are NOT processed by the parser. Special characters do not need escaping within CDATA. Useful for including code snippets or XML fragments literally.
<script> <![CDATA[ function check() { if (a < b && c > d) { // No need to escape < or & here alert("Condition met!"); } } ]]> </script>
- Processing Instructions (PIs): Pass information to applications processing the XML document.
<?xml-stylesheet type="text/css" href="style.css"?> <!-- Example: Link a stylesheet -->
- Entities: Predefined shortcuts for special characters. Can also be custom defined in DTDs.
<
-><
(Less Than)>
->>
(Greater Than)&
->&
(Ampersand)'
->'
(Apostrophe / Single Quote)"
->"
(Quotation Mark / Double Quote)
Well-Formed vs. Valid XML
- Well-Formed: An XML document that adheres to all the basic syntax rules listed above. ALL XML documents must be well-formed to be processed correctly.
- Valid: An XML document that is both well-formed AND conforms to the rules defined in a Document Type Definition (DTD) or an XML Schema (XSD). Validation ensures the document has the correct structure, element types, attribute types, etc., as defined by its grammar.
Character Encoding
- Specifies the character set used (e.g., UTF-8, ISO-8859-1).
- Declared in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
- UTF-8 is widely used and recommended as it supports a vast range of characters.
- If omitted, parsers often default to UTF-8 or UTF-16. Explicit declaration is best practice.
Namespaces
- Avoid naming conflicts when combining elements from different XML vocabularies.
- Declared using the
xmlns
attribute. - Default Namespace: Applies to the element where it's declared and all unprefixed child elements.
<book xmlns="http://example.com/books"> <title>...</title> <!-- Belongs to http://example.com/books --> </book>
- Prefixed Namespace: Associates a prefix with a namespace URI. Used for elements and attributes.
<root xmlns:bk="http://example.com/books" xmlns:auth="http://example.com/authors"> <bk:book> <bk:title>...</bk:title> <auth:author>...</auth:author> <!-- Belongs to http://example.com/authors --> </bk:book> </root>
Simple XML Example
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="library.css"?>
<!-- Represents a collection of books -->
<library xmlns="http://example.com/library/v1">
<book isbn="978-0345391803" available="true">
<title language="en">The Hitchhiker's Guide to the Galaxy</title>
<author>Douglas Adams</author>
<genre>Science Fiction Comedy</genre>
<summary>
<![CDATA[
Follows Arthur Dent after Earth's destruction. <Amazing & Funny!>
]]>
</summary>
</book>
<book isbn="978-0743273565">
<title language="en">The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<genre>Fiction</genre>
<!-- Needs availability check -->
</book>
<empty-element/> <!-- Example of a self-closing empty element -->
</library>
By Language
Python
Convert XML to JSON
Install xmltodict
$ sudo pip install xmltodict
Python Code:
import json
import xmltodict
s = open("foo.xml").read()
d = xmltodict.parse(s)
json.dump(d, open("bar.json",'w'))