Markdown Syntax

Introduction

This document is intended to move Markdown towards a standard syntax. This document format is now widely implemented and deployed on the Web. While the original document (Ref. 1) by John Gruber defines much of what is currently supported by Markdown implementations, there are some aspects of that document that are ambiguous. This document resolves these ambiguities mostly by choosing behavior that is followed by the majority of implementations.

Please note that this document is incomplete, and is a work in progress. Comments would be appreciated.

Definitions

The whitespace characters are U+000C, U+000D, U+0009, U+000A, and U+0020.

Space is the character U+0020.

Tab is the character U+0009.

A line break is the character sequence U+000D U+000A (carriage return followed by line feed) or U+000D and U+000A standing alone.

Lines are separated by line breaks.

A blank line is a line with no characters or a line with no other characters but whitespace characters.

A digit character is a character in the range U+0030 to U+0039.

A non-ordinary line is a line that starts with:

An ordinary line is a line other than a non-ordinary line.

The Base Document is the original Markdown specification by John Gruber (Ref. 1).

Markup Escaping

Characters are markup-escaped using the following method:

Parser

A Markdown stream is parsed in two passes - line by line, then character by character.

Read each line in the file:

The result is then parsed in another pass using the following process:

Read each character in the stream:

Headers

Setext-style

Each "setext-style" header consists of an ordinary line (the text line) followed by a line starting with "=" or "-" (the header line) and consisting only of the same character.

To get the content of an "setext-style" header:

NOTE: Several implementations do not remove trailing whitespace. However, this is rarely semantically significant.

If the header line starts with "=", the result is a first-level header whose content is the resulting text.

If the header line starts with "-", the result is a second-level header whose content is the resulting text.

After an "setext-style" header, the following lines are ignored until a line containing a non-whitespace character is reached.

Atx-style

Each "atx-style" header consists of a line of text that begins with any number of "#" characters followed by one or more whitespace characters.

To get the content of an "atx-style" header:

NOTE: In HTML, only header levels 1 to 6 are supported. An implementation may wish to set N to 6 if it would be greater than 6.

NOTE: Some implementations add an ID to the N-level header based on its content. Doing so is optional.

After an "setext-style" header, the following lines are ignored until a line containing a non-whitespace character is reached.

Blockquotes

A blockquote consists of one or more consecutive blockquote lines. A blockquote line sequence:

ISSUE: Implementations are split on whether an "atx-style" header line ends a block sequence. This is an issue for discussion.

NOTE: Many implementations end the blockquote upon reaching an empty line. The above definition matches the behavior of the majority of implementations.

To get the content of a blockquote from a blockquote line sequence:

Lists

To be done.

Links

To be done.

References

Ref. 1. Gruber, J. "Markdown Syntax Documentation". Daring Fireball, http://daringfireball.net/projects/markdown/syntax.

Ref. 2. "HTML". WHATWG, http://www.whatwg.org/html.

Ref. 3. Berners-Lee, T. et al. "Uniform Resource Identifier (URI): Generic Syntax". STD 66, January 2005.

Author

Peter Occil (poccil14 at gmail dot com)

Any copyright to this specification is released to the Public Domain. http://creativecommons.org/publicdomain/zero/1.0/