Markdown Syntax

Introduction

This document is intended to move Markdown towards a standard syntax. This document format is now widely implemented and deployed on the Web. While the original document (Ref. 1) by John Gruber defines much of what is currently supported by Markdown implementations, there are some aspects of that document that are ambiguous. This document resolves these ambiguities mostly by choosing behavior that is followed by the majority of implementations.

Please note that this document is incomplete, and is a work in progress. Comments would be appreciated.

Definitions

The whitespace characters are U+000C, U+000D, U+0009, U+000A, and U+0020.

Space is the character U+0020.

Tab is the character U+0009.

A line break is the character sequence U+000D U+000A (carriage return followed by line feed) or U+000D and U+000A standing alone.

Lines are separated by line breaks.

A blank line is a line with no characters or a line with no other characters but whitespace characters.

A digit character is a character in the range U+0030 to U+0039.

A non-ordinary line is a line that starts with:

">", "-", "*", or "+" characters followed by space or tab;
one or more "#" characters followed by space or tab; or
one or more digit characters followed by "." and either space or tab.

An ordinary line is a line other than a non-ordinary line.

The Base Document is the original Markdown specification by John Gruber (Ref. 1).

Markup Escaping

Characters are markup-escaped using the following method:

Replace the "&" character with "&".
Replace the "<" character with "<".
Replace the ">" character with ">".

Parser

A Markdown stream is parsed in two passes - line by line, then character by character.

Read each line in the file:

If the line doesn't start with four or more whitespace characters:
- If the line consists of only space, tab, and "_" characters, replace the line with the HTML (Ref. 2) hr element. (See "Horizontal Rules" in the Base Document.)
- If the line consists of only space, tab, and "*" characters, and this is the first line or the line contains at least one space or tab character, replace the line with the HTML hr element. (See "Horizontal Rules" in the Base Document.)
- If the line consists of only space, tab, and "-" characters, and this is the first line or the line contains at least one space or tab character, replace the line with the HTML hr element. (See "Horizontal Rules" in the Base Document.)
If the line consists of only "-" characters or only "*" characters, and the preceding line is an ordinary line, process the preceding line as Markdown and replace both lines with the HTML h2 element with the content equal to the Markdown result (except that the paragraph is not implicitly wrapped in an HTML p element). (See "Setext-style" in "Headers".)

The result is then parsed in another pass using the following process:

Read each character in the stream:

If the character is "\", read the next character and output that character as a markup-encoded character.
If the character is ">", and this is the start of the line, and the following character is space or tab, read to the end of the line. This is the first line of a blockquote sequence. Read the rest of the blockquote using the rules in the section "Blockquotes".
If the character is "*" or "-", if this is the start of the line, and the following character is space or tab, then this is a list item.
If the character is "*" or "+" or "-", if this is the start of the line, and the following character is space or tab, then this is a list item.
Otherwise, if the character is "*" followed by two more "*" characters, read those extra 2 characters and output a strong-emphasis start tag.
ISSUE: This is not yet complete.
Otherwise, if the character is "*" followed by another "*" character, read that extra character and output an emphasis start tag.
ISSUE: This is not yet complete.
Otherwise, if the character is "!" followed by a "[" character, ...
ISSUE: To be defined. See "Images" in the base document.
Otherwise, if the character is "`" followed by another "`" character, read the rest of the stream up to and including the next "``" character sequence, if any. If a second "``" character sequence is found, take the string of characters read after the first "``" sequence, but not including the second "``" sequence, remove leading and trailing spaces and tabs from that string, and output an HTML code element whose content consists of that string as markup-escaped characters. If no further "``" sequence was found, output two "`" characters and reset the character pointer to just after the "``" sequence. (See "Code Spans" in the Base Document.)
Otherwise, if the character is "`", read the rest of the stream up to and including the next "`" character, if any. If a second "`" character is found, output an HTML code element whose content consists of markup-escaped characters read after the first "`", but not including the second "`". If no further "`" character was found, output a "`" character and reset the character pointer to just after the "`". (See "Code Spans" in the Base Document.)
If the character is "&", it indicates a possible character escape. Read the rest of the character escape using the HTML5 rules. If the escape is a recognized character escape, output that escape as is. Otherwise, output "&" and reset the character pointer to just after the "&".
NOTE: While implementations are split on handling unrecognized character escapes, the result is practically indistinguishable when the resulting HTML is rendered by modern browsers.
If the character is a digit character, if this is the start of the line, and the following characters are 0 or more digit characters followed by space or tab, read to the end of the line. See "???" under "???".
ISSUE: This is not yet complete.
If the character is "#", if this is the start of the line, and the following characters are 0 or more "#" characters followed by space or tab, read to the end of the line. See "Atx-style" under "Headers".
If the character is "<", it indicates a possible markup tag. Read the rest of the tag using the HTML5 rules.
- If the tag is a DOCTYPE declaration, output that tag as markup-escaped characters.
- If the tag is a comment tag, an end tag, or the start tag of a void element, output that tag.
- If the tag is the start tag of a non-void element, read the rest of the stream as HTML until the matching end tag of the non-void element is read.
- If the tag is a "valid e-mail address" according to HTML5 (see section 4.10.5.1.5 of ref. 2), enclosed in angle brackets, rather than a markup tag, output that tag as a hyperlink whose URI is "mailto:" followed by that email address and whose content is that email address in markup-escaped characters. (See "Automatic Links" in the Base Document.)
  NOTE: The Base Document suggests randomly converting each character to a character escape, a hex escape, or leaving each character alone. Although this behavior is followed by a majority of known implementations, it is considered optional. Moreover, such a random conversion makes the HTML output less deterministic.
- If the tag is a well-formed URI (RFC3986, Ref. 3) enclosed in angle brackets, rather than a markup tag, output that tag as a hyperlink to that URI, whose content is the markup-escaped form of that URI. (See "Automatic Links" in the Base Document.)
- If the tag is not a valid tag, output that tag as markup-escaped characters.
Output the character as a markup-escaped character.

Headers

Setext-style

Each "setext-style" header consists of an ordinary line (the text line) followed by a line starting with "=" or "-" (the header line) and consisting only of the same character.

To get the content of an "setext-style" header:

Remove leading and trailing whitespace from the text line.
The text is processed as a Markdown paragraph.

NOTE: Several implementations do not remove trailing whitespace. However, this is rarely semantically significant.

If the header line starts with "=", the result is a first-level header whose content is the resulting text.

If the header line starts with "-", the result is a second-level header whose content is the resulting text.

After an "setext-style" header, the following lines are ignored until a line containing a non-whitespace character is reached.

Atx-style

Each "atx-style" header consists of a line of text that begins with any number of "#" characters followed by one or more whitespace characters.

To get the content of an "atx-style" header:

Let N be the number of "#" characters that begin the text. Remove those characters.
Remove leading and trailing whitespace from the text.
Remove trailing "#" characters from the text.
Remove trailing whitespace from the text.
The result is an N-level header whose content is the text.

NOTE: In HTML, only header levels 1 to 6 are supported. An implementation may wish to set N to 6 if it would be greater than 6.

NOTE: Some implementations add an ID to the N-level header based on its content. Doing so is optional.

After an "setext-style" header, the following lines are ignored until a line containing a non-whitespace character is reached.

Blockquotes

A blockquote consists of one or more consecutive blockquote lines. A blockquote line sequence:

Begins with a line that starts with a ">" character.
Ends when a line that doesn't start with a ">" character and isn't a blank line or an ordinary line is reached.

ISSUE: Implementations are split on whether an "atx-style" header line ends a block sequence. This is an issue for discussion.

NOTE: Many implementations end the blockquote upon reaching an empty line. The above definition matches the behavior of the majority of implementations.

To get the content of a blockquote from a blockquote line sequence:

Remove the leading ">" character from all lines in the text that begin with that character.
Remove one leading whitespace character, if any, from each line.
Remove trailing whitespace from each line.
The text is recursively processed as Markdown; as a result, among other things, nested blockquotes are allowed.
The result is a blockquote whose content is the processed text.

Lists

To be done.

Links

To be done.

References

Ref. 1. Gruber, J. "Markdown Syntax Documentation". Daring Fireball, http://daringfireball.net/projects/markdown/syntax.

Ref. 2. "HTML". WHATWG, http://www.whatwg.org/html.

Ref. 3. Berners-Lee, T. et al. "Uniform Resource Identifier (URI): Generic Syntax". STD 66, January 2005.

Author

Peter Occil (poccil14 at gmail dot com)

Any copyright to this specification is released to the Public Domain. http://creativecommons.org/publicdomain/zero/1.0/