Markdown Syntax
Introduction
This document is intended to move Markdown towards a standard syntax.
This document format is now widely implemented and deployed on the Web. While the
original document (Ref. 1)
by John Gruber defines much of what is currently supported by Markdown implementations, there are
some aspects of that document that are ambiguous. This document resolves these ambiguities
mostly by choosing behavior that is followed by the majority of implementations.
Please note that this document is incomplete, and is a work in progress. Comments
would be appreciated.
Definitions
The whitespace characters are U+000C, U+000D, U+0009, U+000A,
and U+0020.
Space is the character U+0020.
Tab is the character U+0009.
A line break is the character sequence U+000D U+000A
(carriage return followed by line feed) or U+000D and U+000A standing alone.
Lines are separated by line breaks.
A blank line is a line with no characters or a line with no other characters
but whitespace characters.
A digit character is a character in the range U+0030 to U+0039.
A non-ordinary line is a line that starts with:
- ">", "-", "*", or "+" characters followed by space or tab;
- one or more "#" characters followed by space or tab; or
- one or more digit characters followed by "." and either space or tab.
An ordinary line is a line other than a non-ordinary line.
The Base Document is the original Markdown specification by John Gruber (Ref. 1).
Markup Escaping
Characters are markup-escaped using the following method:
- Replace the "&" character with "&".
- Replace the "<" character with "<".
- Replace the ">" character with ">".
Parser
A Markdown stream is parsed in two passes - line by line, then character by
character.
Read each line in the file:
- If the line doesn't start with four or more whitespace characters:
- If the line consists of only space, tab, and "_" characters, replace the line
with the HTML (Ref. 2)
hr
element.
(See "Horizontal Rules" in the Base Document.)
- If the line consists of only space, tab, and "*" characters,
and this is the first line or the line contains at least one space or tab character, replace the line
with the HTML
hr
element.
(See "Horizontal Rules" in the Base Document.)
- If the line consists of only space, tab, and "-" characters,
and this is the first line or the line contains at least one space or tab character, replace the line
with the HTML
hr
element.
(See "Horizontal Rules" in the Base Document.)
- If the line consists of only "-" characters or only "*" characters,
and the preceding line is an ordinary line, process the preceding line
as Markdown and replace both lines
with the HTML
h2
element with the content equal
to the Markdown result (except that the paragraph is not implicitly wrapped in an HTML p
element).
(See "Setext-style" in "Headers".)
The result is then parsed in another pass using the following process:
Read each character in the stream:
- If the character is "\", read the next character and output that character
as a markup-encoded character.
- If the character is ">", and this is the start of the line,
and the following character is space or tab, read to the end of the line.
This is the first line of a blockquote sequence. Read the rest of the blockquote
using the rules in the section "Blockquotes".
- If the character is "*" or "-", if this is the start of the line,
and the following character is space or tab, then this is a list item.
- If the character is "*" or "+" or "-", if this is the start of the line,
and the following character is space or tab, then this is a list item.
- Otherwise, if the character is "*" followed by two more "*" characters, read those
extra 2 characters and output a strong-emphasis start tag.
ISSUE: This is not yet complete.
- Otherwise, if the character is "*" followed by another "*" character, read that
extra character and output an emphasis start tag.
ISSUE: This is not yet complete.
- Otherwise, if the character is "!" followed by a "[" character, ...
ISSUE: To be defined. See "Images" in the base document.
- Otherwise, if the character is "`" followed by another "`" character,
read the rest of the stream up to and including the next "``" character sequence, if any. If a second "``" character
sequence is found, take the string of characters read after the first "``" sequence,
but not including the second "``" sequence, remove leading and trailing spaces and tabs from that string, and
output an HTML
code
element whose content consists of that string as markup-escaped characters.
If no further "``" sequence was found, output two "`" characters and reset the character pointer
to just after the "``" sequence. (See "Code Spans" in the Base Document.)
- Otherwise, if the character is "`",
read the rest of the stream up to and including the next "`" character, if any. If a second "`" character
is found, output an HTML
code
element whose
content consists of markup-escaped characters read after the first "`",
but not including the second "`".
If no further "`" character was found, output a "`" character and reset the character pointer
to just after the "`". (See "Code Spans" in the Base Document.)
- If the character is "&", it indicates a possible character
escape. Read the rest of the character escape using the HTML5 rules.
If the escape is a recognized character escape, output that escape as is. Otherwise,
output "&" and reset the character pointer to just after the "&".
NOTE: While implementations are split on handling unrecognized character
escapes, the result is practically indistinguishable when the resulting HTML is
rendered by modern browsers.
- If the character is a digit character, if this is the start of the line,
and the following characters are 0 or more digit
characters followed by space or tab, read to the end of the line. See "???" under "???".
ISSUE: This is not yet complete.
- If the character is "#", if this is the start of the line, and the following characters are 0 or more "#"
characters followed by space or tab, read to the end of the line. See "Atx-style" under "Headers".
- If the character is "<", it indicates a possible markup tag. Read the rest
of the tag using the HTML5 rules.
- If the tag is a DOCTYPE declaration, output that tag as markup-escaped characters.
- If the tag is a comment tag, an end tag, or the start tag of a void element, output
that tag.
- If the tag is the start tag of a non-void element, read the rest of the stream
as HTML until the matching end tag of the non-void element is read.
- If the tag is a "valid e-mail address" according to HTML5 (see section 4.10.5.1.5 of ref. 2), enclosed in angle brackets, rather than a markup
tag, output that tag as a hyperlink whose URI is "mailto:" followed by that email address
and whose content is that email address in markup-escaped characters.
(See "Automatic Links" in the Base Document.)
NOTE: The Base Document suggests randomly converting each character to a
character escape, a hex escape, or leaving each character alone. Although this behavior
is followed by a majority of known implementations, it is considered optional. Moreover,
such a random conversion makes the HTML output less deterministic.
- If the tag is a well-formed URI (RFC3986, Ref. 3) enclosed in angle brackets, rather than a markup
tag, output that tag as a hyperlink to that URI, whose content is the markup-escaped
form of that URI. (See "Automatic Links" in the Base Document.)
- If the tag is not a valid tag, output that tag as markup-escaped characters.
- Output the character as a markup-escaped character.
Headers
Setext-style
Each "setext-style" header consists of an ordinary line (the text line)
followed by a line starting with "=" or "-" (the header line) and consisting
only of the same character.
To get the content of an "setext-style" header:
- Remove leading and trailing whitespace from the text line.
- The text is processed as a Markdown paragraph.
NOTE: Several implementations do not remove trailing whitespace.
However, this is rarely semantically significant.
If the header line starts with "=", the result is a first-level header whose content is
the resulting text.
If the header line starts with "-", the result is a second-level header whose content is
the resulting text.
After an "setext-style" header, the following lines are ignored until a line containing
a non-whitespace character is reached.
Atx-style
Each "atx-style" header consists of a line of text that begins with any number of "#"
characters followed by one or more whitespace characters.
To get the content of an "atx-style" header:
- Let N be the number of "#" characters that begin the text.
Remove those characters.
- Remove leading and trailing whitespace from the text.
- Remove trailing "#" characters from the text.
- Remove trailing whitespace from the text.
- The result is an N-level header whose content is the text.
NOTE: In HTML, only header levels 1 to 6 are supported. An implementation
may wish to set N to 6 if it would be greater than 6.
NOTE: Some implementations add an ID to the N-level header based on its
content. Doing so is optional.
After an "setext-style" header, the following lines are ignored until a line containing
a non-whitespace character is reached.
Blockquotes
A blockquote consists of one or more consecutive blockquote lines.
A blockquote line sequence:
- Begins with a line that starts with a ">" character.
- Ends when a line that doesn't start with a ">" character
and isn't a blank line or an ordinary line is reached.
ISSUE: Implementations are split on whether an "atx-style" header line ends
a block sequence. This is an issue for discussion.
NOTE: Many implementations end the blockquote upon reaching an empty line.
The above definition matches the behavior of the majority of implementations.
To get the content of a blockquote from a blockquote line sequence:
- Remove the leading ">" character from all lines in the text
that begin with that character.
- Remove one leading whitespace character, if any, from each line.
- Remove trailing whitespace from each line.
- The text is recursively processed as Markdown; as a result, among
other things, nested blockquotes are allowed.
- The result is a blockquote whose content is the processed text.
Lists
To be done.
Links
To be done.
References
Ref. 1. Gruber, J. "Markdown Syntax Documentation". Daring Fireball,
http://daringfireball.net/projects/markdown/syntax.
Ref. 2. "HTML". WHATWG,
http://www.whatwg.org/html.
Ref. 3. Berners-Lee, T. et al. "Uniform Resource Identifier (URI): Generic Syntax". STD 66, January 2005.
Author
Peter Occil (poccil14 at gmail dot com)
Any copyright to this specification is released to the Public Domain. http://creativecommons.org/publicdomain/zero/1.0/