Tabs in lines are not expanded to spaces. However, in contexts where spaces help to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
Thus, for example, a tab can be used instead of four spaces in an indented code block. (Note, however, that internal tabs are passed through as literal tabs, not expanded to spaces.)
Any ASCII punctuation character may be backslash-escaped:
Backslashes before other characters are treated as literal backslashes:
Escaped characters are treated as regular characters and do not have their usual Markdown meanings:
If a backslash is itself escaped, the following character is not:
A backslash at the end of the line is a hard line break:
Backslash escapes do not work in code blocks, code spans, autolinks, or raw HTML:
But they work in all other contexts, including URLs and link titles, link references, and info strings in fenced code blocks:
Valid HTML entity references and numeric character references can be used in place of the corresponding Unicode character, with the following exceptions:
Entity and character references are not recognized in code blocks and code spans
Entity and character references cannot stand in place of special characters
that define structural elements in CommonMark. For example, although *
can be used in place of a literal *
character, *
cannot replace *
in emphasis delimiters, bullet list markers, or thematic breaks.
Conforming CommonMark parsers need not store information about whether a particular character was represented in the source using a Unicode character or an entity reference.
Entity references consist of &
+ any of the valid HTML5 entity names + ;
.
The document https://html.spec.whatwg.org/entities.json is used as an
authoritative source for the valid entity references and their corresponding
code points.
Decimal numeric character references consist of &#
+ a string of 1–7 arabic
digits + ;
. A numeric character reference is parsed as the corresponding
Unicode character. Invalid Unicode code points will be replaced by the
REPLACEMENT CHARACTER (U+FFFD). For security reasons, the code point U+0000 will
also be replaced by U+FFFD.
Hexadecimal numeric character references consist of &#
+ either X
or x
+
a string of 1-6 hexadecimal digits + ;
. They too are parsed as the
corresponding Unicode character (this time specified with a hexadecimal numeral
instead of decimal).
Here are some nonentities:
Although HTML5 does accept some entity references without a trailing semicolon (such as ©), these are not recognized here, because it makes the grammar too ambiguous:
Strings that are not on the list of HTML5 named entities are not recognized as entity references either:
Entity and numeric character references are recognized in any context besides code spans or code blocks, including URLs, link titles, and fenced code block info strings:
Entity and numeric character references are treated as literal text in code spans and code blocks:
Entity and numeric character references cannot be used in place of symbols indicating structure in CommonMark documents.
We can think of a document as a sequence of blocks—structural elements like paragraphs, block quotations, lists, headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; others (like headings and paragraphs) contain inline content—text, links, emphasized text, images, code spans, and so on.
Indicators of block structure always take precedence over indicators of inline structure. So, for example, the following is a list with two items, not a list with one item containing a code span:
This means that parsing can proceed in two steps: first, the block structure of the document can be discerned; second, text lines inside paragraphs, headings, and other block constructs can be parsed for inline structure. The second step requires information about link reference definitions that will be available only at the end of the first step. Note that the first step requires processing lines in sequence, but the second can be parallelized, since the inline parsing of one block element does not affect the inline parsing of any other.
We can divide blocks into two types: container blocks, which can contain other blocks, and leaf blocks, which cannot.
This section describes the different kinds of leaf block that make up a Markdown document.
A line consisting of optionally up to three spaces of indentation, followed by a
sequence of three or more matching -
, _
, or *
characters, each followed
optionally by any number of spaces or tabs, forms a thematic break.
Wrong characters:
Not enough characters:
Up to three spaces of indentation are allowed:
Four spaces of indentation is too many:
More than three characters may be used:
Spaces and tabs are allowed between the characters:
Spaces and tabs are allowed at the end:
However, no other characters may occur in the line:
It is required that all of the characters other than spaces or tabs be the same. So, this is not a thematic break:
Thematic breaks do not need blank lines before or after:
Thematic breaks can interrupt a paragraph:
If a line of dashes that meets the above conditions for being a thematic break could also be interpreted as the underline of a setext heading, the interpretation as a setext heading takes precedence. Thus, for example, this is a setext heading, not a paragraph followed by a thematic break:
If you want a thematic break in a list item, use a different bullet:
An ATX heading consists of a string of characters, parsed as inline content,
between an opening sequence of 1–6 unescaped #
characters and an optional
closing sequence of any number of unescaped #
characters. The opening sequence
of #
characters must be followed by spaces or tabs, or by the end of line. The
optional closing sequence of #
s must be preceded by spaces or tabs and may be
followed by spaces or tabs only. The opening #
character may be preceded by up
to three spaces of indentation. The raw contents of the heading are stripped of
leading and trailing space or tabs before being parsed as inline content. The
heading level is equal to the number of #
characters in the opening sequence.
Simple headings:
More than six # characters is not a heading:
At least one space or tab is required between the #
characters and the
heading’s contents, unless the heading is empty. Note that many implementations
currently do not require the space. However, the space was required by the
original ATX implementation, and it helps prevent things like the following
from being parsed as headings:
This is not a heading, because the first # is escaped:
Contents are parsed as inlines:
Leading and trailing spaces or tabs are ignored in parsing inline content:
Up to three spaces of indentation are allowed:
Four spaces of indentation is too many:
ATX headings need not be separated from surrounding content by blank lines, and they can interrupt paragraphs:
ATX headings can be empty:
A setext heading consists of one or more lines of text, not interrupted by a blank line, of which the first line does not have more than 3 spaces of indentation, followed by a setext heading underline. The lines of text must be such that, were they not followed by the setext heading underline, they would be interpreted as a paragraph: they cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block.
A setext heading underline is a sequence of =
characters or a sequence of -
characters, with no more than 3 spaces of indentation and any number of
trailing spaces or tabs.
The heading is a level 1 heading if =
characters are used in the setext
heading underline, and a level 2 heading if -
characters are used. The
contents of the heading are the result of parsing the preceding lines of text
as CommonMark inline content.
In general, a setext heading need not be preceded or followed by a blank line. However, it cannot interrupt a paragraph, so when a setext heading comes after a paragraph, a blank line is needed between them.
Simple examples:
The content of the header may span more than one line:
The contents are the result of parsing the headings’s raw content as inlines. The heading’s raw content is formed by concatenating the lines and removing initial and final spaces or tabs.
The underlining can be any length:
The heading content can be preceded by up to three spaces of indentation, and need not line up with the underlining:
Four spaces of indentation is too many:
The setext heading underline can be preceded by up to three spaces of indentation, and may have trailing spaces or tabs:
Four spaces of indentation is too many:
The setext heading underline cannot contain internal spaces or tabs:
Trailing spaces or tabs in the content line do not cause a hard line break:
Nor does a backslash at the end:
Since indicators of block structure take precedence over indicators of inline structure, the following are setext headings:
A blank line is needed between a paragraph and a following setext heading, since otherwise the paragraph becomes part of the heading’s content:
But in general a blank line is not required before or after setext headings:
Setext headings cannot be empty:
Setext heading text lines must not be interpretable as block constructs other than paragraphs. So, the line of dashes in these examples gets interpreted as a thematic break:
If you want a heading with > foo as its literal text, you can use backslash escapes:
Compatibility note: Most existing Markdown implementations do not allow the text of setext headings to span multiple lines. But there is no consensus about how to interpret
Foo
bar
---
baz
One can find four different interpretations:
We find interpretation 4 most natural, and interpretation 4 increases the expressive power of CommonMark, by allowing multiline headings. Authors who want interpretation 1 can put a blank line after the first paragraph:
Authors who want interpretation 2 can put blank lines around the thematic break,
or use a thematic break that cannot count as a setext heading underline, such as
Authors who want interpretation 3 can use backslash escapes:
An indented code block is composed of one or more indented chunks separated by blank lines. An indented chunk is a sequence of non-blank lines, each preceded by four or more spaces of indentation. The contents of the code block are the literal contents of the lines, including trailing line endings, minus four spaces of indentation. An indented code block has no info string.
An indented code block cannot interrupt a paragraph, so there must be a blank line between a paragraph and a following indented code block. (A blank line is not needed, however, between a code block and a following paragraph.)
If there is any ambiguity between an interpretation of indentation as a code block and as indicating that material belongs to a list item, the list item interpretation takes precedence:
The contents of a code block are literal text, and do not get parsed as Markdown:
Here we have three chunks separated by blank lines:
Any initial spaces or tabs beyond four spaces of indentation will be included in the content, even in interior blank lines:
An indented code block cannot interrupt a paragraph. (This allows hanging indents and the like.)
However, any non-blank line with fewer than four spaces of indentation ends the code block immediately. So a paragraph may occur immediately after indented code:
And indented code can occur immediately before and after other kinds of blocks:
The first line can be preceded by more than four spaces of indentation:
Blank lines preceding or following an indented code block are not included in it:
Trailing spaces or tabs are included in the code block’s content:
A code fence is a sequence of at least three consecutive backtick characters
`
or tildes ~
. (Tildes and backticks cannot be mixed.) A fenced code
block begins with a code fence, preceded by up to three spaces of indentation.
The line with the opening code fence may optionally contain some text following the code fence; this is trimmed of leading and trailing spaces or tabs and called the info string. If the info string comes after a backtick fence, it may not contain any backtick characters. (The reason for this restriction is that otherwise some inline code would be incorrectly interpreted as the beginning of a fenced code block.)
The content of the code block consists of all subsequent lines, until a closing code fence of the same type as the code block began with (backticks or tildes), and with at least as many backticks or tildes as the opening code fence. If the leading code fence is preceded by N spaces of indentation, then up to N spaces of indentation are removed from each line of the content (if present). (If a content line is not indented, it is preserved unchanged. If it is indented N spaces or less, all of the indentation is removed.)
The closing code fence may be preceded by up to three spaces of indentation, and may be followed only by spaces or tabs, which are ignored. If the end of the containing block (or document) is reached and no closing code fence has been found, the code block contains all of the lines after the opening code fence until the end of the containing block (or document). (An alternative spec would require backtracking in the event that a closing code fence is not found. But this makes parsing much less efficient, and there seems to be no real downside to the behavior described here.)
A fenced code block may interrupt a paragraph, and does not require a blank line either before or after.
The content of a code fence is treated as literal text, not parsed as inlines. The first word of the info string is typically used to specify the language of the code sample, and rendered in the class attribute of the code tag. However, this spec does not mandate any particular treatment of the info string.
Here is a simple example with backticks:
With tildes:
Fewer than three backticks is not enough:
The closing code fence must use the same character as the opening fence:
The closing code fence must be at least as long as the opening fence:
Unclosed code blocks are closed by the end of the document (or the enclosing block quote or list item):
A code block can have all empty lines as its content:
A code block can be empty:
Four spaces of indentation is too many:
Closing fences may be preceded by up to three spaces of indentation, and their indentation need not match that of the opening fence:
This is not a closing fence, because it is indented 4 spaces:
Code fences (opening and closing) cannot contain internal spaces or tabs:
Fenced code blocks can interrupt paragraphs, and can be followed directly by paragraphs, without a blank line between:
Other blocks can also occur before and after fenced code blocks without an intervening blank line:
An info string can be provided after the opening code fence. Although this spec doesn’t mandate any particular treatment of the info string, the first word is typically used to specify the language of the code block. In HTML output, the language is normally indicated by adding a class to the code element consisting of language- followed by the language name.
Info strings for backtick code blocks cannot contain backticks:
Closing code fences cannot have info strings:
An HTML block is a group of lines that is treated as raw HTML (and will not be escaped in HTML output).
There are seven two kinds of HTML block, which can be defined by their start
and end conditions. The block begins with a line that meets a start condition
(after up to three optional spaces of indentation). It ends with the first
subsequent line that meets a matching end condition, or the last line of the
document, or the last line of the container block containing the current HTML
block, if no line is encountered that meets the end condition. If the first
line meets both the start condition and the end condition, the block will
contain just that line.
Start condition: line begins with the string <pre
, <script
, <style
,
or <textarea
(case-insensitive), followed by a space, a tab, the string
>
, or the end of the line.
End condition: line contains an end tag </pre>
, </script>
, </style>
,
or </textarea>
(case-insensitive; it need not match the start tag).
Start condition: line begins with a complete open tag (with any tag name
other than pre
, script
, style
, or textarea
) or a complete closing
tag, followed by zero or more spaces and tabs, followed by the end of the
line.
End condition: line is followed by a blank line.
HTML blocks continue until they are closed by their appropriate end condition, or the last line of the document or other container block. This means any HTML within an HTML block that might otherwise be recognised as a start condition will be ignored by the parser and passed through as-is, without changing the parser’s state.
All types of HTML blocks except type 2 may interrupt a paragraph. (This restriction is intended to prevent unwanted interpretation of long tags inside a wrapped paragraph as starting HTML blocks.)
In type 2 blocks, the tag name can be anything:
These rules are designed to allow us to work with tags that can function as
either block-level or inline-level tags. The <del>
tag is a nice example. We
can surround content with <del>
tags in three different ways. In this case,
we get a raw HTML block, because the <del>
tag is on a line by itself:
Finally, in this case, the <del>
tags are interpreted as raw HTML inside the
CommonMark paragraph. (Because the tag is not on a line by itself, we get
inline HTML rather than an HTML block.)
HTML tags designed to contain literal content (pre
, script
, style
,
textarea
), comments, processing instructions, and declarations are treated
somewhat differently. Instead of ending at the first blank line, these blocks
end at the first line containing a corresponding end tag. As a result, these
blocks can contain blank lines:
A pre
tag (type 1):
A script
tag (type 1):
A textarea
tag (type 1):
A style
tag (type 1):
If there is no matching end tag, the block will end at the end of the document (or the enclosing block quote or list item):
The end tag can occur on the same line as the start tag:
Note that anything on the last line after the end tag will be included in the HTML block:
HTML blocks of type 2 cannot interrupt a paragraph:
A link reference definition consists of a link label, optionally preceded by up to three spaces of indentation, followed by a colon (:), optional spaces or tabs (including up to one line ending), a link destination, optional spaces or tabs (including up to one line ending), and an optional link title, which if it is present must be separated from the link destination by spaces or tabs. No further character may occur.
A link reference definition does not correspond to a structural element of a document. Instead, it defines a label which can be used in reference links and reference-style images elsewhere in the document. Link reference definitions can come either before or after the links that use them.
The title may extend over multiple lines:
However, it may not contain a blank line:
The title may be omitted:
The link destination may not be omitted:
However, an empty link destination may be specified using angle brackets:
The title must be separated from the link destination by spaces or tabs:
Both title and destination can contain backslash escapes and literal backslashes:
A link can come before its corresponding definition:
If there are several matching definitions, the first one takes precedence:
As noted in the section on Links, matching of labels is case-insensitive (see matches).
Whether something is a link reference definition is independent of whether the link reference it defines is used in the document. Thus, for example, the following document contains just a link reference definition, and no visible content:
Here is another one:
This is not a link reference definition, because there are characters other than spaces or tabs after the title:
This is a link reference definition, but it has no title:
This is not a link reference definition, because it is indented four spaces:
This is not a link reference definition, because it occurs inside a code block:
A link reference definition cannot interrupt a paragraph.
However, it can directly follow other block elements, such as headings and thematic breaks, and it need not be followed by a blank line.
Several link reference definitions can occur one after another, without intervening blank lines.
Link reference definitions can occur inside block containers, like lists and block quotations. They affect the entire document, not just the container in which they are defined:
A sequence of non-blank lines that cannot be interpreted as other kinds of blocks forms a paragraph. The contents of the paragraph are the result of parsing the paragraph’s raw content as inlines. The paragraph’s raw content is formed by concatenating the lines and removing initial and final spaces or tabs.
A simple example with two paragraphs:
Paragraphs can contain multiple lines, but no blank lines:
Multiple blank lines between paragraphs have no effect:
Leading spaces or tabs are skipped:
Lines after the first may be indented any amount, since indented code blocks cannot interrupt paragraphs.
However, the first line may be preceded by up to three spaces of indentation. Four spaces of indentation is too many:
Blank lines between block-level elements are ignored, except for the role they play in determining whether a list is tight or loose.
Blank lines at the beginning and end of the document are also ignored.
A container block is a block that has other blocks as its contents. There are two basic kinds of container blocks: block quotes and list items. Lists are meta-containers for list items.
We define the syntax for container blocks recursively. The general form of the definition is:
If X is a sequence of blocks, then the result of transforming X in such-and-such a way is a container of type Y with these blocks as its content.
So, we explain what counts as a block quote or list item by explaining how these can be generated from their contents. This should suffice to define the syntax, although it does not give a recipe for parsing these constructions. (A recipe is provided below in the section entitled A parsing strategy.)
A block quote marker, optionally preceded by up to three spaces of indentation,
consists of (a) the character >
together with a following space of
indentation, or (b) a single character >
not followed by a space of
indentation.
The following rules define block quotes:
Basic case. If a string of lines Ls constitute a sequence of blocks Bs, then the result of prepending a block quote marker to the beginning of each line in Ls is a block quote containing Bs.
Consecutiveness. A document cannot contain two block quotes in a row unless there is a blank line between them.
Nothing else counts as a block quote.
Here is a simple example:
The space or tab after the > characters can be omitted:
The > characters can be preceded by up to three spaces of indentation:
Four spaces of indentation is too many:
Laziness is not supported so, all the exceptions work:
A block quote can be empty:
A block quote can have initial or final blank lines:
A blank line always separates block quotes:
Consecutiveness means that if we put these block quotes together, we get a single block quote:
To get a block quote with two paragraphs, use:
Block quotes can interrupt paragraphs:
In general, blank lines are not needed before or after block quotes:
More laziness exceptions:
When including an indented code block in a block quote, remember that the block
quote marker includes both the >
and a following space of indentation. So five
spaces are needed after the >
:
A list marker is a bullet list marker or an ordered list marker.
A bullet list marker is a -
, +
, or *
character.
An ordered list marker is a sequence of 1–9 arabic digits (0-9), followed by
either a .
character or a )
character. (The reason for the length limit is
that with 10 digits we start seeing integer overflows in some browsers.)
The following rules define list items:
Basic case. If a sequence of lines Ls constitute a sequence of blocks Bs starting with a character other than a space or tab, and M is a list marker of width W followed by 1 ≤ N ≤ 4 spaces of indentation, then the result of prepending M and the following spaces to the first line of Ls, and indenting subsequent lines of Ls by W + N spaces, is a list item with Bs as its contents. The type of the list item (bullet or ordered) is determined by the type of its list marker. If the list item is ordered, then it is also assigned a start number, based on the ordered list marker.
Exceptions:
For example, let Ls be the lines
And let M be the marker 1., and N = 2. Then rule #1 says that the following is an ordered list item with start number 1, and the same contents as Ls:
The most important thing to notice is that the position of the text after the list marker determines how much indentation is needed in subsequent blocks in the list item. If the list marker takes up two spaces of indentation, and there are three spaces between the list marker and the next character other than a space or tab, then blocks must be indented five spaces in order to fall under the list item.
Here are some examples showing how far content must be indented to be put under the list item:
It is tempting to think of this in terms of columns: the continuation blocks must be indented at least to the column of the first character other than a space or tab after the list marker. However, that is not quite right. The spaces of indentation after the list marker determine how much relative indentation is needed. Which column this indentation reaches will depend on how the list item is embedded in other constructions, as shown by this example:
Here two occurs in the same column as the list marker 1.
, but is actually
contained in the list item, because there is sufficient indentation after the
last containing blockquote marker.
The converse is also possible. In the following example, the word two
occurs
far to the right of the initial text of the list item, one
, but it is not
considered part of the list item, because it is not indented far enough past the
blockquote marker:
Note that at least one space or tab is needed between the list marker and any following content, so these are not list items:
A list item may contain blocks that are separated by more than one blank line.
A list item may contain any kind of block:
A list item that contains an indented code block will preserve empty lines within the code block verbatim.
Note that ordered list start numbers must be nine digits or less:
A start number may begin with 0s:
A start number may not be negative:
An indented code block will have to be preceded by four spaces of indentation beyond the edge of the region where text will be included in the list item. In the following case that is 6 spaces:
And in this case it is 11 spaces:
If the first block in the list item is an indented code block, then by rule #2, the contents must be preceded by one space of indentation after the list marker:
Note that an additional space of indentation is interpreted as space inside the code block:
Note that rules #1 and #2 only apply to two cases: (a) cases in which the lines to be included in a list item begin with a character other than a space or tab, and (b) cases in which they begin with an indented code block. In a case like the following, where the first block begins with three spaces of indentation, the rules do not allow us to form a list item by indenting the whole thing and prepending a list marker:
This is not a significant restriction, because when a block is preceded by up to three spaces of indentation, the indentation can always be removed without a change in interpretation, allowing rule #1 to be applied. So, in the above case:
Here are some list items that start with a blank line but are not empty:
When the list item starts with a blank line, the number of spaces following the list marker doesn’t change the required indentation:
A list item can begin with at most one blank line. In the following example,
foo
is not part of the list item:
Here is an empty bullet list item:
It does not matter whether there are spaces or tabs following the list marker:
Here is an empty ordered list item:
A list may start or end with an empty list item:
However, an empty list item cannot interrupt a paragraph:
Indented one space:
Indented two spaces:
Indented three spaces:
Four spaces indent gives a code block:
The rules for sublists follow from the general rules above. A sublist must be indented the same number of spaces of indentation a paragraph would need to be in order to be included in the list item.
So, in this case we need two spaces indent:
Here we need four, because the list marker is wider:
Three is not enough:
A list may be the first block in a list item:
A list is a sequence of one or more list items of the same type. The list items may be separated by any number of blank lines.
Two list items are of the same type if they begin with a list marker of the same
type. Two list markers are of the same type if (a) they are bullet list markers
using the same character (-
, +
, or *
) or (b) they are ordered list numbers
with the same delimiter (either .
or )
).
A list is an ordered list if its constituent list items begin with ordered list markers, and a bullet list if its constituent list items begin with bullet list markers.
The start number of an ordered list is determined by the list number of its initial list item. The numbers of subsequent list items are disregarded.
A list is loose if any of its constituent list items are separated by blank
lines, or if any of its constituent list items directly contain two block-level
elements with a blank line between them. Otherwise a list is tight. (The
difference in HTML output is that paragraphs in a loose list are wrapped in
<p>
tags, while paragraphs in a tight list are not.)
Changing the bullet or ordered list delimiter starts a new list:
In CommonMark, a list can interrupt a paragraph. That is, no blank line is needed to separate a paragraph from a following list:
Since it is well established Markdown practice to allow lists to interrupt paragraphs inside list items, the principle of uniformity requires us to allow this outside list items as well. (reStructuredText takes a different approach, requiring blank lines before lists even inside other list items.)
In order to solve the problem of unwanted lists in paragraphs with hard-wrapped numerals, we allow only lists starting with 1 to interrupt paragraphs. Thus,
We may still get an unintended result in cases like
but this rule should prevent most spurious list captures.
There can be any number of blank lines between items:
This is a tight list, because the blank lines are in a code block:
A single-paragraph list is tight:
This list is loose, because of the blank line between the two block elements in the list item:
Inlines are parsed sequentially from the beginning of the character stream to the end (left to right, in left-to-right languages). Thus, for example, in
hi
is parsed as code, leaving the backtick at the end as a literal backtick.
A backtick string is a string of one or more backtick characters (`
) that
is neither preceded nor followed by a backtick.
A code span begins with a backtick string and ends with a backtick string of equal length. The contents of the code span are the characters between these two backtick strings, normalized in the following ways:
This is a simple code span:
Here two backticks are used, because the code contains a backtick. This example also illustrates stripping of a single leading and trailing space:
This example shows the motivation for stripping leading and trailing spaces:
Note that only one space is stripped:
The stripping only happens if the space is on both sides of the string:
Only spaces, and not unicode whitespace in general, are stripped in this way:
Line endings are treated like spaces:
Interior spaces are not collapsed:
Note that backslash escapes do not work in code spans. All backslashes are treated literally:
Backslash escapes are never needed, because one can always choose a string of n backtick characters as delimiters, where the code does not contain any strings of exactly n backtick characters.
Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks. Thus, for example, this is not parsed as emphasized text, since the second * is part of a code span:
Code spans, HTML tags, and autolinks have the same precedence. Thus, this is code:
But this is an HTML tag:
And this is code:
But this is an autolink:
When a backtick string is not closed by a matching backtick string, we just have literal backticks:
John Gruber’s original Markdown syntax description says:
Markdown treats asterisks (*
) and underscores (_
) as indicators of emphasis.
Text wrapped with one *
or _
will be wrapped with an HTML <em>
tag; double
*
’s or _
’s will be wrapped with an HTML <strong>
tag.
This is enough for most users, but these rules leave much undecided, especially
when it comes to nested emphasis. The original Markdown.pl test suite makes it
clear that triple ***
and ___
delimiters can be used for strong emphasis,
and most implementations have also allowed the following patterns:
***strong emph***
***strong** in emph*
***emph* in strong**
**in strong *emph***
*in emph **strong***
The following patterns are less widely supported, but the intent is clear and they are useful (especially in contexts like bibliography entries):
*emph *with emph* in it*
**strong **with strong** in it**
Many implementations have also restricted intraword emphasis to the *
forms,
to avoid unwanted emphasis in words containing internal underscores. (It is best
practice to put these in code spans, but users often do not.)
internal emphasis: foo*bar*baz
no emphasis: foo_bar_baz
The rules given below capture all of these patterns, while allowing for efficient parsing strategies that do not backtrack.
First, some definitions. A delimiter run is either a sequence of one or more *
characters that is not preceded or followed by a non-backslash-escaped *
character, or a sequence of one or more _
characters that is not preceded or
followed by a non-backslash-escaped _
character.
A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) not followed by a Unicode punctuation character, or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
Here are some examples of delimiter runs.
left-flanking but not right-flanking:
***abc
_abc
**"abc"
_"abc"
right-flanking but not left-flanking:
abc***
abc_
"abc"**
"abc"_
Both left and right-flanking:
abc***def
"abc"_"def"
Neither left nor right-flanking:
abc *** def
a _ b
(The idea of distinguishing left-flanking and right-flanking delimiter runs based on the character before and the character after comes from Roopesh Chander’s vfmd. vfmd uses the terminology “emphasis indicator string” instead of “delimiter run,” and its rules for distinguishing left- and right-flanking runs are a bit more complex than the ones given here.)
The following rules define emphasis and strong emphasis:
A single *
character can open emphasis iff (if and only if) it is part of
a left-flanking delimiter run.
A single _
character can open emphasis iff it is part of a left-flanking
delimiter run and either (a) not part of a right-flanking delimiter run or
(b) part of a right-flanking delimiter run preceded by a Unicode punctuation
character.
A single *
character can close emphasis iff it is part of a right-flanking
delimiter run.
A single _
character can close emphasis iff it is part of a right-flanking
delimiter run and either (a) not part of a left-flanking delimiter run or
(b) part of a left-flanking delimiter run followed by a Unicode punctuation
character.
A double **
can open strong emphasis iff it is part of a left-flanking
delimiter run.
A double __
can open strong emphasis iff it is part of a left-flanking
delimiter run and either (a) not part of a right-flanking delimiter run or
(b) part of a right-flanking delimiter run preceded by a Unicode punctuation
character.
A double **
can close strong emphasis iff it is part of a right-flanking
delimiter run.
A double __
can close strong emphasis iff it is part of a right-flanking
delimiter run and either (a) not part of a left-flanking delimiter run or
(b) part of a left-flanking delimiter run followed by a Unicode punctuation
character.
Emphasis begins with a delimiter that can open emphasis and ends with a
delimiter that can close emphasis, and that uses the same character (_
or
*
) as the opening delimiter. The opening and closing delimiters must
belong to separate delimiter runs. If one of the delimiters can both open
and close emphasis, then the sum of the lengths of the delimiter runs
containing the opening and closing delimiters must not be a multiple of 3
unless both lengths are multiples of 3.
Strong emphasis begins with a delimiter that can open strong emphasis and
ends with a delimiter that can close strong emphasis, and that uses the same
character (_
or *
) as the opening delimiter. The opening and closing
delimiters must belong to separate delimiter runs. If one of the delimiters
can both open and close strong emphasis, then the sum of the lengths of the
delimiter runs containing the opening and closing delimiters must not be a
multiple of 3 unless both lengths are multiples of 3.
A literal *
character cannot occur at the beginning or end of
*
-delimited emphasis or **
-delimited strong emphasis, unless it is
backslash-escaped.
A literal _
character cannot occur at the beginning or end of
_
-delimited emphasis or __
-delimited strong emphasis, unless it is
backslash-escaped.
Where rules 1–12 above are compatible with multiple parsings, the following principles resolve ambiguity:
The number of nestings should be minimized. Thus, for example, an
interpretation <strong>
...</strong>
is always preferred to
<em><em>
...</em></em>
.
An interpretation <em><strong>
...</strong></em>
is always preferred to
<strong><em>
...</em></strong>
.
When two potential emphasis or strong emphasis spans overlap, so that the
second begins before the first ends and ends after the first ends, the first
takes precedence. Thus, for example, *foo _bar* baz_
is parsed as
<em>foo _bar</em> baz_
rather than *foo <em>bar* baz</em>
.
When there are two potential emphasis or strong emphasis spans with the same
closing delimiter, the shorter one (the one that opens later) takes
precedence. Thus, for example, **foo **bar baz**
is parsed as
**foo <strong>bar baz</strong>
rather than
<strong>foo **bar baz</strong>
.
Inline code spans, links, images, and HTML tags group more tightly than
emphasis. So, when there is a choice between an interpretation that contains
one of these elements and one that does not, the former always wins. Thus,
for example, *[foo*](bar)
is parsed as *<a href="bar">foo*</a>
rather
than as <em>[foo</em>](bar)
.
These rules can be illustrated through a series of examples.
Rule 1:
This is not emphasis, because the opening *
is followed by whitespace, and
hence not part of a left-flanking delimiter run:
This is not emphasis, because the opening *
is preceded by an alphanumeric and
followed by punctuation, and hence not part of a left-flanking delimiter run:
Unicode nonbreaking spaces count as whitespace, too:
Unicode symbols count as punctuation, too:
Intraword emphasis with *
is permitted:
Rule 2:
This is not emphasis, because the opening _
is followed by whitespace:
This is not emphasis, because the opening _
is preceded by an alphanumeric and
followed by punctuation:
Here _
does not generate emphasis, because the first delimiter run is
right-flanking and the second left-flanking:
This is emphasis, even though the opening delimiter is both left- and right-flanking, because it is preceded by punctuation:
Rule 3:
This is not emphasis, because the closing delimiter does not match the opening delimiter:
This is not emphasis, because the closing *
is preceded by whitespace:
A line ending also counts as whitespace:
This is not emphasis, because the second *
is preceded by punctuation and
followed by an alphanumeric (hence it is not part of a right-flanking delimiter
run):
Intraword emphasis with *
is allowed:
Rule 4:
This is not emphasis, because the closing _
is preceded by whitespace:
This is not emphasis, because the second _
is preceded by punctuation and
followed by an alphanumeric:
This is emphasis, even though the closing delimiter is both left- and right-flanking, because it is followed by punctuation:
Rule 5:
This is not strong emphasis, because the opening delimiter is followed by whitespace:
Intraword strong emphasis with **
is permitted:
Rule 6:
This is not strong emphasis, because the opening delimiter is followed by whitespace:
A line ending counts as whitespace:
This is strong emphasis, even though the opening delimiter is both left- and right-flanking, because it is preceded by punctuation:
Rule 7:
This is not strong emphasis, because the closing delimiter is preceded by whitespace:
Intraword emphasis:
Rule 8:
This is not strong emphasis, because the closing delimiter is preceded by whitespace:
Rule 9:
Any nonempty sequence of inline elements can be the contents of an emphasized span.
In particular, emphasis and strong emphasis can be nested inside emphasis:
Note that in the preceding case, the interpretation
<p><em>foo</em><em>bar<em></em>baz</em></p>
is precluded by the condition that a delimiter that can both open and close
(like the *
after foo
) cannot form emphasis if the sum of the lengths of the
delimiter runs containing the opening and closing delimiters is a multiple of 3
unless both lengths are multiples of 3.
For the same reason, we don’t get two consecutive emphasis sections in this example:
There can be no empty emphasis or strong emphasis:
Rule 10:
Any nonempty sequence of inline elements can be the contents of an strongly emphasized span.
In particular, emphasis and strong emphasis can be nested inside strong emphasis:
Indefinite levels of nesting are possible:
There can be no empty emphasis or strong emphasis:
Rule 11:
Note that when delimiters do not match evenly, Rule 11 determines that the
excess literal *
characters will appear outside of the emphasis, rather than
inside it:
Rule 12:
Note that when delimiters do not match evenly, Rule 12 determines that the
excess literal _
characters will appear outside of the emphasis, rather than
inside it:
Rule 13 implies that if you want emphasis nested directly inside emphasis, you must use different delimiters:
Rule 15:
Rule 17:
A link contains link text (the visible text), a link destination (the URI that is the link destination), and optionally a link title. There are two basic kinds of links in Markdown. In inline links the destination and title are given immediately after the link text. In reference links the destination and title are defined elsewhere in the document.
A link text consists of a sequence of zero or more inline elements enclosed by
square brackets ([
and ]
). The following rules apply:
Links may not contain other links, at any level of nesting. If multiple otherwise valid link definitions appear nested inside each other, the inner-most definition is used.
Brackets are allowed in the link text only if (a) they are backslash-escaped
or (b) they appear as a matched pair of brackets, with an open bracket [
,
a sequence of zero or more inlines, and a close bracket ]
.
Backtick code spans, autolinks, and raw HTML tags bind more tightly than the
brackets in link text. Thus, for example, [foo`]`
could not be a link
text, since the second ] is part of a code span.
The brackets in link text bind more tightly than markers for emphasis and
strong emphasis. Thus, for example, *[foo*](url)
is a link.
A link destination consists of either
a sequence of zero or more characters between an opening <
and a closing
>
that contains no line endings or unescaped <
or >
characters, or
a nonempty sequence of characters that does not start with <
, does not
include ASCII control characters or space character, and includes
parentheses only if (a) they are backslash-escaped or (b) they are part of
a balanced pair of unescaped parentheses. (Implementations may impose limits
on parentheses nesting to avoid performance issues, but at least three
levels of nesting should be supported.)
A link title consists of either
a sequence of zero or more characters between straight double-quote
characters ("
), including a "
character only if it is backslash-escaped,
or
a sequence of zero or more characters between straight single-quote
characters ('
), including a '
character only if it is backslash-escaped,
or
a sequence of zero or more characters between matching parentheses
((...)
), including a (
or )
character only if it is backslash-escaped.
Although link titles may span multiple lines, they may not contain a blank line.
An inline link consists of a link text followed immediately by a left
parenthesis (
, an optional link destination, an optional link title, and a
right parenthesis )
. These four components may be separated by spaces, tabs,
and up to one line ending. If both link destination and link title are present,
they must be separated by spaces, tabs, and up to one line ending.
The link’s text consists of the inlines contained in the link text (excluding
the enclosing square brackets). The link’s URI consists of the link destination,
excluding enclosing <...>
if present, with backslash-escapes in effect as
described above. The link’s title consists of the link title, excluding its
enclosing delimiters, with backslash-escapes in effect as described above.
Here is a simple inline link:
The title, the link text and even the destination may be omitted:
The destination can only contain spaces if it is enclosed in pointy brackets:
The destination cannot contain line endings, even if enclosed in pointy brackets:
The destination can contain )
if it is enclosed in pointy brackets:
Pointy brackets that enclose links must be unescaped:
These are not links, because the opening pointy bracket is not matched properly:
Parentheses inside the link destination may be escaped:
However, if you have unbalanced parentheses, you need to escape or use the
<...>
form:
Parentheses and other symbols can also be escaped, as usual in Markdown:
A link can contain fragment identifiers and queries:
Note that a backslash before a non-escapable character is just a backslash:
URL-escaping should be left alone inside the destination, as all URL-escaped characters are also valid URL characters. Entity and numerical character references in the destination will be parsed into the corresponding Unicode code points, as usual. These may be optionally URL-escaped when written as HTML, but this spec does not enforce any particular policy for rendering URLs in HTML or other formats. Renderers may make different decisions about how to escape or normalize URLs in the output.
Note that, because titles can often be parsed as destinations, if you try to omit the destination and keep the title, you’ll get unexpected results:
Titles may be in single quotes, double quotes, or parentheses:
Titles must be separated from the link using spaces, tabs, and up to one line ending. Other Unicode whitespace like non-breaking space doesn’t work.
Nested balanced quotes are not allowed without escaping:
But it is easy to work around this by using a different quote type:
(Note: Markdown.pl
did allow double quotes inside a double-quoted title, and
its test suite included a test demonstrating this. But it is hard to see a good
rationale for the extra complexity this brings, since there are already many
ways—backslash escaping, entity and numeric character references, or using a
different quote type for the enclosing title—to write titles containing double
quotes. Markdown.pl’s handling of titles has a number of other strange features.
For example, it allows single-quoted titles in inline links, but not reference
links. And, in reference links but not inline links, it allows a title to begin
with " and end with ). Markdown.pl 1.0.1 even allows titles with no closing
quotation mark, though 1.0.2b8 does not. It seems preferable to adopt a simple,
rational rule that works the same way in inline links and link reference
definitions.)
Spaces, tabs, and up to one line ending is allowed around the destination and title:
But it is not allowed between the link text and the following parenthesis:
The link text may contain balanced brackets, but not unbalanced ones, unless they are escaped:
The link text may contain inline content:
However, links may not contain other links, at any level of nesting.
These cases illustrate the precedence of link text grouping over emphasis grouping:
Note that brackets that aren’t part of links do not take precedence:
There are three kinds of reference links: full, collapsed, and shortcut.
A full reference link consists of a link text immediately followed by a link label that matches a link reference definition elsewhere in the document.
A link label begins with a left bracket ([
) and ends with the first right
bracket (]
) that is not backslash-escaped. Between these brackets there must
be at least one character that is not a space, tab, or line ending. Unescaped
square bracket characters are not allowed inside the opening and closing square
brackets of link labels. A link label can have at most 999 characters inside the
square brackets.
One label matches another just in case their normalized forms are equal. To normalize a label, strip off the opening and closing brackets, perform the Unicode case fold, strip leading and trailing spaces, tabs, and line endings, and collapse consecutive internal spaces, tabs, and line endings to a single space. If there are multiple matching reference link definitions, the one that comes first in the document is used. (It is desirable in such cases to emit a warning.)
The link’s URI and title are provided by the matching link reference definition.
Here is a simple example:
The rules for the link text are the same as with inline links. Thus:
The link text may contain balanced brackets, but not unbalanced ones, unless they are escaped:
The link text may contain inline content:
However, links may not contain other links, at any level of nesting.
(In the examples above, we have two shortcut reference links instead of one full reference link.)
The following cases illustrate the precedence of link text grouping over emphasis grouping:
Matching is case-insensitive:
No spaces, tabs, or line endings are allowed between the link text and the link label:
When there are multiple matching link reference definitions, the first is used:
Link labels cannot contain brackets, unless they are backslash-escaped:
Note that in this example ]
is not backslash-escaped:
A link label must contain at least one character that is not a space, tab, or line ending:
A collapsed reference link consists of a link label that matches a link
reference definition elsewhere in the document, followed by the string []
. The
contents of the link label are parsed as inlines, which are used as the link’s
text. The link’s URI and title are provided by the matching reference link
definition. Thus, [foo][]
is equivalent to [foo][foo]
.
The link labels are case-insensitive:
As with full reference links, spaces, tabs, or line endings are not allowed between the two sets of brackets:
A shortcut reference link consists of a link label that matches a link reference
definition elsewhere in the document and is not followed by []
or a link
label. The contents of the link label are parsed as inlines, which are used as
the link’s text. The link’s URI and title are provided by the matching link
reference definition. Thus, [foo]
is equivalent to [foo][]
.
The link labels are case-insensitive:
A space after the link text should be preserved:
If you just want bracketed text, you can backslash-escape the opening bracket to avoid links:
Full and collapsed references take precedence over shortcut references:
Inline links also take precedence:
Here, though, [foo][bar]
is parsed as a reference, since [bar]
is defined:
Syntax for images is like the syntax for links, with one difference. Instead of
link text, we have an image description. The rules for this are the same as for
link text, except that (a) an image description starts with 
or foo <a href="/url">bar</a>
. Only the
plain string content is rendered, without formatting.
Reference-style:
Collapsed:
The labels are case-insensitive:
As with reference links, spaces, tabs, and line endings, are not allowed between the two sets of brackets:
Shortcut:
The link labels are case-insensitive:
If you just want a literal ! followed by bracketed text, you can
backslash-escape the opening [
:
If you want a link after a literal !
, backslash-escape the !
:
Autolinks are absolute URIs and email addresses inside <
and >
. They are
parsed as links, with the URL or email address as the link label.
A URI autolink consists of <
, followed by an absolute URI followed by >
. It
is parsed as a link to the URI, with the URI as the link’s label.
An absolute URI, for these purposes, consists of a scheme followed by a colon
(:
) followed by zero or more characters other than ASCII control characters,
space, <
, and >
. If the URI includes these characters, they must be
percent-encoded (e.g. %20
for a space).
For purposes of this spec, a scheme is any sequence of 2–32 characters beginning with an ASCII letter and followed by any combination of ASCII letters, digits, or the symbols plus (“+”), period (“.”), or hyphen (“-”).
Here are some valid autolinks:
Uppercase is also fine:
Note that many strings that count as absolute URIs for purposes of this spec are not valid URIs, because their schemes are not registered or because of other problems with their syntax:
Spaces are not allowed in autolinks:
Backslash-escapes do not work inside autolinks:
An email autolink consists of <
, followed by an email address, followed by
>
. The link’s label is the email address, and the URL is mailto:
followed by
the email address.
An email address, for these purposes, is anything that matches the non-normative regex from the HTML5 spec:
/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?
(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
Examples of email autolinks:
Backslash-escapes do not work inside email autolinks:
These are not autolinks:
Text between <
and >
that looks like an HTML tag is parsed as a raw HTML tag
and will be rendered in HTML without escaping. Tag and attribute names are not
limited to current HTML tags, so custom tags (and even, say, DocBook tags) may
be used.
Here is the grammar for tags:
A tag name consists of an ASCII letter followed by zero or more ASCII letters, digits, or hyphens (-).
An attribute consists of spaces, tabs, and up to one line ending, an attribute name, and an optional attribute value specification.
An attribute name consists of an ASCII letter, _
, or :
, followed by zero or
more ASCII letters, digits, _
, .
, :
, or -
. (Note: This is the XML
specification restricted to ASCII. HTML5 is laxer.)
An attribute value specification consists of optional spaces, tabs, and up to
one line ending, a =
character, optional spaces, tabs, and up to one line
ending, and an attribute value.
An attribute value consists of an unquoted attribute value, a single-quoted attribute value, or a double-quoted attribute value.
An unquoted attribute value is a nonempty string of characters not including
spaces, tabs, line endings, "
, '
, =
, <
, >
, or `
.
A single-quoted attribute value consists of '
, zero or more characters not
including '
, and a final '
.
A double-quoted attribute value consists of "
, zero or more characters not
including "
, and a final "
.
An open tag consists of a <
character, a tag name, zero or more attributes,
optional spaces, tabs, and up to one line ending, an optional /
character,
and a >
character.
A closing tag consists of the string </
, a tag name, optional spaces, tabs,
and up to one line ending, and the character >
.
An HTML tag consists of an open tag, and a closing tag.
Here are some simple open tags:
Empty elements:
Whitespace is allowed:
With attributes:
Custom tag names can be used:
Illegal tag names, not parsed as HTML:
Illegal attribute names:
Illegal attribute values:
Illegal whitespace:
Missing whitespace:
Closing tags:
Illegal attributes in closing tag:
Entity and numeric character references are preserved in HTML attributes:
Backslash escapes do not work in HTML attributes:
A line ending (not in a code span or HTML tag) that is preceded by two or more
spaces and does not occur at the end of a block is parsed as a hard line break
(rendered in HTML as a <br>
tag):
For a more visible alternative, a backslash before the line ending may be used instead of two or more spaces:
Leading spaces at the beginning of the next line are ignored:
Hard line breaks can occur inside emphasis, links, and other constructs that allow inline content:
Hard line breaks do not occur inside code spans
or HTML tags:
Hard line breaks are for separating inline content within a block. Neither syntax for hard line breaks works at the end of a paragraph or other block element:
A regular line ending (not in a code span or HTML tag) that is not preceded by two or more spaces or a backslash is parsed as a softbreak. (A soft line break may be rendered in HTML either as a line ending or as a space. The result will be the same in browsers. In the examples here, a line ending will be used.)
Spaces at the end of the line and beginning of the next line are removed:
A conforming parser may render a soft line break in HTML either as a line ending or as a space.
A renderer may also provide an option to render soft line breaks as hard line breaks.
Any characters not given an interpretation by the above rules will be parsed as plain textual content.
Internal spaces are preserved verbatim: