15: Marking up textual content in HTML

By Mark Norman Francis

11th October 2012: Material moved to webplatform.org

The Opera web standards curriculum has now been moved to the docs section of the W3C webplatform.org site. Go there to find updated versions of these docs, and much more besides!

12th April 2012: This article is obsolete

The web standards curriculum has been donated to the W3C web education community group, to become part of a much bigger educational resource. It is constantly being updated so that it remains current with modern web design practices and technologies. To find the most up-to-date web standards curriculum, visit the web education community group Wiki. Please make changes to this Wiki yourself, or suggest changes to Chris Mills, who is also the chair of the web education community group.

Introduction

In this article I will take you through the basics of using HTML to describe the meaning of the content within the body of your document.

We will look at general structural elements such as headings and paragraphs and embedding quotes and code. After that we will look at inline content, such as short quotes and emphasis, and finish with a quick examination of old-fashioned presentational content. This article’s structure is as follows:

Note: After each code example, there is a “View live examples” link, which when clicked will take you to the actual rendered output of that source code, contained within a different file—it is to view live examples of how the source code is actually rendered in the browser, as well as looking at the code.

Space—the final frontier

An important point to cover before I start discussing text, though, is that of space, specifically the space between words. When writing HTML, the source document will contain what is termed “white space” — the characters in the file that serve to separate text. An actual space character, as you would get when you hit the Spacebar on the keyboard is the most common, but there are others such as the Tab character and the marker between two separate lines in a document (called a carriage return or new line).

In HTML, multiple occurrences of these characters are (almost) always treated as a single space character. For example:

<h3>In   the
                beginning</h3>

View live examples

would be interpreted by a web browser to be equivalent to:

<h3>In the beginning</h3>

The only place where this is not the case is in the pre element, which is discussed in detail later in this article.

This can be a source of confusion for first-time authors of an HTML document, who try to pad out text with extra spaces to achieve a desired indentation, or to get more spacing after the period between sentences and introduce more vertical space between paragraphs. Influencing the visual layout of your documents is not something to be done in the HTML source, and is instead achieved through style sheets, discussed later in this series of articles.

Block level elements

In this section I’ll go through the syntax and usage of the common block level elements used to format text.

Page section headings

Once the page has been broken down into logical sections, each section should be introduced by an appropriate header. This is discussed further in the article What does a good web page need.

HTML defines six levels of header, h1, h2, h3, h4, h5, and h6 (from the highest importance to the lowest). Generally speaking, the h1 would be the main heading of the entire page and introduce everything. h2 is then used to break the page up into sections, h3 the sub-sections, and so on.

It is important to use the header levels to describe the document in terms of section, sub-section, sub-sub-section as this makes the document more understandable to screen readers and to automated processes (like Google’s indexing bots).

A good example of a header structure, using this document as a template, would look like this:

<h1>Marking up textual content in HTML</h1>
<h2>Introduction</h2>
<h2>Space—the final frontier</h2>
<h2>Block level elements</h2>
<h3>Page section headings</h3>
<h3>Generic paragraphs</h3>
<h3>Quoting other sources</h3>
<h3>Preformatted text</h3>
<h2>Inline elements</h2>

[…and so on…]

View live examples

Generic paragraphs

The paragraph is the building block of most documents. In HTML a paragraph is represented by the p element, which takes no special attributes. For example:

<p>This is a very short paragraph. It only has two sentences.</p>

View live examples

A paragraph in many articles and books can contain just one sentence. Whilst the meaning (in terms of written prose) of “paragraph” is fairly clear, on the web much shorter areas of text are often wrapped in paragraph elements as the author believes this is more “semantic” than using a div element (we will cover these in a future article called “Generic containers”).

A paragraph is a collection of one or more sentences, just as in newspapers and books. On the web, it is good form to use the paragraph element for this and not just any random piece of text in the page. If it is just a few words and not even a proper sentence, then it should probably not be marked up as a paragraph.

Quoting other sources

Very often articles, blog posts, and reference documents will quote in whole or in part another document. In HTML, this is marked up using the blockquote element for lengthy quotations, such as entire sentences, paragraphs, lists, or the like.

A blockquote element cannot contain text, but must instead have another block level element inside it. You should use the same block level element as was used in the original document. If you are quoting a paragraph of text, use a paragraph; when quoting a list of items, use the elements for lists; and so on.

If the quote comes from another web page, you can indicate this using the cite attribute, like so:

<p>HTML 4.01 is the only version of HTML that you should use
when creating a new web page, as, according to the 
specification:</p>
<blockquote cite="http://www.w3.org/TR/html401/">
<p>This document obsoletes previous versions of HTML 4.0,
although W3C will continue to make those specifications and
their DTDs available at the W3C Web site.</p>
</blockquote>

View live examples

The cite attribute is not required in the case where the quote is taken from a novel, magazine or other form of offline content.

Preformatted text

Any text in which the formatting and white space (see earlier) is significant should be marked up using the pre element.

In most web browsers, text marked as preformatted will be displayed to the user as it appears in the source, sometimes using a fixed-width (monospaced) font, giving the text a feeling of having come from a typewriter. This is an artifact of programmers using fixed width fonts for early uses of preformatted text.

In this example, you can see a snippet of code written in the perl programming language:

<pre><code class="language-perl">
# read in the named file in its entirety
sub slurp {
  my $filename = shift;
  my $file     = new FileHandle $filename;
                
  if ( defined $file ) {
    local $/;
    return <$file>;
  }
  return undef;
};
</code></pre>

View live examples

The use of code above will be explained in the lesser-known semantic elements article later on in the course.

Inline elements

In this section I’ll go through the syntax and usage of the common inline elements used to format text.

Short quotations

Short quotes which are used within a normal sentence or paragraph are contained within the q element. Like the blockquote element, this can contain a cite attribute, which indicates the page on the internet where the quote can be found.

A short quote should normally be rendered with quotation marks around it. According to the HTML specification, these should be inserted by the user-agent so that they can be correctly nested and made aware of the language being used in the document. CSS can be used to control the quotation marks used—this is covered in a later article on “styling text”.

An example of q in action:

<p>This did not end well for me. Oh well, 
              <q lang="fr">c'est la vie</q> as the French say.</p>

View live examples

Emphasis

HTML contains two methods for indicating that the text within needs to be emphasised to the user, such as error messages, warnings, or notes. For visual browsers this normally means applying a different colour, font or making the text bolder or italicised. For users of screen readers this can result in a different voice or other auditory effect.

For text that needs to be emphasised, you use the em element, like so:

<p><em>Please note:</em> the kettle is to be unplugged at
              night.</p>

View live examples

If an entire sentence was to be emphasised, but there was still a point within that sentence needed to be emphasised further, you use the strong element to indicate stronger emphasis than normal, like so:

<p><em>Please note: the kettle <strong>must</strong> be unplugged every evening, otherwise it will explode -
<strong>killing us all</strong></em>.</p>

View live examples

Italicised text

It is commonly thought that “italicised” does not describe the meaning, and thus the i element should not be used (much like some other presentational elements described in the next section).

There are a couple of instances when describing the content as being italicised is arguably correct. It has been noted that some concepts are best described as “italicised” rather than having to create some very specific and otherwise unused elements. These include things such as the names of ships, the titles of television series, movies and books, some technical terms and other taxonomic designations.

The argument is that the italicisation indicates that the text within is special, and the context indicates how it is special. Indeed, this is reflected in the currently draft HTML 5 specification:

The i element represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose […] The i element should be used as a last resort when no other element is more appropriate.

Since the i element can be restyled by CSS to not be italic, the meaning of “italic” in this context is essentially “something a little bit different”. I don’t find this acceptable, personally speaking, but there is enough precedent out there for it to be used this way.

Presentational elements—never use these

The HTML specification includes several elements that are widely described as “presentational” because they only specify what the content within them should look like, and not what it means.

Some of these have been labeled as deprecated in the specification. This means that they have been superseded by a newer method of achieving the same result.

I will describe them briefly here, but note that this is mostly of historic interest—these elements should never be used in any modern web page. The effect of all of these elements should be achieved in another way and will be described in two forthcoming articles: “styling text with CSS” and lesser known semantic elements.

font face="…" size="…": The text within should be rendered by the browser using a font different from the default — instead, fonts should be set using CSS.
b: The text within is bold—this almost always means the text has been emphasised, so you should use em or strong as shown earlier.
s and strike: The text within has been struck-through with a line—if this is merely a presentational effect, this should be achieved with CSS. Alternatively, if the text is actually being marked as having been deleted or unwanted it should be marked up with the del element, described in the later article.
u: The text within has been underlined—this is almost always a visual effect, and so should be achieved with CSS.
tt: The text within is presented in a “teletype” or monospaced font —this should be achieved with CSS or a more appropriate semantic element such as pre—as shown above.
big and small: The size of the text within has been adjusted—this should be achieved with CSS.

Summary

In this article, I have talked about some of the most common elements used when marking up textual content. In the next article, you will progress to another type of content: lists of items.

About the author

Picture of the article author Mark Norman Francis

Photo credit: Andy Budd.

Mark Norman Francis has been working with the internet since before the web was invented. In his last job he worked at Yahoo! as a Front End Architect for the world’s biggest website, defining best practices, coding standards and quality in web development internationally.

Previous to Yahoo! he worked at Formula One Management, Purple Interactive and City University in various roles including web development, backend CGI programming and systems architecture. He pretends to blog at http://marknormanfrancis.com/.

This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.

15: Marking up textual content in HTML

11th October 2012: Material moved to webplatform.org

12th April 2012: This article is obsolete

Introduction

Space—the final frontier

Block level elements

Page section headings

Generic paragraphs

Quoting other sources

Preformatted text

Inline elements

Short quotations

Emphasis

Italicised text

Presentational elements—never use these

Summary

About the author

Comments