HTML entities: when and how to encode them
An HTML entity is a stand in for a character that would otherwise confuse the browser's parser or cannot be typed safely. Instead of writing the raw character, you write a short code that the browser turns back into that character when it renders the page. The classic example is & for the ampersand. Entities exist so that text and markup can share the same document without the browser mistaking one for the other.
The characters that must be encoded
A small set of characters carries structural meaning in HTML. When they appear as literal text rather than markup, escape them:
<becomes<so the parser does not start a tag>becomes>to close the pair cleanly&becomes&so it is not read as the start of another entity"becomes"inside double quoted attribute values'becomes'(or') inside single quoted attribute values
The ampersand is the one people forget, and it is the most important. Because & begins every entity, an unescaped ampersand in text can swallow the characters after it. Writing Tom & Jerry may render fine, but R&D near other text can break in subtle ways. Escaping it to R&D removes all ambiguity.
Named versus numeric entities
There are two ways to write the same character.
- Named entities use a memorable label:
©, ,€. They are readable but limited to a fixed list. - Numeric entities use the character's code point: decimal as
©or hexadecimal as©. Both produce the copyright symbol. Numeric form can represent any character, including ones with no name.
character named decimal hex
& & & &
© © © ©
< < < <
Use whichever is clearer. Named entities read better for common symbols, numeric entities are universal.
Why encoding matters
Two problems disappear when you escape correctly.
The first is broken markup. A stray < in a code snippet or a math expression can make the browser swallow the rest of your content as a malformed tag. Escaping keeps your text as text.
The second is a class of cross site scripting, or XSS. If a page takes user supplied input and drops it into HTML without escaping, an attacker can submit something like <script>steal()</script> and have the browser execute it. Escaping the angle brackets turns that payload into harmless visible text. This is why frameworks escape interpolated values by default. The danger appears when you bypass that default and insert raw HTML yourself. Note that escaping is context dependent: encoding for HTML body is not the same as encoding for a URL, a JavaScript string, or a CSS value, so escape for the context the data lands in.
When you do not need to encode
Encoding is not always required, and over encoding creates its own bugs.
- Plain text that contains none of the special characters needs nothing.
- A character inside the correct attribute quoting style is fine. A
"inside a single quoted attribute does not need escaping, and vice versa. - Content already handled by a templating engine that auto escapes should not be escaped again, or you get visible
&amp;doubling.
Decoding entities you find
The reverse is just as common. Scraped HTML, exported data, and log files often arrive full of entities like Café & bar. Decoding turns that back into Café & bar so the text is readable and ready to process. Watch for double encoding, where &amp; should decode to & and then to &.
To escape text for safe insertion, or to decode entities back to plain characters, use the HTML Entity Encoder / Decoder. It runs in your browser, so nothing you paste is transmitted. If you are also formatting the surrounding document, the JSON Formatter & Validator and Base64 Encode / Decode tools cover the neighboring jobs.