charsets.
by Jamie Zawinski
<jwz@jwz.org>
Written 18-Dec-1997.
Updated 6-Sep-2002.
A lot of web pages use characters that cannot be viewed by users of Macintosh or Unix computers. This is because all three operating systems use different character sets. Specifically, all three use character sets that are different supersets of ASCII.
This is farther complicated by the fact that many documents that are actually written in CP-1252 claim (incorrectly) that they are Latin1.
Portable web pages written in Western European ("Romance") languages should be written using only characters from Latin1.
Mozilla could, and should, do a more intelligent translation of CP-1252 characters to their best possible screen representation.
For example, it would be really really nice if, when looking at a web page that is actually in CP-1252 (with long-hyphen, open and close double-quotes, etc) if the layout engine would convert those characters to ones that I can actually see on my screen. Most Unix fonts don't include those characters at all, so they show up either as blank space, or as ``?.''
Likewise, entities like ``©'' should work in
all contexts, regardless of what the prevailing charset or font
is. If I am in a charset/font that doesn't include a copyright symbol,
but the document says ``©'' then Mozilla should do
whatever it takes to cause that character to show up on the
Here are a pair of tables that attempt to explain this mess.
This table lists the non-ASCII characters which exist in any of the three character sets, and which either: do not exist in all of them; or, exist but have different codes.
Those characters which have entries in the Latin1 column but not in the CP-1252 or MacRoman columns are characters which probably cannot be displayed at all on a Macintosh: the standard fonts probably don't contain those characters.
Likewise, those characters which have entries in the CP-1252 column but not in the Latin1 column cannot be displayed on Unix.
However, in all cases, entries in this table are trouble, because they can't be displayed as-is: if the font expects Latin1, then any MacRoman characters to be displayed must first be converted to their Latin1 equivalents.
(This is not a problem for Latin1 characters using CP-1252 fonts, since all Latin1 characters are also CP-1252 characters. However, all Latin1 characters are *not* MacRoman characters.)
The first column lists the official name of the character; the second column contains the raw byte of the character's code, in each of the three character sets (if applicable.) At least one of these should look like the appropriate character to you; the others will not.
+------------------------------------------------------------------------+ | Name Latin1 CP-1252 MacRoman | | Dec/Hex/Oct Dec/Hex/Oct Dec/Hex/Oct | +------------------------------------------------------------------------+ | euro - --- -- --- 128 80 200 219 DB 333 | | Aring above Ł 197 C5 305 197 C5 305 129 81 201 | | single low-9 quote - --- -- --- 130 82 202 226 E2 342 | | fhook - --- -- --- 131 83 203 196 C4 304 | | double low-9 quote - --- -- --- 132 84 204 227 E3 343 | | horizontal ellipsis - --- -- --- 133 85 205 201 C9 311 | | dagger - --- -- --- 134 86 206 160 A0 240 | | double dagger - --- -- --- 135 87 207 224 E0 340 | | per mille - --- -- --- 137 89 211 228 E4 344 | | Scaron -- --- -- --- 138 8A 212 --- -- --- | | single left angle quote - --- -- --- 139 8B 213 220 DC 334 | | OE - --- -- --- 140 8C 214 206 CE 316 | | ccedilla 231 E7 347 231 E7 347 141 8D 215 | | Zcaron -- --- -- --- 142 8E 216 --- -- --- | | egrave 232 E8 350 232 E8 350 143 8F 217 | | ecircumflex 234 EA 352 234 EA 352 144 90 220 | | left single quote - --- -- --- 145 91 221 212 D4 324 | | right single quote - --- -- --- 146 92 222 213 D5 325 | | left double quote - --- -- --- 147 93 223 210 D2 322 | | right double quote - --- -- --- 148 94 224 211 D3 323 | | bullet - --- -- --- 149 95 225 165 A5 245 | | en dash - --- -- --- 150 96 226 208 D0 320 | | em dash - --- -- --- 151 97 227 209 D1 321 | | small tilde - --- -- --- 152 98 230 247 F7 367 | | trade - --- -- --- 153 99 231 170 AA 252 | | scaron -- --- -- --- 154 9A 232 --- -- --- | | single right angle quote - --- -- --- 155 9B 233 221 DD 335 | | oe - --- -- --- 156 9C 234 207 CF 317 | | ugrave 249 F9 371 249 F9 371 157 9D 235 | | zcaron -- --- -- --- 158 9E 236 --- -- --- | | Ydiaeresis - --- -- --- 159 9F 237 217 D9 331 | | no-break space 160 A0 240 160 A0 240 202 CA 312 | | inverted exclamation 161 A1 241 161 A1 241 193 C1 301 | | currency - 164 A4 244 164 A4 244 --- -- --- | | yen 165 A5 245 165 A5 245 180 B4 264 | | broken bar - 166 A6 246 166 A6 246 --- -- --- | | section 167 A7 247 167 A7 247 164 A4 244 | | diaeresis 168 A8 250 168 A8 250 172 AC 254 | | feminine ordinal 170 AA 252 170 AA 252 187 BB 273 | | left double angle quote 171 AB 253 171 AB 253 199 C7 307 | | not 172 AC 254 172 AC 254 194 C2 302 | | soft hyphen - 173 AD 255 173 AD 255 --- -- --- | | registered 174 AE 256 174 AE 256 168 A8 250 | | macron 175 AF 257 175 AF 257 248 F8 370 | | degree 176 B0 260 176 B0 260 161 A1 241 | | superscript two - 178 B2 262 178 B2 262 --- -- --- | | superscript three - 179 B3 263 179 B3 263 --- -- --- | | acute 180 B4 264 180 B4 264 171 AB 253 | | pilcrow 182 B6 266 182 B6 266 166 A6 246 | | middle dot 183 B7 267 183 B7 267 225 E1 341 | | cedilla 184 B8 270 184 B8 270 252 FC 374 | | superscript one - 185 B9 271 185 B9 271 --- -- --- | | masculine ordinal 186 BA 272 186 BA 272 188 BC 274 | | right double angle quote 187 BB 273 187 BB 273 200 C8 310 | | one quarter - 188 BC 274 188 BC 274 --- -- --- | | one half - 189 BD 275 189 BD 275 --- -- --- | | three quarters - 190 BE 276 190 BE 276 --- -- --- | | inverted question 191 BF 277 191 BF 277 192 C0 300 | | Agrave 192 C0 300 192 C0 300 203 CB 313 | | Aacute 193 C1 301 193 C1 301 231 E7 347 | | Acircumflex 194 C2 302 194 C2 302 229 E5 345 | | Atilde 195 C3 303 195 C3 303 204 CC 314 | | Adiaeresis Ā 196 C4 304 196 C4 304 128 80 200 | | Aring above Ł 197 C5 305 197 C5 305 129 81 201 | | AE Ʈ 198 C6 306 198 C6 306 174 AE 256 | | Ccedilla ǂ 199 C7 307 199 C7 307 130 82 202 | | Egrave 200 C8 310 200 C8 310 233 E9 351 | | Eacute Ƀ 201 C9 311 201 C9 311 131 83 203 | | Ecircumflex 202 CA 312 202 CA 312 230 E6 346 | | Ediaeresis 203 CB 313 203 CB 313 232 E8 350 | | Igrave 204 CC 314 204 CC 314 237 ED 355 | | Iacute 205 CD 315 205 CD 315 234 EA 352 | | Icircumflex 206 CE 316 206 CE 316 235 EB 353 | | Idiaeresis 207 CF 317 207 CF 317 236 EC 354 | | ETH - 208 D0 320 208 D0 320 --- -- --- | | Ntilde ф 209 D1 321 209 D1 321 132 84 204 | | Ograve 210 D2 322 210 D2 322 241 F1 361 | | Oacute 211 D3 323 211 D3 323 238 EE 356 | | Ocircumflex 212 D4 324 212 D4 324 239 EF 357 | | Otilde 213 D5 325 213 D5 325 205 CD 315 | | Odiaeresis օ 214 D6 326 214 D6 326 133 85 205 | | multiplication - 215 D7 327 215 D7 327 --- -- --- | | Ostroke د 216 D8 330 216 D8 330 175 AF 257 | | Ugrave 217 D9 331 217 D9 331 244 F4 364 | | Uacute 218 DA 332 218 DA 332 242 F2 362 | | Ucircumflex 219 DB 333 219 DB 333 243 F3 363 | | Udiaeresis ܆ 220 DC 334 220 DC 334 134 86 206 | | Yacute - 221 DD 335 221 DD 335 --- -- --- | | THORN - 222 DE 336 222 DE 336 --- -- --- | | sharp s ߧ 223 DF 337 223 DF 337 167 A7 247 | | agrave 224 E0 340 224 E0 340 136 88 210 | | aacute 225 E1 341 225 E1 341 135 87 207 | | acircumflex 226 E2 342 226 E2 342 137 89 211 | | atilde 227 E3 343 227 E3 343 139 8B 213 | | adiaeresis 228 E4 344 228 E4 344 138 8A 212 | | aring above 229 E5 345 229 E5 345 140 8C 214 | | ae 230 E6 346 230 E6 346 190 BE 276 | | ccedilla 231 E7 347 231 E7 347 141 8D 215 | | egrave 232 E8 350 232 E8 350 143 8F 217 | | eacute 233 E9 351 233 E9 351 142 8E 216 | | ecircumflex 234 EA 352 234 EA 352 144 90 220 | | ediaeresis 235 EB 353 235 EB 353 145 91 221 | | igrave 236 EC 354 236 EC 354 147 93 223 | | iacute 237 ED 355 237 ED 355 146 92 222 | | icircumflex 238 EE 356 238 EE 356 148 94 224 | | idiaeresis 239 EF 357 239 EF 357 149 95 225 | | eth - 240 F0 360 240 F0 360 --- -- --- | | ntilde 241 F1 361 241 F1 361 150 96 226 | | ograve 242 F2 362 242 F2 362 152 98 230 | | oacute 243 F3 363 243 F3 363 151 97 227 | | ocircumflex 244 F4 364 244 F4 364 153 99 231 | | otilde 245 F5 365 245 F5 365 155 9B 233 | | odiaeresis 246 F6 366 246 F6 366 154 9A 232 | | division 247 F7 367 247 F7 367 214 D6 326 | | ostroke 248 F8 370 248 F8 370 191 BF 277 | | ugrave 249 F9 371 249 F9 371 157 9D 235 | | uacute 250 FA 372 250 FA 372 156 9C 234 | | ucircumflex 251 FB 373 251 FB 373 158 9E 236 | | udiaeresis 252 FC 374 252 FC 374 159 9F 237 | | yacute - 253 FD 375 253 FD 375 --- -- --- | | thorn - 254 FE 376 254 FE 376 --- -- --- | | ydiaeresis 255 FF 377 255 FF 377 216 D8 330 | +------------------------------------------------------------------------+ |
This table lists the non-ASCII characters which exist in all three character sets, and which have the same codes in each. If a document's character set is mislabelled (it is really CP-1259 or MacRoman but is labelled as Latin1, or vice versa) these characters will still all display properly (assuming the display device supports any of the three charsets.)
You'll notice that there are not a whole lot of characters in this table.
+--------------------------------------------+ | Name - Dec/Hex/Oct | +--------------------------------------------+ | cent 162 A2 242 | | pound 163 A3 243 | | copyright 169 A9 251 | | plus-minus 177 B1 261 | | micro 181 B5 265 | +--------------------------------------------+
If you want to generate tables like this comparing other character sets, you might find this perl script that I wrote to generate these tables useful.
The canonical source of information about these character sets can be found on ftp.unicode.org here:
The names of the HTML character entities that correspond to the official Unicode character names can be found on w3.org here: