charsets.
by Jamie Zawinski <jwz@jwz.org>
Written 18-Dec-1997.
Updated 6-Sep-2002.

A lot of web pages use characters that cannot be viewed by users of Macintosh or Unix computers. This is because all three operating systems use different character sets. Specifically, all three use character sets that are different supersets of ASCII.

This is farther complicated by the fact that many documents that are actually written in CP-1252 claim (incorrectly) that they are Latin1.

Portable web pages written in Western European ("Romance") languages should be written using only characters from Latin1.

Mozilla could, and should, do a more intelligent translation of CP-1252 characters to their best possible screen representation.

For example, it would be really really nice if, when looking at a web page that is actually in CP-1252 (with long-hyphen, open and close double-quotes, etc) if the layout engine would convert those characters to ones that I can actually see on my screen. Most Unix fonts don't include those characters at all, so they show up either as blank space, or as ``?.''

Likewise, entities like ``&copy;'' should work in all contexts, regardless of what the prevailing charset or font is. If I am in a charset/font that doesn't include a copyright symbol, but the document says ``&copy;'' then Mozilla should do whatever it takes to cause that character to show up on the screen -- which probably means drawing that character with a different font (one that does include it.)

Here are a pair of tables that attempt to explain this mess.


Problematic Characters

Mozilla Bugs

+------------------------------------------------------------------------+
|    Name                             Latin1      CP-1252     MacRoman   |
|                                  Dec/Hex/Oct  Dec/Hex/Oct  Dec/Hex/Oct |
+------------------------------------------------------------------------+
| euro                       -   --- -- ---   128 80 200   219 DB 333  |
| Aring above                Ł   197 C5 305   197 C5 305   129 81 201  |
| single low-9 quote         -   --- -- ---   130 82 202   226 E2 342  |
| fhook                      -   --- -- ---   131 83 203   196 C4 304  |
| double low-9 quote         -   --- -- ---   132 84 204   227 E3 343  |
| horizontal ellipsis        -   --- -- ---   133 85 205   201 C9 311  |
| dagger                     -   --- -- ---   134 86 206   160 A0 240  |
| double dagger              -   --- -- ---   135 87 207   224 E0 340  |
| per mille                  -   --- -- ---   137 89 211   228 E4 344  |
| Scaron                     --   --- -- ---   138 8A 212   --- -- ---  |
| single left angle quote    -   --- -- ---   139 8B 213   220 DC 334  |
| OE                         -   --- -- ---   140 8C 214   206 CE 316  |
| ccedilla                      231 E7 347   231 E7 347   141 8D 215  |
| Zcaron                     --   --- -- ---   142 8E 216   --- -- ---  |
| egrave                        232 E8 350   232 E8 350   143 8F 217  |
| ecircumflex                   234 EA 352   234 EA 352   144 90 220  |
| left single quote          -   --- -- ---   145 91 221   212 D4 324  |
| right single quote         -   --- -- ---   146 92 222   213 D5 325  |
| left double quote          -   --- -- ---   147 93 223   210 D2 322  |
| right double quote         -   --- -- ---   148 94 224   211 D3 323  |
| bullet                     -   --- -- ---   149 95 225   165 A5 245  |
| en dash                    -   --- -- ---   150 96 226   208 D0 320  |
| em dash                    -   --- -- ---   151 97 227   209 D1 321  |
| small tilde                -   --- -- ---   152 98 230   247 F7 367  |
| trade                      -   --- -- ---   153 99 231   170 AA 252  |
| scaron                     --   --- -- ---   154 9A 232   --- -- ---  |
| single right angle quote   -   --- -- ---   155 9B 233   221 DD 335  |
| oe                         -   --- -- ---   156 9C 234   207 CF 317  |
| ugrave                        249 F9 371   249 F9 371   157 9D 235  |
| zcaron                     --   --- -- ---   158 9E 236   --- -- ---  |
| Ydiaeresis                 -   --- -- ---   159 9F 237   217 D9 331  |
| no-break space                160 A0 240   160 A0 240   202 CA 312  |
| inverted exclamation          161 A1 241   161 A1 241   193 C1 301  |
| currency                   -   164 A4 244   164 A4 244   --- -- ---  |
| yen                           165 A5 245   165 A5 245   180 B4 264  |
| broken bar                 -   166 A6 246   166 A6 246   --- -- ---  |
| section                       167 A7 247   167 A7 247   164 A4 244  |
| diaeresis                     168 A8 250   168 A8 250   172 AC 254  |
| feminine ordinal              170 AA 252   170 AA 252   187 BB 273  |
| left double angle quote       171 AB 253   171 AB 253   199 C7 307  |
| not                           172 AC 254   172 AC 254   194 C2 302  |
| soft hyphen                -   173 AD 255   173 AD 255   --- -- ---  |
| registered                    174 AE 256   174 AE 256   168 A8 250  |
| macron                        175 AF 257   175 AF 257   248 F8 370  |
| degree                        176 B0 260   176 B0 260   161 A1 241  |
| superscript two            -   178 B2 262   178 B2 262   --- -- ---  |
| superscript three          -   179 B3 263   179 B3 263   --- -- ---  |
| acute                         180 B4 264   180 B4 264   171 AB 253  |
| pilcrow                       182 B6 266   182 B6 266   166 A6 246  |
| middle dot                    183 B7 267   183 B7 267   225 E1 341  |
| cedilla                       184 B8 270   184 B8 270   252 FC 374  |
| superscript one            -   185 B9 271   185 B9 271   --- -- ---  |
| masculine ordinal             186 BA 272   186 BA 272   188 BC 274  |
| right double angle quote      187 BB 273   187 BB 273   200 C8 310  |
| one quarter                -   188 BC 274   188 BC 274   --- -- ---  |
| one half                   -   189 BD 275   189 BD 275   --- -- ---  |
| three quarters             -   190 BE 276   190 BE 276   --- -- ---  |
| inverted question             191 BF 277   191 BF 277   192 C0 300  |
| Agrave                        192 C0 300   192 C0 300   203 CB 313  |
| Aacute                        193 C1 301   193 C1 301   231 E7 347  |
| Acircumflex                   194 C2 302   194 C2 302   229 E5 345  |
| Atilde                        195 C3 303   195 C3 303   204 CC 314  |
| Adiaeresis                 Ā   196 C4 304   196 C4 304   128 80 200  |
| Aring above                Ł   197 C5 305   197 C5 305   129 81 201  |
| AE                         Ʈ   198 C6 306   198 C6 306   174 AE 256  |
| Ccedilla                   ǂ   199 C7 307   199 C7 307   130 82 202  |
| Egrave                        200 C8 310   200 C8 310   233 E9 351  |
| Eacute                     Ƀ   201 C9 311   201 C9 311   131 83 203  |
| Ecircumflex                   202 CA 312   202 CA 312   230 E6 346  |
| Ediaeresis                    203 CB 313   203 CB 313   232 E8 350  |
| Igrave                        204 CC 314   204 CC 314   237 ED 355  |
| Iacute                        205 CD 315   205 CD 315   234 EA 352  |
| Icircumflex                   206 CE 316   206 CE 316   235 EB 353  |
| Idiaeresis                    207 CF 317   207 CF 317   236 EC 354  |
| ETH                        -   208 D0 320   208 D0 320   --- -- ---  |
| Ntilde                     ф   209 D1 321   209 D1 321   132 84 204  |
| Ograve                        210 D2 322   210 D2 322   241 F1 361  |
| Oacute                        211 D3 323   211 D3 323   238 EE 356  |
| Ocircumflex                   212 D4 324   212 D4 324   239 EF 357  |
| Otilde                        213 D5 325   213 D5 325   205 CD 315  |
| Odiaeresis                 օ   214 D6 326   214 D6 326   133 85 205  |
| multiplication             -   215 D7 327   215 D7 327   --- -- ---  |
| Ostroke                    د   216 D8 330   216 D8 330   175 AF 257  |
| Ugrave                        217 D9 331   217 D9 331   244 F4 364  |
| Uacute                        218 DA 332   218 DA 332   242 F2 362  |
| Ucircumflex                   219 DB 333   219 DB 333   243 F3 363  |
| Udiaeresis                 ܆   220 DC 334   220 DC 334   134 86 206  |
| Yacute                     -   221 DD 335   221 DD 335   --- -- ---  |
| THORN                      -   222 DE 336   222 DE 336   --- -- ---  |
| sharp s                    ߧ   223 DF 337   223 DF 337   167 A7 247  |
| agrave                        224 E0 340   224 E0 340   136 88 210  |
| aacute                        225 E1 341   225 E1 341   135 87 207  |
| acircumflex                   226 E2 342   226 E2 342   137 89 211  |
| atilde                        227 E3 343   227 E3 343   139 8B 213  |
| adiaeresis                    228 E4 344   228 E4 344   138 8A 212  |
| aring above                   229 E5 345   229 E5 345   140 8C 214  |
| ae                            230 E6 346   230 E6 346   190 BE 276  |
| ccedilla                      231 E7 347   231 E7 347   141 8D 215  |
| egrave                        232 E8 350   232 E8 350   143 8F 217  |
| eacute                        233 E9 351   233 E9 351   142 8E 216  |
| ecircumflex                   234 EA 352   234 EA 352   144 90 220  |
| ediaeresis                    235 EB 353   235 EB 353   145 91 221  |
| igrave                        236 EC 354   236 EC 354   147 93 223  |
| iacute                        237 ED 355   237 ED 355   146 92 222  |
| icircumflex                   238 EE 356   238 EE 356   148 94 224  |
| idiaeresis                    239 EF 357   239 EF 357   149 95 225  |
| eth                        -   240 F0 360   240 F0 360   --- -- ---  |
| ntilde                        241 F1 361   241 F1 361   150 96 226  |
| ograve                        242 F2 362   242 F2 362   152 98 230  |
| oacute                        243 F3 363   243 F3 363   151 97 227  |
| ocircumflex                   244 F4 364   244 F4 364   153 99 231  |
| otilde                        245 F5 365   245 F5 365   155 9B 233  |
| odiaeresis                    246 F6 366   246 F6 366   154 9A 232  |
| division                      247 F7 367   247 F7 367   214 D6 326  |
| ostroke                       248 F8 370   248 F8 370   191 BF 277  |
| ugrave                        249 F9 371   249 F9 371   157 9D 235  |
| uacute                        250 FA 372   250 FA 372   156 9C 234  |
| ucircumflex                   251 FB 373   251 FB 373   158 9E 236  |
| udiaeresis                    252 FC 374   252 FC 374   159 9F 237  |
| yacute                     -   253 FD 375   253 FD 375   --- -- ---  |
| thorn                      -   254 FE 376   254 FE 376   --- -- ---  |
| ydiaeresis                    255 FF 377   255 FF 377   216 D8 330  |
+------------------------------------------------------------------------+

Identical Characters

	+--------------------------------------------+
	|    Name                    -  Dec/Hex/Oct  |
	+--------------------------------------------+
	| cent                          162 A2 242  |
	| pound                         163 A3 243  |
	| copyright                     169 A9 251  |
	| plus-minus                    177 B1 261  |
	| micro                         181 B5 265  |
	+--------------------------------------------+


If you want to generate tables like this comparing other character sets, you might find this perl script that I wrote to generate these tables useful.

The canonical source of information about these character sets can be found on ftp.unicode.org here:

The names of the HTML character entities that correspond to the official Unicode character names can be found on w3.org here:


[ up ]