blog

My blog at www.shimmy1996.com

git clone git://git.shimmy1996.com/blog.git

2019-12-01-fun-with-fonts-on-the-web.en.md (8026B)

    1 +++
    2 title = "Fun with Fonts on the Web"
    3 date = 2019-12-01
    4 slug = "fun-with-fonts-on-the-web"
    5 draft = false
    6 +++
    7 
    8 A more accurate version of the title probably should be "Fun with Fonts in Web Browsers", but oh well, it sounds cooler that way. Text rendering is [hard](https://gankra.github.io/blah/text-hates-you/), and it certainly doesn't help that we have a plethora of different writing systems (blame the Tower of Babel for that, I guess) which cannot be elegantly fitted into a uniform system. Running a bilingual blog doubles the trouble in font picking, and here's a compilation of the various problems I encountered.
    9 
   10 
   11 ## Space Invaders {#space-invaders}
   12 
   13 Most browsers join consecutive lines of text in HTML to a single one with an added space in between, so
   14 
   15 ```html
   16 <html>Line one and
   17 line two.</html>
   18 ```
   19 
   20 renders to
   21 
   22 ```text
   23 Line one and line two.
   24 ```
   25 
   26 Such a simplistic rule doesn't work for CJK languages where no separators is used between words. The solution is to specify the `lang` attribute for the page (or any specific element on the page) like so:
   27 
   28 ```html
   29 <html lang="zh">第一行和
   30 第二行。</html>
   31 ```
   32 
   33 If your browser is smart enough (like Firefox), it will join the lines correctly. All the Blink based browsers, however, still stubbornly shove in the extra space, so it looks like I will be stuck in unwrapped source files like a barbarian for a bit longer. While not a cure-all solution, specifying the `lang` attribute still have the added benefit of enabling language-specific CSS rules, which comes in handy later.
   34 
   35 
   36 ## Return of the Quotation Marks {#return-of-the-quotation-marks}
   37 
   38 As mentioned in a [previous post](https://www.shimmy1996.com/en/posts/2018-06-24-fun-with-fonts-in-emacs/), CJK fonts would render quotation marks as full-width characters, different from Latin fonts. This won't be a problem as long as a web page doesn't try to mix-and-match fonts: just use language specific font-stack.
   39 
   40 ```css
   41 body:lang(en) {
   42     font-family: "Oxygen Sans", sans-serif;
   43 }
   44 
   45 body:lang(zh) {
   46     font-family: "Noto Sans SC", sans-serif;
   47 }
   48 ```
   49 
   50 Coupled with matching `lang` attributes, the story would have ended here. Firefox even allows you to specify default fonts on a per language basis, so you can actually get away with just the fallback values, like `sans-serif` or `serif`, and not even bother writing language specific CSS.
   51 
   52 However, what if I want to use Oxygen Sans for Latin characters, Noto Sans SC for CJK characters? While seemingly an sensible solution, specifying font stack like so,
   53 
   54 ```css
   55 body:lang(zh) {
   56     font-family: "Oxygen Sans", "Noto Sans SC", sans-serif;
   57 }
   58 ```
   59 
   60 would cause the quotation marks to be rendered using Oxygen Sans, which displays them as half-width characters. The solution I found is to declare an override font with a specified `unicode-range` that covers the quotation marks,
   61 
   62 ```css
   63 @font-face {
   64     font-family: "Noto Sans SC Override";
   65     unicode-range: U+2018-2019, U+201C-201D;
   66     src: local("NotoSansCJKsc-Regular");
   67 }
   68 ```
   69 
   70 and revise the font stack as
   71 
   72 ```css
   73 body:lang(zh) {
   74     font-family: "Noto Sans SC Override", "Oxygen Sans", "Noto Sans SC", sans-serif;
   75 }
   76 ```
   77 
   78 Now we can enjoy the quotation marks in their full-width glory!
   79 
   80 
   81 ## Font Ninja {#font-ninja}
   82 
   83 Font files are quite significant in size, and even more so for CJK ones: the Noto Sans SC font just mentioned is [over 8MB](https://github.com/googlefonts/noto-cjk/blob/master/NotoSansSC-Regular.otf) in size. No matter how determined I am to serve everything from my own server, this seems like an utter overkill considering the average HTML file size on my site is probably closer to 8KB. How does all the web font services handle this then?
   84 
   85 Most web font services work by adding a bunch of [`@font-face`](https://developer.mozilla.org/en-US/docs/Web/CSS/@font-face) definitions into a website's style sheet, which pulls font files from dedicated servers. To reduce the size of files been served, Google Fonts slice the font file into smaller chunks, and declare corresponding `unicode-range` for each chunk under `@font-face` blocks (this is exactly how they handle [CJK fonts](https://fonts.googleapis.com/css?family=Noto+Sans+SC)). They also compress the font files into WOFF2, further reducing file size. On the other hand, [Adobe Fonts](https://fonts.adobe.com/) (previously known as Typekit) seem to have some JavaScript wizardry that dynamically determines which glyphs to load from a font file.
   86 
   87 Combining best of both worlds, and thanks to the fact that this is a static site, it is easy to gather all the used characters and serve a font file containing just that. The tools of choice here would be pyftsubset (available as a component of [fonttools](https://pypi.org/project/fonttools/)) and GNU AWK. Compressing font files into WOFF2 also requires Brotli, a compression library. Under Arch Linux, the required packages are [python-fonttools](https://www.archlinux.org/packages/community/any/python-fonttools/), [gawk](https://www.archlinux.org/packages/core/x86%5F64/gawk/), [brotli](https://www.archlinux.org/packages/community/x86%5F64/brotli/), and [python-brotli](https://www.archlinux.org/packages/community/x86%5F64/python-brotli/).
   88 
   89 Here's a shell one-liner to collect all the used glyphs from generated HTML files:
   90 
   91 ```sh
   92 find . -type f -name "*.html" -printf "%h/%f " | xargs -l awk 'BEGIN{FS="";ORS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }' > glyphs.txt
   93 ```
   94 
   95 You may need to `export LANG=en_US.UTF-8` (or any other UTF-8 locale) for certain glyphs to be handled correctly. With the list of glyphs, we can extract the useful part of font files and compress them:
   96 
   97 ```sh
   98 pyftsubset NotoSansSC-Regular.otf --text-file=glyphs.txt --flavor=woff2 --output-file=NotoSansSC-Regular.woff2
   99 ```
  100 
  101 Specifying `--no-hinting` and `--desubroutinize` can further reduce size of generated file at the cost of some aesthetic fine-tuning. A similar technique can be used to shrink down Latin fonts to include only ASCII characters (or keep the extended ASCII range with `U+0000-00FF`):
  102 
  103 ```sh
  104 pyftsubset Oxygen-Sans.ttf --unicodes="U+0000-007F" --flavor=woff2 --output-file=Oxygen-Sans.woff2
  105 ```
  106 
  107 Once this is done, available glyphs can be checked using most font manager software, or this [online checker](http://torinak.com/font/lsfont.html) (no support for WOFF2 though, but you can convert into other formats first, such as WOFF).
  108 
  109 I also played around the idea of actually dividing the glyphs into further chunks by popularity, so here's another one liner to get list of glyphs sorted by number of appearances
  110 
  111 ```sh
  112 find . -type f -name "*.html" -printf "%h/%f " | xargs -l awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]++;}} END{for(c in chars){printf "%06d %s\n", chars[c], c;}}' | sort -r > glyph-by-freq.txt
  113 ```
  114 
  115 It turns out my blog has around 1000 different Chinese characters, with roughly 400 of them appearing more than 10 times. Since the file sizes I get from directly a single subsetting is already good enough, I didn't bother proceeding with another split.
  116 
  117 
  118 ## For Your Browsers Only {#for-your-browsers-only}
  119 
  120 With all the tricks in my bag, I was able to cut down the combined font file size to around 250KB, still magnitudes above that of an HTML file though. While it is nice to see my site appearing the same across all devices and screens, I feel the benefit is out of proportion compared to the 100-fold increase in page size.
  121 
  122 Maybe it is just not worth it to force the choice of fonts. In case you want to see my site as I would like to see it, here are my go-to fonts:
  123 
  124 -   Proportional Latin font: [Oxygen Sans](https://github.com/KDE/oxygen-fonts). Note that the KDE version has nuanced differences from the [Google Fonts version](https://fonts.google.com/specimen/Oxygen), and I like the KDE version much more.
  125 -   Proportional CJK font: [Noto Sans CJK](https://www.google.com/get/noto/help/cjk/).
  126 -   Monospace font: [Iosevka](https://typeof.net/Iosevka/), the ss09 variant, to be more exact.