# UniHan tutorial ```elixir Mix.install([ {:unicode_unihan, "~> 0.1.0"} ]) ``` ## Setup The `Unihan` module lets you work with the Unihan database at three levels of granularity: * individual characters, * population of characters, and * attributes of fields within characters This Livebook walks you through these three levels. ```elixir import Unicode.Unihan ``` ## Single Unicode Lookup The `Unihan` library provides, first and foremost, fast lookups of the range of data within the Unihan database. The function `unihan/1` accepts a variety of input, and returns the information contained within the Unihan database as a parsed map. The character "萬", standing for ten-thousand in Zh-T, will be used as an example. ```elixir # usage as codepoint unihan(33836) ``` ```elixir # use as string grapheme unihan(33836) == unihan("萬") ``` ```elixir # use as hex string unihan(33836) == unihan("U+842C") ``` The map can be accessed through the unicode keys **as atoms**. These keys are specified in [Annex #38](https://www.unicode.org/reports/tr38/). The values have been further parsed, often into maps of their own; key naming of these smaller maps are not specified by Unihan, and have taken (in general) to be consistent to the implementation in Python's [unihan-etl](https://github.com/cihai/unihan-etl/blob/master/src/unihan_etl/expansion.py) library. See documentation on Hexdocs for details. Note that these maps can be accessed using `Access` (square brackets `[]`), or the method dot notation (`.`). The former returns `nil` when the key does not exist, whereas the latter throws an exception. ```elixir # parses to an int IO.inspect(unihan("萬")[:kGradeLevel]) # parses to a list IO.inspect(unihan("萬")[:kCantonese]) # parses to a map IO.inspect(unihan("萬")[:kTotalStrokes]) ``` Often we would like to return from a map or a codepoint to its string grapheme representation. The `to_string/1` function lets you do that. ```elixir IO.inspect(Unicode.Unihan.to_string(33836)) map = unihan("萬") IO.inspect(Unicode.Unihan.to_string(map)) ``` Given that you'd often have a list of maps returned from population-level queries (in the next section), `to_string/1` also accepts a list of maps. ## Population-level Information `Unihan` provides 2 functions, `filter/1` and `reject/1`, which lets you isolate subset of codepoints from the `@unihan` map. Both of these accepts a 1-arity function. The following example selects, from the full `@unihan`, the characters that Grade 1 & 2 students are expected to learn. Since we have parsed `kGradeLevel` into an integer for you, you can use the comparison operator `<=` directly: ```elixir filter(fn char -> char[:kGradeLevel] <= 2 end) |> Enum.count() ``` In practice, the 1-arity function can be written more conveniently using the capture `&` syntax, especially when they are chained together. In the following usage, we isolate the characters that Grade 1 and 2 students are expected to learn, *but* only if they have tone 1 in Cantonese: ```elixir filter( &(&1[:kGradeLevel] <= 2 and &1[:kCantonese][:tone] == "1") ) |> Enum.sort_by(fn {_codepoint, map} -> map[:kTotalStrokes][:Hant] end) |> Enum.map(fn {_codepoint, map} -> Unicode.Unihan.to_string(map) end) ``` Here we also see the usage of `to_string/1` acting on a list of maps to return their human-friendly string representation. `reject/1` works similarly: ```elixir reject(&(&1[:kTotalStrokes][:"zh-Hant"] < 60)) |> Enum.map(fn {_codepoint, map} -> Unicode.Unihan.to_string(map) end) ``` (That blob? It's a character containing 6 distinct characters: cloud, cloud, cloud, dragon, dragon, dragon. You probably guessed correctly that it means *dragon flying*.) ## Unihan field parsing Unihan fields were given as strings in the UniHan database, where each string encapsulates complex meaning. For example, for "萬", its `kHanyuPinyin` is given as `53247.080:wàn`. This can be parsed according to the specifications: > The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典》 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.). Each location has the form “ABCDE.XYZ” (as in “kHanYu”); multiple locations for a given pīnyīn reading are separated by commas. The list of locations is followed by a colon, followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The multi-clause private function `Unicode.Unihan.Utils.decode_value/3` was used to parse the values. This gives the user access to the internals, as you have seen in the `filter` example above: the tone, which was simply encoded as part of the binary, is present as a `:tone` key. For more information about the fields and how they were parsed, see the HexDoc page on *Fields*. ## Support modules Codepoint 33863 is useful for *machines*, but "萬" is usually what the end-users want to know. We thus provided the `to_string/1` function for easy access. Several UniHan fields are similar in this regard: they use alphanumeric strings to encode data with specific semantic meaning, and we have similar convenient functions for you to access what you may want to use. These requires domain-knowledge, and CJK glyphs encapsulates long history of a broad geography. If you have requests, please consider reaching out. ### Radicals CJK characters are often categorized by their radical (部首). These radicals are codified in the KangXi Dictionary 康熙字典 published in 1715, and there are 214 radicals. In Unicode, these radicals were given a numerical index, and maps to *two* unicode codepoints whose glyphs are visually indistinguishable: 1. codepoint for the radical, and 2. codepoint for the radical as a CJK Unified Ideograph An additional complication is that some radicals are represented differently in the Simplified script. The following functions are provided for easy access to the various glyph representations. ```elixir # defaults to traditional, unified ideograph Unicode.Unihan.Radical.radical(187) ``` ```elixir # :script accepts :Hans and :Hant keys Unicode.Unihan.Radical.radical(187, script: :Hans) ``` ```elixir # to access the radical character, provide the following :glyph keyword Unicode.Unihan.Radical.radical( 187, script: :Hant, glyph: :radical_character ) ``` ```elixir # the special :all instruction gives the buddha hotdog Unicode.Unihan.Radical.radical(187, :all) ``` ### Cangjie Cangjie is a commonly used keyboard-input method. Each alphabet is mapped to a "construction part", and the glyphs can be accessed as follows: ```elixir Unicode.Unihan.Cangjie.cangjie("A") ``` A bang equivalent `cangjie!/1` exists, which additionally accepts a list as input: ```elixir Unicode.Unihan.Cangjie.cangjie!(["A", "B", "C", "D", "E", "F", "G"]) ``` This facilitates preparing end-user friendly representations: ```elixir "萬" |> unihan() |> Map.get(:kCangjie) |> Unicode.Unihan.Cangjie.cangjie!() ``` ### Jyutping Jyutping is a modern Cantonese romanization scheme. Each reading comprises four components: 1. onset 2. nucleus 3. coda 4. tone The nucleus and coda, concatenated together, is known as the *final*. The `Unicode.Unihan.Cantonese` module provides functions for working with jyutping. ```elixir # checking validity of a given input Unicode.Unihan.Cantonese.is_valid?("faan1") ``` ```elixir Unicode.Unihan.Cantonese.to_jyutping("faan1") ``` A bang equivalent `to_jyutping!/1` is available. ```elixir "faan1" |> Unicode.Unihan.Cantonese.to_jyutping!() |> Map.get(:final) ```