LoginSignup
0
0

More than 5 years have passed since last update.

Sorting Japanese Words in Indexes (1)

Posted at

Japanese Writing System requires a number of characters, such as hiragana, katakana, kanji, punctuation marks, mathematical signs, numerals, leads for tables, general signs, Latin alphabets, Greek alphabets, and Cyrillic alphabets. The above statement is well-known among Japanese-speaking people because widely used JIS X 0208 (JIS stands for Japanese Industrial Standards) were defined as such. However, in the context of sorting Japanese words in the back-of-the-book indexes, the main part is sorting hiragana, katakana, and kanji.

For the purpose of back-of-the-book indexes, we usually sort Japanese words by pronunciation basically. Japanese pronunciation can be written down using hiragana or katakana, which represent Japanese syllables. A set of hiragana and that of katakana are almost the same in the meaning of representing syllables, which means there are one-to-one mapping between hiragana and katakana, except that one character 'ヴ' (vu) is only defined in katakana. In this series, we choose hiragana to normalize syllables.

How to pronounce kanji is the difficult problem for non-Japanese-speaking people. It requires much of knowledge about Japanese grammars and many of memories matching (a combination of) kanji to its pronunciation. 'Kakasi' (Kanji Kana Simple Inverter) is an example of the tools to assist the matching, but it fails in some cases. Ultimately, proof-readers should check the result before the publishing. In this series, we assume that all of the pronunciation for kanji are given by the users (i.e., authors, editors, or indexers) correctly.

If we assume the above, we need to illustrate a lot of things, we will follow the indications in JIS X 4061 titled "Collation of Japanese Character String". Rough translation limited to the basic collation rules between pronunciation and appearance in JIS X 4061 will be convenient to implement the first version of indexing program.

Since JIS X 4061 is related to JIS X 0208, it includes the rules of sorting among many kinds of characters or letters rather than hiragana, katakana, and kanji. However, I think that different language's script should be ordered by the language's rule. For example, in Español, 'ch' should be separated from the 'c' entries, and 'll' from 'l'. Then, I will skip the details for the points.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0