Commons:Chinese characters decomposition
The purpose of this work is to provide purely graphical decomposition of Chinese characters. The decomposition is not meant to be etymological. If a character is said to be composed of two simpler characters, it can theoretically be drawn by superposing the corresponding two simpler characters. For instance, a computer program to generate stroke orders or graphical glyphs for Chinese characters would work properly for 雖 using only the graphical decomposition into 虽 and 隹.
The graphical decomposition of a character reflects the etymological (or historical) composition most of the time, but not always. For example, 雖 means "as big as a lizard", and character dictionaries tell us that its etymological decomposition is 虫 and 唯, where 虫 is the Radical, and 唯 is the pronunciation. However, the graphical decomposition, recorded in this database, is 虽 and 隹. In order to derive 雖 from its etymology, a computer program would need to know to move 口 to the correct position. So, useful as the etymological decomposition may be, please remember that this database is intended only for recording the graphical decompositions, and, as this example shows, it is not always possible to also record the etymology (at least, without either extending the database format to support etymologies, or creating a new, separate database only for etymologies).
Variants are graphically different, so the decomposition should reflect that; there is not necessarily graphical derivation between these variants.
Caution [ edit ]
- Don't spoil the tabulations - Please, leave this file in a machine-readable form. A TSV version is available here.
- If you intend to work for some time (couple of days?) on this file by downloading / uploading it, please leave a note in order to avoid edition conflicts. Thank you.
Sources and References [ edit ]
: The most complete online etymology source
- Tool for retrieving CCD statistics and exploit decomposition: Download
- The ISO10646 decomposition legend: User:Artsakenos/CCD-ISO10646
- The table of the new Unicode character set: User:Artsakenos/CCD-Table2
File format [ edit ]
Composition kind [ edit ]
- 一 = Graphical primitive, non composition (second character is always a deformed version of another character)
- 吅 = Horizontal composition (when repetition, the second character is deformed)
- 吕 = Vertical composition (when repetition, the second character is deformed)
- 回 = Inclusion of the second character inside the first (门, 囗, 匚. )
- 咒 = Vertical composition, the top part being a repetition.
- 弼 = Horizontal composition of three, the third being the repetition of the first.
- 品 = Repetition of three.
- 叕 = Repetition of four.
- 冖 = Vertical composition, separated by "冖".
- + = Graphical superposition or addition.
Note: There is a standard to describe decomposition rules (reported in User:Artsakenos/CCD-ISO10646), which is not in use here for different reasons: e.g., (i) there is no "three characters composition" like 罒 or 目, the composition is (nearly) always a 2+1 one. (ii) the "surround" kind is given by the surrounding character, there is no need to state it by a separate code. (iii) the 冖 composition is not identified (actually it is the only one to be of a true 目 kind).
Statistics [ edit ]
The Table Contains 21170 decompositions: all characters from 一 (4e00) to 龥 (9fa5), and 263 additional characters that appear as character components in the previous ones.
Composition Kind Count: 吅=14966 (70.7%),吕=4848 (22.9%),回=474 (2.23%), 一=360, +=170, 冖=147, 咒=80, 弼=61, 品=50, *=10, 叕=4