对FCITX输入法的几点建议

sevk · 发表于 2007-5-30 12:29:36

本人是五笔用户

1,在输入拼音时,提供五笔反查功能.
2,可按热键快速切换到拼音输入法.
3,如果可能的话,提供五笔和拼音混输功能
(用过www.freewb.org就知道,他是提供混输的,也可以设置成only只用五笔输入,按`键输入拼音并反查五笔.)
4,输入;set时,提供相关设置功能,如打开/关闭GBK(;gbk),因为';'可以是命令键嘛.
5,以上建议只是个人意见,大家不要打我.

Yuking · 发表于 2007-5-30 16:16:23

谢谢先~
1. 在五笔模式下，已经提供了拼音反查五笔的功能
2. 这个功能也有呀，默认是z
3. 现在fcitx的功能似乎已经不难实现这个了，正在考虑这个功能
4. fcitx的常用功能都有热键了，请查看一下说明文档

bruceasu · 发表于 2007-5-30 16:58:49

現在fcitx使用gb碼作為內碼, 能否轉為使用unicode碼作為內碼? 使用gb碼把字數限制在了2萬左右, 但是為了使用大字庫, 是否可以改為使用用unicode作內碼? 可以選用utf8, utf16, ucs-4(wchar_t), 使用utf8可以根linux系統編碼方向靠攏輸入輸出可以不用多次轉碼, 使用utf16比較節省空間, 使用ucs-4方便計算字數和分配空間, 也和linux底層系統保持一致.

ailantian · 发表于 2007-5-30 18:15:54

Post by bruceasu
現在fcitx使用gb碼作為內碼, 能否轉為使用unicode碼作為內碼? 使用gb碼把字數限制在了2萬左右, 但是為了使用大字庫, 是否可以改為使用用unicode作內碼? 可以選用utf8, utf16, ucs-4(wchar_t), 使用utf8可以根linux系統編碼方向靠攏輸入輸出可以不用多次轉碼, 使用utf16比較節省空間, 使用ucs-4方便計算字數和分配空間, 也和linux底層系統保持一致.

UTF8的编码大概也只有6万多个。而且现在linux的locale中似乎只有utf8的，
没有utf16和utf32的，最大的字符集可能支持的就是GB18030了。所以如果要使用
大字符集的话，目前的选择似乎只能是GB18030,但是这个东西不是国际标准，
到时候会有问题。现在都趋向于用unicode编码。

debian中dpkg-reconfigure locales,看不到有大位的utf编码集。估计是不支持。最大可能就是GB18030
java里面的编码好像是utf16的

dgod · 发表于 2007-5-30 18:30:54

纠正楼上，UTF8的编码空间几乎是无限的。
另gb18030是国家强制标准，中文处理的话也足够了。

即使不考虑标准问题，使用gb作为编码在汉字处理上也有它的优越性，至少内存占用会少很多，当然还有其他的。

我的建议，现在的gb18030字符是能够输出的，应用程序能够收到，但是在fcitx的输入法候选条上显示的是一个方框，希望这点能改掉，这样处理大字符集就没问题了。
(刚刚又试了一下，0528版没问题了，不知道是否装了wqy的缘故)

ailantian · 发表于 2007-5-30 18:38:47

Post by dgod
纠正楼上，UTF8的编码空间几乎是无限的。
另gb18030是国家强制标准，中文处理的话也足够了。

即使不考虑标准问题，使用gb作为编码在汉字处理上也有它的优越性，至少内存占用会少很多，当然还有其他的。

我的建议，现在的gb18030字符是能够输出的，应用程序能够收到，但是在fcitx的输入法候选条上显示的是一个方框，希望这点能改掉，这样处理大字符集就没问题了。

呵呵，抱歉，我刚才正在看，回来看见有人回复了。
我印象中utf8是只能表示6万多字符的。不过刚才看到文章说可以表示2的31次方个。
我记得以前再哪里看过的，不是这个数，再找找。

bruceasu · 发表于 2007-5-30 18:52:55

Post by ailantian
UTF8的编码大概也只有6万多个。而且现在linux的locale中似乎只有utf8的，
没有utf16和utf32的，最大的字符集可能支持的就是GB18030了。所以如果要使用
大字符集的话，目前的选择似乎只能是GB18030,但是这个东西不是国际标准，
到时候会有问题。现在都趋向于用unicode编码。

debian中dpkg-reconfigure locales,看不到有大位的utf编码集。估计是不支持。最大可能就是GB18030
java里面的编码好像是utf16的

就目前而言, UTF-8(原來是6個字節,為了和utf-16同一,現在最多使用4個字節)和UTF16支持2^20(1048576)個編碼空間, ucs-4 支持 2 ^ 31(超過21億,目前使用2 ^ 20)個編碼空間; 國家標准gb18030的編碼空間是超過150萬. 但是就收錄漢字範圍而言, unicode 體系收錄了 CJK(2萬左右)[unicode 2.0], CJK-EXTA(六七千左右)[unicode 3.1], CJK-EXTB(4萬左右)[unicode 4.0] 和 CJK-EXTC(我還沒有查過,估計會有數萬,目前不少輸入法已經使用其中的數千個)[unicode 5.0]. 而gb18030在7年前曾承諾漢字部分跟unicode同步, 并留出空間2^20個碼位, 但到目前為止,還沒有任何動作, 漢字數字為27000左右,相當于unicode3.1水平(CJK和CJK-EXTA). 本來想在裏面增加粵語輸入,但是由于是使用gbk(2萬左右)內碼, 4千多個處在CJK-EXTA和CJK-EXTB範圍中的粵語常用字無法處理. 理論上即使是使用gb18030還是有3千多字無法處理, 實際上使用gb18030根本就沒有意義, 它的CJK-EXTA部分已經是使用4字節表示了,連utf16都不如,utf16也是到了CJK-EXTB時才使用4字節表示.此時還不如幹脆轉為unicode實現.

ailantian · 发表于 2007-5-30 19:25:35

Post by bruceasu
就目前而言, UTF-8(原來是6個字節,為了和utf-16同一,現在最多使用4個字節)和UTF16支持2^20(1048576)個編碼空間, ucs-4 支持 2 ^ 31(超過21億,目前使用2 ^ 20)個編碼空間; 國家標准gb18030的編碼空間是超過150萬. 但是就收錄漢字範圍而言, unicode 體系收錄了 CJK(2萬左右)[unicode 2.0], CJK-EXTA(六七千左右)[unicode 3.1], CJK-EXTB(4萬左右)[unicode 4.0] 和 CJK-EXTC(我還沒有查過,估計會有數萬,目前不少輸入法已經使用其中的數千個)[unicode 5.0]. 而gb18030在7年前曾承諾漢字部分跟unicode同步, 并留出空間2^20個碼位, 但到目前為止,還沒有任何動作, 漢字數字為27000左右,相當于unicode3.1水平(CJK和CJK-EXTA). 本來想在裏面增加粵語輸入,但是由于是使用gbk(2萬左右)內碼, 4千多個處在CJK-EXTA和CJK-EXTB範圍中的粵語常用字無法處理. 理論上即使是使用gb18030還是有3千多字無法處理, 實際上使用gb18030根本就沒有意義, 它的CJK-EXTA部分已經是使用4字節表示了,連utf16都不如,utf16也是到了CJK-EXTB時才使用4字節表示.此時還不如幹脆轉為unicode實現.

http://www.nits.gov.cn/sc2/jishufile1-3.asp
unicode和iso的标准还是不同的，虽然现在两个标准再融合
，国家的倾向是使用iso10646
关于utf8,我还是没有找到在哪里看过说的是6万多字符，但是我明显记得
以前是看过的

，一定要找到，其实我觉得不大可能表示2^31个字符，因为utf8
前面编码的时候有保留位
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/unicode/faq/utf_bom.html

bruceasu · 发表于 2007-5-30 19:49:37

he UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well as section 3.9 of the Unicode 4.0 standard does not have these problems. It is clearly the way to go for using Unicode under Unix-style operating systems.

UTF-8 has the following properties:

* UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
* All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
* The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
* All possible 231 UCS codes can be encoded.
* UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
* The sorting order of Bigendian UCS-4 byte strings is preserved.
* The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:

U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx <---- max (2^31)

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

“Unicode” originally implied that the encoding was UCS-2 and it initially didn’t make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the extended “21-bit” Unicode in a way backwards compatible with UCS-2. The term UTF-32 was introduced in Unicode to describe a 4-byte encoding of the extended “21-bit” Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters above U-0010FFFF, while UCS-4 can cover all 231 code positions up to U-7FFFFFFF. The ISO 10646 working group has agreed to modify their standard to exclude code positions beyond U-0010FFFF, in order to turn the new UCS-4 and UTF-32 into practically the same thing.

In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-8 in UCS and Unicode differed originally slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.)

dgod · 发表于 2007-5-30 21:41:02

GBK 20951字
CJK EXT-A 6582字
输入法处理的一般只有常用包括gb2312的七八千字。

在linux下，应该是没人会使用utf16的，用unicode就用的utf8，任何汉字在utf8下至少3字节。

		自动登录	找回密码
密码			注册

对FCITX输入法的几点建议

浏览过的版块