Notes on Reference Citations - Version 2.71, by Dr. Herong Yang
Converting GB2312 to UTF-8
'Herong's Tutorial Notes on GB2312 Character Set' tutorial book was cited in a Sun Java forum article in 2005.
The Herong's Tutorial Notes on GB2312 Character Set tutorial book was cited in a Sun Java forum article in 2005. Note that my Geocities site has been moved to herongyang.com now.
Subject: Re: Java Forums - Converting GB2312 to UTF-8 Date: Aug 11, 2005 Source: http://forum.java.sun.com/thread.jspa?threadID=639403 Author: horinius This problem of yours is very interesting. I'm no conversion expert, but after some investigations, I think the problem comes from the word's GB2312 code itself (or Windows or font?). Let me explain things first . I've added an instruction to print out the number of characters in your strLine: ... System.out.print(strLine.length() + "\n"); bw.write( strLine); ... Now, if I use your word, F25B, it gives 2 characters! If I use another word, E0A2, it gives 1 character *as expected*. <==> That's why I think the problem comes from your word. Then I turned to Unicode's database and conversion: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9A15 As you could see, it seems that your word can't be mapped to GB2312 (but it is mapped to Big5). So, there're 2 possibilities: 1. Unicode consortium forgot to add this word to its database 2. This word doesn't exist in GB2312. After some searches, I came across this website on GB2312 <-> Unicode (which is very well done. Good job, Herong!): http://www.geocities.com/herong_yang/gb2312/ When I looked up characters in the F2XX range: http://www.geocities.com/herong_yang/gb2312/bihua_4.html it is clear that your word, F25B, isn't defined! Moreover, as I could see from the pattern, I think GB2312 encoding scheme is as follows: a GB2312 code being a XX YY pair, possible values for YY are A1 to FE.(Am I correct?) Now, your word's code is F2 5B, but 5B isn't within A1-FE range, so the code isn't a valid GB2312 code and so I think that word is simply not encoded in GB2312. But the problem might be Window's because Microsoft never does things according to standards and M$ might "invent" a GB2312 code for non- GB2312-defined characters. This is very annoying but I don't know what to do beside writing complaint letter to Microsoft (but it's mostly ignored).
Table of Contents