Converting GB2312 to UTF-8

'Herong's Tutorial Notes on GB2312 Character Set' tutorial book was cited in a Sun Java forum article in 2005.

The Herong's Tutorial Notes on GB2312 Character Set tutorial book was cited in a Sun Java forum article in 2005. Note that my Geocities site has been moved to herongyang.com now.

Subject: Re: Java Forums - Converting GB2312 to UTF-8
Date: Aug 11, 2005
Source: http://forum.java.sun.com/thread.jspa?threadID=639403
Author: horinius

This problem of yours is very interesting.

I'm no conversion expert, but after some investigations, I think the 
problem comes from the word's GB2312 code itself (or Windows or 
font?). Let me explain things first .

I've added an instruction to print out the number of characters in 
your strLine:
...
System.out.print(strLine.length() + "\n");
bw.write( strLine);
...

Now, if I use your word, F25B, it gives 2 characters! If I use another
word, E0A2, it gives 1 character *as expected*. <==> That's why I 
think the problem comes from your word.

Then I turned to Unicode's database and conversion:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9A15

As you could see, it seems that your word can't be mapped to GB2312 
(but it is mapped to Big5). So, there're 2 possibilities:
1. Unicode consortium forgot to add this word to its database
2. This word doesn't exist in GB2312.

After some searches, I came across this website on GB2312 <-> Unicode
(which is very well done. Good job, Herong!):

http://www.geocities.com/herong_yang/gb2312/

When I looked up characters in the F2XX range:
http://www.geocities.com/herong_yang/gb2312/bihua_4.html

it is clear that your word, F25B, isn't defined!

Moreover, as I could see from the pattern, I think GB2312 encoding 
scheme is as follows: a GB2312 code being a XX YY pair, possible 
values for YY are A1 to FE.(Am I correct?)

Now, your word's code is F2 5B, but 5B isn't within A1-FE range, so 
the code isn't a valid GB2312 code and so I think that word is simply
not encoded in GB2312.

But the problem might be Window's because Microsoft never does things 
according to standards and M$ might "invent" a GB2312 code for non-
GB2312-defined characters.

This is very annoying but I don't know what to do beside writing 
complaint letter to Microsoft (but it's mostly ignored).

Table of Contents

 About This Book

 Reference Citations in 2016

 Reference Citations in 2015

 Reference Citations in 2014

 Reference Citations in 2013

 Reference Citations in 2012

 Reference Citations in 2011

 Reference Citations in 2010

 Reference Citations in 2009

 Reference Citations in 2008

 Reference Citations in 2007

 Reference Citations in 2006

Reference Citations in 2005

 Kalkati.net, XML database dump

 com.liferay.portal.service.impl.PortletServiceImpl

 Japanese Chinese Tea Web Sites

 AIProject Log

Converting GB2312 to UTF-8

 "OK" auf chinesisch gesucht :)

 Insertion Sort

 tanya ttg open file

 Base64Decoder

 SSL Client Authentication

 Softwaretechnik-Praktikum SS 2005

 How to develop a scanner/disinfector

 JSTL break ? possible

 Attacks on Encryption Schemes

 Encoding a C String/Buffer with ASCII Char

 mysql 5alpha stored procedures vs mssql

 Hangul, Chinese characters to Unicode Conversion

 Appunti di Informatica Libera

 Reference Citations in 2004

 Reference Citations in 2003

 PDF Printing Version