Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
"unicodedata" Module for Unicode Properties
This section provides tutorial example on how to use the 'unicodedata' to retrieve properties of code points defined by the Unicode standard.
Python also offers a built-in module called "unicodedata" that provides a number of static methods to access varies properties of a given code point defined by the Unicode standard. Some commonly used "unicodedata" methods are given below:
unicodedata.unidata_version - Identifies the version number of the Unicode standard supported by the "unicodedata" module.
unicodedata.lookup(name) - Returns the code point as a "str" for a given Unicode character name.
unicodedata.name(char) - Returns the character name associated to a given Unicode code point.
unicodedata.category(char) - Returns the category code associated to a given Unicode code point.
unicodedata.combining(char) - Returns the combining class associated to a given Unicode code point.
unicodedata.decomposition(char) - Returns the decomposition string associated to a given Unicode code point.
unicodedata.normalize(form, str) - Converts a given string to the normalized form of a given form code, NFC (Normal Form Composition), NFKC (Normal Form Compatibility Composition), NFD (Normal Form Decomposition), or NFKD (Normal Form Compatibility Decomposition).
unicodedata.is_normalized(form, str) - Returns true if the given string is already normalized according to a given form code, NFC, NFKC, NFD, or NFKD.
unicodedata.decimal(char) - Returns the decimal value associated to a given Unicode code point.
unicodedata.digit(char) - Returns the digit value associated to a given Unicode code point.
unicodedata.numeric(char) - Returns the numeric value associated to a given Unicode code point.
Here is a Python script that shows you how to use the "unicodedata" module.
# unicodedata-Module-Test.py # Copyright 2019 (c) HerongYang.com. All Rights Reserved. # import unicodedata print("Unicode version: {0}".format(unicodedata.unidata_version)) char = unicodedata.lookup("Parenthesized Number Ten") name = unicodedata.name(char) print("{0} - {1}".format(char, name)) print(" category(): {0}".format(unicodedata.category(char))) print(" combining(): {0}".format(unicodedata.combining(char))) print(" decomposition(): {0}".format(unicodedata.decomposition(char))) print(" decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) print(" digit(): {0}".format(unicodedata.digit(char, "N/A"))) print(" numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) char = unicodedata.lookup("Combining Cedilla") name = unicodedata.name(char) print("{0} - {1}".format(char, name)) print(" category(): {0}".format(unicodedata.category(char))) print(" combining(): {0}".format(unicodedata.combining(char))) print(" decomposition(): {0}".format(unicodedata.decomposition(char))) print(" decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) print(" digit(): {0}".format(unicodedata.digit(char, "N/A"))) print(" numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) def normalized_info(form, str): norm = unicodedata.normalize(form, str) info = "normalize({0}, {1}): {2}, {3}, {4}".format(form, str, \ norm, norm.encode(), len(norm)) return info char = unicodedata.lookup("Parenthesized Number Ten") name = unicodedata.name(char) print("{0} - {1}".format(char, name)) print(" {0}".format(normalized_info('NFC', char))) print(" {0}".format(normalized_info('NFKC', char))) print(" {0}".format(normalized_info('NFD', char))) print(" {0}".format(normalized_info('NFKD', char))) char = unicodedata.lookup("LATIN SMALL LETTER C WITH CEDILLA") name = unicodedata.name(char) print("{0} - {1}".format(char, name)) print(" {0}".format(normalized_info('NFC', char))) print(" {0}".format(normalized_info('NFKC', char))) print(" {0}".format(normalized_info('NFD', char))) print(" {0}".format(normalized_info('NFKD', char)))
Run the above script, it will print the following output:
herong$ python3 unicodedata-Module-Test.py Unicode version: 12.1.0 ⑽ - PARENTHESIZED NUMBER TEN category(): No combining(): 0 decomposition(): <compat> 0028 0031 0030 0029 decimal(): N/A digit(): N/A numeric(): 10.0 ̧ - COMBINING CEDILLA category(): Mn combining(): 202 decomposition(): decimal(): N/A digit(): N/A numeric(): N/A ⑽ - PARENTHESIZED NUMBER TEN normalize(NFC, ⑽): ⑽, b'\xe2\x91\xbd', 1 normalize(NFKC, ⑽): (10), b'(10)', 4 normalize(NFD, ⑽): ⑽, b'\xe2\x91\xbd', 1 normalize(NFKD, ⑽): (10), b'(10)', 4 ç - LATIN SMALL LETTER C WITH CEDILLA normalize(NFC, ç): ç, b'\xc3\xa7', 1 normalize(NFKC, ç): ç, b'\xc3\xa7', 1 normalize(NFD, ç): ç, b'c\xcc\xa7', 2 normalize(NFKD, ç): ç, b'c\xcc\xa7', 2
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
►Python Language and Unicode Characters
Summary of Unicode Support in Python
Unicode Support on "str" Data Type
Unicode Character Encoding and Decoding
►"unicodedata" Module for Unicode Properties
Java Language and Unicode Characters
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor