"unicodedata" Module for Unicode Properties

This section provides tutorial example on how to use the 'unicodedata' to retrieve properties of code points defined by the Unicode standard.

Python also offers a built-in module called "unicodedata" that provides a number of static methods to access varies properties of a given code point defined by the Unicode standard. Some commonly used "unicodedata" methods are given below:

unicodedata.unidata_version - Identifies the version number of the Unicode standard supported by the "unicodedata" module.

unicodedata.lookup(name) - Returns the code point as a "str" for a given Unicode character name.

unicodedata.name(char) - Returns the character name associated to a given Unicode code point.

unicodedata.category(char) - Returns the category code associated to a given Unicode code point.

unicodedata.combining(char) - Returns the combining class associated to a given Unicode code point.

unicodedata.decomposition(char) - Returns the decomposition string associated to a given Unicode code point.

unicodedata.normalize(form, str) - Converts a given string to the normalized form of a given form code, NFC (Normal Form Composition), NFKC (Normal Form Compatibility Composition), NFD (Normal Form Decomposition), or NFKD (Normal Form Compatibility Decomposition).

unicodedata.is_normalized(form, str) - Returns true if the given string is already normalized according to a given form code, NFC, NFKC, NFD, or NFKD.

unicodedata.decimal(char) - Returns the decimal value associated to a given Unicode code point.

unicodedata.digit(char) - Returns the digit value associated to a given Unicode code point.

unicodedata.numeric(char) - Returns the numeric value associated to a given Unicode code point.

Here is a Python script that shows you how to use the "unicodedata" module.

# unicodedata-Module-Test.py
# Copyright 2019 (c) HerongYang.com. All Rights Reserved.
#
import unicodedata

print("Unicode version: {0}".format(unicodedata.unidata_version))

char = unicodedata.lookup("Parenthesized Number Ten")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  category(): {0}".format(unicodedata.category(char)))
print("  combining(): {0}".format(unicodedata.combining(char)))
print("  decomposition(): {0}".format(unicodedata.decomposition(char)))
print("  decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) 
print("  digit(): {0}".format(unicodedata.digit(char, "N/A"))) 
print("  numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) 

char = unicodedata.lookup("Combining Cedilla")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  category(): {0}".format(unicodedata.category(char)))
print("  combining(): {0}".format(unicodedata.combining(char)))
print("  decomposition(): {0}".format(unicodedata.decomposition(char)))
print("  decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) 
print("  digit(): {0}".format(unicodedata.digit(char, "N/A"))) 
print("  numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) 

def normalized_info(form, str):
  norm = unicodedata.normalize(form, str)
  info = "normalize({0}, {1}): {2}, {3}, {4}".format(form, str, \
    norm, norm.encode(), len(norm))
  return info

char = unicodedata.lookup("Parenthesized Number Ten")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  {0}".format(normalized_info('NFC', char)))
print("  {0}".format(normalized_info('NFKC', char)))
print("  {0}".format(normalized_info('NFD', char)))
print("  {0}".format(normalized_info('NFKD', char)))

char = unicodedata.lookup("LATIN SMALL LETTER C WITH CEDILLA")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  {0}".format(normalized_info('NFC', char)))
print("  {0}".format(normalized_info('NFKC', char)))
print("  {0}".format(normalized_info('NFD', char)))
print("  {0}".format(normalized_info('NFKD', char)))

Run the above script, it will print the following output:

herong$ python3 unicodedata-Module-Test.py 

Unicode version: 12.1.0

 - PARENTHESIZED NUMBER TEN
  category(): No
  combining(): 0
  decomposition(): <compat> 0028 0031 0030 0029
  decimal(): N/A
  digit(): N/A
  numeric(): 10.0

 ̧ - COMBINING CEDILLA
  category(): Mn
  combining(): 202
  decomposition(): 
  decimal(): N/A
  digit(): N/A
  numeric(): N/A

 - PARENTHESIZED NUMBER TEN
  normalize(NFC, ): , b'\xe2\x91\xbd', 1
  normalize(NFKC, ): (10), b'(10)', 4
  normalize(NFD, ): , b'\xe2\x91\xbd', 1
  normalize(NFKD, ): (10), b'(10)', 4

ç - LATIN SMALL LETTER C WITH CEDILLA
  normalize(NFC, ç): ç, b'\xc3\xa7', 1
  normalize(NFKC, ç): ç, b'\xc3\xa7', 1
  normalize(NFD, ç): , b'c\xcc\xa7', 2
  normalize(NFKD, ç): , b'c\xcc\xa7', 2

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

Python Language and Unicode Characters

 Summary of Unicode Support in Python

 Python Source Code Encoding

 Unicode Support on "str" Data Type

 Unicode Character Encoding and Decoding

"unicodedata" Module for Unicode Properties

 Java Language and Unicode Characters

 Character Encoding in Java

 Character Set Encoding Maps

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB