"unicodedata" Module for Unicode Properties

Unicode Tutorials - Herong's Tutorial Examples

∟"unicodedata" Module for Unicode Properties

This section provides tutorial example on how to use the 'unicodedata' to retrieve properties of code points defined by the Unicode standard.

Python also offers a built-in module called "unicodedata" that provides a number of static methods to access varies properties of a given code point defined by the Unicode standard. Some commonly used "unicodedata" methods are given below:

unicodedata.unidata_version - Identifies the version number of the Unicode standard supported by the "unicodedata" module.

unicodedata.lookup(name) - Returns the code point as a "str" for a given Unicode character name.

unicodedata.name(char) - Returns the character name associated to a given Unicode code point.

unicodedata.category(char) - Returns the category code associated to a given Unicode code point.

unicodedata.combining(char) - Returns the combining class associated to a given Unicode code point.

unicodedata.decomposition(char) - Returns the decomposition string associated to a given Unicode code point.

unicodedata.normalize(form, str) - Converts a given string to the normalized form of a given form code, NFC (Normal Form Composition), NFKC (Normal Form Compatibility Composition), NFD (Normal Form Decomposition), or NFKD (Normal Form Compatibility Decomposition).

unicodedata.is_normalized(form, str) - Returns true if the given string is already normalized according to a given form code, NFC, NFKC, NFD, or NFKD.

unicodedata.decimal(char) - Returns the decimal value associated to a given Unicode code point.

unicodedata.digit(char) - Returns the digit value associated to a given Unicode code point.

unicodedata.numeric(char) - Returns the numeric value associated to a given Unicode code point.

Here is a Python script that shows you how to use the "unicodedata" module.

# unicodedata-Module-Test.py
# Copyright 2019 (c) HerongYang.com. All Rights Reserved.
#
import unicodedata

print("Unicode version: {0}".format(unicodedata.unidata_version))

char = unicodedata.lookup("Parenthesized Number Ten")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  category(): {0}".format(unicodedata.category(char)))
print("  combining(): {0}".format(unicodedata.combining(char)))
print("  decomposition(): {0}".format(unicodedata.decomposition(char)))
print("  decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) 
print("  digit(): {0}".format(unicodedata.digit(char, "N/A"))) 
print("  numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) 

char = unicodedata.lookup("Combining Cedilla")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  category(): {0}".format(unicodedata.category(char)))
print("  combining(): {0}".format(unicodedata.combining(char)))
print("  decomposition(): {0}".format(unicodedata.decomposition(char)))
print("  decimal(): {0}".format(unicodedata.decimal(char, "N/A"))) 
print("  digit(): {0}".format(unicodedata.digit(char, "N/A"))) 
print("  numeric(): {0}".format(unicodedata.numeric(char, "N/A"))) 

def normalized_info(form, str):
  norm = unicodedata.normalize(form, str)
  info = "normalize({0}, {1}): {2}, {3}, {4}".format(form, str, \
    norm, norm.encode(), len(norm))
  return info

char = unicodedata.lookup("Parenthesized Number Ten")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  {0}".format(normalized_info('NFC', char)))
print("  {0}".format(normalized_info('NFKC', char)))
print("  {0}".format(normalized_info('NFD', char)))
print("  {0}".format(normalized_info('NFKD', char)))

char = unicodedata.lookup("LATIN SMALL LETTER C WITH CEDILLA")
name = unicodedata.name(char)
print("{0} - {1}".format(char, name))
print("  {0}".format(normalized_info('NFC', char)))
print("  {0}".format(normalized_info('NFKC', char)))
print("  {0}".format(normalized_info('NFD', char)))
print("  {0}".format(normalized_info('NFKD', char)))

Run the above script, it will print the following output:

herong$ python3 unicodedata-Module-Test.py 

Unicode version: 12.1.0

⑽ - PARENTHESIZED NUMBER TEN
  category(): No
  combining(): 0
  decomposition(): <compat> 0028 0031 0030 0029
  decimal(): N/A
  digit(): N/A
  numeric(): 10.0

 ̧ - COMBINING CEDILLA
  category(): Mn
  combining(): 202
  decomposition(): 
  decimal(): N/A
  digit(): N/A
  numeric(): N/A

⑽ - PARENTHESIZED NUMBER TEN
  normalize(NFC, ⑽): ⑽, b'\xe2\x91\xbd', 1
  normalize(NFKC, ⑽): (10), b'(10)', 4
  normalize(NFD, ⑽): ⑽, b'\xe2\x91\xbd', 1
  normalize(NFKD, ⑽): (10), b'(10)', 4

ç - LATIN SMALL LETTER C WITH CEDILLA
  normalize(NFC, ç): ç, b'\xc3\xa7', 1
  normalize(NFKC, ç): ç, b'\xc3\xa7', 1
  normalize(NFD, ç): ç, b'c\xcc\xa7', 2
  normalize(NFKD, ç): ç, b'c\xcc\xa7', 2