Unicode encoding range of Chinese characters
Unicode encoding range
u4e00 ~ u9fff
U + 4e00 ~ U + 9fa5 is the most commonly used range, that is, the block named CJK unified ideographs. The characters between U + 9fa6 ~ U + 9fff are still empty codes, which have not been defined yet, but there is no guarantee that they will not be defined in the future
def is_zh(char):
"""
:param char: Single character
:return:
"""
if u'\u4e00' <= char <= u'\u9fff':
return True
return False
Statistics of Chinese and English words
In word document, Review – > count the number of words
can calculate the number of words, Chinese words, non Chinese words and so on, now use Python to achieve
example: Hello, world 4
# -*- coding: utf-8 -*-
import re
def strQ2B(ustring):
# Full to half angle string
rstring = ""
for uchar in ustring:
inside_code = ord(uchar)
if inside_code == 12288: # convert full-corner spaces directly
inside_code = 32
elif (inside_code >= 65281 and inside_code <= 65374): # Full-angle characters (except spaces) are converted according to the relationship
inside_code -= 65248
rstring += chr(inside_code)
return rstring
def querySimpleProcess(ss):
# query preprocessing, excluding characters other than Chinese and English numbers, all converted to lowercase
s1=strQ2B(ss)
s2=re.sub(r"(?![\u4e00-\u9fff]|[0-9a-zA-Z])."," ",s1)
s3=re.sub(r"\s+"," ",s2)
return s3.strip().lower()
# Determine if it contains Chinese
def check_contain_chinese(check_str):
for ch in check_str:
if u'\u4e00' <= ch <= u'\u9fff':
return True
return False
# Determine if it contains English
def check_contain_english(check_str):
for ch in check_str:
if u'a' <= ch <= u'z' or u'A' <= ch <= u'Z':
return True
return False
# Delete letters from a string for character counting purposes
def delete_letters(ss):
rs = re.sub(r"[a-zA-Z]+","",ss)
return rs
# First space split, get the list, and then line processing each element in the list
###Example: Smart School Uniform Commercial=6, Disrespectful Breakup=2
### Exception: C Mile C Mile=3 ### Can't handle
# If the element does not contain Chinese, then the length of the element is recorded as: 1 + the number of digits
# If the element does not contain English, the length of the element is recorded as: the number of Chinese characters + the number of digits, you can directly use the len () method
# If the element contains both English and Chinese, the length of the element is recorded as: number of Chinese characters + number of digits + 1
def countCharacters(inputStr):
tmpStr = querySimpleProcess(inputStr)
str2list = tmpStr.strip().split(" ")
if len(str2list) > 0:
charsNum = 0 # Initialize character count
for elem in str2list:
chineseFlag = check_contain_chinese(elem)
englishFlag = check_contain_english(elem)
if englishFlag == False: # no English
charsNum = charsNum + len(elem)
continue
else: # contain English
elem = delete_letters(elem)
charsNum = charsNum + 1 + len(elem)
return charsNum
return 0
Read More:
- [Exception]’ascii’ codec can’t decode byte 0xe8 in position 2: ordinal not in range(128)
- Python Error: SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 2-3:
- [Solved] SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 6-7: malformed
- SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 2-3: truncated \UX
- [Solved] SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 10-11: malformed
- SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 2-3: truncated \UX
- [Solved] SyntaxError: (unicode error) ‘utf-8’ codec can’t decode byte 0xbe in position 0
- IndexError: list index out of range [How to Solve]
- Tesseract OCR text recognition using tess4j encapsulation
- Linux Fatal Python error: Py_Initialize Unable to get the locale encoding
- [Solved] Fatal Python error: initfsencoding: Unable to get the locale encoding
- Xpath Error: IndexError: list index out of range [How to Solve]
- Invalid python sd, Fatal Python error: init_fs_encoding: failed to get the Python cod [How to Solve]
- Extracting TF-IDF keywords from text using Jieba
- Python TypeError: coercing to Unicode: need string or buffer, NoneType found
- How to Solve paddleOCR recognition of curved text Error
- Set the maximum number of Postgres connections Error [How to Solve]
- Python Kmeans Error: ConvergenceWarning: Number of distinct clusters (99) found smaller than n_clusters (100).
- [Solved] Error: [email protected]: wrong number of arguments (given 1, expected 0)