Chinese corpus:
A list of words
I don’t have much inside glory
If Huawei users find that the battery life is less than one day, please use Mr. Yu’s microblog to protect their rights reasonably
it’s cheaper than 500 g
use
CountVectorizer()
report errors:
Sklearn ValueError: empty vocabulary; perhaps the documents only contain stop words
Question:
def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64):
Solution
Countvectorizer () defaults to analysis = “word”, and changes to countvectorizer (analysis = “char”, lowercase = false)