Sklearn ValueError: empty vocabulary; perhaps the documents only contain stop words

Chinese corpus:

A list of words

I don’t have much inside glory
If Huawei users find that the battery life is less than one day, please use Mr. Yu’s microblog to protect their rights reasonably
it’s cheaper than 500 g

use

CountVectorizer()

report errors:

Sklearn ValueError: empty vocabulary; perhaps the documents only contain stop words

 

Question:

def __init__(self, input='content', encoding='utf-8',
             decode_error='strict', strip_accents=None,
             lowercase=True, preprocessor=None, tokenizer=None,
             stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
             ngram_range=(1, 1), analyzer='word',
             max_df=1.0, min_df=1, max_features=None,
             vocabulary=None, binary=False, dtype=np.int64):

Solution

Countvectorizer () defaults to analysis = “word”, and changes to countvectorizer (analysis = “char”, lowercase = false)

 

 

Read More: