The root cause lies in the incorrect use of fit, transform and fit_transform.
First, make it clear that the incoming parameters can be possible causes of series
errors:
When using TfidfVectorizer(), transform after fit_transform
def tfidf(X_train):
tfidf_vec = TfidfVectorizer()
X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_train_tfidf = tfidf_vec.transform(X_train_tfidf)
return X_train_tfidf
Here the result of fit_transform is passed to transform again, and the transform is repeated
The correct way to use it should be:
def tfidf(X_train,X_test):
tfidf_vec = TfidfVectorizer()
X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)
return X_train_tfidf,X_test_tfidf
fit_transform into the training set, transform into the test set
Or you can fit first and then transform
def tfidf(data):
tfidf_vec = TfidfVectorizer()
tfidf_model = tfidf_vec.fit(data)
print(tfidf_model.dtype)
X_tfidf = tfidf_model.transform(data)
return X_tfidf
You can also use CountVectorizer() and TfidfTransformer()
def tfidf(data):
vectorizer = CountVectorizer() # This class will transform the words in the text into a word frequency matrix, and the matrix element a[i][j] represents the word frequency of word j under class i text
transformer = TfidfTransformer() # This class will count the tf-idf weights of each word
tfidf_before = vectorizer.fit_transform(data)
tfidf = transformer.fit_transform(tfidf_before)
return tfidf
Note that the first type is divided into training set and test set. It needs to be divided into training set and test set before
inputting parameters. The latter two need to extract tfidf features and then divide training set and test set,
otherwise it may cause training set and test set. The feature dimensions of the set are different, and they are not transformed under the same standard
Read More:
- Mxnet.gluon Load Pre Training
- Python: How to Processe “return multiple values”
- Extracting TF-IDF keywords from text using Jieba
- Python: Panda scramble data
- The lenet model trained by Python failed to predict its own handwritten pictures
- Pytorch ValueError: Expected more than 1 value per channel when training, got input size [1, 768
- ModuleNotFoundError: No module named ‘tensorflow.python’ And the pits encountered after installation
- Tesseract OCR text recognition using tess4j encapsulation
- Python USES the PO design pattern for automated testing
- To solve the problem that the loss of verification set of resnet50 pre-training model remains unchanged
- Python error: urllib.error.HTTPError : http Error 404: not found
- Tensorflow: Common Usage of tf.get_variable_scope()
- Python classes that connect to the database
- Python TypeError: coercing to Unicode: need string or buffer, NoneType found
- Python 3.X error: valueerror: data type must provide an itemsize
- Python custom convolution kernel weight parameters
- [How to Fix] TypeError: Cannot cast array data from dtype(‘float64‘) to dtype(‘<U32‘)….
- `Model.XXX` is not supported when the `Model` instance was constructed with eager mode enabled
- Python traverses all files under the specified path and retrieves them according to the time interval
- The automatic token of Python interface is passed into the header