Tag Archives: NLP

Nltk Library Download error: [errno: 11004] getaddrinfo failed

Download the natural language processing nltk library and install it under win10. The error is as follows:

IP address of raw.githubusercontent.com could not found

since the hosts cannot be modified and saved in normal mode, it needs to be run in management mode

Solution:

  1. Run Power Shell as administrator
  2. Go to the hosts directory: C:\Windows\system32\drivers\etc
  3. Enter cmd, notepad hosts, and enter to modify the hosts file.
  4. Add the website ip address and URL in the last line of the hosts file and save it.


If the IP address of raw.github is changed frequently, you need to query the URL first: open the URL to query the IP address: https://www.ipaddress.com/ Enter raw.githubusercontent Com to get the IP address

Test succeeded

[pl.LightningModule] spaCy & pytorch-lightning Error

In pl.lightningmodule, Spacy cannot be used for word segmentation, or an error will be reported

1. Use in the forward process

...
File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
...

It is possible that all objects in the model are automatically transformed into trainable objects within the PL framework. Similarly, if the original pipe is also transformed into trainablepipe, an error will be reported, including an error as shown above

2. Avoid problem 1 and use nlp.pipe

Similarly, the same problem as in forward will be converted to a trainable pipe

3. To avoid problem 1, write Spacy processing outside the model as a function call

The same error will be reported. The error is different from the above. It is a very inexplicable error

Solution:

I didn’t find a good solution, so I had to rewrite the required functions manually, such as stopping words

How to Solve Error in importing scala word2vecmodel

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
model save:
Link: http://spark.apache.org/docs/2.3.4/api/scala/index.html#org.apache.spark.mllib.feature.Word2VecModel

var model = Word2VecModel.load(spark.sparkContext, config.model_path)

model read:
Link: http://spark.apache.org/docs/2.3.4/api/scala/index.html#org.apache.spark.mllib.feature.Word2VecModel$

var model = Word2VecModel.load(spark.sparkContext, config.model_path)

Read Error:

Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:312)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1337)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1372)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1371)
	at org.apache.spark.mllib.util.Loader$.loadMetadata(modelSaveLoad.scala:129)
	at org.apache.spark.mllib.feature.Word2VecModel$.load(Word2Vec.scala:699)
	at job.ml.embeddingModel.graphEmbedding$.run(graphEmbedding.scala:40)
	at job.ml.embeddingModel.graphEmbedding$.main(graphEmbedding.scala:24)
	at job.ml.embeddingModel.graphEmbedding.main(graphEmbedding.scala)
	

POM file add

    <dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>15.0</version>
    </dependency>

Run OK again!

[Solved] HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/saved_model

Cause:

Originally, there was a saved_model folder under the two folders sort_change_nlp and sort_nlp, but the names were different, one was saved_model and the other was saved_model_copy. Using saved_model in http_sever clock, the following error occurred:

model = YesOrNoModel.from_pretrained(model_name)

def is_model_answer(query):
    for rule in base_data.q_v_model_list:
        result1 = re.compile(rule).findall(query)
        if len(result1):
            return "saved_model"
    return ""
  model_name = is_model_answer(query)
    print(f"model_name={model_name},answer={answer}")
    if len(model_name):
        model = YesOrNoModel.from_pretrained(model_name)

Error:

ssh://[email protected]:22/usr/bin/python -u /opt/program_files/python/local_map_python_work/judge/sort_proj/test.py
404 Client Error: Not Found for url: https://huggingface.co/saved_model/resolve/main/config.json
Traceback (most recent call last):
  File "/root/.local/lib/python3.6/site-packages/transformers/configuration_utils.py", line 520, in get_config_dict
    user_agent=user_agent,
  File "/root/.local/lib/python3.6/site-packages/transformers/file_utils.py", line 1371, in cached_path
    local_files_only=local_files_only,
  File "/root/.local/lib/python3.6/site-packages/transformers/file_utils.py", line 1534, in get_from_cache
    r.raise_for_status()
  File "/usr/local/python3/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/saved_model/resolve/main/config.json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/program_files/python/local_map_python_work/judge/sort_proj/test.py", line 22, in <module>
    model = YesOrNoModel.from_pretrained("saved_model")
  File "/root/.local/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1196, in from_pretrained
    **kwargs,
  File "/root/.local/lib/python3.6/site-packages/transformers/configuration_utils.py", line 455, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/root/.local/lib/python3.6/site-packages/transformers/configuration_utils.py", line 532, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'saved_model'. Make sure that:
- 'saved_model' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'saved_model' is the correct path to a directory containing a config.json file

 

Solution:
The reason is that http.server is a level folder with sort_change_nlp and sort_nlp, so the path passed in when calling the model should be: sort_change_nlp/saved_model or sort_nlp, saved_model_copy

Please note that the module “sacrableu” has no “compute blue” attribute

Recently, I encountered the following errors when installing fairseq running experiment in machine translation related tasks:

AttributeError: module 'sacrebleu' has no attribute 'compute_bleu'

After consulting, it is a problem with sacrableu version 2.0. Only the following operations are required:

pip uninstall sacrebleu
pip install sacrebleu==1.5.1

Problem solved!

python gensim AttributeError: ‘Doc2Vec‘ object has no attribute ‘dv‘

python3

gensim 4.0.1

My code: doc2vec reported an error when loading the doc2vec model file

from gensim.models import Doc2Vec

doc2vec_model = Doc2Vec.load('data/doc2vec.model')

“AttributeError: ‘Doc2Vec’ object has no attribute ‘dv’”

Reason 1: there may be some problems with the latest version. Change the version!!!

Reason 2: another code   Change model.dv to = & gt;    model.docvecs  

resolvent:

I uninstalled gensim PIP uninstall gensim

Reinstall PIP install gensim = = 3.8.3

Solved!!! I hope it will be useful to you.

NLTK Error: [Error:11004] getaddrinfo failed [How to Solve]

When I run nltk’s word segmentation:

from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

The Punkt package is missing, so the following code is used to download it:

import nltk
nltk.download()

report errors  [ Error:11004] getaddrinfo failed

resolvent:

1. Open the website to query the IP address: https://www.ipaddress.com/ , and enter raw.githubusercontent.com

2. Copy the following four websites

3. Open   C:\Windows\System32\drivers\etc\hosts   Paste the above URL to the back

NLTK Error [nltk_data] Error loading stopwords: hostname

Nltk error [nltk]_ Data] error loading stopwords: host name. Use the following code to download stopwords

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    

nltk.download('stopwords')
nltk.download('punkt')

Sklearn ValueError: empty vocabulary; perhaps the documents only contain stop words

Chinese corpus:

A list of words

I don’t have much inside glory
If Huawei users find that the battery life is less than one day, please use Mr. Yu’s microblog to protect their rights reasonably
it’s cheaper than 500 g

use

CountVectorizer()

report errors:

Sklearn ValueError: empty vocabulary; perhaps the documents only contain stop words

 

Question:

def __init__(self, input='content', encoding='utf-8',
             decode_error='strict', strip_accents=None,
             lowercase=True, preprocessor=None, tokenizer=None,
             stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
             ngram_range=(1, 1), analyzer='word',
             max_df=1.0, min_df=1, max_features=None,
             vocabulary=None, binary=False, dtype=np.int64):

Solution

Countvectorizer () defaults to analysis = “word”, and changes to countvectorizer (analysis = “char”, lowercase = false)

 

 

download (‘point’) False

The following is the operation of using NLTK for word segmentation and then removing stop_words, but when it runs, it is prompted to download PUNkt.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Our output here:



['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

After several attempts, it turned out to be False.

changed other people’s machine, it’s good…
I want to copy the directory to the failed directory on my machine: