Category Archives: Python

How to Solve Pytorch eval Stuck Error

Question

The single card training is very fast. When it comes to eval, it doesn’t move after running a batch, and there is no error.

Tried, still not moving

1, change the pin_memory of valid_loader to False. if it is True, it will automatically load the data into pin_memory, which speeds up the data
transfer speed to GPU.
2, change num_workers to 1, some people say too many workers may lead to multi-process interlock, can reduce or not

 

Final Solution:

valid_loader:
pin_memory = true # this is very important. Before, people on the Internet said that changing false might solve the problem. My experiment proved that if you do not work, you can run normally by changing back to true.
num_workers=4
batch_size=8

train_loader:
pin_memory=True
num_workers=4
batch_size = 8
these parameters are the same as valid_loader

In general, first of all, the pin_memory of valid_loader is kept True, which is well understood, the data is automatically loaded into pin_memory, which speeds up the data transfer to the GPU and naturally speeds up the inference process. Then, the number of workers and batch_size is reduced, and both valid_loader and train_loader are reduced. pin_memory of train_loader is also kept True.

[Solved] LeNet Script Train Error: AttributeError: ‘DictIterator’ object has no attribute ‘get_next’

My training environment:

Windows10 64bit;

MindSpore1.5.0-beta;

CPU;

python3.9;

When training Mnist data set with LeNet , the following error occurs

How to solve this problem??

It’s true that the version is a bit old – the above use case needs to be modified if implemented on a newer version – from the current implementation of Iterator’s code, it no longer has the get_next method: https://gitee.com/mindspore/mindspore/blob/master/mindspore/python/mindspore/dataset/engine/iterators.py#L59

But it has __next__ method, so the above line can be modified, you can try it: > original: data = ds.get_next() > modified: data = next(ds)

[Solved] Jupyter Notebook Error: SparkException: Python worker failed to connect back

report errors

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-24-bafca16b0526> in <module>
      8     return jobitem, ratingsRDD
      9 jobitem, jobRDD = preparJobdata(sc)
---> 10 jobRDD.collect() 

G:\Projects\python-3.6.4-amd64\lib\site-packages\pyspark\rdd.py in collect(self)
    947         """
    948         with SCCallSiteSync(self.context) as css:
--> 949             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    950         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    951 

G:\Projects\python-3.6.4-amd64\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

G:\Projects\python-3.6.4-amd64\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.101.68 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
	at java.net.ServerSocket.implAccept(ServerSocket.java:545)
	at java.net.ServerSocket.accept(ServerSocket.java:513)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 14 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
	at java.net.ServerSocket.implAccept(ServerSocket.java:545)
	at java.net.ServerSocket.accept(ServerSocket.java:513)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 14 more

Solution:

The following variable environments are configured:

# Windows Hadoop variable environments are configured
HADOOP_HOME = F:\hadoop-common-2.2.0-bin-master\hadoop-common-2.2.0-bin-master

# Windows JDKvariable environments are configured
JAVA_HOME = F:\jdk-8u121-windows-x64_8.0.1210.13

# Windows Pysparkvariable environments are configured
PYSPARK_DRIVER_PYTHON = jupyter
PYSPARK_DRIVER_PYTHON_OPTS = notebook
PYSPARK_PYTHON = python

Remember to restart the computer after the configuration is completed!

How to Solve wikiextractor Extract Wikipedia Corpus Error

When I extracted Wikipedia corpus, I first used the wikiextractor. Later, I found that it was always wrong, so it was useless. Since many people asked me how to extract the corpus, I now publish the code

I didn’t write the code, but I found it from a website. Because it took too long and I forgot the address of the website, I can’t post the original URL. If the author sees it, please send a private letter to my original URL

The author’s email address is: [email protected]

How to use: enter the command at the command line:

python data_pre_process.py zhwiki-latest-pages-articles.xml.bz2(Wikipedia Corpus) wiki.zh.text(Saved files)

Source code:

# -*- coding: utf-8 -*-
# Author: Pan Yang ([email protected])
# Copyrigh 2017
from __future__ import print_function

import logging
import os.path
import six
import sys

from IPython.core.page import page
from gensim.corpora import WikiCorpus

page.encoding = 'utf-8'

# Wrapping the Wikipedia xml corpus into txt format
# python data_pre_process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w', encoding='utf-8')
    wiki = WikiCorpus(inp, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
            #   ###another method
            #   output.write(space.join(map(lambda x: x.decode("utf-8"), str(text))) + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if i % 10000 == 0:
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

ModuleNotFoundError: No module named ‘requests‘ [How to Solve]

Problem description: After installing the requests module using pip under cmd, I can use import requests, but I can’t import it under Pycharm IDE with the following error: ModuleNotFoundError: No module named ‘requests ‘

Later I found out that my python was installed on E drive, but I installed it on C drive with the install requests command, without switching E drive. I reinstalled it on the E drive

Installation steps

①cmd window: switch to the Scripts file in the python installation directory on disk E

②Install command: pip install requests, but there is a yellow text error

③ Follow the Yellow prompt command and enter: python -m pip install --upgrade pip to complete the installation

④ Prove that the installation is successful: you can enter the command under Python: import requests command does not report an error

Pycharm page

[Solved] Python Requests Error: simplejson.errors.JSONDecodeError: Expecting value

Problem: when running the interface automation script, the requested data is correct, but it keeps reporting error: simplejson.errors Jsondecodeerror: expecting value


reason:
the reason for the error reported by the author here is that this interface returns: response in the non-JSON format
while in the basic interface code, the return is res.json()

Solution:
Modify to return text, or add an exception capture

exception capture

Nltk Library Download error: [errno: 11004] getaddrinfo failed

Download the natural language processing nltk library and install it under win10. The error is as follows:

IP address of raw.githubusercontent.com could not found

since the hosts cannot be modified and saved in normal mode, it needs to be run in management mode

Solution:

  1. Run Power Shell as administrator
  2. Go to the hosts directory: C:\Windows\system32\drivers\etc
  3. Enter cmd, notepad hosts, and enter to modify the hosts file.
  4. Add the website ip address and URL in the last line of the hosts file and save it.


If the IP address of raw.github is changed frequently, you need to query the URL first: open the URL to query the IP address: https://www.ipaddress.com/ Enter raw.githubusercontent Com to get the IP address

Test succeeded

VScode: How to Solve Pylance Error (pip Library Files Installed)

View Python installation location

Enter in Terminal

where python

View PIP installation library file location

pip show <packagename>

Open vscode and set the library path

Ctrl+shift+p Open ` preference; Open Settings(JSON)

Add path

"python.analysis.extraPaths":[
	    "/root/miniconda3/lib/python3.9/site-packages",
        "/root/.local/lib/python3.9/site-packages",
        "...."
        ]

[Solved] D455 Depth Camera Error: keyerror: ‘frame_ device_ t‘

color_t = sens_frame.color_t[self.computespeed_t_type]
KeyError: ‘frame_device_t’
Probably because two values are not manually added in the registry
Enable Metadata: Metadata contains important timestamp information and needs to be enabled manually. Please refer to:

https://dev.intelrealsense.com/docs/compiling-librealsense-for-windows-guide

Modifying the Windows Registry:
For each interface found (Steps 2 and 3) perform
Using Registry Editing tool such as “regedit” navigate to HKLM\SYSTEM\CurrentControlSet\Control\DeviceClasses{e5323777-f976-4f5b-9b55-b94699c46e44} branch.
Browse into the subdirectory with the name identical to the Device instance path obtained from the previous step
Expand the entry into #GLOBAL -> Device Parameters
Add DWORD 32bit value named MetadataBufferSizeInKB0 with value 5.
Add an additional DWORD 32bit value named MetadataBufferSizeInKB1 with value 5 for RS400 device zero interface ##?##USB#VID_8086&PID… MI_00…

Repeat the previous step for HKLM\SYSTEM\CurrentControlSet\Control\DeviceClasses\{65E8773D-8F56-11D0-A3B9-00A0C9223196} branch

[Solved] RuntimeError: “unfolded2d_copy“ not implemented for ‘Half‘

report errors

RuntimeError: "unfolded2d_copy" not implemented for 'Half'

reason

Parameters use_half=true passed in by the model, that is, the CPU is reasoned by using fp16 mixed precision calculation. Fp16 is used to speed up the speed, but the pytorch CPU does not support fp16,

Solution:

  1. Add use_half=False.
  2. Modify half() to float().

So that the model can be calculated;

Modification of my error report:


I hope this article is useful to you!

Thank you for your comments!

[Solved] RuntimeError: DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Question

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Today, when running yoov7 on my own computer, I used the CPU to run the test model because I didn’t have a GPU. I used the CPU to predict an independent image. There is no problem running an image. It is very nice!!! However, when I predict a video (multiple images), he told me that the memory allocation was insufficient,

DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.,

Moreover, it does not appear after the second image is run. It appears when the 17th image is calculated. The memory can not be released several times later~~~~~~~~

analysis

In pytorch, a tensor has a requires_grad parameter, which, if set to True, is automatically derived when backpropagating the tensor. tensor’s requires_grad property defaults to False, and if a node (leaf variable: tensor created by itself) requires_grad is set to True, then all nodes that depend on it require_grad to be True (even if other dependent tensors have requires_grad = False). grad is set to True, then all the nodes that depend on it will have True (even if the other tensor’s requires_grad = False)


Note:

requires_grad is a property of the generic data structure Tensor in Pytorch, which is used to indicate whether the current quantity needs to retain the corresponding gradient information in the calculation. Taking linear regression as an example, it is easy to know that the weights w and deviations b are the objects to be trained, and in order to get the most suitable parameter values, we need to set a relevant loss function, based on the idea of gradient back propagation Perform training.

When requires_grad is set to False, the backpropagation is not automatically derivative, so it saves memory or video memory.

Then the solution to this problem follows, just let the model not record the gradient during the test, because it is not really used.

 

Solution:

Use with torch.no_grad(), let the model not save the gradient during the test:

with torch.no_grad():
    output, _ = model(image) # Add before the image calculation

In this way, when the model calculates each image, the derivative will not be obtained and the gradient will not be saved!

Perfect solution!

Pycharm WebSocket Error: Error: Connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed

Problem description

Pycharm encountered SSL error while running websocket

Error: Connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed

Problem-solving

Since I use Anaconda environment in pycharm, I need to configure it in the environment corresponding to anaconda

python -m certifi

Get certificate path

conda config --set ssl_verify <your-path>

Save the certificate path so that websocket can be opened normally