Author Archives: Robins

[Solved] TensorFlow Error: UnknownError (see above for traceback): Failed to get convolution algorithm.

[Python/Pytorch – Bug] –UnknownError (see above for traceback): Failed to get convolution algorithm.

 

Question

Problem: TensorFlow reports an error: unknown error (see above for traceback): failed to get revolution algorithm

 

analysis

Analysis: the reason is that the memory of the graphics card is not enough. Selecting the appropriate memory of the graphics card can solve the problem.

 

Solution:
1. Gpustat checks the usage of the graphics card
2. Select a graphics card with enough memory;

[Solved] runtime error: reference binding to null pointer of type ‘std::vector<int, std::allocator<int>>‘

How to Solve Error: runtime error: reference binding to null pointer of type ‘std::vector<int, std::allocator<int>>‘

For vector<vector >
When initializing it, you have to open up the appropriate storage space for it!

Solution:
when initializing, open up the required storage space:

//Suppose the size of the two-dimensional array space is m x n
vector<vector<int> > count(m);
for(int i=0;i<m;i++)
	count[i].resize(n);

[Solved] CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

How to Solve error: CUDA failure 999: unknown error

1. Error Message:

CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

 

2. Solution:

To reload the nvidia kernel module, enter the following command.

sudo rmmod nvidia_uvm

sudo modprobe nvidia_uvm

[Solved] IDEA java compile error: Error:java: Compilation failed: internal java compiler error

There are two possible cases of error: Java: compilation failed: internal java compiler error.

1. The JDK version may be inconsistent

Check whether the following three parts are consistent.

Preferences -> Compiler -> Version setting SDK version in Java compiler

File -> Project Structure -> SDK version in project

File -> Project Structure -> Module -> Module SDK version in each module dependencies

2. The memory setting of the compilation environment may be insufficient.

Preferences -> Compiler -> Shared build process heap size

Increase the default memory of 700M.

ERROR Error: [copy-webpack-plugin] patterns must be an array

ERROR Error: [copy-webpack-plugin] patterns must be an array

When installing mars3d, you need to install the copy webpack plugin plug-in
and then configure it in vue.config.js report an error: Error: error: [copy webpack plugin] patterns must be an array
the current node version is 12.2.0

const plugins = [
            // Identifies the home directory where cesium resources are located, which is needed for resource loading, multithreading, etc. inside cesium
            new webpack.DefinePlugin({
                CESIUM_BASE_URL: JSON.stringify(path.join(config.output.publicPath, cesiumRunPath))
            }),
            // Cesium-related resource directories need to be copied to the system directory (some CopyWebpackPlugin versions may not have patterns in their syntax)
           new CopyWebpackPlugin({
                patterns: [
                { from: path.join(cesiumSourcePath, 'Workers'), to: path.join(config.output.path, cesiumRunPath, 'Workers') },
                { from: path.join(cesiumSourcePath, 'Assets'), to: path.join(config.output.path, cesiumRunPath, 'Assets') },
                { from: path.join(cesiumSourcePath, 'ThirdParty'), to: path.join(config.output.path, cesiumRunPath, 'ThirdParty') },
                { from: path.join(cesiumSourcePath, 'Widgets'), to: path.join(config.output.path, cesiumRunPath, 'Widgets') }
                ]
            })
        ];

new CopyWebpackPlugin is changed as follows:

const plugins = [
            // Identifies the home directory where cesium resources are located, which is needed for resource loading, multithreading, etc. within cesium
         new webpack.DefinePlugin({
                CESIUM_BASE_URL: JSON.stringify(path.join(config.output.publicPath, cesiumRunPath))
         }),
         new CopyWebpackPlugin([
	        { from: path.join(cesiumSourcePath, 'Workers'), to: path.join(config.output.path, cesiumRunPath, 'Workers') },
	        { from: path.join(cesiumSourcePath, 'Assets'), to: path.join(config.output.path, cesiumRunPath, 'Assets') },
	        { from: path.join(cesiumSourcePath, 'ThirdParty'), to: path.join(config.output.path, cesiumRunPath, 'ThirdParty') },
	        { from: path.join(cesiumSourcePath, 'Widgets'), to: path.join(config.output.path, cesiumRunPath, 'Widgets') }
        ])
 ];

Run successfully

[Solved] Fatal error in PMPI_Init: Other MPI error, error stack:MPIR_Init_thread(138)

1. Problems

Use the compiler of onepai 2021.3. After compiling the program, use the slurm scheduling system to run the job and report an error:

2. Solutions

In the sbatch script of slurm, add the environment variable of libpmi2.so

export I_MPI_PMI_LIBRARY=../slurm/lib/libpmi2.so

If the application image is made in a container (such as docker)

1. Copy the libpmi2.so file into the container

2. At the same time, the corresponding environment variables are configured for export

 

[Solved] Pytorch Error: RuntimeError: Error(s) in loading state_dict for Network: size mismatch

Problem background

GitHub open source project: https://github.com/zhang-tao-whu/e2ec

python train_net.py coco_finetune --bs 12 \
--type finetune --checkpoint data/model/model_coco.pth

The error is reported as follows:

loading annotations into memory...
Done (t=0.09s)
creating index...
index created!
load model: data/model/model_coco.pth
Traceback (most recent call last):
  File "train_net.py", line 67, in <module>
    main()
  File "train_net.py", line 64, in main
    train(network, cfg)
  File "train_net.py", line 40, in train
    begin_epoch = load_network(network, model_dir=args.checkpoint, strict=False)
  File "/root/autodl-tmp/e2ec/train/model_utils/utils.py", line 66, in load_network
    net.load_state_dict(net_weight, strict=strict)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Network:
        size mismatch for dla.ct_hm.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
        size mismatch for dla.ct_hm.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).

Since my own dataset has only 1 category, while the COCO dataset has 80 categories, the size of the dla.ct_hm.2 parameter in the pre-training model does not match mine, so the weight of this parameter in the pre-training model needs to be discarded.

Solution:

Modify in e2ec/train/model_utils/utils.py:

def load_network(net, model_dir, strict=True, map_location=None):

    if not os.path.exists(model_dir):
        print(colored('WARNING: NO MODEL LOADED !!!', 'red'))
        return 0

    print('load model: {}'.format(model_dir))
    if map_location is None:
        pretrained_model = torch.load(model_dir, map_location={'cuda:0': 'cpu', 'cuda:1': 'cpu',
                                                               'cuda:2': 'cpu', 'cuda:3': 'cpu'})
    else:
        pretrained_model = torch.load(model_dir, map_location=map_location)
    if 'epoch' in pretrained_model.keys():
        epoch = pretrained_model['epoch'] + 1
    else:
        epoch = 0
    pretrained_model = pretrained_model['net']

    net_weight = net.state_dict()
    for key in net_weight.keys():
        net_weight.update({key: pretrained_model[key]})
    '''
	Discard some parameters
	'''
    net_weight.pop("dla.ct_hm.2.weight")
    net_weight.pop("dla.ct_hm.2.bias")
    
    net.load_state_dict(net_weight, strict=strict)
    return epoch

Note: setting strict=False in load_state_dict is only useful for adding or removing partial layers, not for changing the dimension size on the original parameters.

MindSpore Error: [ERROR] MD:unexpected error.Not a valid index

ERROR:[MD]:unexpected error. Not a valid index


problem phenomenon: a single card does not report an error, and the training process can be correctly executed. However, when switching to distributed training, an error in the diagram is reported. After troubleshooting, the cause of the error is found to be the wrong use of distributed sampling
the error reporting method is as follows:

the order of distributed sampling and random sampling needs to be changed. The correct way is to perform random sampling first and then distributed sampling

The correct modification is as follows:

distributed training can be performed correctly after modification

[Solved] IDEA jdbc Connect Database Error: java.sql.SQLException: Undefined Error

The following error is reported when using jdbc to connect to a database in IDEA:

Solution:

1. In IDEA, select Run -> Edit Configurations;

2. In the pop-up interface, enter -Duser.name=user in VM option and restart.

****If VM options are not found

* * * * restart operation

File-> Invalidate Caches-> Just restart

[Solved] PyTorch Load Model Error: Missing key(s) RuntimeError: Error(s) in loading state_dict for

torch.load() error reporting missing key(s) pytorch

Error condition: error loading the pre training model

RuntimeError: Error(s) in loading state_dict for :

Missing key(s) in state_dict: “features.0.weight” …

Unexpected key(s) in state_dict: “module.features.0.weight” …

 

Error reason:

The keywords of model parameters wrapped with nn.DataParallel will have an extra “module.” in front of them than the keywords of model parameters not wrapped with nn.DataParallel

 

Solution:

1. Loading nn.DataParallel(net) trained models using net.

Delete module.

# original saved file with DataParallel
state_dict = torch.load('model_path')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v
# load params
net.load_state_dict(new_state_dict)

Code source

checkpoint = torch.load('model_path')
for key in list(checkpoint.keys()):
    if 'model.' in key:
        checkpoint[key.replace('model.', '')] = checkpoint[key]
        del checkpoint[key]

net.load_state_dict(checkpoint)

Use nn.DataParallel when loading the model

checkpoint = torch.load('model_path')
net = torch.nn.DataParallel(net)
net.load_state_dict(checkpoint)

2. Load the net trained model using nn.DataParallel(net).

Before saving the weight, add module

If you use torch.save() when saving weights, use model.module.state_dict() to get model weight

torch.save(net.module.state_dict(), 'model_path')

Read the model before using nn.DataParallel and then use nn.

net.load_state_dict(torch.load('model_path'))
net = nn.DataParallel(net, device_ids=[0, 1]) 

Add module manually

net = nn.DataParallel(net) 
from collections import OrderedDict
new_state_dict = OrderedDict()
state_dict =savepath #Pre-trained model path
for k, v in state_dict.items():
	# add “module.” manually
    if 'module' not in k:
        k = 'module.'+k
    else:
    # Swap the location of modules and features
        k = k.replace('features.module.', 'module.features.')
    new_state_dict[k]=v

net.load_state_dict(new_state_dict)

[Solved] pyinstaller: error: unrecognized arguments: sklearn

How to Solve Error: pyinstaller: error: unrecognized arguments: sklearn

 

Solution:

Go to cmd
pyinstaller main.py  –hidden-import PySide2.QtXml –hidden-import sklearn –hidden-import sklearn.ensemble._forest –icon=”logo.ico”
Add the unrecognized sklearn to hidden import –hidden-import sklearn
The issue will be fixed.

 

[Solved] Compose error “HTTP request took too long to complete“

Recently, I noticed that the following error messages often appear in docker-compose:

ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Adding COMPOSE_HTTP_TIMEOUT seems to only delay the error. Is this a known issue? Or is there a workaround?
Tried increasing the COMPOSE_HTTP_TIMEOUT environment variable and it didn’t work.

# Increase timeout period to 120 seconds.
export COMPOSE_HTTP_TIMEOUT=120;
# Rebuild all containers using the new images.
docker-compose up -d;

# or use docker-compose --verbose up -d to check out the errors

My Solution:

sudo service docker restart
docker-compose up

Maybe I have a high inode

df -ih			#  -i, --inodes   List information about information nodes, not data block usage
Filesystem     Inodes IUsed IFree IUse% Mounted on
udev             493K   390  492K    1% /dev
tmpfs            494K   537  494K    1% /run
/dev/xvda1       1.3M  1.2M   70K   95% /
tmpfs            494K     1  494K    1% /dev/shm
tmpfs            494K     3  494K    1% /run/lock
tmpfs            494K    16  494K    1% /sys/fs/cgroup
tmpfs            494K     4  494K    1% /run/user/1000