Tag Archives: Data mining

Error in plot.new() : figure margins too large

Full error:

#Question

Fit the regression model and calculate the dfbetas value of each sample and the optimal dfbetas threshold. Finally, visualize the impact of each sample on each predictive variable;

#fit a regression model
model <- lm(mpg~disp+hp, data=mtcars)

#view model summary
summary(model)

#calculate DFBETAS for each observation in the model
dfbetas <- as.data.frame(dfbetas(model))

#display DFBETAS for each observation
dfbetas

#find number of observations
n <- nrow(mtcars)

#calculate DFBETAS threshold value
thresh <- 2/sqrt(n)

thresh

#specify 2 rows and 1 column in plotting region

#dev.off()
#par(mar = c(1, 1, 1, 1))

par(mfrow=c(2,1))

#plot DFBETAS for disp with threshold lines
plot(dfbetas$disp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#plot DFBETAS for hp with threshold lines 
plot(dfbetas$hp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#Solution
par(mar = c(1, 1, 1, 1))

#fit a regression model
model <- lm(mpg~disp+hp, data=mtcars)

#view model summary
summary(model)

#calculate DFBETAS for each observation in the model
dfbetas <- as.data.frame(dfbetas(model))

#display DFBETAS for each observation
dfbetas

#find number of observations
n <- nrow(mtcars)

#calculate DFBETAS threshold value
thresh <- 2/sqrt(n)

thresh

#specify 2 rows and 1 column in plotting region

#dev.off()
par(mar = c(1, 1, 1, 1))

par(mfrow=c(2,1))

#plot DFBETAS for disp with threshold lines
plot(dfbetas$disp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#plot DFBETAS for hp with threshold lines 
plot(dfbetas$hp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

Full Error Message:
> par(mfrow=c(2,1))
>
> #plot DFBETAS for disp with threshold lines
> plot(dfbetas$disp, type=’h’)
Error in plot.new() : figure margins too large
> abline(h = thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
> abline(h = -thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
>
> #plot DFBETAS for hp with threshold lines
> plot(dfbetas$hp, type=’h’)
Error in plot.new() : figure margins too large
> abline(h = thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
> abline(h = -thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet

Error in *** : subscript out of bounds [How to Solve]

Error in *** : subscript out of bounds

Full error:

Question:

The data is out of bounds. Except for others, who knows where it is? Do you ask memory? Who do you ask?

#make this example reproducible
set.seed(0)

#create matrix with 10 rows and 3 columns
x = matrix(data = sample.int(100, 30), nrow = 10, ncol = 3)

#print matrix
print(x)

#attempt to display 11th row of matrix
x[11, ]

#attempt to display 4th column of matrix
x[, 4]

#attempt to display value in 11th row and 4th column
x[11, 4]

Solution:

#display number of rows and columns in matrix
dim(x)

#display 10th row of matrix
x[10, ]

#display number of columns in matrix
ncol(x)

#display 3rd column of matrix
x[, 3]

#display value in 10th row and 3rd column of matrix
x[10, 3]

Full error Messages:
>
> #attempt to display 11th row of matrix
> x[11, ]
Error in x[11, ] : subscript out of bounds
>
> #attempt to display 4th column of matrix
> x[, 4]
Error in x[, 4] : subscript out of bounds
>
> #attempt to display value in 11th row and 4th column
> x[11, 4]
Error in x[11, 4] : subscript out of bounds
>

The attribute error: he has no attributes.

Error: NoneType ‘object has no attribute ‘seconds’

Time-famifamifamifamitime _operationon/date date date
import date
import date date
import date date
start time = datetime.date.time.now(()
> endtime = datetime.date.date.now(()
> endtime = date.date.date.now(),
print (endtime-starttime). seconds
print(endtime-starttime). seconds
> change after changes:
import date
date
import date
date> starttime
starttime = date date date date.date.time.now()
s
print(a)

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

catalogue

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Questions

Solution

Questions

Using pyinstaller to package py file as windows exe program, the following problems are encountered:

Win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Stack overflow and CSDN, you can see all kinds of things

Some say that virtual environment must be used, but I don’t believe it;

Some people say that you must use native Python environment (not Anaconda or other integrated environment), but I don’t believe it;

Some people say that it may be caused by the too low and too high version of Python;

Some people say that maybe your version of pyinstaler is too high. Let’s have a try. I believe it;

#The following error occurred

Win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryex’,’the system can’t find the specified file. ‘)

#It is suspected that the temporary file in the original build interferes with the new build. After deleting the temporary file, it continues to build, and the same error still exists;

Solution

#Check on stack overflow and CSDN. There are all kinds of things to say. Finally, we adopted the method of reducing the version and succeeded

# pip install pyinstaller==3.5

# pyinstaller -F prediction.py

#The file structure after packaging is as follows:

#The prediction.exe file generated by the package is stored in the dist directory

Reference: pythoninstaller

Reference: packaging with pyinstaler failed. Error: win32ctypes. Pywin32. Pywintypes. Error: (1920, ‘loadlibraryexw’,’the system cannot access this file. ‘)

Reference: pitfalls encountered by Python pyinstall packaging tool

Reference: pyinstaller win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryexw’,’the system cannot find the file specified. ‘)

Reference: pyinnstaller win32ctypes. Pywin32. Pywintypes. Error: (1920, ‘loadlibraryexw’,’system cannot access the file ‘)

Typeerror: UFUNC ‘isn’t supported for the input types

It took me a lot of time to find the wrong problem, so I hope you can be inspired.

Look at the code explanation

da1
Out[1]: 
          a   b  c        aa
0  0.200000  a1  1  0.200000
1  0.500000  a2  2  0.500000
2  0.428571  a3  3  0.428571
3       NaN  a2  4       NaN
4  0.833333  a1  5  0.833333
5  0.750000  a1  6  0.750000
6  0.777778  a3  7  0.777778
7       NaN  a1  8       NaN
8      test  a3  9       NaN

In [2]: ddn1 = da1['a'].values

In [3]: ddn1
Out[3]: 
array([0.2, 0.5, 0.42857142857142855, nan, 0.8333333333333334, 0.75,
       0.7777777777777778, nan, 'test'], dtype=object)

The type dtype of numpy array is object

In [4]: np.isnan(ddn1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-414-406cd3e92434> in <module>
----> 1 np.isnan(ddn1)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The reason for the error is that the type dtype of numpy is object, not number.

In [5]: type(ddn1[:8])
Out[5]: numpy.ndarray
In [6]: type(ddn1[8])
Out[6]: str

Although the previous values are all numbers, the last value is a string, and all values of the array are not of the same type.

In [7]: ddn1 = ddn1[:8]

In [8]: ddn1
Out[8]: 
array([0.2, 0.5, 0.42857142857142855, nan, 0.8333333333333334, 0.75,
       0.7777777777777778, nan], dtype=object)

Even if the last string is truncated by slicing, the type of the array does not change.

ddn1 = ddn1.astype('float')

ddn1
Out[9]: 
array([0.2       , 0.5       , 0.42857143,        nan, 0.83333333,
       0.75      , 0.77777778,        nan])

np.isnan(ddn1)
Out[10]: array([False, False, False,  True, False, False, False,  True])

Need to display the array into a numeric type line (here is converted to float).

In [11]: ddn1 = np.append(ddn1,'test')

In [12]: ddn1
Out[12]: 
array(['0.2', '0.5', '0.42857142857142855', 'nan', '0.8333333333333334',
       '0.75', '0.7777777777777778', 'nan', 'test'], dtype='<U32')
In [13]: np.isnan(np.append(ddn1,'test'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-440-26598f53c9e6> in <module>
----> 1 np.isnan(np.append(ddn1,'test'))

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

When a text value is appended, the type dtype of the array is changed again, which is not numeric. Reuse np.isnan There must be a mistake.

The conclusion is that to avoid errors, the value type in the array must be float or int.

The cause, discrimination, test and solution of Multicollinearity

Recently, in regression analysis, the sign of correlation coefficient is opposite to that of regression equation coefficient. After research, it is confirmed that it is a multicollinearity problem and the solution is explored.

Here, the related knowledge of multicollinearity is sorted out as follows.

It is possible that the two explanatory variables are highly correlated in theory, but the observed values may not be highly correlated, and vice versa. So multicollinearity is essentially a data problem.

There are several reasons for multicollinearity

1. All explanatory variables share the same time trend;

2. One explanatory variable is the lag of the other, and they tend to follow the same trend;

3. Because the basis of data collection is not wide enough, some explanatory variables may change together;

4. There is a linear relationship between some explanatory variables;

distinguish:

1. It is found that the sign of coefficient estimation is not correct;

2. Some important explanatory variables t value is low, but r square is not low

3. When an unimportant explanatory variable was deleted, the regression results changed significantly;

Inspection;

1. In correlation analysis, the correlation coefficient higher than 0.8 indicates the existence of multicollinearity, but the low correlation coefficient does not indicate the absence of multicollinearity;

2. Vif test;

3. Conditional coefficient test;

resolvent:

1. Increase data;

2. Some constraints are imposed on the model;

3. Delete one or more collinear variables;

4. Deform the model properly;

5. Principal component regression

The principle of dealing with multicollinearity is as follows

1. Multicollinearity is universal, and no measures can be taken for minor multicollinearity problems;

2. Serious multicollinearity problems can be found by experience or regression analysis. Such as the sign of influence coefficient, the important explanatory variable t value is very low. Necessary measures should be taken according to different situations.

3. If the model is only used for prediction, as long as the fitting degree is good, it can not deal with the multicollinearity problem. When the multicollinearity model is used for prediction, it often does not affect the prediction results;

Above is excerpt “econometrics intermediate course” pan Shengchu chief editor

Random forest algorithm learning

Random forest algorithm learning
When I was doing Kaggle recently, I found that the random forest algorithm had a very good effect on classification problems. In most cases, the effect was far better than that of SVM, log regression, KNN and other algorithms. So I want to think about how this algorithm works.
To learn random forest, we first briefly introduce the integrated learning method and decision tree algorithm. The following is only a brief introduction of these two methods (see Chapter 5 and Chapter 8 of statistical learning Methods for specific learning recommendations).

Bagging and Boosting concepts and differences
This part is mainly to study the: http://www.cnblogs.com/liuwu265/p/4690486.html
Random forest belongs to Bagging algorithm in Ensemble Learning. In ensemble learning, the algorithms are mainly divided into Bagging algorithm and Boosting algorithm. Let’s first look at the characteristics and differences between the two approaches.
Bagging (Bagging)
The algorithm process of Bagging is as follows:

randomly draw n training samples from the original sample set using the Bootstraping method, conduct a total of k rounds of drawing, and obtain k training sets. (K training sets are independent of each other, and elements can be repeated.) For k training sets, we train k models (these models can be determined according to specific problems, such as decision tree, KNN, etc.) for classification problems: classification results are generated by voting; For the regression problem, the mean value of the predicted results of k models is taken as the final prediction result. (all models are equally important)

Boosting, Boosting
Boosting algorithm process is as follows:

establishes weight wi for each sample in the training set, indicating the attention paid to each sample. When the probability of a sample being misclassified is high, it is necessary to increase the weight of the sample. Each iteration is a weak classifier during the iteration process. We need some kind of strategy to combine them as the final model. (For example, AdaBoost gives each weak classifier a weight and combines them linearly as the final classifier. The weaker classifier with smaller error has larger weight)

Bagging, Boosting the main difference

sample selection: Bagging adopts Bootstrap randomly put back sampling; But Boosting the training set of each round is unchanged, changing only the weight of each sample. Sample weight: Bagging uses uniform sampling with equal weight for each sample. Boosting adjust the sample weight according to the error rate, the greater the error rate, the greater the sample weight. Prediction function: Bagging all prediction functions have equal weight; Boosting the prediction function with lower error has greater weight. Parallel computing: Bagging each prediction function can be generated in parallel; Boosting each prediction function must be generated iteratively in sequence.

The following is the new algorithm obtained by combining the decision tree with these algorithm frameworks:
1) Bagging + decision tree = random forest
2) AdaBoost + decision tree = lifting tree
3) Gradient Boosting + decision tree = GBDT

The decision tree
Common decision tree algorithms include ID3, C4.5 and CART. The model building ideas of the three algorithms are very similar, but different indexes are adopted. The process of building the decision tree model is roughly as follows:
ID3, Generation of C4.5 decision tree
Input: training set D, feature set A, threshold EPS output: decision tree T
If

D in all samples belong to the same kind of Ck, it is single node tree T, the class Ck as A symbol of the class of the nodes, T if returns A null set, namely no characteristics as the basis, it is single node tree T, and D in implementing cases the largest class Ck as A symbol of the class of the node, return T otherwise, calculating the feature of D information gain in A (ID3)/information gain ratio (C4.5), choose the greatest feature of the information gain if Ag Ag information gain (than) is less than the threshold value of eps, is T for single node tree, and will be the biggest in implementing occuring D class Ck as A symbol of the class of the node, Otherwise, D is divided into several non-empty subsets Di according to the feature Ag, and the class with the largest number of real cases in Di is taken as the marker to construct the child node, and the tree T is formed by the node and its child nodes. T is returned to the ith child node, with Di as the training set and a-{Ag} as the feature set. Recursively, 1~5 is called to obtain the subtree Ti, and Ti

is returned
Generation of CART decision tree
Here is a brief introduction to the differences between CART and ID3 and C4.5.

CART tree is a binary tree, while ID3 and C4.5 can be multi-partite trees. When generating subtrees, CART selects one feature and one value as the segmentation point. The basis for generating two subtrees is gini index, and the feature with the minimum gini index and the segmentation point are selected to generate subtrees

Pruning of a decision tree
The pruning of decision tree is mainly to prevent overfitting, but the process is not described in detail.
The main idea is to trace back upward from the leaf node and try to prune a node to compare the loss function value of the decision tree before and after pruning. Finally, we can get the global optimal pruning scheme through dynamic programming (tree DP, acMER should know).

Random Forests
Random forest is an important integrated learning method based on Bagging, which can be used for classification, regression and other problems.
Random forests have many advantages:
Has a very high accuracy the introduction of randomness, it is not easy to make random forests after fitting of the randomness of introduction, makes the random forest has a good ability to resist noise can deal with high dimension data, and don’t have to do feature selection can not only deal with discrete data, also can deal with continuous data, data set without standardized training speed, can be variable importance easy parallelization
Disadvantages of random forest:
When the number of decision trees in the random forest is large, the space and time required for training will be large. The random forest model has a lot of problems to explain. It is a black box model
Similar to Bagging process described above, the construction process of random forest is roughly as follows:

from the original training focus Bootstraping method is used to back the sampling random select m sample, sampling conducted n_tree time, generate n_tree a training set for n_tree a training set, we respectively training n_tree a decision tree model for a single decision tree model, assuming that the number of training sample characteristics of n, so every time divided according to the information gain/information gain ratio/gini index, choose the best characteristics of split each tree has been divided like this, until all the training sample in the node belong to the same class. In the process of decision tree splitting, the random forest is composed of multiple decision trees generated without pruning. For the classification problem, the final classification result is decided by voting according to multiple tree classifiers. For regression problems, the mean of predicted values of multiple trees determines the final prediction result

Reintex index of pandas

Convention:

import pandas as pd
import numpy as np

ReIndex reindexes
Reindex () is an important method of the Pandas object, which creates a new object with a new index.
I. Re-index Series objects

se1=pd.Series([1,7,3,9],index=['d','c','a','f'])
se1

Code results:

d    1
c    7
a    3
f    9
dtype: int64

Calling reindex will reorder the missing values and fill them with NaN.

se2=se1.reindex(['a','b','c','d','e','f'])
se2

Code results:

a    3.0
b    NaN
c    7.0
d    1.0
e    NaN
f    9.0
dtype: float64

When passing in method= “” select interpolation processing mode when reindexing:
Method = ‘ffill’ or ‘pad forward filling
Method = ‘bfill’ or ‘backfill’

se3=pd.Series(['blue','red','black'],index=[0,2,4])
se4=se3.reindex(range(6),method='ffill')
se4

Code results:

0     blue
1     blue
2      red
3      red
4    black
5    black
dtype: object

Second, reindex the DataFrame object
For a DataFrame object, reIndex can modify the row index and column index.

df1=pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],columns=['one','two','four'])
df1

Code results:

	one	two	four
a	0	1	2
c	3	4	5
d	6	7	8

Reordering the row index by default
Passing in only one sequence does not rearrange the index of the sequence

df1.reindex(['a','b','c','d'])

Code results:

The

c 3.0

one

two

four

a 0.0

1.0

2.0

NaN

4.0

5.0

d 6.0

7.0

8.0

df1.reindex(index=['a','b','c','d'],columns=['one','two','three','four'])

Code results:

one

two

three

four

a 0.0

1.0

NaN

2.0

NaN

c 3.0

4.0

NaN

5.0

d 6.0

7.0

NaN

8.0

Pass in fill_value=n and replace the missing value with n:

df1.reindex(index=['a','b','c','d'],columns=['one','two','three','four'],fill_value=100)

Code results:

	one	two	three	four
a	0	1	100	2
b	100	100	100	100
c	3	4	100	5
d	6	7	100	8

Thanks for your browsing,
hope my efforts can help you,
encourage!

No such file or directory

No such file or directory can not be found when reading file in R

Recently, while reading the file, the following problem occurred
> Passenger = read. CSV (‘ international – the airline – passengers. CSV, sep = ‘, ‘)
Error in File (file, “RT “) : Unable to open link
Warning Message:
In the file (file, “rt”) :
Unable to open file ‘international-Airline-passengers. CSV ‘: No such file or Directory

R can’t find the file.

File path, divided into absolute path and relative path, using absolute path is troublesome, usually use relative path, and relative path refers to, relative to the current working path.

Using the geTWd () function, you can get the current working path

Solution 1
Set the directory where the international-airline-travel.csv file is placed to the working directory

Solution 2
Copy the file international-airline-passengers. CSV to the current working directory

Attributeerror: ‘dataframe’ object has no attribute ‘IX’ error

“AttributeError: ‘DataFrame’ object has no attribute ‘ix'”

recently reported when using the ix method of DataFrame

after searching on the Internet, is removed from the series.ix and dataframe.ix method at the beginning of pandas’ 1.0.0 version.

my solution: use the loc method of DataFrame or the iloc method instead.

check pandas

for details

reference: https://hacpai.com/article/1581255121678

ProgrammerAH

Programmer Guide, Tips and Tutorial

Tag Archives: Data mining

Error in plot.new() : figure margins too large

Error in *** : subscript out of bounds [How to Solve]

The attribute error: he has no attributes.

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Typeerror: UFUNC ‘isn’t supported for the input types

The cause, discrimination, test and solution of Multicollinearity

Random forest algorithm learning

Reintex index of pandas

No such file or directory

Attributeerror: ‘dataframe’ object has no attribute ‘IX’ error