Tag Archives: Data mining

[Solved] Python Pandas Read Error: OSError: initializing from file failed

Problem Description:

error when loading CSV format data in pandas

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On Data Analysis/Unit 1 Project Collection/train.csv")
B.head(3)

report errors:

OSError: Initializing from file failed

Cause analysis:

When calling the read_csv() method of pandas, the C engine is used as the parser engine by default, and when the file name contains Chinese, using the C engine will be wrong in some cases.


Solution:

Specify the engine as Python when calling the read_csv() method

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On-Data-Analysis/Unit-1-Project-Collection/train.csv",engine='python')
B.head(3)

Error: Discrete value supplied to continuous scale [How to Solve]

 

#Simulation data

df <- structure(list(`10` = c(0, 0, 0, 0, 0, 0), `33.95` = c(0, 0, 
0, 0, 0, 0), `58.66` = c(0, 0, 0, 0, 0, 0), `84.42` = c(0, 0, 
0, 0, 0, 0), `110.21` = c(0, 0, 0, 0, 0, 0), `134.16` = c(0, 
0, 0, 0, 0, 0), `164.69` = c(0, 0, 0, 0, 0, 0), `199.1` = c(0, 
0, 0, 0, 0, 0), `234.35` = c(0, 0, 0, 0, 0, 0), `257.19` = c(0, 
0, 0, 0, 0, 0), `361.84` = c(0, 0, 0, 0, 0, 0), `432.74` = c(0, 
0, 0, 0, 0, 0), `506.34` = c(1, 0, 0, 0, 0, 0), `581.46` = c(0, 
0, 0, 0, 0, 0), `651.71` = c(0, 0, 0, 0, 0, 0), `732.59` = c(0, 
0, 0, 0, 0, 1), `817.56` = c(0, 0, 0, 1, 0, 0), `896.24` = c(0, 
0, 0, 0, 0, 0), `971.77` = c(0, 1, 1, 1, 0, 1), `1038.91` = c(0, 
0, 0, 0, 0, 0), MW = c(3.9, 6.4, 7.4, 8.1, 9, 9.4)), .Names = c("10", 
"33.95", "58.66", "84.42", "110.21", "134.16", "164.69", "199.1", 
"234.35", "257.19", "361.84", "432.74", "506.34", "581.46", "651.71", 
"732.59", "817.56", "896.24", "971.77", "1038.91", "MW"), row.names = c("Merc", 
"Peug", "Fera", "Fiat", "Opel", "Volv"
), class = "data.frame")


df

Question:

library(reshape)

## Plotting
meltDF = melt(df, id.vars = 'MW')
ggplot(meltDF[meltDF$value == 1,]) + geom_point(aes(x = MW, y = variable)) +
  scale_x_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200)) +
  scale_y_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200))

Solution:

After the meltdf variable is defined, the factor variable can be transformed into numerical white energy;

If x is a numeric value, add scale_x_continual(); If x is a character/factor, add scale_x_discreate().

meltDF$variable=as.numeric(levels(meltDF$variable))[meltDF$variable]


ggplot(meltDF[meltDF$value == 1,]) + geom_point(aes(x = MW, y =   variable)) +
     scale_x_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200)) +
     scale_y_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200))

Full Error Messages:
> library(reshape)
>
> ## Plotting
> meltDF = melt(df, id.vars = ‘MW’)
> ggplot(meltDF[meltDF$value == 1,]) + geom_point(aes(x = MW, y = variable)) +
+     scale_x_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200)) +
+     scale_y_continuous(limits=c(0, 1200), breaks=c(0, 400, 800, 1200))
Error: Discrete value supplied to continuous scale
>

[Solved] Scala error: type mismatch; found : java.util.List[?0] required: java.util.List[B]

Scala error: type mismatch; found : java.util.List[?0] required: java.util.List[B]


Problem:
Due to the incompatibility between Scala type inference and Java type inference;

import java.util
import java.util.stream.Collectors
class Animal
class Dog extends Animal
class Cat extends Animal

object ObjectConversions extends App {

  import java.util.{List => JList}
  implicit  def convertLowerBound[ B <: Animal] (a: JList[Animal]): JList[B] = a.stream().map(a => a.asInstanceOf[B]).collect(Collectors.toList())
  val a= new util.ArrayList[Animal]()
  a.add(new Cat)
  convertLowerBound[Cat](a)
}

Solution:

When calling Java methods and still want to infer generic types, you need to pass generic types specifically

#Or the implicit conversion statement is not imported
import scala.collection.JavaConversions._

def convertLowerBound[ B <: Animal] (a: JList[Animal]): JList[B] = a.stream().map[B](a => a.asInstanceOf[B]).collect(Collectors.toList[B]())


#or

def convertLowerBound[B <: Animal : TypeTag] (a: JList[Animal]) = a.asInstanceOf[JList[B]]

scala> def convertLowerBound[ B <: Animal] (a: JList[Animal]): JList[B] = a.stream().map[B](a => a.asInstanceOf[B]).collect(Collectors.toList[B]())
convertLowerBound: [B <: Animal](a: java.util.List[Animal])java.util.List[B]
scala> convertLowerBound[Cat](a)
res30: java.util.List[Cat] = [[email protected], [email protected]]
scala> a.add(new Cat())
res16: Boolean = true
scala> convertLowerBound[Cat](a)
res17: java.util.List[Cat] = [[email protected]]
scala> a.add(new Dog())
res19: Boolean = true
scala> convertLowerBound[Cat](a)
res20: java.util.List[Cat] = [[email protected], [email protected]]

完整错误:
<console>:15: error: type mismatch;
found   : java.util.List[?0]
required: java.util.List[B]
Note: ?0 >: B, but Java-defined trait List is invariant in type E.
You may wish to investigate a wildcard type such as `_ >: B`. (SLS 3.2.10)
implicit  def convertLowerBound[ B <: Animal] (a: JList[Animal]): JList[B] = a.stream().map(a => a.asInstanceOf[B]).collect(Collectors.toList())

[Solved] ParserError: NULL byte detected. This byte cannot be processed in Python‘s native csv library

ParserError: NULL byte detected. This byte cannot be processed in Python’s native csv library at the moment, so please pass in engine=’c’ instead



Error:

file_name = os.listdir(base_dir)[0]

col_list = [feature list]
col = col_list
#encoding
#data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding="GBK",usecols=range(len(col)))
    
data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'unicode_escape', engine ='python')


#data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'utf-8', engine ='python')

path = "D:\\test\\repo\\data.csv"

Solution:

engine =’c’

file_name = os.listdir(base_dir)[0]

#encoding
#data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding="GBK",usecols=range(len(col)))
    
data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'unicode_escape', engine ='c')


#data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'utf-8', engine ='python')

path = "D:\\test\\repo\\data.csv"

Full Error Messages:
—————————————————————————

Error                                     Traceback (most recent call last)
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _next_iter_line(self, row_num)
2967             assert self.data is not None
-> 2968             return next(self.data)
2969         except csv.Error as e:
Error: line contains NULL byte
During handling of the above exception, another exception occurred:
ParserError                               Traceback (most recent call last)
<ipython-input-12-c5d0c651c50e> in <module>
85                    ]
86
---> 87     data = inference_process(data_dir)
88     #print(data.head())
89     f=open("break_model1.pkl",'rb')
<ipython-input-12-c5d0c651c50e> in inference_process(base_dir)
18     #encoding
19 #     data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding="GBK",usecols=range(len(col)))
---> 20     data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'unicode_escape', engine ='python')
21 #     data = pd.read_csv("D:\\test\\repo\\data.csv",sep = ',',encoding = 'utf-8', engine ='python')
22
D:\anaconda\lib\site-packages\pandas\io\parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
608     kwds.update(kwds_defaults)
609
--> 610     return _read(filepath_or_buffer, kwds)
611
612
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
460
461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
463
464     if chunksize or iterator:
D:\anaconda\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
817             self.options["has_index_names"] = kwds["has_index_names"]
818
--> 819         self._engine = self._make_engine(self.engine)
820
821     def close(self):
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
1048             )
1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
1051
1052     def _failover_to_python(self):
D:\anaconda\lib\site-packages\pandas\io\parsers.py in __init__(self, f, **kwds)
2308                 self.num_original_columns,
2309                 self.unnamed_cols,
-> 2310             ) = self._infer_columns()
2311         except (TypeError, ValueError):
2312             self.close()
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _infer_columns(self)
2615             for level, hr in enumerate(header):
2616                 try:
-> 2617                     line = self._buffered_line()
2618
2619                     while self.line_pos <= hr:
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _buffered_line(self)
2809             return self.buf[0]
2810         else:
-> 2811             return self._next_line()
2812
2813     def _check_for_bom(self, first_row):
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _next_line(self)
2906
2907             while True:
-> 2908                 orig_line = self._next_iter_line(row_num=self.pos + 1)
2909                 self.pos += 1
2910
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _next_iter_line(self, row_num)
2989                     msg += ". " + reason
2990
-> 2991                 self._alert_malformed(msg, row_num)
2992             return None
2993
D:\anaconda\lib\site-packages\pandas\io\parsers.py in _alert_malformed(self, msg, row_num)
2946         """
2947         if self.error_bad_lines:
-> 2948             raise ParserError(msg)
2949         elif self.warn_bad_lines:
2950             base = f"Skipping line {row_num}: "
ParserError: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instea

Error: could not find function … in R [How to Solve]

Error: could not find function … in R

Question:

solve:

Full error:


Question:

> mytest.ax(lable,prediction)
Error in mytest.ax(lable, prediction) :
could not find function “mytest.ax”

Solution:

First, is the function name written correctly?R language function names are case sensitive.

Second, is the package containing the function installed?install.packages(“package_name”)

Third,

require(package_name)

library(package)

Require (package_name) (and check its return value) or library (package) (this should be done every time you start a new R session)

Fourth, are you using an old r version that does not yet exist?Or the version of R package; Or after the version is updated, some functions are removed from the original package;

Fifth, functions are added and removed over time, and the referenced code may expect an updated or older version than the package you installed. Or it’s too new. Cran doesn’t contain the latest version;

Full error:

> mytest.ax(lable,prediction)
Error in mytest.ax(lable, prediction) :
could not find function "mytest.ax"

Error in plot.new() : figure margins too large

Error in plot.new() : figure margins too large

Full error:


#Question

Fit the regression model and calculate the dfbetas value of each sample and the optimal dfbetas threshold. Finally, visualize the impact of each sample on each predictive variable;

#fit a regression model
model <- lm(mpg~disp+hp, data=mtcars)

#view model summary
summary(model)

#calculate DFBETAS for each observation in the model
dfbetas <- as.data.frame(dfbetas(model))

#display DFBETAS for each observation
dfbetas

#find number of observations
n <- nrow(mtcars)

#calculate DFBETAS threshold value
thresh <- 2/sqrt(n)

thresh

#specify 2 rows and 1 column in plotting region

#dev.off()
#par(mar = c(1, 1, 1, 1))

par(mfrow=c(2,1))

#plot DFBETAS for disp with threshold lines
plot(dfbetas$disp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#plot DFBETAS for hp with threshold lines 
plot(dfbetas$hp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#Solution
par(mar = c(1, 1, 1, 1))

#fit a regression model
model <- lm(mpg~disp+hp, data=mtcars)

#view model summary
summary(model)

#calculate DFBETAS for each observation in the model
dfbetas <- as.data.frame(dfbetas(model))

#display DFBETAS for each observation
dfbetas

#find number of observations
n <- nrow(mtcars)

#calculate DFBETAS threshold value
thresh <- 2/sqrt(n)

thresh

#specify 2 rows and 1 column in plotting region

#dev.off()
par(mar = c(1, 1, 1, 1))

par(mfrow=c(2,1))

#plot DFBETAS for disp with threshold lines
plot(dfbetas$disp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)

#plot DFBETAS for hp with threshold lines 
plot(dfbetas$hp, type='h')
abline(h = thresh, lty = 2)
abline(h = -thresh, lty = 2)


Full Error Message:
> par(mfrow=c(2,1))
>
> #plot DFBETAS for disp with threshold lines
> plot(dfbetas$disp, type=’h’)
Error in plot.new() : figure margins too large
> abline(h = thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
> abline(h = -thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
>
> #plot DFBETAS for hp with threshold lines
> plot(dfbetas$hp, type=’h’)
Error in plot.new() : figure margins too large
> abline(h = thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
> abline(h = -thresh, lty = 2)
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet

Error in *** : subscript out of bounds [How to Solve]

Error in *** : subscript out of bounds

Full error:


Question:

The data is out of bounds. Except for others, who knows where it is? Do you ask memory? Who do you ask?

#make this example reproducible
set.seed(0)

#create matrix with 10 rows and 3 columns
x = matrix(data = sample.int(100, 30), nrow = 10, ncol = 3)

#print matrix
print(x)

#attempt to display 11th row of matrix
x[11, ]

#attempt to display 4th column of matrix
x[, 4]

#attempt to display value in 11th row and 4th column
x[11, 4]

Solution:

#

#display number of rows and columns in matrix
dim(x)

#

#display 10th row of matrix
x[10, ]

#display number of columns in matrix
ncol(x)

#display 3rd column of matrix
x[, 3]

#display value in 10th row and 3rd column of matrix
x[10, 3]

Full error Messages:
>
> #attempt to display 11th row of matrix
> x[11, ]
Error in x[11, ] : subscript out of bounds
>
> #attempt to display 4th column of matrix
> x[, 4]
Error in x[, 4] : subscript out of bounds
>
> #attempt to display value in 11th row and 4th column
> x[11, 4]
Error in x[11, 4] : subscript out of bounds
>

The attribute error: he has no attributes.

Error: NoneType ‘object has no attribute ‘seconds’

Time-famifamifamifamitime _operationon/date date date
import date
import date date
import date date
start time = datetime.date.time.now(()
> endtime = datetime.date.date.now(()
> endtime = date.date.date.now(),
print (endtime-starttime). seconds
print(endtime-starttime). seconds
> change after changes:
import date
date
import date
date> starttime
starttime = date date date date.date.time.now()
s
print(a)

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

catalogue

Win32ctypes. Pywin32. Pywintypes. Error: (2 ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Questions

Solution


Questions

Using pyinstaller to package py file as windows exe program, the following problems are encountered:

Win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryex’,’the system can’t find the specified file. ‘)

Stack overflow and CSDN, you can see all kinds of things

Some say that virtual environment must be used, but I don’t believe it;

Some people say that you must use native Python environment (not Anaconda or other integrated environment), but I don’t believe it;

Some people say that it may be caused by the too low and too high version of Python;

Some people say that maybe your version of pyinstaler is too high. Let’s have a try. I believe it;

#The following error occurred

Win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryex’,’the system can’t find the specified file. ‘)

#It is suspected that the temporary file in the original build interferes with the new build. After deleting the temporary file, it continues to build, and the same error still exists;

Solution

#Check on stack overflow and CSDN. There are all kinds of things to say. Finally, we adopted the method of reducing the version and succeeded

#   pip install pyinstaller==3.5

#   pyinstaller -F prediction.py

#The file structure after packaging is as follows:

#The prediction.exe file generated by the package is stored in the dist directory

Reference: pythoninstaller

Reference: packaging with pyinstaler failed. Error: win32ctypes. Pywin32. Pywintypes. Error: (1920, ‘loadlibraryexw’,’the system cannot access this file. ‘)

Reference: pitfalls encountered by Python pyinstall packaging tool

Reference: pyinstaller win32ctypes. Pywin32. Pywintypes. Error: (2, ‘loadlibraryexw’,’the system cannot find the file specified. ‘)

Reference: pyinnstaller win32ctypes. Pywin32. Pywintypes. Error: (1920, ‘loadlibraryexw’,’system cannot access the file ‘)

Typeerror: UFUNC ‘isn’t supported for the input types

It took me a lot of time to find the wrong problem, so I hope you can be inspired.

Look at the code explanation

da1
Out[1]: 
          a   b  c        aa
0  0.200000  a1  1  0.200000
1  0.500000  a2  2  0.500000
2  0.428571  a3  3  0.428571
3       NaN  a2  4       NaN
4  0.833333  a1  5  0.833333
5  0.750000  a1  6  0.750000
6  0.777778  a3  7  0.777778
7       NaN  a1  8       NaN
8      test  a3  9       NaN

In [2]: ddn1 = da1['a'].values

In [3]: ddn1
Out[3]: 
array([0.2, 0.5, 0.42857142857142855, nan, 0.8333333333333334, 0.75,
       0.7777777777777778, nan, 'test'], dtype=object)

The type dtype of numpy array is object

In [4]: np.isnan(ddn1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-414-406cd3e92434> in <module>
----> 1 np.isnan(ddn1)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The reason for the error is that the type dtype of numpy is object, not number.

In [5]: type(ddn1[:8])
Out[5]: numpy.ndarray
In [6]: type(ddn1[8])
Out[6]: str

Although the previous values are all numbers, the last value is a string, and all values of the array are not of the same type.

In [7]: ddn1 = ddn1[:8]

In [8]: ddn1
Out[8]: 
array([0.2, 0.5, 0.42857142857142855, nan, 0.8333333333333334, 0.75,
       0.7777777777777778, nan], dtype=object)

Even if the last string is truncated by slicing, the type of the array does not change.

ddn1 = ddn1.astype('float')

ddn1
Out[9]: 
array([0.2       , 0.5       , 0.42857143,        nan, 0.83333333,
       0.75      , 0.77777778,        nan])

np.isnan(ddn1)
Out[10]: array([False, False, False,  True, False, False, False,  True])

Need to display the array into a numeric type line (here is converted to float).

In [11]: ddn1 = np.append(ddn1,'test')

In [12]: ddn1
Out[12]: 
array(['0.2', '0.5', '0.42857142857142855', 'nan', '0.8333333333333334',
       '0.75', '0.7777777777777778', 'nan', 'test'], dtype='<U32')
In [13]: np.isnan(np.append(ddn1,'test'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-440-26598f53c9e6> in <module>
----> 1 np.isnan(np.append(ddn1,'test'))

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

When a text value is appended, the type dtype of the array is changed again, which is not numeric. Reuse np.isnan There must be a mistake.

The conclusion is that to avoid errors, the value type in the array must be float or int.

The cause, discrimination, test and solution of Multicollinearity

Recently, in regression analysis, the sign of correlation coefficient is opposite to that of regression equation coefficient. After research, it is confirmed that it is a multicollinearity problem and the solution is explored.

Here, the related knowledge of multicollinearity is sorted out as follows.

It is possible that the two explanatory variables are highly correlated in theory, but the observed values may not be highly correlated, and vice versa. So multicollinearity is essentially a data problem.

There are several reasons for multicollinearity

1. All explanatory variables share the same time trend;

2. One explanatory variable is the lag of the other, and they tend to follow the same trend;

3. Because the basis of data collection is not wide enough, some explanatory variables may change together;

4. There is a linear relationship between some explanatory variables;

distinguish:

1. It is found that the sign of coefficient estimation is not correct;

2. Some important explanatory variables t value is low, but r square is not low

3. When an unimportant explanatory variable was deleted, the regression results changed significantly;

Inspection;

1. In correlation analysis, the correlation coefficient higher than 0.8 indicates the existence of multicollinearity, but the low correlation coefficient does not indicate the absence of multicollinearity;

2. Vif test;

3. Conditional coefficient test;

resolvent:

1. Increase data;

2. Some constraints are imposed on the model;

3. Delete one or more collinear variables;

4. Deform the model properly;

5. Principal component regression

The principle of dealing with multicollinearity is as follows

1. Multicollinearity is universal, and no measures can be taken for minor multicollinearity problems;

2. Serious multicollinearity problems can be found by experience or regression analysis. Such as the sign of influence coefficient, the important explanatory variable t value is very low. Necessary measures should be taken according to different situations.

3. If the model is only used for prediction, as long as the fitting degree is good, it can not deal with the multicollinearity problem. When the multicollinearity model is used for prediction, it often does not affect the prediction results;

Above is excerpt “econometrics intermediate course” pan Shengchu chief editor