Tag Archives: Pandas

[Solved] Python Pandas Read Error: OSError: initializing from file failed

Problem Description:

error when loading CSV format data in pandas

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On Data Analysis/Unit 1 Project Collection/train.csv")
B.head(3)

report errors:

OSError: Initializing from file failed

Cause analysis:

When calling the read_csv() method of pandas, the C engine is used as the parser engine by default, and when the file name contains Chinese, using the C engine will be wrong in some cases.


Solution:

Specify the engine as Python when calling the read_csv() method

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On-Data-Analysis/Unit-1-Project-Collection/train.csv",engine='python')
B.head(3)

[Solved] SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 10-11: malformed

#Read a *.txt file using the read_table() function in the Pandas library
data = pd.read_table(r'D:\New\test.txt',delimiter=',',encoding = 'UTF-8')
print(data)

Title defect Solution: add “R” before the path to solve it.

python D:\New\MyTest.py
        name  date   id
0   jianghu  20210201  00001
1  jianghu1  20210202  00002
2  jianghu2  20210203  00003

 

[Solved] AttributeError: module ‘pandas‘ has no attribute ‘rolling_count‘

Problem Description:

For the problems encountered in automatic modeling today, we use iris data set to initialize the automl framework and pass in training data. The problem is that in the last line of fit, an error is reported: attributeerror: module ‘pandas’ has no attribute’ rolling_ At that time, I read the wrong version of pandas on the Internet. Then I reinstalled it on the Internet and found that it still couldn’t.

Use Microsoft’s flaml automated modeling framework to directly pip, Install flaml. Attach Code:

from flaml import AutoML
from sklearn.datasets import load_iris
import pandas as pd



iris = load_iris()
iris_data = pd.concat([pd.DataFrame(iris.data),pd.Series(iris.target)],axis=1)
iris_data.columns = ["_".join(feature.split(" ")[:2]) for feature in iris.feature_names]+["target"]
iris_data = iris_data[(iris_data.target==0) |(iris_data.target==1)]


flaml_automl = AutoML()
flaml_automl.fit(pd.DataFrame(iris_data.iloc[:,:-1]),iris_data.iloc[:,-1],time_budget=10,estimator_list=['lgbm','xgboost'])

After the upgrade dask is finally executed (PIP install — upgrade dask), it can run normally. However, it is strange that the error message does not prompt dask related problems. Some bloggers on the Internet say that dask provides interfaces to pandas and numpy, which may be caused by the low version of the interface??

Finally, after upgrading dask, the problem is solved!

ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied

Error content: importerror: C extension: DLL load failed: access denied. not built. If you want to import pandas from the source directory, you may need to run ‘python setup.py build_ ext –inplace –force’ to build the C extensions first.

Reason for error reporting: it may be caused by deleting the installation dependency package of pandas in the environment by mistake, or by deleting the anti-virus software. Generally, the latter does it.

Solution 1: uninstall pandas: PIP uninstall pandas , reinstall: PIP install pandas

—————————————————————————
if you use the above method, you may make mistakes

Error content: error: could not install packages due to an oserror: [errno 13] permission denied: ‘your project path \ venv \ lib \ site packages \ pandas\_ libs\
tslibs\period.cp36-win_ amd64.pyd’
Check the permissions.

Error reason: it means that there are missing files in the folder reported above and cannot be downloaded.

Solution 2: delete the folder and reinstall pandas. For example, in this example, delete pandas under site packages. Remember to delete pandas under site packages instead of site packages. Don’t make a mistake. Delete all the environment and you’ll be finished.

—————————————————————————

If it’s still the previous error, congratulations. I’ve been confused for half an hour. Hahaha……..

Why?The reason is that with Shadu software, I don’t need to repeat what sahdu software is. I know everything: the inner corners of a circle and. Because it is on, it causes the software to delete the PYD file during pandas installation

Solution 3: turn off the kill software, and then try again according to solution 2. It should be successful
if you still can’t, you can only be a freak. Ha ha ha. Just kidding, you can write private letters and step on the pit together!

Pandas read_csv pandas.errors.ParserError: Error tokenizing data

What you will learn?
pandas read_csv escape commas and double qoutes
Prepare datas

# test.csv or test.txt
"1","123","4","\"data\""
"test","123","4","if(\"data\" = \"<test>\", (10*24))"

Wrong-way

import pandas as pd

datas = pd.read_csv('test.txt', header=None, skip_blank_lines=True)

You got

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 5

Right way

import pandas as pd

datas = pd.read_csv('test.txt', header=None, skip_blank_lines=True, escapechar='\\')

Digression

Many people on the Internet encounter this problem and add a parameter: error_bad_Lines = false (tested, the second row will be lost for the above data). If the amount of data is not large, check the method of the specified row: cat – N filename | head – N end_line_no| tail -n +start_line_no

Pandas Error: ValueError: setting an array element with a sequence.

Pandas apply returns multiple columns

Originally, I wanted to process the dataframe line by line through NP. Vectorize() and return several new fields. An error valueerror: setting an array element with a sequence

def test():
    arr = np.random.randn(4,4)
    cols = ['a', 'b', 'c']
    df = pd.DataFrame(data=arr,columns=['e','f','g','h'])
    def func(a,b,c):
        output1 = a+1
        output2 = b*2
        output3 = c-4
        return pd.Series([output1,output2,output3])
    vfunc = np.vectorize(func)
    df[cols] = vfunc(df['e'],df['f'],df['g'])
    print(df)
test()

The reason for the error is that the assigned DF [cols] is inconsistent with the dimension returned by vffunc, and the shape between the returned data frame and the result does not match. Use apply to solve it, and the parameter result_ Type = “expand” means that the result will be converted into columns, and each returned value will be used as the value in the column of result dataframe. In apply (func), the number of results returned by func should be the same as the number of col columns in DF [col]

def test():
    arr = np.random.randn(4,4)
    cols = ['a', 'b', 'c']
    df = pd.DataFrame(data=arr,columns=['e','f','g','h'])
    def func(row):
        a,b,c = row['e'],row['f'],row['g']
        output1 = a+1
        output2 = b*2
        output3 = c-4
        return output1,output2,output3
    df[cols] = df.apply(func,axis=1, result_type="expand")
    print(df)
test()

output

          e         f         g         h         a         b         c
0  0.493280 -0.092513 -3.014135 -0.361842  1.493280 -0.185027 -7.014135
1  0.300695 -0.745392  0.591653 -1.752471  1.300695 -1.490785 -3.408347
2 -0.033944 -1.556307 -0.359979  1.808213  0.966056 -3.112615 -4.359979
3  0.701741 -0.272337  0.041114  0.150049  1.701741 -0.544674 -3.958886

For a single column

df['id'] 

And

ID = ['id']
df[ID]

The results obtained are different. The former is [1,2,3,4], and the latter is [[1], [2], [3], [4]

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

AttributeError: DatetimeProperties object has no attribute

1.Question

AttributeError: ‘DatetimeProperties’ object has no attribute ‘weekday_ name’

Simple test, run the following code:

import pandas as pd

# Create dates
dates = pd.Series(pd.date_range("7/26/2021", periods=3, freq="D"))
# Check the day of the week
print(dates.dt.weekday_name)
# Show only values
print(dates.dt.weekday)

2.Solution

weekday_ Change name to day_ name()

import pandas as pd

# Create dates
dates = pd.Series(pd.date_range("7/26/2021", periods=3, freq="D"))
# Check the day of the week
print(dates.dt.day_name())
# Show only values
print(dates.dt.weekday)

For example:

Type

pandas parse_ Data exception, automatically skip

When processing raw data, the following error occurs:

id,name,date
0,a,2020/01/01
0,b,2020/01/01
0,c,2020/01/01
0,d,2020/01/01
0,e,2020/01/01
0,f,9999/01/01

It was treated with panda :

data = pandas.read_csv(file, sep=";", encoding="ISO-8859-1", parse_dates=["date"],  date_parser=lambda x: pandas.to_datetime(x, format="%d.%m.%Y"))

But the running time is wrong, which means out of bonds timestamp .

Our current approach is to skip the exception line,

The following line needs to be added

date_parser=lambda x: pd.to_datetime(x, errors="coerce")

There are three kinds of assignments for the errors parameter. The default value is’ raise ‘. An error will be reported if the parsing does not conform to the specification.

You can assign the errors parameter to “coerce” and set the time format of the error to NAT during parsing. If you don’t want to deal with the wrong time format, you can assign errors to ‘ignore’, so that the original format is the same.

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

If ‘raise’, then invalid parsing will raise an exception.If ‘coerce’, then invalid parsing will be set as NaT.If ‘ignore’, then invalid parsing will return the input.