Tag Archives: Pandas

pandas parse_ Data exception, automatically skip

When processing raw data, the following error occurs:

id,name,date
0,a,2020/01/01
0,b,2020/01/01
0,c,2020/01/01
0,d,2020/01/01
0,e,2020/01/01
0,f,9999/01/01

It was treated with panda :

data = pandas.read_csv(file, sep=";", encoding="ISO-8859-1", parse_dates=["date"],  date_parser=lambda x: pandas.to_datetime(x, format="%d.%m.%Y"))

But the running time is wrong, which means out of bonds timestamp .

Our current approach is to skip the exception line,

The following line needs to be added

date_parser=lambda x: pd.to_datetime(x, errors="coerce")

There are three kinds of assignments for the errors parameter. The default value is’ raise ‘. An error will be reported if the parsing does not conform to the specification.

You can assign the errors parameter to “coerce” and set the time format of the error to NAT during parsing. If you don’t want to deal with the wrong time format, you can assign errors to ‘ignore’, so that the original format is the same.

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

If ‘raise’, then invalid parsing will raise an exception.If ‘coerce’, then invalid parsing will be set as NaT.If ‘ignore’, then invalid parsing will return the input.

How to Solve Pandas Error: nested renamer is not supported python

Problem Description
After running df.groupby([‘id’])[‘click’].agg({‘click_std’: ‘std’}).reset_index(), I get nested renamer is not supported python error

Solution
In the new Pandas version, the dictionary approach of {‘click_std’:’std’} has been abandoned in favor of df.groupby([‘id ‘])[‘click’].agg(click_std=’std’).reset_index() and then run successfully.

Reference:

    https://stackoverflow.com/questions/60229375/solution-for-specificationerror-nested-renamer-is-not-supported-while-agg-alo
    https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.20.0.html#whatsnew-0200-api-breaking-deprecate-group-agg-dict
    https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html

[Solved] when using jupyter notebook, “terminated worker error” appears

Error:
When calling pandas_profiling.ProfileReport(df) using Jupyter Notebook, an error is reported “A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.”The solution is as follows.
    Terminal type pip list , view joblib-0.13.2 type pip uninstall joblib type pip install -U joblib

Restart the kernel, then execute the code again and it runs successfully!

Python TypeError: Unrecognized value type: <class ‘str‘>dateutil.parser._parser.ParserError: Unknow

When I want to convert a column of dates in the data frame into the date format of panda, I encountered this kind of error.

reader = pd.read_csv(f'new_files/2020-12-22-5-10.csv', usecols=['passCarTime'],dtype={'passCarTime':'string'})
pd.to_datetime(reader.passCarTime.head())
Out[98]: 
0   2020-12-22 10:00:00
1   2020-12-22 10:00:00
2   2020-12-22 10:00:00
3   2020-12-22 10:00:00
4   2020-12-22 10:00:00
Name: passCarTime, dtype: datetime64[ns]
pd.to_datetime(reader.passCarTime)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\arrays\datetimes.py", line 2085, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data)
  File "pandas\_libs\tslibs\conversion.pyx", line 350, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-99-e1b00dc18517>", line 1, in <module>
    pd.to_datetime(reader.passCarTime)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\tools\datetimes.py", line 801, in to_datetime
    cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\tools\datetimes.py", line 178, in _maybe_cache
    cache_dates = convert_listlike(unique_dates, format)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\tools\datetimes.py", line 465, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\arrays\datetimes.py", line 2090, in objects_to_datetime64ns
    raise e
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\arrays\datetimes.py", line 2075, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas\_libs\tslib.pyx", line 364, in pandas._libs.tslib.array_to_datetime
  File "pandas\_libs\tslib.pyx", line 591, in pandas._libs.tslib.array_to_datetime
  File "pandas\_libs\tslib.pyx", line 726, in pandas._libs.tslib.array_to_datetime_object
  File "pandas\_libs\tslib.pyx", line 717, in pandas._libs.tslib.array_to_datetime_object
  File "pandas\_libs\tslibs\parsing.pyx", line 243, in pandas._libs.tslibs.parsing.parse_datetime_string
  File "D:\PyCharm2020\python2020\lib\site-packages\dateutil\parser\_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "D:\PyCharm2020\python2020\lib\site-packages\dateutil\parser\_parser.py", line 649, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: passCarTime

I’m not a professional, and my English is not good. I can’t understand what’s wrong. I have seen that there is no missing value in the date column of the file, and there is no date that does not conform to the format… This is very strange, welcome to leave a message, thanks in advance!

Solution

When converting, add a parameter errors ='coerce '.

reader = pd.read_csv(f'new_files/2020-12-22-5-10.csv', usecols=['passCarTime'],dtype={'passCarTime':'string'})
reader.passCarTime = pd.to_datetime(reader.passCarTime,errors='coerce') 
reader.passCarTime.head()
Out[120]: 
0   2020-12-22 10:00:00
1   2020-12-22 10:00:00
2   2020-12-22 10:00:00
3   2020-12-22 10:00:00
4   2020-12-22 10:00:00
Name: passCarTime, dtype: datetime64[ns]
reader.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307707 entries, 0 to 307706
Data columns (total 1 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   passCarTime  307703 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 2.3 MB

ImportError: No module named indexes.base

The problem recurred

When I use pickle to reload data, all the errors are as follows:

Traceback (most recent call last):
  File "segment.py", line 17, in <module>
    word2id = pickle.load(pk)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named indexes.base

The reason for this

The same code and data run on two different machines. At first, I thought the wrong machine was missing some Python packages. But there are too many packages to install, so I can’t try them one by one. Fortunately, I use virsualenv to copy the environment from another machine to this machine directly. After running, there is no problem. But in order to find out which Python installation package is missing, I use the original compilation environment, reuse pickle to generate the original data to be loaded, and then reload it At this time, there was no error.

summary

To sum up, the reason is that the original version of panda used in the generation of pickle file is different from the current version of load pickle file. So whether it is to write code in Python or other languages, the compiling environment is very important. Once the version of a package is different, it may also lead to program errors.

Three methods of converting dict into dataframe by pandas

Input: my_ dict = {‘i’: 1, ‘love’: 2, ‘you’: 3}

Expected output: my_ df

      0
i     1
love  2
you   3

If the key and value in the dictionary are one-to-one, enter my directly_ df = pd.DataFrame (my_ “Value error: if using all scalar values, you must pass an index”.

 

The solution is as follows:

1. Specifies the index of the dictionary when using the dataframe function

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_df = pd.DataFrame(my_dict,index=[0]).T

print(my_df)

 

2. Convert dictionary dict to list and transfer it to dataframe

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_list = [my_dict]
my_df = pd.DataFrame(my_list).T

print(my_df)

 

3. Use DataFrame.from_ Dict function

For specific parameters, please refer to the official website: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_ dict.html

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_df = pd.DataFrame.from_dict(my_dict, orient='index')

print(my_df)

Output results

      0
i     1
love  2
you   3

Pandas generates new columns through LOC

A very convenient way to use pandas is through indexes such as LOC, iloc and IX. here is a record:

df.loc [condition, new column] = assign initial value

If the new column name is an existing column name, it will be changed on the original data column

import pandas as pd
import numpy as np
 
data = pd.DataFrame ( np.random.randint (0,100,40).reshape(10,4),columns=list(‘abcd’))
print(data)
data.loc [data. D & gt; = 50, ‘greater than 50′] =’Yes’
Print (data)

By using LOC to index, judge in the index, and then assign value to the new column according to the result of judgment. This is a very convenient and basic operation. Of course, I don’t remember it clearly recently, so I’ll record it here.

 

Panda was unable to open the. Xlsx file, xlrd.biffh.XLRDError : Excel xlsx file; not supported

The reason is that XLRD was recently updated to version 2.0.1 and only supports.xls files. So Pandas. Read_Excel (‘ xxx.xlsx ‘) will return an error.
You can install an older version of XLRD and run it in CMD:
PIP uninstall XLRD
PIP install XLRD ==1.2.0
You can also open the.xlsx file with OpenPyXL instead of XLRD:
Df = pandas read_excel (‘ data. XLSX, engine = ‘openpyxl)

Error report and solution of import panda in Windows 10

If you use PIP Install Pandas in Windows 10, you will be unable to use Pandas and will receive an error, as shown in the figure below

You will need to change the version of Pandas at this time
Uninstall the previous Pandas first

pip uninstall pandas

Then specify the PIP to download version 1.0.1 of Pandas

pip install pandas==1.0.1

 

Reintex index of pandas

Convention:

import pandas as pd
import numpy as np

ReIndex reindexes
Reindex () is an important method of the Pandas object, which creates a new object with a new index.
I. Re-index Series objects

se1=pd.Series([1,7,3,9],index=['d','c','a','f'])
se1

Code results:

d    1
c    7
a    3
f    9
dtype: int64

Calling reindex will reorder the missing values and fill them with NaN.

se2=se1.reindex(['a','b','c','d','e','f'])
se2

Code results:

a    3.0
b    NaN
c    7.0
d    1.0
e    NaN
f    9.0
dtype: float64

When passing in method= “” select interpolation processing mode when reindexing:
Method = ‘ffill’ or ‘pad forward filling
Method = ‘bfill’ or ‘backfill’

se3=pd.Series(['blue','red','black'],index=[0,2,4])
se4=se3.reindex(range(6),method='ffill')
se4

Code results:

0     blue
1     blue
2      red
3      red
4    black
5    black
dtype: object

Second, reindex the DataFrame object
For a DataFrame object, reIndex can modify the row index and column index.

df1=pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],columns=['one','two','four'])
df1

Code results:

one two four
a 0 1 2
c 3 4 5
d 6 7 8

Reordering the row index by default
Passing in only one sequence does not rearrange the index of the sequence

df1.reindex(['a','b','c','d'])

Code results:

The

c 3.0

one

two

four

a 0.0

1.0

2.0

b

NaN

NaN

NaN

4.0

5.0

d 6.0

7.0

8.0

df1.reindex(index=['a','b','c','d'],columns=['one','two','three','four'])

Code results:

one

two

three

four

a 0.0

1.0

NaN

2.0

b

NaN

NaN

NaN

NaN

c 3.0

4.0

NaN

5.0

d 6.0

7.0

NaN

8.0

Pass in fill_value=n and replace the missing value with n:

df1.reindex(index=['a','b','c','d'],columns=['one','two','three','four'],fill_value=100)

Code results:

one two three four
a 0 1 100 2
b 100 100 100 100
c 3 4 100 5
d 6 7 100 8

Thanks for your browsing,
hope my efforts can help you,
encourage!

Converting string object into datetime type in pandas

import pandas as pd


from pandas import DataFrame
from dateutil.parser import parse

data

data = DataFrame(columns=['date'], data=['2020-11-01','2020-11-05','2020-11-08','2020-11-11'])
data

data.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes"""

conversion

data['date'] = data['date'].apply(parse)

data.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 160.0 bytes
"""