Tag Archives: Pandas

Solution of Unicode decodeerror -‘utf-8 ‘codec can’t decode byte 0xc4 in position 0 – invalid continuation byte


Reading CSV document with Pandas is an error that cannot be read because there is Chinese in the document. The error is due to the failure of the 'utf-8' codec to decode the 0 bit byte 0xc4

Solutions:
After reading the file, add encoding=’ GBK ‘,
such as: pddata=pd. Read_csv (' felipe.csv ',encoding=' GBK ')


Interested to continue to see the reason!
As you know, the default encoding we use in Python is UTF-8. For an introduction to coding, I recommend taking a look at Liao Da’s Python tutorial, “Strings and Coding.” Since UTF-8 format cannot correctly read CSV files with Chinese characters, it would be a good idea to select a format that can read Chinese characters.
So what format can read Chinese characters?We open the Python3 official website: find the section on standard characters. The diagram below:

So what format do you want to change?You can see that the third column of the table, Language, represents what Language the encoding supports. So let’s find out.
!
I’m not going to show you the table here, but if you’re interested, go to the website. Anyway, under my careful search, there is big5; Big5hkscs; Gb2312; GBK; Gb18030. Hz; The five formats iso2022_jp_2 may support Chinese. After my test, I found gb2312; GBK; Gb18030 can read CSV files with Chinese smoothly. (Since all three are ok, let’s have a good GBK.)

It works!

Pandas memory error

Problem description
I will use pandas to process data, which is known to all children’s shoes related to data processing. Read in from CSV file with a pd.read_csv, and then play as you like. Recently, when I was doing data preprocessing, I encountered a headache. Every time the program was executed to pd. Read_csv is always reported to memory error. Finally all kinds of search stackoverflow, finally find the crux:
Replace Python with X64
Yes, it’s as simple as that!
Attached to the
As anyone who has ever used Pandas knows, it is impossible to install Pandas in WIN without any trouble. With all kinds of dependence and the disgusting network in China, it may not be so easy to change Python into X64. Please don’t be impatient, and here comes the magic device:
Please use Anaconda, select an X64, download and install it directly, graphically, install the python X64 version with a full set of data analysis package, of course, there is a Python X64 version with ipython and other tools, not malicious, and even strongly recommend the first time you install Python directly to use this.
The end

Pandas read_ Error in json() valueerror: training data

has a json file as follows:

{
  "cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
  "title": "2018上半年最热新歌TOP50",
  "author": "网易云音乐",
  "times": "1264万",
  "url": "https://music.163.com/playlist?id=2303649893",
  "id": "2303649893"
}
{
  "cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
  "title": "你的青春里有没有属于你的一首歌?",
  "author": "mayuko然",
  "times": "4576万",
  "url": "https://music.163.com/playlist?id=2201879658",
  "id": "2201879658"
}

when I try to read the file with pd.read_json('data.json'), I get an error. The error part is as follows:

File "D:\python\lib\site-packages\pandas\io\json\json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Trailing data

after a baidu found that it was a json format error, is the first time I know there is such a thing as jsonviewer. You need to save the dictionary in the file as an element in the list, modified as follows.

[{
  "cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
  "title": "2018上半年最热新歌TOP50",
  "author": "网易云音乐",
  "times": "1264万",
  "url": "https://music.163.com/playlist?id=2303649893",
  "id": "2303649893"
},
{
  "cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
  "title": "你的青春里有没有属于你的一首歌?",
  "author": "mayuko然",
  "times": "4576万",
  "url": "https://music.163.com/playlist?id=2201879658",
  "id": "2201879658"
}]

another method is to file each behavior of a complete dictionary, and then modify the parameters in the function pd. Read_json ('data.json',lines=True). lines default to False, set to True can read json objects according to the row. In the psor.read_json document, it is explained as follows:

lines: Boolean, default False. Read the file as a json object perline.new in version 0.19.0.

the modified json file is as follows:

{"cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140","title": "2018上半年最热新歌TOP50","author": "网易云音乐","times": "1264万","url": "https://music.163.com/playlist?id=2303649893","id": "2303649893"}
{"cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140","title": "你的青春里有没有属于你的一首歌?","author": "mayuko然","times": "4576万","url": "https://music.163.com/playlist?id=2201879658","id": "2201879658"}

Python common error: if using all scalar values, you must pass an index (four solutions)

(author: Chen’s freebies)

1, error scenario:

import pandas as pd
dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict)

2, error reason:

The dictionary

is passed in directly with the nominal attribute value and index is required, that is, index is set when the DataFrame object is created.

3, solution:

Creating DataFrame objects with a

dictionary is a common requirement, but it can be written differently depending on the object form. Look at the code, the following four methods can correct this error, and produce the same correct results, which method to use according to your own needs.

import pandas as pd

#方法一:直接在创建DataFrame时设置index即可
dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict,index=[0])
print(data)

#方法二:通过from_dict函数将value为标称变量的字典转换为DataFrame对象
dict = {'a':1,'b':2,'c':3}
pd.DataFrame.from_dict(dict,orient='index').T
print(data)

#方法三:输入字典时不要让Value为标称属性,把Value转换为list对象再传入即可
dict = {'a':[1],'b':[2],'c':[3]}
data = pd.DataFrame(dict)
print(data)

#方法四:直接将key和value取出来,都转换成list对象
dict = {'a':1,'b':2,'c':3}
pd.DataFrame(list(dict.items()))
print(data)

Summary of three methods for pandas to convert dict into dataframe

enter: my_dict = {‘ I ‘: 1, ‘love’: 2, ‘you’: 3}

expected output: my_df

      0
i     1
love  2
you   3

If the key and value in the dictionary are one-to-one, then directly entering my_df = pd.DataFrame(my_dict) will give an error “ValueError: If using all scalar values, you must pass an index”.

solution:

1, use DataFrame to specify the dictionary index index

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_df = pd.DataFrame(my_dict,index=[0]).T

print(my_df)

2. DataFrame

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_list = [my_dict]
my_df = pd.DataFrame(my_list).T

print(my_df)

3, DataFrame. From_dict

specific parameters can refer to the website: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html

import pandas as pd

my_dict = {'i': 1, 'love': 2, 'you': 3}
my_df = pd.DataFrame.from_dict(my_dict, orient='index')

print(my_df)

output

      0
i     1
love  2
you   3

Pandas sort according to a column_ values)

pandas sort

by a column
There are many ways to sort

pandas, sort_values means to sort

by a certain column

pd. Sort_values (” XXX “, inplace = True)

means that pd is sorted by the field XXX. Inplace defaults to False. If this value is False, then the original pd order does not change, but returns the sorted

[Python] pandas Library pd.to_ Parameter arrangement and example of Excel operation writing into excel file

excel writes to pd. dataframe.to_excel (); Write DataFrame to an Excel sheet.

to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, 
header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, 
merge_cells=True, encoding=None,inf_rep='inf', verbose=True, freeze_panes=None)

common parameter resolution

  • excel_writer: ExcelWriter target path
In [16]: df = pd.read_csv('test.csv')

In [17]: df
Out[17]:
   index  a_name  b_name
0      0       1       3
1      1       2       3
2      2       3       4
#excel_writer :'excel_output.xls'输出路径
In [18]: df.to_excel('excel_output.xls')

Sheet_name: excel sheet name

#得到的表名就是'biubiu'
In [20]: df.to_excel('excel_output.xls',sheet_name='biubiu')
  • na_rep: missing value filled, can be set to the string
In [25]: df = pd.read_excel('excel_output.xls')

In [26]: df
Out[26]:
   index  a_name  b_name
0      0       1     3.0
1      1       2     3.0
2      2       3     NaN
#如果na_rep设置为bool值,则写入excel时改为01;也可以写入字符串或数字
In [27]: df.to_excel('excel_output.xls',na_rep=True)

In [28]: pd.read_excel('excel_output.xls')
Out[28]:
   index  a_name  b_name
0      0       1       3
1      1       2       3
2      2       3       1

In [29]: df.to_excel('excel_output.xls',na_rep=False)

In [30]: pd.read_excel('excel_output.xls')
Out[30]:
   index  a_name  b_name
0      0       1       3
1      1       2       3
2      2       3       0

In [31]: df.to_excel('excel_output.xls',na_rep=11)

In [32]: pd.read_excel('excel_output.xls')
Out[32]:
   index  a_name  b_name
0      0       1       3
1      1       2       3
2      2       3      11
  • columns: select the output columns to be stored.
In [44]: df.to_excel('excel_output.xls',na_rep=11,columns=['index'])

In [45]: pd.read_excel('excel_output.xls')
Out[45]:
   index
0      0
1      1
2      2
  • header: specify the row as the column name, default 0, that is, take the first row, the data is the data below the column name row; If the data does not contain column names, set header = None;
In [48]: df.to_excel('excel_output.xls',na_rep=11,index=False)

In [49]: pd.read_excel('excel_output.xls')
Out[49]:
   index  a_name  b_name
0      0       1       3
1      1       2       3
2      2       3      11

In [50]: df.to_excel('excel_output.xls',na_rep=11,index=False,header=None)

In [51]: pd.read_excel('excel_output.xls')
Out[51]:
   0  1   3
0  1  2   3
1  2  3  11
  • index: defaults to True and displays index. When index=False, row index (name) is not displayed
  • index_label: sets the column name of index column

Python: How to Reshape the data in Pandas DataFrame

directory

perspective Pandas DataFrame

groups the data in Pandas DataFrame

summary


After using our dataset, we’ll take a quick look at visualizations that can be easily created from the dataset using the popular Python library, and then walk through an example of visualizations.

  • download CSV and database file – 127.8 KB
  • download the source code 122.4 KB

    Python and “Pandas” is part of the data cleaning series. It aims to leverage data science tools and technologies to get developers up and running quickly.

    if you would like to see other articles in this series, they can be found here:

    Part 1 – introduction to Pandas</li b>

  • – loading CSV and SQL data into Pandas
  • – correcting missing data
  • – merging multiple data sets in Pandas
  • – cleaning up Pandas part 5 – removing Pandas
  • -0 1 1 2 4 5

  • 6
  • 7 part 7 – use seaframe and Pandas to visualize data
  • 8

9
DataFrame to make the most of the data.

, even after the data set is cleaned up, the Pandas need to be reshaped to make the most of the data. Remodeling is manipulating the table structure to form a term used when different data sets, such as </ span> “</ span> wide </ span> ” </ span> data table is set to </ span> “</ span> long </ span> ” </ span> .

and 0 1 cross 2 3 table support, you will be familiar with this if you have used pivottability tables in or data built into many relational databases pivot and 1 cross 2 3 table support.

For example, the table above (Pandas document. ) has been adjusted by perspective, stacking or disassembling the table.

</ span> stack0 method takes tables with multiple indexes and groups them into groups 1 2

  • 3 4 unstack6 method takes tables with multiple unique columns and ungroups them into groups 7 89in this phase, we will study a variety of methods to use to reshape the data. We’ll see how to use perspective and stack of data frames to get different images of the data.

    please note that we have used this series of module source data files to create a full , you can in the head of this article download and install .

    see through  Pandas  DataFrame

    , we can use pivot function to create a new 0 1 DataFrame2 3 from the existing frame. So far, our table has been indexed by buy ID, but let’s convert the previously created combinedData table into a more interesting table.

    First, let’s try the following

    method by starting a new code block and adding:

    productsByState = combinedData.pivot(index='product_id', columns='company', values='paid')

    the result looks something like this:

    running this command will result in a duplicate index error, only applies to DataFrame.

    but there’s another way to get us to a solution to this problem. pivot_table works similarly to PivotTable, but it will aggregate duplicate values without generating errors.

     

; pivot_table; pivot_table; pivot_table

let’s use this method with the default value:

productsByState = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid')

you can view the results here:

This will produce a DataFrame, which contains the list of products and the average of each state in each column. This isn’t really that useful, so let’s change the aggregation method:

reshapedData = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid', aggfunc=np.sum)
reshapedData = reshapedData.fillna(0)
print(reshapedData.head(10))

now, this will generate a product table that contains the total sales of all these products in each state. The second line of this method also removes the NaN value and replaces it with 0, since it is assumed that the product is not sold in this state.

in to group data

another remodeling activity that we’ll see is grouping data elements together. Let’s go back to the original large DataFrame and create a new DataFrame.

  • groupby </ span> method the large data set and grouped according to the column value </ span> </ li> </ ul>
    Start a new code block and add:
volumesData = combinedData.groupby(by='customer_id') print(volumesData.head(10))

results as follows:

doesn’t seem to be really doing anything because our DataFrame was on purchase_id.

let’s add a summary function to summarize the data so that our grouping works as expected:

volumesData = combinedData.groupby(by='customer_id').sum()
print(volumesData.head(10))

again, this is the result:

this would group the data set the way we expected but we seem to be missing some columns and doesn’t make any sense so let’s extend the groupby method and trim that 0 1 purchase_id2 3 column: 4

5

volumesData = combinedData.groupby(by=['customer_id','first_name','last_name','product_id','product']).sum()
volumesData.drop(columns='purchase_id', inplace=True)
print(volumesData.head(10))

this is our new result:

the end result looks good and gives us a good idea of what the customer is buying, the amount of money and how much they are paying.

Finally, we will make another change to the groupby data set. Add the following to create a total for each state DataFrame :

totalsData = combinedData.groupby(by='state').sum().reset_index()
totalsData.drop(columns=['purchase_id','customer_id','product_id'], inplace=True)

The key change here is that we added a sum after the reset_index. This is to ensure that the generated DataFrame has available indexes for our visualization work.

summary

we have taken a complete, clean data set and adapted it in several different ways to give us more insight into the data.

Next, we’ll look at visualizations and see how they can be an important tool for presenting our data and ensuring that the results are clean.