Tag Archives: data analysis

[Solved] Python Pandas Read Error: OSError: initializing from file failed

Problem Description:

error when loading CSV format data in pandas

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On Data Analysis/Unit 1 Project Collection/train.csv")
B.head(3)

report errors:

OSError: Initializing from file failed

Cause analysis:

When calling the read_csv() method of pandas, the C engine is used as the parser engine by default, and when the file name contains Chinese, using the C engine will be wrong in some cases.


Solution:

Specify the engine as Python when calling the read_csv() method

B = pd.read_csv("C:/Users/hp/Desktop/Hands-On-Data-Analysis/Unit-1-Project-Collection/train.csv",engine='python')
B.head(3)

[Solved] python Error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.

Python Error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Error Codes:

if (code in list(changed_code['Old material code'])):
            temp_index = changed_code.loc[changed_code['Old material code'] == code].index

The type of code here is float.

Cause analysis:

In this judgment method, the judged value cannot be of float type.

Solution:

Just convert float format to int format

if (int(code) in list(changed_code['Old material code'])):
            temp_index = changed_code.loc[changed_code['Old material code'] == code].index

Solve the problem, brothers, get better!!

Python Pandas Error: KeyError: 0 [How to Solve]

Keyerror: error reported by 0

The following are error codes

I call my own library function and use apply to realize vlookup in Excel. The following is the code

data2 = super_function.vlook_up(data1, ['material group', 'material description'], data, ['material group', 'material group description'])

Error message

KeyError: 0

Error reporting reason

This kind of error reporting is due to the index problem. As a result, some numbers were deleted during the original data processing, resulting in the index starting from 6 instead of 0.

Solution:

Just reassign the index

data1.index = list(range(len(data1)))

result

Run successfully.

[Solved] Excel plug in installation failed: unable to resolve the value of property ‘type’

[Description of the problem]
The third party to Excel plug-in installation package as Figure 1, I have not done Excel plug-in installation package, it is estimated that the callVSTOInstaller.exe

The installation failed with the following message

ERROR message ” The value of the property ‘type’ cannot be parsed. The error is: Could not load file or assembly ‘Microsoft.Office.BusinessApplications.Fba,Version=14.0.0.0,Culture=nutral, PublicKeyToken=71e9ce111e9429c’ or one of its dependencies. The system cannot find the file specified. (C:\Program Files\Common Files\Microsoft Shared\VSTO\10.0\VSTOInstaller.exe.Config Line 10)

[Solution]
Fixed location plugin folder

 

    1. C:\Program Files (x86)\Common Files\Microsoft shared\VSTO\10.0 or C:\Program Files\Common Files\Microsoft shared\VSTO\10.0.

Rename VSTOInstaller.exe.config, such as VSTOInstaller.exe.config.old. and reinstall successfully.

[Run Result]
After installation plug-in directory.


Normal operation interface.

RuntimeError: implement_array_function method already has a docstring(Pycharm install package error)

Recently, I’m writing the course of financial analysis and prediction in Python. Because I’m lazy, I didn’t match the required Library under CONDA in advance. Using pychar install package directly will lead to some version incompatibility and mismatching due to the installation sequence, which leads to
runtimeerror: implementation_ array_ Function method already has a docstring
error report
mark
I don’t know what I’m writing<
PIP universal pandas
PIP universal mattlotlib
PIP universal Skippy
PIP universal numpy
PIP universal scikit learn
then install
PIP install numpy
PIP install Skippy
PIP install panda
PIP install mattlotlib
pip install scikit learn
in the following order

numpy.random.rand()

numpy.random.randn (d0, d1, … , DN) is to return one or more sample values from the standard normal distribution.  
numpy.random.rand (d0, d1, … , DN) in [0,1].   

 

numpy.random.rand (d0,d1,… ,dn)

The rand function generates data between [0,1] according to the given dimension, including 0 and excluding 1DN table. The return value of each dimension is the array of the specified dimension

np.random.rand(4,2)
array([[0.64959905, 0.14584702],
       [0.56862369, 0.5992007 ],
       [0.42512475, 0.83075541],
       [0.75685279, 0.00910825]])

np.random.rand(4,3,2) # shape: 4*3*2
array([[[0.07304796, 0.48810928],
        [0.59523586, 0.83281804],
        [0.47530734, 0.50402275]],

       [[0.63153869, 0.19636159],
        [0.93727986, 0.13564719],
        [0.11122609, 0.59646316]],

       [[0.17276155, 0.66621767],
        [0.81926792, 0.28781293],
        [0.20228714, 0.72412133]],

       [[0.29365696, 0.53956076],
        [0.19105394, 0.47044441],
        [0.85930046, 0.3867359 ]]])

 

Python data analysis dataframe converts dates to weeks

When doing data analysis with Python, how to add a new column of weeks “week” and change the date, such as “2021-03-02” to the nth week of 2021?

Introduce two methods:
the dataframe of my data is called sample, which has the feature “date”. Now add a new feature “week”
1 dt.week

sample["week"]=sample["date"].dt.week

2. Use datetimeindex

df['week'] = pd.DatetimeIndex(sample['date']).week

Pandas read_ Error in json() valueerror: training data

has a json file as follows:

{
  "cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
  "title": "2018上半年最热新歌TOP50",
  "author": "网易云音乐",
  "times": "1264万",
  "url": "https://music.163.com/playlist?id=2303649893",
  "id": "2303649893"
}
{
  "cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
  "title": "你的青春里有没有属于你的一首歌?",
  "author": "mayuko然",
  "times": "4576万",
  "url": "https://music.163.com/playlist?id=2201879658",
  "id": "2201879658"
}

when I try to read the file with pd.read_json('data.json'), I get an error. The error part is as follows:

File "D:\python\lib\site-packages\pandas\io\json\json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Trailing data

after a baidu found that it was a json format error, is the first time I know there is such a thing as jsonviewer. You need to save the dictionary in the file as an element in the list, modified as follows.

[{
  "cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
  "title": "2018上半年最热新歌TOP50",
  "author": "网易云音乐",
  "times": "1264万",
  "url": "https://music.163.com/playlist?id=2303649893",
  "id": "2303649893"
},
{
  "cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
  "title": "你的青春里有没有属于你的一首歌?",
  "author": "mayuko然",
  "times": "4576万",
  "url": "https://music.163.com/playlist?id=2201879658",
  "id": "2201879658"
}]

another method is to file each behavior of a complete dictionary, and then modify the parameters in the function pd. Read_json ('data.json',lines=True). lines default to False, set to True can read json objects according to the row. In the psor.read_json document, it is explained as follows:

lines: Boolean, default False. Read the file as a json object perline.new in version 0.19.0.

the modified json file is as follows:

{"cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140","title": "2018上半年最热新歌TOP50","author": "网易云音乐","times": "1264万","url": "https://music.163.com/playlist?id=2303649893","id": "2303649893"}
{"cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140","title": "你的青春里有没有属于你的一首歌?","author": "mayuko然","times": "4576万","url": "https://music.163.com/playlist?id=2201879658","id": "2201879658"}

pd.to_csv,Error: need to escape, but no escapechar set

pd. To_csv,

df3.to_csv('E:\\data\\xxxx.csv',index=False,header= 0,sep='|', encoding="utf-8", quoting=csv.QUOTE_NONE)

Error: Error: need to escape, but no escapechar set

reason: this problem may be due to the fact that the description contains’ |’, ‘|’ is also a delimiter. CSV attempts to escape it, but cannot, because there is no CSV. Escapechars setting

solution:
provides a escapechar, when quoting for QUOTE_NONE, specify a character makes separator without limit, in order to escape.

df3.to_csv('E:\\data\\xxxx.csv',index=False,header= 0,sep='|', encoding="utf-8", quoting=csv.QUOTE_NONE,escapechar='|')

reference:

https://stackoverflow.com/questions/32107790/writing-to-csv-getting-error-need-to-escape-for-a-blank-string

Python+ Pandas + Evaluation of Music Equipment over the years (Notes)

from the point of view of the title, this lesson is supposed to explore how something changes over time. Therefore, the core content should be to explore the changing trend of time series data. Details are as follows:

1. Fetch data

data source: http://jmcauley.ucsd.edu/data/amazon/links.html

amazon e-commerce website, provides some data resources, the data on the above page is from May 1996 to July 2014, more than 20 years of product reviews. Ratings only data header “user, item, rating, timestamp”
we download “himself Instruments” comments in the file. (the data download is very slow, almost to the time of day), the teacher can use the downloaded file: time analysis/https://www.njcie.com/python/2
2. Processing data

1, read data

[script]

rnames = ['uid', 'pid', 'rating', 'timestamp']
ratings = pd.read_csv('D:\\ratings_Musical_Instruments.csv', header=None, names=rnames)

2. Processing time stamp

[script]

ratings['date'] = ratings['timestamp'].apply(datetime.fromtimestamp)
ratings['year'] = ratings['date'].dt.year
ratings['month'] = ratings['date'].dt.month
ratings= ratings['date'].to_period(freq='M')
print(ratings)

[result]
date uid pid rating timestamp year month
2014-03 A1YS9MDZP93857 00028320 3.0 1394496000 2014 3
2013-06 A3TS466QBAWB9D 0014072149 5.0 1370476800 2013 6
[description]

  1. time stamp is an integer, 1970-01-01 00:00:00 to the number of seconds of statistical time.
  2. datetime. Fromtimestamp () converts timestamp data to datetime data
  3. . Note that the datatime data is converted to period data by specifying the index column and assigning the converted dataframe to a DF variable. Namely: ratings ratings of = [‘ date ‘] to_period (freq = “M”)
    Analysis data

    1. Mean score of each month

    [script 1]

    pingFen = ratings['rating'].groupby('date').mean()
    plt.plot(pingFen)
    plt.show()
    

    [script 2]

    pingFen = ratings['rating'].groupby('date').mean()
    plt.plot(pingFen.to_timestamp())
    plt.show()
    

    TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’, TypeError: TypeError: float() argument must be a string or a number, not ‘Period’.

    script 2 results as follows:

    [description]
    1) this does not indicate the score for which item, but only the overall score for the “musical instrument” product.
    2) it can be seen that the score of Musical Instruments tends to be stable after 2004, and shows a downward trend from 1998 to 2004.

    2. The number of participants per month

    because the above results seem to ignore one factor, which is the number of participants in the scoring. In extreme cases, the fewer participants, the more unstable the scoring and the worse the representation. Let’s take a look at the number of ratings:
    [script]

    pingFenR = ratings['rating'].groupby('date').count()
    plt.plot(pingFenR.to_timestamp())
    plt.show()
    

    【 results 】

    From the figure, we can see that before 2010, the number of participants was very small, but after 2010, the number of participants increased rapidly, so in analysis 1, the mean score after 2010 was more stable, which can also be said that the data at this stage is more meaningful.

    3. Combined with the effective number of scores in each period, the score was observed

    , then how to present the data of three dimensions in the same graph, including time, number of participants and mean score?
    scatter plot with size!
    [script]

    pingFen = ratings[['rating']].groupby('date').agg(['count', 'mean'])
    plt.scatter(pingFen.index.to_timestamp(),pingFen['rating']['mean'], ingFen['rating']['count'])
    plt.show()
    

    【 results 】

    【 description 】
     contrast PLT. Scatter (“) and (PLT) the plot (), the period index into the timestamp, the different?
    plt.plot() : pingfn.to_timestamp ()
    plt.scatter() : pingfn.index. To_timestamp () :
    plt.scatter() : pingfn.index. To_timestamp () :
    ; == I don’t know about this point, anyone who knows can leave a message ~ ==

    4. Adjust the image display effect

    1) resize (enlarge or shrink)

    sometimes we show little difference in the size of the scatter, so we can adjust it by multiplying or dividing the parameter of the size of the point by a value.
    for example: in the graph I formed above, some points are too big. The way to turn them down is as follows:

    plt.scatter(pingFen.index.to_timestamp(), pingFen['rating']['mean'], pingFen['rating']['count']/100)
    

    results are as follows:

    2) resize (normalized)

    maps all data to between 0 and 1, using (n-min)/(max-min).
    I tried the following script and got the wrong result
    (pingFen['count'] -pingfen ['count']. Min () /(pingFen['count']. Max ())
    KeyError: ‘count’
    the reason is that there is no field in this rating table (pingFen) called ‘count’, count is just an algorithm.
    solution is as follows:

    pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean'))
    pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
    

    and I’m going to talk specifically about the agg() function here, and it’s going to take me a lot of work to understand it.
    “about pandas in the agg (instructions) 】

    3) color

    parameter details, see https://www.jb51.net/article/127806.htm
    here I use the script as follows:

    ratings = ratings.to_period(freq='M')
    pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean'))
    pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
    plt.scatter(pingFen.index.to_timestamp(), pingFen['avg'], s=pingFen['sl']*1500, c=pingFen['cnt']/pingFen['cnt'].mean(), alpha=0.3)
    

    to scatter () interpretation is as follows:

    after the adjustment effect below, this is my dream JingTu, although is not the best,

    【 description 】
     figure in x axis as the time
     y for scoring average, as can be seen from the graph, after 2006, scoring average concentration between 4.1 to 4.5
     point of capital for the number of participation grade, can be seen from the diagram after 2012, the number of raters broad and the number of scores before 2004, so its score more valuable
     color representation of the information can be configured, because this case without introducing too much observation data dimensions, I still use the grading number to display color here, Purple is small, yellow is large, and green is in the middle.

    [end]