Python+ Pandas + Evaluation of Music Equipment over the years (Notes)

from the point of view of the title, this lesson is supposed to explore how something changes over time. Therefore, the core content should be to explore the changing trend of time series data. Details are as follows:

1. Fetch data

data source:

amazon e-commerce website, provides some data resources, the data on the above page is from May 1996 to July 2014, more than 20 years of product reviews. Ratings only data header “user, item, rating, timestamp”
we download “himself Instruments” comments in the file. (the data download is very slow, almost to the time of day), the teacher can use the downloaded file: time analysis/
2. Processing data

1, read data


rnames = ['uid', 'pid', 'rating', 'timestamp']
ratings = pd.read_csv('D:\\ratings_Musical_Instruments.csv', header=None, names=rnames)

2. Processing time stamp


ratings['date'] = ratings['timestamp'].apply(datetime.fromtimestamp)
ratings['year'] = ratings['date'].dt.year
ratings['month'] = ratings['date'].dt.month
ratings= ratings['date'].to_period(freq='M')

date uid pid rating timestamp year month
2014-03 A1YS9MDZP93857 00028320 3.0 1394496000 2014 3
2013-06 A3TS466QBAWB9D 0014072149 5.0 1370476800 2013 6

  1. time stamp is an integer, 1970-01-01 00:00:00 to the number of seconds of statistical time.
  2. datetime. Fromtimestamp () converts timestamp data to datetime data
  3. . Note that the datatime data is converted to period data by specifying the index column and assigning the converted dataframe to a DF variable. Namely: ratings ratings of = [‘ date ‘] to_period (freq = “M”)
    Analysis data

    1. Mean score of each month

    [script 1]

    pingFen = ratings['rating'].groupby('date').mean()

    [script 2]

    pingFen = ratings['rating'].groupby('date').mean()

    TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’, TypeError: TypeError: float() argument must be a string or a number, not ‘Period’.

    script 2 results as follows:

    1) this does not indicate the score for which item, but only the overall score for the “musical instrument” product.
    2) it can be seen that the score of Musical Instruments tends to be stable after 2004, and shows a downward trend from 1998 to 2004.

    2. The number of participants per month

    because the above results seem to ignore one factor, which is the number of participants in the scoring. In extreme cases, the fewer participants, the more unstable the scoring and the worse the representation. Let’s take a look at the number of ratings:

    pingFenR = ratings['rating'].groupby('date').count()

    【 results 】

    From the figure, we can see that before 2010, the number of participants was very small, but after 2010, the number of participants increased rapidly, so in analysis 1, the mean score after 2010 was more stable, which can also be said that the data at this stage is more meaningful.

    3. Combined with the effective number of scores in each period, the score was observed

    , then how to present the data of three dimensions in the same graph, including time, number of participants and mean score?
    scatter plot with size!

    pingFen = ratings[['rating']].groupby('date').agg(['count', 'mean'])
    plt.scatter(pingFen.index.to_timestamp(),pingFen['rating']['mean'], ingFen['rating']['count'])

    【 results 】

    【 description 】
     contrast PLT. Scatter (“) and (PLT) the plot (), the period index into the timestamp, the different?
    plt.plot() : pingfn.to_timestamp ()
    plt.scatter() : pingfn.index. To_timestamp () :
    plt.scatter() : pingfn.index. To_timestamp () :
    ; == I don’t know about this point, anyone who knows can leave a message ~ ==

    4. Adjust the image display effect

    1) resize (enlarge or shrink)

    sometimes we show little difference in the size of the scatter, so we can adjust it by multiplying or dividing the parameter of the size of the point by a value.
    for example: in the graph I formed above, some points are too big. The way to turn them down is as follows:

    plt.scatter(pingFen.index.to_timestamp(), pingFen['rating']['mean'], pingFen['rating']['count']/100)

    results are as follows:

    2) resize (normalized)

    maps all data to between 0 and 1, using (n-min)/(max-min).
    I tried the following script and got the wrong result
    (pingFen['count'] -pingfen ['count']. Min () /(pingFen['count']. Max ())
    KeyError: ‘count’
    the reason is that there is no field in this rating table (pingFen) called ‘count’, count is just an algorithm.
    solution is as follows:

    pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean'))
    pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())

    and I’m going to talk specifically about the agg() function here, and it’s going to take me a lot of work to understand it.
    “about pandas in the agg (instructions) 】

    3) color

    parameter details, see
    here I use the script as follows:

    ratings = ratings.to_period(freq='M')
    pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean'))
    pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
    plt.scatter(pingFen.index.to_timestamp(), pingFen['avg'], s=pingFen['sl']*1500, c=pingFen['cnt']/pingFen['cnt'].mean(), alpha=0.3)

    to scatter () interpretation is as follows:

    after the adjustment effect below, this is my dream JingTu, although is not the best,

    【 description 】
     figure in x axis as the time
     y for scoring average, as can be seen from the graph, after 2006, scoring average concentration between 4.1 to 4.5
     point of capital for the number of participation grade, can be seen from the diagram after 2012, the number of raters broad and the number of scores before 2004, so its score more valuable
     color representation of the information can be configured, because this case without introducing too much observation data dimensions, I still use the grading number to display color here, Purple is small, yellow is large, and green is in the middle.


Read More: