from the point of view of the title, this lesson is supposed to explore how something changes over time. Therefore, the core content should be to explore the changing trend of time series data. Details are as follows:
1. Fetch data
data source: http://jmcauley.ucsd.edu/data/amazon/links.html
amazon e-commerce website, provides some data resources, the data on the above page is from May 1996 to July 2014, more than 20 years of product reviews. Ratings only data header “user, item, rating, timestamp”
we download “himself Instruments” comments in the file. (the data download is very slow, almost to the time of day), the teacher can use the downloaded file: time analysis/https://www.njcie.com/python/2 p>
2. Processing data
1, read data
[script]
rnames = ['uid', 'pid', 'rating', 'timestamp']
ratings = pd.read_csv('D:\\ratings_Musical_Instruments.csv', header=None, names=rnames)
2. Processing time stamp
[script]
ratings['date'] = ratings['timestamp'].apply(datetime.fromtimestamp)
ratings['year'] = ratings['date'].dt.year
ratings['month'] = ratings['date'].dt.month
ratings= ratings['date'].to_period(freq='M')
print(ratings)
[result]
date uid pid rating timestamp year month
2014-03 A1YS9MDZP93857 00028320 3.0 1394496000 2014 3
2013-06 A3TS466QBAWB9D 0014072149 5.0 1370476800 2013 6
[description]
- time stamp is an integer, 1970-01-01 00:00:00 to the number of seconds of statistical time.
- datetime. Fromtimestamp () converts timestamp data to datetime data
- . Note that the datatime data is converted to period data by specifying the index column and assigning the converted dataframe to a DF variable. Namely: ratings ratings of = [‘ date ‘] to_period (freq = “M”) li> ol>
Analysis data1. Mean score of each month
[script 1]
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen) plt.show()
[script 2]
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen.to_timestamp()) plt.show()
TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’, TypeError: TypeError: float() argument must be a string or a number, not ‘Period’.
script 2 results as follows:
[description]
1) this does not indicate the score for which item, but only the overall score for the “musical instrument” product.
2) it can be seen that the score of Musical Instruments tends to be stable after 2004, and shows a downward trend from 1998 to 2004.2. The number of participants per month
because the above results seem to ignore one factor, which is the number of participants in the scoring. In extreme cases, the fewer participants, the more unstable the scoring and the worse the representation. Let’s take a look at the number of ratings:
[script]pingFenR = ratings['rating'].groupby('date').count() plt.plot(pingFenR.to_timestamp()) plt.show()
【 results 】
From the figure, we can see that before 2010, the number of participants was very small, but after 2010, the number of participants increased rapidly, so in analysis 1, the mean score after 2010 was more stable, which can also be said that the data at this stage is more meaningful.
3. Combined with the effective number of scores in each period, the score was observed
, then how to present the data of three dimensions in the same graph, including time, number of participants and mean score?
scatter plot with size!
[script]pingFen = ratings[['rating']].groupby('date').agg(['count', 'mean']) plt.scatter(pingFen.index.to_timestamp(),pingFen['rating']['mean'], ingFen['rating']['count']) plt.show()
【 results 】
【 description 】
contrast PLT. Scatter (“) and (PLT) the plot (), the period index into the timestamp, the different?
plt.plot() : pingfn.to_timestamp ()
plt.scatter() : pingfn.index. To_timestamp () :
plt.scatter() : pingfn.index. To_timestamp () :
; == I don’t know about this point, anyone who knows can leave a message ~ ==4. Adjust the image display effect
1) resize (enlarge or shrink)
sometimes we show little difference in the size of the scatter, so we can adjust it by multiplying or dividing the parameter of the size of the point by a value.
for example: in the graph I formed above, some points are too big. The way to turn them down is as follows:plt.scatter(pingFen.index.to_timestamp(), pingFen['rating']['mean'], pingFen['rating']['count']/100)
results are as follows:
2) resize (normalized)
maps all data to between 0 and 1, using (n-min)/(max-min).
I tried the following script and got the wrong result
(pingFen['count'] -pingfen ['count']. Min () /(pingFen['count']. Max ())
KeyError: ‘count’
the reason is that there is no field in this rating table (pingFen) called ‘count’, count is just an algorithm.
solution is as follows:pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
and I’m going to talk specifically about the agg() function here, and it’s going to take me a lot of work to understand it.
“about pandas in the agg (instructions) 】
3) color
parameter details, see https://www.jb51.net/article/127806.htm
here I use the script as follows: p>ratings = ratings.to_period(freq='M') pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max()) plt.scatter(pingFen.index.to_timestamp(), pingFen['avg'], s=pingFen['sl']*1500, c=pingFen['cnt']/pingFen['cnt'].mean(), alpha=0.3)
to scatter () interpretation is as follows:
after the adjustment effect below, this is my dream JingTu, although is not the best,
【 description 】
figure in x axis as the time
y for scoring average, as can be seen from the graph, after 2006, scoring average concentration between 4.1 to 4.5
point of capital for the number of participation grade, can be seen from the diagram after 2012, the number of raters broad and the number of scores before 2004, so its score more valuable
color representation of the information can be configured, because this case without introducing too much observation data dimensions, I still use the grading number to display color here, Purple is small, yellow is large, and green is in the middle.[end]
Read More:
- Python: How to Reshape the data in Pandas DataFrame
- Python traverses all files under the specified path and retrieves them according to the time interval
- The Python DOM method iterates over all the XML in a folder
- Python automatically generates the requirements file for the current project
- Python recursively traverses all files in the directory to find the specified file
- Python classes that connect to the database
- Change the Python installation path in Pycharm
- You can run the Ansible Playbook in Python by hand
- Python Pandas Error: KeyError: 0 [How to Solve]
- Python Pandas Typeerror: invalid type comparison
- Pandas read_csv pandas.errors.ParserError: Error tokenizing data
- [Solved] Python Pandas Read Error: OSError: initializing from file failed
- Python Pandas Error: No module named ‘openpyxl‘
- How to Solve Python Pandas Read or Import Files Error
- Error reading file by pandas pandas.errors.EmptyDataError: no columns to parse from file
- Python opens the table and appears pandas.errors.ParserError: Error tokenizing data. C error:
- Pandas Error: ValueError: setting an array element with a sequence.
- [Solved] AttributeError: module ‘pandas‘ has no attribute ‘rolling_count‘
- [How to Fix]pandas.errors.ParserError: Error tokenizing data
- [Solved] Pandas dataframe merge error: Different types cannot be merged