from the point of view of the title, this lesson is supposed to explore how something changes over time. Therefore, the core content should be to explore the changing trend of time series data. Details are as follows:
1. Fetch data
data source: http://jmcauley.ucsd.edu/data/amazon/links.html
amazon e-commerce website, provides some data resources, the data on the above page is from May 1996 to July 2014, more than 20 years of product reviews. Ratings only data header “user, item, rating, timestamp”
we download “himself Instruments” comments in the file. (the data download is very slow, almost to the time of day), the teacher can use the downloaded file: time analysis/https://www.njcie.com/python/2 p>
2. Processing data
1, read data
rnames = ['uid', 'pid', 'rating', 'timestamp'] ratings = pd.read_csv('D:\\ratings_Musical_Instruments.csv', header=None, names=rnames)
2. Processing time stamp
ratings['date'] = ratings['timestamp'].apply(datetime.fromtimestamp) ratings['year'] = ratings['date'].dt.year ratings['month'] = ratings['date'].dt.month ratings= ratings['date'].to_period(freq='M') print(ratings)
date uid pid rating timestamp year month
2014-03 A1YS9MDZP93857 00028320 3.0 1394496000 2014 3
2013-06 A3TS466QBAWB9D 0014072149 5.0 1370476800 2013 6
- time stamp is an integer, 1970-01-01 00:00:00 to the number of seconds of statistical time.
- datetime. Fromtimestamp () converts timestamp data to datetime data
- . Note that the datatime data is converted to period data by specifying the index column and assigning the converted dataframe to a DF variable. Namely: ratings ratings of = [‘ date ‘] to_period (freq = “M”) li> ol>
1. Mean score of each month
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen) plt.show()
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen.to_timestamp()) plt.show()
TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’, TypeError: TypeError: float() argument must be a string or a number, not ‘Period’.
script 2 results as follows:
1) this does not indicate the score for which item, but only the overall score for the “musical instrument” product.
2) it can be seen that the score of Musical Instruments tends to be stable after 2004, and shows a downward trend from 1998 to 2004.
2. The number of participants per month
because the above results seem to ignore one factor, which is the number of participants in the scoring. In extreme cases, the fewer participants, the more unstable the scoring and the worse the representation. Let’s take a look at the number of ratings:
pingFenR = ratings['rating'].groupby('date').count() plt.plot(pingFenR.to_timestamp()) plt.show()
【 results 】
From the figure, we can see that before 2010, the number of participants was very small, but after 2010, the number of participants increased rapidly, so in analysis 1, the mean score after 2010 was more stable, which can also be said that the data at this stage is more meaningful.
3. Combined with the effective number of scores in each period, the score was observed
, then how to present the data of three dimensions in the same graph, including time, number of participants and mean score?
scatter plot with size!
pingFen = ratings[['rating']].groupby('date').agg(['count', 'mean']) plt.scatter(pingFen.index.to_timestamp(),pingFen['rating']['mean'], ingFen['rating']['count']) plt.show()
【 results 】
【 description 】
contrast PLT. Scatter (“) and (PLT) the plot (), the period index into the timestamp, the different?
plt.plot() : pingfn.to_timestamp ()
plt.scatter() : pingfn.index. To_timestamp () :
plt.scatter() : pingfn.index. To_timestamp () :
; == I don’t know about this point, anyone who knows can leave a message ~ ==
4. Adjust the image display effect
1) resize (enlarge or shrink)
sometimes we show little difference in the size of the scatter, so we can adjust it by multiplying or dividing the parameter of the size of the point by a value.
for example: in the graph I formed above, some points are too big. The way to turn them down is as follows:
plt.scatter(pingFen.index.to_timestamp(), pingFen['rating']['mean'], pingFen['rating']['count']/100)
results are as follows:
2) resize (normalized)
maps all data to between 0 and 1, using (n-min)/(max-min).
I tried the following script and got the wrong result
(pingFen['count'] -pingfen ['count']. Min () /(pingFen['count']. Max ())
the reason is that there is no field in this rating table (pingFen) called ‘count’, count is just an algorithm.
solution is as follows:
pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
and I’m going to talk specifically about the agg() function here, and it’s going to take me a lot of work to understand it.
“about pandas in the agg (instructions) 】
parameter details, see https://www.jb51.net/article/127806.htm
here I use the script as follows: p>
ratings = ratings.to_period(freq='M') pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max()) plt.scatter(pingFen.index.to_timestamp(), pingFen['avg'], s=pingFen['sl']*1500, c=pingFen['cnt']/pingFen['cnt'].mean(), alpha=0.3)
to scatter () interpretation is as follows:
after the adjustment effect below, this is my dream JingTu, although is not the best,
【 description 】
figure in x axis as the time
y for scoring average, as can be seen from the graph, after 2006, scoring average concentration between 4.1 to 4.5
point of capital for the number of participation grade, can be seen from the diagram after 2012, the number of raters broad and the number of scores before 2004, so its score more valuable
color representation of the information can be configured, because this case without introducing too much observation data dimensions, I still use the grading number to display color here, Purple is small, yellow is large, and green is in the middle.
- Python Time Module timestamp, Time string formatting and Conversion (13-bit timestamp)
- Python time tuples are converted to timestamps, strings
- Python: How to Reshape the data in Pandas DataFrame
- Python Pandas Typeerror: invalid type comparison
- Python traverses all files under the specified path and retrieves them according to the time interval
- Python: Np.where Ternary Operator
- Python: Panda scramble data
- Typeerror in Python regular expression: expected string or bytes like object
- The Usage of Np.random.uniform()
- Python errors: valueerror: if using all scalar values, you must pass an index (four solutions)
- python2.7 ExcelWriter error Exception caught in workbook destructor. Explicit close() may be require
- [leetcode] 295. Find Median from Data Stream Python
- Python defines a full vector class
- [Python] How to Sort a Group of Tuples Using the Sorted() Function
- can‘t multiply sequence by non-int of type ‘numpy.float64‘
- Python ImportError: numpy.core.multiarray failed to import
- pd.to_csv Error: need to escape, but no escapechar set
- Facenet validate_on_lfw.py Error AssertionError: The number of LFW images must be an integer multip
- [Solved] RuntimeError: cuda runtime error: device-side assert trigger
- Python TypeError: Unrecognized value type: ＜class ‘str‘＞dateutil.parser._parser.ParserError: Unknow