Recently, I’m writing the course of financial analysis and prediction in Python. Because I’m lazy, I didn’t match the required Library under CONDA in advance. Using pychar install package directly will lead to some version incompatibility and mismatching due to the installation sequence, which leads to
runtimeerror: implementation_ array_ Function method already has a docstring
error report
mark
I don’t know what I’m writing<
PIP universal pandas
PIP universal mattlotlib
PIP universal Skippy
PIP universal numpy
PIP universal scikit learn
then install
PIP install numpy
PIP install Skippy
PIP install panda
PIP install mattlotlib
pip install scikit learn
in the following order
Tag Archives: data analysis
numpy.random.rand()
numpy.random.randn (d0, d1, … , DN) is to return one or more sample values from the standard normal distribution.
numpy.random.rand (d0, d1, … , DN) in [0,1].
numpy.random.rand (d0,d1,… ,dn)
The rand function generates data between [0,1] according to the given dimension, including 0 and excluding 1DN table. The return value of each dimension is the array of the specified dimension
np.random.rand(4,2)
array([[0.64959905, 0.14584702],
[0.56862369, 0.5992007 ],
[0.42512475, 0.83075541],
[0.75685279, 0.00910825]])
np.random.rand(4,3,2) # shape: 4*3*2
array([[[0.07304796, 0.48810928],
[0.59523586, 0.83281804],
[0.47530734, 0.50402275]],
[[0.63153869, 0.19636159],
[0.93727986, 0.13564719],
[0.11122609, 0.59646316]],
[[0.17276155, 0.66621767],
[0.81926792, 0.28781293],
[0.20228714, 0.72412133]],
[[0.29365696, 0.53956076],
[0.19105394, 0.47044441],
[0.85930046, 0.3867359 ]]])
Python data analysis dataframe converts dates to weeks
When doing data analysis with Python, how to add a new column of weeks “week” and change the date, such as “2021-03-02” to the nth week of 2021?
Introduce two methods:
the dataframe of my data is called sample, which has the feature “date”. Now add a new feature “week”
1 dt.week
sample["week"]=sample["date"].dt.week
2. Use datetimeindex
df['week'] = pd.DatetimeIndex(sample['date']).week
Pandas read_ Error in json() valueerror: training data
has a json file as follows:
{
"cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
"title": "2018上半年最热新歌TOP50",
"author": "网易云音乐",
"times": "1264万",
"url": "https://music.163.com/playlist?id=2303649893",
"id": "2303649893"
}
{
"cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
"title": "你的青春里有没有属于你的一首歌?",
"author": "mayuko然",
"times": "4576万",
"url": "https://music.163.com/playlist?id=2201879658",
"id": "2201879658"
}
when I try to read the file with pd.read_json('data.json')
, I get an error. The error part is as follows:
File "D:\python\lib\site-packages\pandas\io\json\json.py", line 853, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Trailing data
after a baidu found that it was a json format error, is the first time I know there is such a thing as jsonviewer. You need to save the dictionary in the file as an element in the list, modified as follows.
[{
"cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140",
"title": "2018上半年最热新歌TOP50",
"author": "网易云音乐",
"times": "1264万",
"url": "https://music.163.com/playlist?id=2303649893",
"id": "2303649893"
},
{
"cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140",
"title": "你的青春里有没有属于你的一首歌?",
"author": "mayuko然",
"times": "4576万",
"url": "https://music.163.com/playlist?id=2201879658",
"id": "2201879658"
}]
another method is to file each behavior of a complete dictionary, and then modify the parameters in the function pd. Read_json ('data.json',lines=True)
. lines
default to False
, set to True
can read json objects according to the row. In the psor.read_json document, it is explained as follows:
lines: Boolean, default False. Read the file as a json object perline.new in version 0.19.0.
the modified json file is as follows:
{"cover": "http://p2.music.126.net/wsPS7l8JZ3EAOvlaJPWW-w==/109951163393967421.jpg?param=140y140","title": "2018上半年最热新歌TOP50","author": "网易云音乐","times": "1264万","url": "https://music.163.com/playlist?id=2303649893","id": "2303649893"}
{"cover": "http://p2.music.126.net/wpahk9cQCDtdzJPE52EzJQ==/109951163271025942.jpg?param=140y140","title": "你的青春里有没有属于你的一首歌?","author": "mayuko然","times": "4576万","url": "https://music.163.com/playlist?id=2201879658","id": "2201879658"}
div>
pd.to_csv,Error: need to escape, but no escapechar set
pd. To_csv, p>
df3.to_csv('E:\\data\\xxxx.csv',index=False,header= 0,sep='|', encoding="utf-8", quoting=csv.QUOTE_NONE)
Error: Error: need to escape, but no escapechar set
reason: this problem may be due to the fact that the description contains’ |’, ‘|’ is also a delimiter. CSV attempts to escape it, but cannot, because there is no CSV. Escapechars setting
solution: strong>
provides a escapechar, when quoting for QUOTE_NONE, specify a character makes separator without limit, in order to escape.
df3.to_csv('E:\\data\\xxxx.csv',index=False,header= 0,sep='|', encoding="utf-8", quoting=csv.QUOTE_NONE,escapechar='|')
reference:
https://stackoverflow.com/questions/32107790/writing-to-csv-getting-error-need-to-escape-for-a-blank-string
Python+ Pandas + Evaluation of Music Equipment over the years (Notes)
from the point of view of the title, this lesson is supposed to explore how something changes over time. Therefore, the core content should be to explore the changing trend of time series data. Details are as follows:
1. Fetch data
data source: http://jmcauley.ucsd.edu/data/amazon/links.html
amazon e-commerce website, provides some data resources, the data on the above page is from May 1996 to July 2014, more than 20 years of product reviews. Ratings only data header “user, item, rating, timestamp”
we download “himself Instruments” comments in the file. (the data download is very slow, almost to the time of day), the teacher can use the downloaded file: time analysis/https://www.njcie.com/python/2 p>
2. Processing data
1, read data
[script]
rnames = ['uid', 'pid', 'rating', 'timestamp']
ratings = pd.read_csv('D:\\ratings_Musical_Instruments.csv', header=None, names=rnames)
2. Processing time stamp
[script]
ratings['date'] = ratings['timestamp'].apply(datetime.fromtimestamp)
ratings['year'] = ratings['date'].dt.year
ratings['month'] = ratings['date'].dt.month
ratings= ratings['date'].to_period(freq='M')
print(ratings)
[result]
date uid pid rating timestamp year month
2014-03 A1YS9MDZP93857 00028320 3.0 1394496000 2014 3
2013-06 A3TS466QBAWB9D 0014072149 5.0 1370476800 2013 6
[description]
- time stamp is an integer, 1970-01-01 00:00:00 to the number of seconds of statistical time.
- datetime. Fromtimestamp () converts timestamp data to datetime data
- . Note that the datatime data is converted to period data by specifying the index column and assigning the converted dataframe to a DF variable. Namely: ratings ratings of = [‘ date ‘] to_period (freq = “M”) li> ol>
Analysis data1. Mean score of each month
[script 1]
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen) plt.show()
[script 2]
pingFen = ratings['rating'].groupby('date').mean() plt.plot(pingFen.to_timestamp()) plt.show()
TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’. TypeError: float() argument must be a string or a number, not ‘Period’, TypeError: TypeError: float() argument must be a string or a number, not ‘Period’.
script 2 results as follows:
[description]
1) this does not indicate the score for which item, but only the overall score for the “musical instrument” product.
2) it can be seen that the score of Musical Instruments tends to be stable after 2004, and shows a downward trend from 1998 to 2004.2. The number of participants per month
because the above results seem to ignore one factor, which is the number of participants in the scoring. In extreme cases, the fewer participants, the more unstable the scoring and the worse the representation. Let’s take a look at the number of ratings:
[script]pingFenR = ratings['rating'].groupby('date').count() plt.plot(pingFenR.to_timestamp()) plt.show()
【 results 】
From the figure, we can see that before 2010, the number of participants was very small, but after 2010, the number of participants increased rapidly, so in analysis 1, the mean score after 2010 was more stable, which can also be said that the data at this stage is more meaningful.
3. Combined with the effective number of scores in each period, the score was observed
, then how to present the data of three dimensions in the same graph, including time, number of participants and mean score?
scatter plot with size!
[script]pingFen = ratings[['rating']].groupby('date').agg(['count', 'mean']) plt.scatter(pingFen.index.to_timestamp(),pingFen['rating']['mean'], ingFen['rating']['count']) plt.show()
【 results 】
【 description 】
contrast PLT. Scatter (“) and (PLT) the plot (), the period index into the timestamp, the different?
plt.plot() : pingfn.to_timestamp ()
plt.scatter() : pingfn.index. To_timestamp () :
plt.scatter() : pingfn.index. To_timestamp () :
; == I don’t know about this point, anyone who knows can leave a message ~ ==4. Adjust the image display effect
1) resize (enlarge or shrink)
sometimes we show little difference in the size of the scatter, so we can adjust it by multiplying or dividing the parameter of the size of the point by a value.
for example: in the graph I formed above, some points are too big. The way to turn them down is as follows:plt.scatter(pingFen.index.to_timestamp(), pingFen['rating']['mean'], pingFen['rating']['count']/100)
results are as follows:
2) resize (normalized)
maps all data to between 0 and 1, using (n-min)/(max-min).
I tried the following script and got the wrong result
(pingFen['count'] -pingfen ['count']. Min () /(pingFen['count']. Max ())
KeyError: ‘count’
the reason is that there is no field in this rating table (pingFen) called ‘count’, count is just an algorithm.
solution is as follows:pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max())
and I’m going to talk specifically about the agg() function here, and it’s going to take me a lot of work to understand it.
“about pandas in the agg (instructions) 】
3) color
parameter details, see https://www.jb51.net/article/127806.htm
here I use the script as follows: p>ratings = ratings.to_period(freq='M') pingFen = ratings.groupby('date').agg(cnt=('rating', 'count'), avg =('rating', 'mean')) pingFen['sl'] = (pingFen['cnt']-pingFen['cnt'].min())/(pingFen['cnt'].max()) plt.scatter(pingFen.index.to_timestamp(), pingFen['avg'], s=pingFen['sl']*1500, c=pingFen['cnt']/pingFen['cnt'].mean(), alpha=0.3)
to scatter () interpretation is as follows:
after the adjustment effect below, this is my dream JingTu, although is not the best,
【 description 】
figure in x axis as the time
y for scoring average, as can be seen from the graph, after 2006, scoring average concentration between 4.1 to 4.5
point of capital for the number of participation grade, can be seen from the diagram after 2012, the number of raters broad and the number of scores before 2004, so its score more valuable
color representation of the information can be configured, because this case without introducing too much observation data dimensions, I still use the grading number to display color here, Purple is small, yellow is large, and green is in the middle.[end]