0
0

More than 3 years have passed since last update.

I would like to graph the infection transition of covid-19 with a range of periods using pandas and matplotlib.

Last updated at Posted at 2021-03-24

日本語

When I tried to draw a graph using the public data of the Ministry of Health, Labor and Welfare of Japan OVID-19 (the data of the Ministry of Health and Welfare from February 14, 2020), it was superimposed on the data released by NHK (the data of NHK from January 16, 2020).

NHK opendata about Covid-19 in japan

Ministry of Health, Labour and Welfare of Japan opendata about Covid-19 in japan

There is a separate file for each item (number of dead, number of positives, number of tests, etc.), so read them separately. The dates are common (should be, but I haven't seen them). The number of deaths (death_total.csv) is only cumulative (addition) for each day.

death_total
#mhlw1 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_positive_daily.csv", parse_dates=['日付'])
#death_total.csv
mhlw2 = pd.read_csv("https://www.mhlw.go.jp/content/death_total.csv", parse_dates=['日付'])
#mhlw3 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_tested_daily.csv", parse_dates=['日付'])
#mhlw4 = pd.read_csv("https://www.mhlw.go.jp/content/severe_daily.csv", parse_dates=['日付'])

I don't know how to draw a graph at all, but I was able to do it.
395日間死亡数 (1).png

The one I referred to was the one published by this person.
https://oku.edu.mie-u.ac.jp/~okumura/python/COVID-19.html
I don't know how to plot the graph, so I copied the code that was able to draw the graph as it was, executed it, and then transformed it into a template.
Looking at this person's program, it seems that he had been working on visualization of the coronavirus infection status since the early days when there was no open data.

What is this graph? The daily number of deaths from coronavirus infection.
Oddly enough, the total number of data released by the Ministry of Health and Welfare and the data released by NHK are also different. If they are the same number, they should match exactly when stacked.

Then, if you look at the open data that uses other NHK aggregated data ...

Data Discrepancies
Our data can sometimes disagree with MHLW or prefectural governments because of different policies we are using to input the data. Here are our known discrepancies and why:

  • National death counts. We are counting more deaths than MHLW. Our death counts are aligned with NHK's reporting. We are unclear why MHLW is reporting less deaths.

Well, I thought it was noise or my mistake, but apparently it wasn't.

So, after all, I want to see the difference between the two.

That part was from April to May, the beginning of May.
So, I would like to see a close-up that includes about two weeks, including that period. That is the specific thing that the title points to.

I want to draw the data for the required 30 days (with this as the range) on the graph from the data for 425 days.

First of all, the program code for drawing the first graph was like this.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd

#mhlw1 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_positive_daily.csv", parse_dates=['日付'])
mhlw2 = pd.read_csv("https://www.mhlw.go.jp/content/death_total.csv", parse_dates=['日付'])
#mhlw3 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_tested_daily.csv", parse_dates=['日付'])
#mhlw4 = pd.read_csv("https://www.mhlw.go.jp/content/severe_daily.csv", parse_dates=['日付'])


#nhk = pd.read_csv('https://www3.nhk.or.jp/n-data/opendata/coronavirus/nhk_news_covid19_domestic_daily_data.csv', parse_dates=['日付'])
# I use downloaded csv file
nhk2 = pd.read_csv(r'/content/nhk_news_covid19_domestic_daily_data.csv', parse_dates=['日付'])

print(len(nhk2))# days: 425 days as of 2021-03-16
print(len(mhlw2))# days: 396 days as of 2021-03-16

locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator)
fig, ax = plt.subplots(figsize=(15, 3))

ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)

ax.bar(mhlw2['日付'], mhlw2['死亡者数'] - mhlw2['死亡者数'].shift(),width=1,color='red')
ax.bar(nhk2['日付'], nhk2['国内の死者数_1日ごとの発表数'] ,width =0.5, color='blue')

The NHK data (https://www3.nhk.or.jp/n-data/opendata/coronavirus/nhk_news_covid19_domestic_daily_data.csv) was like this.

            日付  国内の感染者数_1日ごとの発表数  国内の感染者数_累計  国内の死者数_1日ごとの発表数  国内の死者数_累計
0   2020-01-16                 1           1                0          0
1   2020-01-17                 0           1                0          0
2   2020-01-18                 0           1                0          0
3   2020-01-19                 0           1                0          0
4   2020-01-20                 0           1                0          0
..         ...               ...         ...              ...        ...
420 2021-03-11              1317      444332               45       8464
421 2021-03-12              1271      445603               58       8522
422 2021-03-13              1320      446923               51       8573
423 2021-03-14               988      447911               21       8594
424 2021-03-15               695      448606               38       8632

As a caveat, when I read csv from the URL directly into the google colabratory or jupyter notebook, an error occurred on the way, so
First, I downloaded it with wget and then used it.

!wget 'https://www3.nhk.or.jp/n-data/opendata/coronavirus/nhk_news_covid19_domestic_daily_data.csv'

Read from downloaded file

pd.read_csv(r'/content/nhk_news_covid19_domestic_daily_data.csv', parse_dates=['日付'])

That's the path to the file. In the case of path, I thought that it might not recognize the directory symbol unless I loaded some special module, but when I added r, there was no problem.
About r
https://stackoverflow.com/questions/19034822/unknown-python-expression-filename-r-path-to-file

If the path is written as a character string with r, the part after r is treated as raw strings, so the \and/included in the path are recognized as not being escape symbols.

The number of deaths (death_total.csv) in the Ministry of Health and Welfare of Japan data is cumulative daily.

            日付  死亡者数
0   2020-02-14     1
1   2020-02-15     1
2   2020-02-16     1
3   2020-02-17     1
4   2020-02-18     1
..         ...   ...
391 2021-03-11  8449
392 2021-03-12  8507
393 2021-03-13  8558
394 2021-03-14  8588
395 2021-03-15  8620

It seems that you can access the previous data with pandas DataFrame because it is shift(), so if you do the following, it should be the number generated in one day from the cumulative (addition) number.

mhlw2['死亡者数'] - mhlw2['死亡者数'].shift()

I don't know how to do it with pandas, but if you want to see the data of the cumulative number of occurrences per day,
I checked it like this.

for i,l in enumerate(mhlw2['死亡者数']):
    d = 0
    if i > 0:
       d = l - mhlw2['死亡者数'][i-1]
    print(d)

The cumulative total for the previous day is subtracted from the cumulative total for the corresponding day. For example, even if a value of 10 is entered, it does not mean that 10 people died on that day, but that 10 people died by that day (including that day), so the value of the previous day (also cumulative) If you subtract one person by the day before, that day will be 9 in one day.

As a caveat, in the case of shift(), the value of the column of the data that has become a row and a column loaded in pandas DataFrame is shifted down by one (shift), so one below. When there is a shift, some value is entered at the top of the column in the first row (NaN).

It seems that if you add fillna(0), it will be filled with zeros. However, I want to bring the value from the one before the range, so fill it with the data one day before the date and time specified in the range (here, float(mhlw00['deaths'])).

mhlw['死亡者数'] - mhlw['死亡者数'].shift().fillna(float(mhlw00['死亡者数']))

It's going to be a long line, but it's simple if you look at the image of cumulative to date-cumulative to yesterday. I just want to do that.

Also, if you do shift(), all the shifted numbers will be converted from int values to float values, so first convert the type to float so that the subsequent subtraction will be calculated. If the type is not converted to float, it will be replaced with NaN, so it will not be calculated. Probably, if the int value is left as it is, the value will not be entered where it was shifted down.
This difference in behavior can be confirmed in this way.

print('-'*55)
print(mhlw2[mask0]['死亡者数'] - mhlw2[mask0]['死亡者数'].shift())
print('-'*55)
print(mhlw2[mask0]['死亡者数'] - mhlw2[mask0]['死亡者数'].shift().fillna(float(mhlw2[mask00]['死亡者数'])))

It may be the same with mhlw2['deaths'].Diff() without using shift().

Also, what you should check in advance is what is the end of the data? should be seen for confirmation. The last is important.
It seems that "-1 == last" used in a list in pandas DataFrame can be used like iloc, so let's look at it in the form of pd.iloc[-1].

print('NHK:',nhk2.iloc[-1])
print('-'*53)
print('MHLW:',mhlw2.iloc[-1])

Draw a graph for a limited range

After that, I want to make a graph for a limited period. In other words, when I thought about the procedure of asking matplotlib, which graphs the data for the entire period, to graph only this part, I could do this. However, although it is done, there may be other ways. I don't know at all. In this method, to_datetime is used to calculate date as time, and range is prepared.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd

#mhlw1 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_positive_daily.csv", parse_dates=['日付'])
mhlw2 = pd.read_csv("https://www.mhlw.go.jp/content/death_total.csv", parse_dates=['日付'])
#mhlw3 = pd.read_csv("https://www.mhlw.go.jp/content/pcr_tested_daily.csv", parse_dates=['日付'])
#mhlw4 = pd.read_csv("https://www.mhlw.go.jp/content/severe_daily.csv", parse_dates=['日付'])


#nhk2 = pd.read_csv('https://www3.nhk.or.jp/n-data/opendata/coronavirus/nhk_news_covid19_domestic_daily_data.csv', parse_dates=['日付'])
#I use downloaded file
nhk2 = pd.read_csv(r'/content/nhk_news_covid19_domestic_daily_data.csv', parse_dates=['日付'])

#print(len(nhk2)) # total days
#print(len(mhlw2)) # total days
#print('-'*53)
#print('NHK:',nhk2.iloc[-1]) # show last data
#print('-'*53)
#print('MHLW:',mhlw2.iloc[-1]) # show last data


locator = mdates.AutoDateLocator()
formatter = mdates.ConciseDateFormatter(locator)
fig, ax = plt.subplots(figsize=(10, 3))

ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)

mhlw2['日付'] = pd.to_datetime(mhlw2['日付'])
nhk2['日付'] = pd.to_datetime(nhk2['日付'])


start = '2020-04-15'
end   = '2020-05-15'

mask0 = (mhlw2['日付'] >= pd.Timestamp(start)) & \
       (mhlw2['日付'] <= pd.Timestamp(end))
mask1 = (nhk2['日付'] >= pd.Timestamp(start)) & \
       (nhk2['日付'] <= pd.Timestamp(end))

import pandas.tseries.offsets as offsets
mask00 = (mhlw2['日付'] >= pd.Timestamp(start)+offsets.Day(-1)) &\
         (mhlw2['日付'] < pd.Timestamp(start))

mhlw00 = mhlw2[mask00]# 1day before the range
#print(mhlw2[mask00]['死亡者数'])
#print('-'*55)
#print(mhlw2[mask0]['死亡者数'] - mhlw2[mask0]['死亡者数'].shift())
#print('-'*55) ##比較
#print(mhlw2[mask0]['死亡者数'] - mhlw2[mask0]['死亡者数'].shift().fillna(float(mhlw2[mask00]['死亡者数'])))
mhlw = mhlw2[mask0]
nhk = nhk2[mask1]

#ax.plot(mhlw3['日付'], mhlw3['PCR 検査実施件数(単日)'],color='pink')
#ax.bar(mhlw1['日付'], mhlw1['PCR 検査陽性者数(単日)'],color='blue')
#ax.bar(mhlw4['日付'], mhlw4['重症者数'],color='green')
#ax.bar(mhlw2['日付'], mhlw2['死亡者数'] - mhlw2['死亡者数'].shift(),color='red')
ax.bar(mhlw['日付'], mhlw['死亡者数'] - mhlw['死亡者数'].shift().fillna(float(mhlw00['死亡者数'])),color='red')
ax.bar(nhk['日付'], nhk['国内の死者数_1日ごとの発表数'] ,width =0.3, color='blue')

When I read the open data of the Ministry of Health, Labor and Welfare and the CSV data published by NHK into the pandas DataFrame, the'date'seems to be a character string type, and I am converting it.
mhlw2['日付'] = pd.to_datetime(mhlw2['日付'])

Next, (mhlw2['日付'] >= pd.Timestamp(start)) & \(mhlw2['日付'] <= pd.Timestamp(end)) is a Boolean operation.
It means that mask0 has received the range of the condition of start=<range=<end. The name of this method may be the logical product of masking and conditional statements. If you want to know more details, you may find a hit by searching for "pandas", "period", "masking".

(mhlw2['日付'] >= pd.Timestamp(start)) , mhlw2['日付'] <= pd.Timestamp(end))
It is a logical product & to satisfy both of these in the conditional expression, but it should be noted that the logical product operator can use &. If it is and, it will be interpreted differently. It's kind of deep here, or it's likely to be "Which one?", So it might be better to stumble deeply.
Then, the mask0 specified so far from here is passed to mhlw as mhlw = mhlw2[mask0], and it is given as an argument to the subsequent ax.bar().

1月間死亡数.png

You can see the graph of the period from start = '2020-04-15' to end = '2020-05-15'.

If you change this part, the range will change. Here, the range is from New Year's Day to March 15th.

start = '2021-01-01'
end   = '2021-03-15'

When align ='edge' is specified, the position of the bar graph is shifted and displayed like side by side.

ax.bar(mhlw['日付'], mhlw['死亡者数'] - mhlw['死亡者数'].shift().fillna(float(mhlw00['死亡者数'])),width =-0.3,align='edge',color='red')
ax.bar(nhk['日付'], nhk['国内の死者数_1日ごとの発表数'] ,width =0.3,align='edge', color='blue')

graph (2).png

If you add a label and plt.legend(), it looks more like a graph. (MHLW stands for Ministry of Health, Labor and Welfare)

ax.bar(mhlw['日付'], mhlw['死亡者数'] - mhlw['死亡者数'].shift().fillna(float(mhlw00['死亡者数'])),label='MHLW death',width =-0.3,align='edge',color='red')
ax.bar(nhk['日付'], nhk['国内の死者数_1日ごとの発表数'],label='NHK death',width =0.3,align='edge', color='blue')

plt.legend()

graph (3).png

If you look at the data released by NHK and the data released by the Ministry of Health and Welfare, you will notice that there is a big difference in the number of deaths between 2020-04-22 and 2020-05-08. I'm worried about the difference in the timing of the announcement, so if you look closely, about the same 2020-04-22, 16 people died in NHK, but 91 people in the data of the Ministry of Health and Welfare, 2020- Regarding 05-08, NHK has 16 dead, but the Ministry of Health and Welfare data shows 49.

I looked for a graph that used data from the Ministry of Health, Labor and Welfare to see what else was going on, rather than plotting it myself.
According to the one published by the Toyo Keizai Online Editorial Department, "Because it includes cases that the Ministry of Health, Labor and Welfare is confirming from April 22nd, and because the data source has changed from May 8th" Frequently Asked Questions. In this graph, the death toll for 2020-04-22 is zero-filled.

Toyo Keizai Online "Coronavirus Disease (COVID-19) Situation Report in Japan
新型コロナウイルス国内感染の状況
日本国内において現在確定している新型コロナウイルス感染症(COVID-19)の状況を厚生労働省の報道発表資料からビジュアル化した。
制作・運用:東洋経済オンライン編集部
最終更新:2021年3月16日
Data source: MHLW Open Data. On and after 22 April, cases which MHLW is still confirming reports from prefectures are included. On and after 8 May, the data source was changed. The number of deaths on June 19 includes the 13 patients who were PCR tested positive but died from diseases other than COVID-19 before that day in Saitama. See this link or notes below for further information.


0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0