目標
こういう感じに、年、地域、サブ地域、国、国番号、人口が書いてある表が欲しい時どうするかという話。
LocID,Location,Time,PopTotal,SubRegName,GeoRegName
392,Japan,2019,126860299,Eastern Asia,Asia
結論
https://population.un.org/wpp/ から必要なデータをダウンロード出来ます。
- All variants (CSV, 21.35 MB) WPP2019_TotalPopulationBySex.csv より、人口データをダウンロード出来ます。
- Locations (XLSX, 131 KB) WPP2019_F01_LOCATIONS.XLSX より、国コードのデータをダウンロード出来ます。
WPP2019_TotalPopulationBySex.csv の中身はこんな感じです。
LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
...
392,Japan,2,Medium,1950,1950.5,40602.499,42199.585,82802.084,227.132
392,Japan,2,Medium,1951,1951.5,41380.556,42935.709,84316.265,231.285
- LocID: ISO 3166-1 の数字による国コード
- Location: 国や地域の名前
- VarID, Variant: 未来の値について出生率や死亡率や移民率を色々変えて予測している。標準は Medium。
- Time,MidPeriod: 年
- PopMale,PopFemale,PopTotal: 男女別及び総人口 (1000人)
- PopDensity: 1km^2 あたり人口密度 (1000人)
この LocID や Location の項目には国名の他に Asia や East Asia のような地域名も含まれているため、このままでは使えないです。地域名を除き国だけに揃えるために WPP2019_F01_LOCATIONS.XLSX を参照します。このファイルには
- Location
- DB
- NOTES
の3つのシートがあり、そのうち DB がちょうど機械読み取りしやすい形になっています。例えば Japan の項目は
Index, Location, Notes, LocID, ISO3_Code, LocType, LocTypeName, ParentID, WorldID, SubRegID, SubRegName, SDGSubRegID, SDGSubRegName, SDGRegID, SDGRegName, GeoRegID, GeoRegName
133, Japan, 392, JPN, 4, Country/Area, 906, 900, 906, Eastern Asia, 1832, Eastern and South-Eastern Asia, 935, Asia
です。という事で、次の事が分かります。
- LocType = 4 または LocTypeName = Country/Area の項目は国を表す。
- SubRegName はその国のサブ地域を表す。(例: Eastern Asia)
- GeoRegName はその国の地域を表す。(例: Asia)
Jupyter でデータを作る
(ノートブックのリンク: https://colab.research.google.com/drive/160xZ5tAGKb1enC0LU2JYEOA6m3l3w1cn?usp=sharing)
ここまで調べてようやく作業開始です。まず WPP2019_TotalPopulationBySex.csv の読み込み。
import pandas as pd
population_src = pd.read_csv("WPP2019_TotalPopulationBySex.csv")
population_src.head()
LocID | Location | VarID | Variant | Time | MidPeriod | PopMale | PopFemale | PopTotal | PopDensity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 4099.243 | 3652.874 | 7752.117 | 11.874 |
1 | 4 | Afghanistan | 2 | Medium | 1951 | 1951.5 | 4134.756 | 3705.395 | 7840.151 | 12.009 |
2 | 4 | Afghanistan | 2 | Medium | 1952 | 1952.5 | 4174.450 | 3761.546 | 7935.996 | 12.156 |
3 | 4 | Afghanistan | 2 | Medium | 1953 | 1953.5 | 4218.336 | 3821.348 | 8039.684 | 12.315 |
4 | 4 | Afghanistan | 2 | Medium | 1954 | 1954.5 | 4266.484 | 3884.832 | 8151.316 | 12.486 |
必要な情報だけ抜き出す。
population = population_src[population_src.Variant == "Medium"][["LocID", "Location", "Time", "PopTotal"]]
population["PopTotal"] = (population["PopTotal"] * 1000).astype(int)
population
LocID | Location | Time | PopTotal | |
---|---|---|---|---|
0 | 4 | Afghanistan | 1950 | 7752116 |
1 | 4 | Afghanistan | 1951 | 7840151 |
2 | 4 | Afghanistan | 1952 | 7935996 |
3 | 4 | Afghanistan | 1953 | 8039684 |
4 | 4 | Afghanistan | 1954 | 8151316 |
WPP2019_F01_LOCATIONS.XLSX の読み込み。
locations_src = pd.read_excel('WPP2019_F01_LOCATIONS.XLSX', sheet_name="DB")
locations_src.head()
Index | Location | Notes | LocID | ISO3_Code | LocType | LocTypeName | ParentID | WorldID | SubRegID | SubRegName | SDGSubRegID | SDGSubRegName | SDGRegID | SDGRegName | GeoRegID | GeoRegName | MoreDev | LessDev | LeastDev | oLessDev | LessDev_ExcludingChina | LLDC | SIDS | WB_HIC | WB_MIC | WB_UMIC | WB_LMIC | WB_LIC | WB_NoIncomeGroup | MaxHIV_Male | MaxHIV_Female | MaxHIV_BothSexes | YearMaxHIV_BothSexes | HIVAIDSMortalityImpact_AgePattern | HIVAIDSMortalityImpact_e0 | TotPop2019LessThan90k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | WORLD | NaN | 900 | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | UN development groups | a | 1803 | NaN | 25.0 | Label/Separator | 900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | More developed regions | b | 901 | NaN | 5.0 | Development group | 1803 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Less developed regions | c | 902 | NaN | 5.0 | Development group | 1803 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Least developed countries | d | 941 | NaN | 5.0 | Development group | 902 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
必要な情報だけ抜き出す。
location = locations_src[locations_src.LocType == 4][["LocID", "SubRegName", "GeoRegName"]]
location.head()
LocID | SubRegName | GeoRegName | |
---|---|---|---|
26 | 108 | Eastern Africa | Africa |
27 | 174 | Eastern Africa | Africa |
28 | 262 | Eastern Africa | Africa |
29 | 232 | Eastern Africa | Africa |
30 | 231 | Eastern Africa | Africa |
人口データと国データを結合します。
population_by_countries = population.merge(location)
population_by_countries.head()
LocID | Location | Time | PopTotal | SubRegName | GeoRegName | |
---|---|---|---|---|---|---|
0 | 4 | Afghanistan | 1950 | 7752116 | Southern Asia | Asia |
1 | 4 | Afghanistan | 1951 | 7840151 | Southern Asia | Asia |
2 | 4 | Afghanistan | 1952 | 7935996 | Southern Asia | Asia |
3 | 4 | Afghanistan | 1953 | 8039684 | Southern Asia | Asia |
4 | 4 | Afghanistan | 1954 | 8151316 | Southern Asia | Asia |
日本のデータを探してみます。
population_by_countries[(population_by_countries.Location == "Japan") & (population_by_countries.Time == 2019)]
LocID | Location | Time | PopTotal | SubRegName | GeoRegName | |
---|---|---|---|---|---|---|
16377 | 392 | Japan | 2019 | 126860299 | Eastern Asia | Asia |
良さそうなので保存します。
population_by_countries.to_csv("population_by_countries.csv", index=False)
!head population_by_countries.csv
LocID,Location,Time,PopTotal,SubRegName,GeoRegName
4,Afghanistan,1950,7752116,Southern Asia,Asia
4,Afghanistan,1951,7840151,Southern Asia,Asia
4,Afghanistan,1952,7935996,Southern Asia,Asia
4,Afghanistan,1953,8039684,Southern Asia,Asia
4,Afghanistan,1954,8151316,Southern Asia,Asia
4,Afghanistan,1955,8270992,Southern Asia,Asia
4,Afghanistan,1956,8398873,Southern Asia,Asia
4,Afghanistan,1957,8535157,Southern Asia,Asia
4,Afghanistan,1958,8680097,Southern Asia,Asia
おまけの苦労話
- 最初 日本語版 Wikipedia や WolframAlpha で国の人口順リストを調べると http://data.un.org/ というサイトが出てくるのでここから引っ張ってこようと思った。
- Population, surface area and density のリンクにある CSV は機械読み込み出来る感じがしない。
- どうやらこれは Statistical Yearbook という本の抜粋という事が分かった。
- Yearbook の Annex I によると、このデータの国番号や国名は United Nations Standard Country Codes M 49 と呼ぶ事が分かった。M49 の解説は https://unstats.un.org/unsd/methodology/m49/ に見つかった。これによると、次のような階層構造になっている。
- Global Name: World (1)
- Region Name: Asia (142)
- Sub-region Name: Eastern Asia (30)
- Intermediate Region Name: なし
- Country or Area: Japan (392)
- M49 とは ISO 3166-1 の元になった規格らしい。
- M49 の機械読み込み出来る形式のファイルが見つからない。一応 CSV のリンクはあるが、国名にコンマが入っていてカラムがずれている。
- なんか嫌な気分になった。
- 他の例を探すと datahub.io では data.worldbank.org のデータを使っていた。
- world population という語で検索すると https://www.worldometers.info/world-population/ が出てくる。
- 資料として https://population.un.org/wpp/ を使っていた!
- どうもダウンロードして使うデータとしてはこちらのほうが治安が良さそうだ。
ということで、たかが世界人口というありふれたデータを取得するだけのためにめちゃくちゃ時間を使ってしまったという事をここに記録する。