3
3

More than 3 years have passed since last update.

世界の国別の人口を知りたい。

Posted at

目標

Screen Shot 2020-05-09 at 0.11.02.png

こういう感じに、年、地域、サブ地域、国、国番号、人口が書いてある表が欲しい時どうするかという話。

LocID,Location,Time,PopTotal,SubRegName,GeoRegName
392,Japan,2019,126860299,Eastern Asia,Asia

結論

https://population.un.org/wpp/ から必要なデータをダウンロード出来ます。

WPP2019_TotalPopulationBySex.csv の中身はこんな感じです。

LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
...
392,Japan,2,Medium,1950,1950.5,40602.499,42199.585,82802.084,227.132
392,Japan,2,Medium,1951,1951.5,41380.556,42935.709,84316.265,231.285
  • LocID: ISO 3166-1 の数字による国コード
  • Location: 国や地域の名前
  • VarID, Variant: 未来の値について出生率や死亡率や移民率を色々変えて予測している。標準は Medium。
  • Time,MidPeriod: 年
  • PopMale,PopFemale,PopTotal: 男女別及び総人口 (1000人)
  • PopDensity: 1km^2 あたり人口密度 (1000人)

この LocID や Location の項目には国名の他に Asia や East Asia のような地域名も含まれているため、このままでは使えないです。地域名を除き国だけに揃えるために WPP2019_F01_LOCATIONS.XLSX を参照します。このファイルには

  • Location
  • DB
  • NOTES

の3つのシートがあり、そのうち DB がちょうど機械読み取りしやすい形になっています。例えば Japan の項目は

Index, Location, Notes, LocID, ISO3_Code, LocType, LocTypeName, ParentID, WorldID, SubRegID, SubRegName, SDGSubRegID, SDGSubRegName, SDGRegID, SDGRegName, GeoRegID, GeoRegName
133, Japan, 392, JPN, 4, Country/Area, 906, 900, 906, Eastern Asia, 1832, Eastern and South-Eastern Asia, 935, Asia

です。という事で、次の事が分かります。

  • LocType = 4 または LocTypeName = Country/Area の項目は国を表す。
  • SubRegName はその国のサブ地域を表す。(例: Eastern Asia)
  • GeoRegName はその国の地域を表す。(例: Asia)

Jupyter でデータを作る

(ノートブックのリンク: https://colab.research.google.com/drive/160xZ5tAGKb1enC0LU2JYEOA6m3l3w1cn?usp=sharing)

ここまで調べてようやく作業開始です。まず WPP2019_TotalPopulationBySex.csv の読み込み。

import pandas as pd

population_src = pd.read_csv("WPP2019_TotalPopulationBySex.csv")
population_src.head()
LocID Location VarID Variant Time MidPeriod PopMale PopFemale PopTotal PopDensity
0 4 Afghanistan 2 Medium 1950 1950.5 4099.243 3652.874 7752.117 11.874
1 4 Afghanistan 2 Medium 1951 1951.5 4134.756 3705.395 7840.151 12.009
2 4 Afghanistan 2 Medium 1952 1952.5 4174.450 3761.546 7935.996 12.156
3 4 Afghanistan 2 Medium 1953 1953.5 4218.336 3821.348 8039.684 12.315
4 4 Afghanistan 2 Medium 1954 1954.5 4266.484 3884.832 8151.316 12.486

必要な情報だけ抜き出す。

population = population_src[population_src.Variant == "Medium"][["LocID", "Location", "Time", "PopTotal"]]
population["PopTotal"] = (population["PopTotal"] * 1000).astype(int)
population
LocID Location Time PopTotal
0 4 Afghanistan 1950 7752116
1 4 Afghanistan 1951 7840151
2 4 Afghanistan 1952 7935996
3 4 Afghanistan 1953 8039684
4 4 Afghanistan 1954 8151316

WPP2019_F01_LOCATIONS.XLSX の読み込み。

locations_src = pd.read_excel('WPP2019_F01_LOCATIONS.XLSX', sheet_name="DB")
locations_src.head()
Index Location Notes LocID ISO3_Code LocType LocTypeName ParentID WorldID SubRegID SubRegName SDGSubRegID SDGSubRegName SDGRegID SDGRegName GeoRegID GeoRegName MoreDev LessDev LeastDev oLessDev LessDev_ExcludingChina LLDC SIDS WB_HIC WB_MIC WB_UMIC WB_LMIC WB_LIC WB_NoIncomeGroup MaxHIV_Male MaxHIV_Female MaxHIV_BothSexes YearMaxHIV_BothSexes HIVAIDSMortalityImpact_AgePattern HIVAIDSMortalityImpact_e0 TotPop2019LessThan90k
0 1 WORLD NaN 900 NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 UN development groups a 1803 NaN 25.0 Label/Separator 900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 More developed regions b 901 NaN 5.0 Development group 1803 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Less developed regions c 902 NaN 5.0 Development group 1803 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 Least developed countries d 941 NaN 5.0 Development group 902 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

必要な情報だけ抜き出す。

location = locations_src[locations_src.LocType == 4][["LocID", "SubRegName", "GeoRegName"]]
location.head()
LocID SubRegName GeoRegName
26 108 Eastern Africa Africa
27 174 Eastern Africa Africa
28 262 Eastern Africa Africa
29 232 Eastern Africa Africa
30 231 Eastern Africa Africa

人口データと国データを結合します。

population_by_countries = population.merge(location)
population_by_countries.head()
LocID Location Time PopTotal SubRegName GeoRegName
0 4 Afghanistan 1950 7752116 Southern Asia Asia
1 4 Afghanistan 1951 7840151 Southern Asia Asia
2 4 Afghanistan 1952 7935996 Southern Asia Asia
3 4 Afghanistan 1953 8039684 Southern Asia Asia
4 4 Afghanistan 1954 8151316 Southern Asia Asia

日本のデータを探してみます。

population_by_countries[(population_by_countries.Location == "Japan") & (population_by_countries.Time == 2019)]
LocID Location Time PopTotal SubRegName GeoRegName
16377 392 Japan 2019 126860299 Eastern Asia Asia

良さそうなので保存します。

population_by_countries.to_csv("population_by_countries.csv", index=False)
!head population_by_countries.csv
LocID,Location,Time,PopTotal,SubRegName,GeoRegName
4,Afghanistan,1950,7752116,Southern Asia,Asia
4,Afghanistan,1951,7840151,Southern Asia,Asia
4,Afghanistan,1952,7935996,Southern Asia,Asia
4,Afghanistan,1953,8039684,Southern Asia,Asia
4,Afghanistan,1954,8151316,Southern Asia,Asia
4,Afghanistan,1955,8270992,Southern Asia,Asia
4,Afghanistan,1956,8398873,Southern Asia,Asia
4,Afghanistan,1957,8535157,Southern Asia,Asia
4,Afghanistan,1958,8680097,Southern Asia,Asia

おまけの苦労話

  • 最初 日本語版 Wikipedia や WolframAlpha で国の人口順リストを調べると http://data.un.org/ というサイトが出てくるのでここから引っ張ってこようと思った。
  • Population, surface area and density のリンクにある CSV は機械読み込み出来る感じがしない。
  • どうやらこれは Statistical Yearbook という本の抜粋という事が分かった。
  • Yearbook の Annex I によると、このデータの国番号や国名は United Nations Standard Country Codes M 49 と呼ぶ事が分かった。M49 の解説は https://unstats.un.org/unsd/methodology/m49/ に見つかった。これによると、次のような階層構造になっている。
    • Global Name: World (1)
    • Region Name: Asia (142)
    • Sub-region Name: Eastern Asia (30)
    • Intermediate Region Name: なし
    • Country or Area: Japan (392)
  • M49 とは ISO 3166-1 の元になった規格らしい。
  • M49 の機械読み込み出来る形式のファイルが見つからない。一応 CSV のリンクはあるが、国名にコンマが入っていてカラムがずれている。
  • なんか嫌な気分になった。
  • 他の例を探すと datahub.io では data.worldbank.org のデータを使っていた。
  • world population という語で検索すると https://www.worldometers.info/world-population/ が出てくる。
    • 資料として https://population.un.org/wpp/ を使っていた!
    • どうもダウンロードして使うデータとしてはこちらのほうが治安が良さそうだ。

ということで、たかが世界人口というありふれたデータを取得するだけのためにめちゃくちゃ時間を使ってしまったという事をここに記録する。

3
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
3