4
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

COVID-19関連の論文の可視化(BERT→UMAP→plotly)

Last updated at Posted at 2020-03-31

はじめに

twitterを眺めているとCOVID-19関連の論文のデータセット記事が流れてきた。
太っ腹なことに論文の書誌情報や非商用であれば3000件弱の論文テキスト情報、Metadata file なら全部を提供してくれるそうで。
さらにBERTの学習済モデル(SciBERT)もあって至れり尽くせり。これはコロナ関連の論文だけじゃなくても使えるんじゃないか?

SciBERT is a BERT model trained on scientific text.

これだけ色々揃っていたので、要約部分の文章ならembeddingをすぐ取り出せるじゃないか、と思い、作ってみた。
可視化にはplotly(express)で。
dataset公開の趣旨としては、上記記事によれば

This dataset is intended to mobilize researchers to apply recent advances in natural language >processing to generate new insights in support of the fight against this infectious disease. The >corpus will be updated weekly as new research is published in peer-reviewed publications and >archival services like bioRxiv, medRxiv, and others.

とのことでこの記事が何か新しい知見を出せるかはなはだ疑問だけども、やってみた。

データ入手と前処理

import pandas as pd
import torch
import plotly.express as px
df=pd.read_csv("https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-27/metadata.csv")

df.shape
# (45774, 17)

df.columns
# ['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract', 'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_full_text', 'full_text_file', 'url']

著者ランキング

image.png

時系列集計

newplot (21).png

BERTによるembedding

colaboratoyのGPUで5時間ほど。
huggingfaceのtransformersをありがたくつかわせてもらう。sciBERTもhuggingfaceで乗せられるように作ってあってありがとうございます。

!pip install transformers
!git clone https://github.com/huggingface/transformers.git
!mkdir output

sciBERTのtokenizerとmodel読み込み。隠れ層を取り出すためにconfigでoutput_hidden_StatesをTrueにしとく。

from transformers import *

config = BertConfig.from_pretrained('allenai/scibert_scivocab_uncased')
config.output_hidden_states = True

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased',config=config)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased',config=config)

# ceasedの場合はこちら。
# tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
# model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

def sentence_embed(sentence):
    try:
        input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True)).unsqueeze(0)
        # Extract all layer's features (layer 0 is the embedding layer)
        outputs = model(input_ids)#, masked_lm_labels=input_ids)
        embeddings = torch.mean(outputs[-1][11][0],dim=0)
        #sentence_embedding = torch.mean(hidden_states[11][0], dim=0)
        return embeddings.tolist()
    except:
        return [0] * 768
        #sentence_embedding = torch.mean(embeddings, dim=0)
        #return sentence_embedding


# str[0:500]はBERTが512byteまで取得のため。隠れ最終層(12層目の隠れ層)を取り出す。
embed_data=df['abstract'].str[0:500].apply(sentence_embed)

可視化

UMAPとかでの次元圧縮
# umapで可視化
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(embed_data.values)


# tsneを使う場合はこちら。
# from sklearn.manifold import TSNE
# X = a.values
# X_embedded = TSNE(n_components=2).fit_transform(X)
# X_embedded.shape
可視化部分の関数
import plotly.graph_objects as go
import numpy as np
def drawfig(dataframe):
    x = dataframe['xpos'].values
    y = dataframe['ypos'].values

   
    fig = go.Figure(go.Histogram2dContour(
        x = x,
        y = y,
        #ncontours=50,
        colorscale = 'Jet',
        contours = dict(
            showlabels = True,
            labelfont = dict(
                family = 'Raleway',
                color = 'white'
            )
        ),
        hoverlabel = dict(
            bgcolor = 'white',
            bordercolor = 'black',
            font = dict(
                family = 'Raleway',
                color = 'black'
            )
        ),
        

    ))
 
    df_covid19 = dataframe[(dataframe['abstract'].fillna("0").str.lower().str.contains("covid-19")) | (dataframe['title'].fillna("0").str.lower().str.contains("covid-19")) | (dataframe['title'].fillna("0").str.lower().str.contains("coronavirus.*19"))]
    df_not_covid19 = dataframe[~dataframe.isin(df_covid19)]
    #df_not_covid19 = dataframe[dataframe['abstract'].fillna("0").str.lower().str.contains("covid-19")==False] 
    fig.add_scattergl(x=df_not_covid19["xpos"],
                y=df_not_covid19["ypos"],
                mode="markers",
                marker=dict(size=1, color="yellow"),
                #customdata ="<a href=" + df_not_covid19['url']+ "' style='color: rgb(0,0,0)'>" + df_not_covid19["title"].str[0:50] + "…</a>", 
                #hoverinfo="<a href=" + df_not_covid19['url']+ "' style='color: rgb(0,0,0)'>" + df_not_covid19["title"].str[0:50] + "…</a>",
                textsrc = "http://google.com",
                text= "<a href='" + df_not_covid19['url']+ "' style='color: rgb(0,0,0)'>DOI:" +df_not_covid19['doi'] +"<br>"+df_not_covid19["title"]+ "</a>",
                textposition='middle center',
                name="全体"
                )
    
    
    fig.add_scattergl(x=df_covid19["xpos"],
                y=df_covid19["ypos"],
                mode="markers",
                marker=dict(size=2, color="red"),
                #hoverinfo="<a href='" + dataframe['url']+ "' style='color: rgb(0,0,0)'>" + dataframe["title"].str[0:50] + "…</a>",
                text= "<a href='" +df_covid19['url']+ "' style='color: rgb(0,0,0)'>DOI:" +df_covid19['doi'] +"<br>"+ df_covid19["title"]+ "…</a>",
                textposition='middle center',
                name="covid19関連"
                )

    
    
    x_disp_range =[-7,13]
    y_disp_range =[-10,11]

    fig.update_xaxes(range=x_disp_range)
    fig.update_yaxes(range=y_disp_range)

    fig.update_layout(
        title=dataframe['year_range'].astype(str).iloc[0]+" 赤点:COVID-19関連、黄色:その他",
        height = 800,
        width = 800,
        bargap = 0,
        hovermode = 'closest',
        showlegend = False
    )


    title=dataframe['year_range'].astype(str).iloc[0]
    fig.write_html(title+"heatmap.html")
    #f2 = go.FigureWidget(fig)
    fig.show()
    #return fig

可視化結果

  • 時系列でqcutして8等分(5,000件くらいずつcut)すると下記のような時系列の区切りに。
df_xypos['year_range']=pd.qcut(df_xypos['publish_year'].astype(int),8)
Categories (8, interval[float64]): [(1950.999, 2003.0] < (2003.0, 2007.0] < (2007.0, 2010.0] <
                                    (2010.0, 2013.0] < (2013.0, 2015.0] < (2015.0, 2017.0] <
                                    (2017.0, 2019.0] < (2019.0, 2020.0]]

それぞれの期間でplot。ほんとはアニメーションが良かったけれど技術力がおいつかず。
マウスでは非常に困難かもしれないけれど各論文へのリンクもあり。スマホのほうが参照しやすかった!

newplot (12).png
https://storage.googleapis.com/yoshino/corona/(1950.999%2C%202003.0%5Dheatmap.html

newplot (13).png
https://storage.googleapis.com/yoshino/corona/(2003.0%2C%202007.0%5Dheatmap.html

newplot (14).png
https://storage.googleapis.com/yoshino/corona/(2007.0%2C%202010.0%5Dheatmap.html

newplot (17).png
https://storage.googleapis.com/yoshino/corona/(2010.0%2C%202013.0%5Dheatmap.html

newplot (15).png
https://storage.googleapis.com/yoshino/corona/(2013.0%2C%202015.0%5Dheatmap.html

image.png
https://storage.googleapis.com/yoshino/corona/(2015.0%2C%202017.0%5Dheatmap.html

newplot (19).png
https://storage.googleapis.com/yoshino/corona/(2017.0%2C%202019.0%5Dheatmap.html

newplot (18).png
https://storage.googleapis.com/yoshino/corona/(2019.0%2C%202020.0%5Dheatmap.html

引用

Citation:
COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-20. Retrieved from https://pages.semanticscholar.org/coronavirus-research. Accessed 2020-03-30. doi:10.5281/zenodo.3715505
(CORD-19, 2020)
4
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?