5
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

PubMedのxmlをパースして集計

Last updated at Posted at 2019-02-19

#承前
PubMedはNCBIが公開しているMEDLINE(医療系の論文データベース)の検索サイト。
普通に検索するだけであれば、以下にあるような論文の要素を検索してくれる。
https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Search_Field_Descriptions_and

#PubMedからダウンロード可能なxmlの中身
たとえば、以下の論文の著者がどこに所属しているか知りたい、と考えたら、該当ページでAuthor Informationを開くと見ることが可能。
Rejuvenating exhausted T cells during chronic viral infection.
https://www.ncbi.nlm.nih.gov/pubmed/16469690

この元のデータはxml形式なら、右上に表示されるSend to > File >Format XML > Create Fileでダウンロード可能。
実際には以下のような表記をされている。

pubmed_result.xml
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>

<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">16469690</PMID>
        <DateCompleted>
            <Year>2006</Year>
            <Month>03</Month>
            <Day>20</Day>
        </DateCompleted>
        <DateRevised>
            <Year>2017</Year>
            <Month>11</Month>
            <Day>16</Day>
        </DateRevised>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Print">0092-8674</ISSN>
                <JournalIssue CitedMedium="Print">
                    <Volume>124</Volume>
                    <Issue>3</Issue>
                    <PubDate>
                        <Year>2006</Year>
                        <Month>Feb</Month>
                        <Day>10</Day>
                    </PubDate>
                </JournalIssue>
                <Title>Cell</Title>
                <ISOAbbreviation>Cell</ISOAbbreviation>
            </Journal>
            <ArticleTitle>Rejuvenating exhausted T cells during chronic viral infection.</ArticleTitle>
            <Pagination>
                <MedlinePgn>459-61</MedlinePgn>
            </Pagination>
            <Abstract>
                <AbstractText>In a recent paper in Nature, show that the immunoreceptor PD-1 is upregulated by &quot;exhausted&quot; T cells during the chronic phase of viral infection in mice. Remarkably, blocking the interaction between PD-1 and its ligand, PD-L1, reactivates these T cells and reduces viral load.</AbstractText>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Okazaki</LastName>
                    <ForeName>Taku</ForeName>
                    <Initials>T</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Immunology and Genomic Medicine, Graduate School of Medicine, Kyoto University, Yoshida-Konoe, Sakyo-ku, Kyoto, 606-8501, Japan.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Honjo</LastName>
                    <ForeName>Tasuku</ForeName>
                    <Initials>T</Initials>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
                <PublicationType UI="D016420">Comment</PublicationType>
            </PublicationTypeList>
        </Article>
        <MedlineJournalInfo>
            <Country>United States</Country>
            <MedlineTA>Cell</MedlineTA>
            <NlmUniqueID>0413066</NlmUniqueID>
            <ISSNLinking>0092-8674</ISSNLinking>
        </MedlineJournalInfo>
        <ChemicalList>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D000954">Antigens, Surface</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D051017">Apoptosis Regulatory Proteins</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D018122">B7-1 Antigen</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D060890">B7-H1 Antigen</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="C498919">Cd274 protein, mouse</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D008562">Membrane Glycoproteins</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="C491383">Pdcd1 protein, mouse</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D010455">Peptides</NameOfSubstance>
            </Chemical>
            <Chemical>
                <RegistryNumber>0</RegistryNumber>
                <NameOfSubstance UI="D061026">Programmed Cell Death 1 Receptor</NameOfSubstance>
            </Chemical>
        </ChemicalList>
        <CitationSubset>IM</CitationSubset>
        <CommentsCorrectionsList>
            <CommentsCorrections RefType="CommentOn">
                <RefSource>Nature. 2006 Feb 9;439(7077):682-7</RefSource>
                <PMID Version="1">16382236</PMID>
            </CommentsCorrections>
        </CommentsCorrectionsList>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D000954" MajorTopicYN="N">Antigens, Surface</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D051017" MajorTopicYN="N">Apoptosis Regulatory Proteins</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D018122" MajorTopicYN="N">B7-1 Antigen</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D060890" MajorTopicYN="N">B7-H1 Antigen</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D018414" MajorTopicYN="N">CD8-Positive T-Lymphocytes</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="Y">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D002908" MajorTopicYN="N">Chronic Disease</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D008562" MajorTopicYN="N">Membrane Glycoproteins</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D051379" MajorTopicYN="N">Mice</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D018448" MajorTopicYN="N">Models, Immunological</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D010455" MajorTopicYN="N">Peptides</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D061026" MajorTopicYN="N">Programmed Cell Death 1 Receptor</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015398" MajorTopicYN="N">Signal Transduction</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D014777" MajorTopicYN="N">Virus Diseases</DescriptorName>
                <QualifierName UI="Q000276" MajorTopicYN="Y">immunology</QualifierName>
                <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
    <PubmedData>
        <History>
            <PubMedPubDate PubStatus="pubmed">
                <Year>2006</Year>
                <Month>2</Month>
                <Day>14</Day>
                <Hour>9</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="medline">
                <Year>2006</Year>
                <Month>3</Month>
                <Day>21</Day>
                <Hour>9</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="entrez">
                <Year>2006</Year>
                <Month>2</Month>
                <Day>14</Day>
                <Hour>9</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
        </History>
        <PublicationStatus>ppublish</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">16469690</ArticleId>
            <ArticleId IdType="pii">S0092-8674(06)00115-2</ArticleId>
            <ArticleId IdType="doi">10.1016/j.cell.2006.01.022</ArticleId>
        </ArticleIdList>
    </PubmedData>
</PubmedArticle>

</PubmedArticleSet>


特定の検索結果に関して、著者のaffiliationを集計しようと試みた。

#実際のコード

#python 3.6
#pmid, 出版年, 著者数, 1st,2nd, last authorのaffiliationをdataframeに

import pandas as pd
import xml.etree.ElementTree as ET

parse = ET.parse("pubmed_result.xml") #適宜修正
root = parse.getroot()

#空のdataframeを作成
df=pd.DataFrame(columns=("1st","authornumber","last","pmid","year","2nd"))
df2=pd.Series("NaN")

ID=1
for child in root:
    #子要素からPMIDを取得
    for pmid in child.findall('.//*[@Version="1"]'):
        pmid=pmid.text

    #子要素からPubMed収載年を取得
    for year in child.findall('.//PubDate/Year'):
        year=year.text
    
    #あとで使うための値を定義    
    countries=[]
    authornumber=0
    #affiliationの国と著者人数を取得
    for aff in child.findall(".//Affiliation"):
        affiliation=aff.text.strip(".")
        country=affiliation.split(",")
        country=country[-1]
        countries.append(country)
        authornumber=authornumber+1
    countries=pd.Series(countries)

    df1=pd.DataFrame({"pmid":pmid,
                      "year":year,
                      "authornumber":authornumber,
                      },index=[ID])
    
    #authorの数によって別指定が必要    
    if authornumber>=2:
        df1["1st"]=countries[0]
        df1["2nd"]=countries[1]
        df1["last"]=countries.iloc[-1]
    elif authornumber==0:    
        df1["1st"]=df2 
        df1["2nd"]=df2
        df1["last"]=df2
    else:    
        df1["1st"]=countries[0]
        df1["2nd"]=countries[0]
        df1["last"]=countries[0]

    df=pd.concat([df,df1])
    ID=ID+1
    print(ID) #なくてもいいが、不安解消のため

df.to_csv("affiliations.csv",index=True,encoding="utf8")

この後にデータクリーニングの作業があるが、とりあえず、データシートを作成するところまで。
acceptとpublishの期間を出したり、いろいろ応用可能。

5
2
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?