#承前
PubMedはNCBIが公開しているMEDLINE(医療系の論文データベース)の検索サイト。
普通に検索するだけであれば、以下にあるような論文の要素を検索してくれる。
https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Search_Field_Descriptions_and
#PubMedからダウンロード可能なxmlの中身
たとえば、以下の論文の著者がどこに所属しているか知りたい、と考えたら、該当ページでAuthor Informationを開くと見ることが可能。
Rejuvenating exhausted T cells during chronic viral infection.
https://www.ncbi.nlm.nih.gov/pubmed/16469690
この元のデータはxml形式なら、右上に表示されるSend to > File >Format XML > Create Fileでダウンロード可能。
実際には以下のような表記をされている。
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">16469690</PMID>
<DateCompleted>
<Year>2006</Year>
<Month>03</Month>
<Day>20</Day>
</DateCompleted>
<DateRevised>
<Year>2017</Year>
<Month>11</Month>
<Day>16</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0092-8674</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>124</Volume>
<Issue>3</Issue>
<PubDate>
<Year>2006</Year>
<Month>Feb</Month>
<Day>10</Day>
</PubDate>
</JournalIssue>
<Title>Cell</Title>
<ISOAbbreviation>Cell</ISOAbbreviation>
</Journal>
<ArticleTitle>Rejuvenating exhausted T cells during chronic viral infection.</ArticleTitle>
<Pagination>
<MedlinePgn>459-61</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>In a recent paper in Nature, show that the immunoreceptor PD-1 is upregulated by "exhausted" T cells during the chronic phase of viral infection in mice. Remarkably, blocking the interaction between PD-1 and its ligand, PD-L1, reactivates these T cells and reduces viral load.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Okazaki</LastName>
<ForeName>Taku</ForeName>
<Initials>T</Initials>
<AffiliationInfo>
<Affiliation>Department of Immunology and Genomic Medicine, Graduate School of Medicine, Kyoto University, Yoshida-Konoe, Sakyo-ku, Kyoto, 606-8501, Japan.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Honjo</LastName>
<ForeName>Tasuku</ForeName>
<Initials>T</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D016420">Comment</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>Cell</MedlineTA>
<NlmUniqueID>0413066</NlmUniqueID>
<ISSNLinking>0092-8674</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D000954">Antigens, Surface</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D051017">Apoptosis Regulatory Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D018122">B7-1 Antigen</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D060890">B7-H1 Antigen</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="C498919">Cd274 protein, mouse</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D008562">Membrane Glycoproteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="C491383">Pdcd1 protein, mouse</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D010455">Peptides</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D061026">Programmed Cell Death 1 Receptor</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<CommentsCorrectionsList>
<CommentsCorrections RefType="CommentOn">
<RefSource>Nature. 2006 Feb 9;439(7077):682-7</RefSource>
<PMID Version="1">16382236</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000818" MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000954" MajorTopicYN="N">Antigens, Surface</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D051017" MajorTopicYN="N">Apoptosis Regulatory Proteins</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D018122" MajorTopicYN="N">B7-1 Antigen</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D060890" MajorTopicYN="N">B7-H1 Antigen</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D018414" MajorTopicYN="N">CD8-Positive T-Lymphocytes</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="Y">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D002908" MajorTopicYN="N">Chronic Disease</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D008562" MajorTopicYN="N">Membrane Glycoproteins</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D051379" MajorTopicYN="N">Mice</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D018448" MajorTopicYN="N">Models, Immunological</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010455" MajorTopicYN="N">Peptides</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="N">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D061026" MajorTopicYN="N">Programmed Cell Death 1 Receptor</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015398" MajorTopicYN="N">Signal Transduction</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014777" MajorTopicYN="N">Virus Diseases</DescriptorName>
<QualifierName UI="Q000276" MajorTopicYN="Y">immunology</QualifierName>
<QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2006</Year>
<Month>2</Month>
<Day>14</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2006</Year>
<Month>3</Month>
<Day>21</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2006</Year>
<Month>2</Month>
<Day>14</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">16469690</ArticleId>
<ArticleId IdType="pii">S0092-8674(06)00115-2</ArticleId>
<ArticleId IdType="doi">10.1016/j.cell.2006.01.022</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
特定の検索結果に関して、著者のaffiliationを集計しようと試みた。
#実際のコード
#python 3.6
#pmid, 出版年, 著者数, 1st,2nd, last authorのaffiliationをdataframeに
import pandas as pd
import xml.etree.ElementTree as ET
parse = ET.parse("pubmed_result.xml") #適宜修正
root = parse.getroot()
#空のdataframeを作成
df=pd.DataFrame(columns=("1st","authornumber","last","pmid","year","2nd"))
df2=pd.Series("NaN")
ID=1
for child in root:
#子要素からPMIDを取得
for pmid in child.findall('.//*[@Version="1"]'):
pmid=pmid.text
#子要素からPubMed収載年を取得
for year in child.findall('.//PubDate/Year'):
year=year.text
#あとで使うための値を定義
countries=[]
authornumber=0
#affiliationの国と著者人数を取得
for aff in child.findall(".//Affiliation"):
affiliation=aff.text.strip(".")
country=affiliation.split(",")
country=country[-1]
countries.append(country)
authornumber=authornumber+1
countries=pd.Series(countries)
df1=pd.DataFrame({"pmid":pmid,
"year":year,
"authornumber":authornumber,
},index=[ID])
#authorの数によって別指定が必要
if authornumber>=2:
df1["1st"]=countries[0]
df1["2nd"]=countries[1]
df1["last"]=countries.iloc[-1]
elif authornumber==0:
df1["1st"]=df2
df1["2nd"]=df2
df1["last"]=df2
else:
df1["1st"]=countries[0]
df1["2nd"]=countries[0]
df1["last"]=countries[0]
df=pd.concat([df,df1])
ID=ID+1
print(ID) #なくてもいいが、不安解消のため
df.to_csv("affiliations.csv",index=True,encoding="utf8")
この後にデータクリーニングの作業があるが、とりあえず、データシートを作成するところまで。
acceptとpublishの期間を出したり、いろいろ応用可能。