Recovery of the data

Published research articles references responding to 'bacteria' as a search keyword (authors, abstracts, publication journal, ...) were downloaded from Pubmed using dedicated python library (Entrez) to a MongoDB database. Approx. 1.9 million of articles. An online ressource had impact factors of all existing journals incorporated in classical html tables over different pages. Some parsing using beautifulsoup library allowed the incorporation of the data in the same database. For more details see the following scripts :

  • Get_ImpactFactors.py
  • Get_PubMed.py

Selection of the data

From the database, articles published between 1985 and 2014, last year of computed impact factor were subselected. From there articles were either considered as a whole or subdivided with respect to their journals impact factor. When available, all the keywords from the papers were saved into dictionaries and the 500 most abundant were dumped into json format for further use. Details in the script :

  • Data_selection.py

Data Analysis

Time evolution of keywords

First let's read in the data:

In [1]:
import json
import os
import matplotlib.pyplot as plt
import pandas as pd
import pygal
import numpy as np
from IPython.display import SVG

ROOT_PATH = os.path.join(os.getcwd(), 'data')
if not os.path.isdir(ROOT_PATH):
    os.mkdir(ROOT_PATH)

# Read in data top keyword counts
data = dict()
data_names = []
for i in range(1985, 2015, 5):
    with open(ROOT_PATH + "/mostcommon_" + str(i) + "_" + str(i + 4) +
              ".txt", "r") as data_file:
        data[str(i) + "_" + str(i + 4)] = json.load(data_file)
        data_names.append(str(i) + "_" + str(i + 4))

Next, data is reformated into pandas dataframe for convenience

In [2]:
df = pd.DataFrame()
for i in data_names:
    if df.empty:
        df = pd.DataFrame(data[i].values(), index=data[i].keys(), columns=[str(i)])
    else:
        df = pd.merge(df, pd.DataFrame(data[i].values(),
                                      index=data[i].keys(), columns=[str(i)]),
                      left_index=True, right_index=True, how='outer').fillna(0)

Ratios in normalized keywords counts from one time period to the following were computed to get a score that indicate evolution over time.

In [3]:
for i in range(0, len(data_names) - 1):
    ratio = df[data_names[i + 1]] / df[data_names[i]]
    df[str(data_names[i + 1] + '/' + data_names[i])] = ratio

sc_df = df.fillna(1).replace('%','',regex=True).astype('float')

Start generating graphs following different 'major' chosen profiles. Global overview of all data points was useless.

In [4]:
# Change default font
custom_style = pygal.style.Style(
    background='white',
    plot_background='white',
    font_family='Lato')
In [5]:
# First lost, not coming back
sort_data = sc_df.sort_values(by="1990_1994/1985_1989",ascending=False)
data = sort_data.loc[(sc_df["1990_1994/1985_1989"] <= 0.8) &
                     (sc_df["1995_1999/1990_1994"] <= 0.8) &
                     (sc_df["2000_2004/1995_1999"] <= 1.2) &
                     (sc_df["2000_2004/1995_1999"] >= 0.8) &
                     (sc_df["2005_2009/2000_2004"] <= 1.2) &
                     (sc_df["2005_2009/2000_2004"] >= 0.8) &
                     (sc_df["2010_2014/2005_2009"] <= 1.2) &
                     (sc_df["2010_2014/2005_2009"] >= 0.8), data_names][:12]

line_chart = pygal.Line(logarithmic=False, legend_at_bottom=True,
                        style=custom_style, y_title='# Normalized published articles')
line_chart.title = 'Old loss in published papers keywords'
line_chart.x_labels = list(data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]))
line_chart.render_to_file(ROOT_PATH + '/old_loss.svg')
SVG(filename=ROOT_PATH + '/old_loss.svg')
Out[5]:
Old loss in published papers keywords00100100200200300300400400500500600600700700800800900900100010001100110012001200130013001985_19891990_19941995_19992000_20042005_20092010_2014Old loss in published papers keywords# Normalized published articles583.773561613.4615384615235.4490807931985_1989465.7651202148.076923077272.5107019331990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014965.076363313.4615384615115.6974782291985_1989767.6358037148.076923077177.7054763951990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014581.718020913.4615384615236.0946419931985_1989460.4290223148.076923077274.186551881990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014754.383440613.4615384615181.8675012091985_1989582.3969752148.076923077235.8814102481990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014486.135377913.4615384615266.1132377841985_1989373.5268558148.076923077301.4789652921990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_20141307.32389213.46153846158.211538461541985_1989981.0797213148.076923077110.671478541990_1994623.1497271282.692307692223.0826389641995_1999595.3113085417.307692308231.8255467792000_2004487.8523216551.923076923265.5740160562005_2009473.8651186686.538461538269.9668239742010_2014556.023762113.4615384615244.1641569911985_1989410.1172417148.076923077289.9874228031990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014554.995991713.4615384615244.4869375911985_1989408.5926423148.076923077290.4662370731990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014795.494254813.4615384615168.9562772131985_1989582.3969752148.076923077235.8814102481990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014819.132972913.4615384615161.5323234161985_1989596.1183699148.076923077231.5720818151990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014688.606137813.4615384615202.5254596031985_1989494.732509148.076923077263.4132307961990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014578.634709913.4615384615237.0629837931985_1989404.7811438148.076923077291.6632727491990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_20040551.923076923418.7884615382005_20090686.538461538418.7884615382010_2014Yersinia enteroco…Yersinia enterocoliticaSimplexvirusDrug SynergismMycoplasmaImmune ToleranceAmino AcidsCell Transformati…Cell Transformation, ViralRotavirusHaemophilus Infec…Haemophilus InfectionsChromosome Deleti…Chromosome DeletionImmune SeraIndicators and Re…Indicators and Reagents
In [6]:
sort_data = sc_df.sort_values(by="2010_2014/2005_2009",ascending=False)
data = sort_data.loc[(sc_df["1990_1994/1985_1989"] >= 0.8) &
                     (sc_df["1990_1994/1985_1989"] <= 1.2) &
                     (sc_df["1995_1999/1990_1994"] >= 0.8) &
                     (sc_df["1995_1999/1990_1994"] <= 1.2) &
                     (sc_df["2000_2004/1995_1999"] <= 1.2) &
                     (sc_df["2000_2004/1995_1999"] >= 0.8) &
                     (sc_df["2005_2009/2000_2004"] <= 1.2) &
                     (sc_df["2005_2009/2000_2004"] >= 0.8) &
                     (sc_df["2010_2014/2005_2009"] >= 1.4), data_names][:12]


line_chart = pygal.Line(logarithmic=False, legend_at_bottom=True,
                        style=custom_style, y_title='# Normalized published articles')
line_chart.title = 'Recent increase in published papers keywords'
line_chart.x_labels = list(data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]))
line_chart.render_to_file(ROOT_PATH + '/recent_increase.svg')
SVG(filename=ROOT_PATH + '/recent_increase.svg')
Out[6]:
Recent increase in published papers keywords001001002002003003004004005005006006001985_19891990_19941995_19992000_20042005_20092010_2014Recent increase in published papers keywords# Normalized published articles013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009425.3639844692.423076923144.9412176162010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009454.2839154692.423076923126.3227081142010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009572.976132692.42307692349.90924203342010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009466.0326373692.423076923118.7589386292010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009624.489759692.42307692316.74502198322010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009516.9437658692.42307692385.98260419322010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009501.5800525692.42307692395.8736873662010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009542.8512039692.42307692369.30352276442010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009483.5050956692.423076923107.5102558052010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009487.120087692.423076923105.1829421172010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009496.4588147692.42307692399.17071509032010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_19940285.115384615418.7884615381995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_2009637.7447274692.4230769238.211538461542010_2014Hydrophobic and H…Hydrophobic and Hydrophilic InteractionsNanoparticlesAntioxidantsAntimicrobial Cat…Antimicrobial Cationic PeptidesOrthomyxoviridae …Orthomyxoviridae InfectionsDietDisease ResistanceTranscriptomeMutation, MissenseBiocatalysisAntibodies, Neutr…Antibodies, NeutralizingMultilocus Sequen…Multilocus Sequence Typing
In [7]:
sort_data = sc_df.sort_values(by="2005_2009/2000_2004",ascending=False)
data = sort_data.loc[(sc_df["1990_1994/1985_1989"] >= 0.8) &
                     (sc_df["1990_1994/1985_1989"] <= 1.2) &
                     (sc_df["1995_1999/1990_1994"] >= 0.8) &
                     (sc_df["1995_1999/1990_1994"] <= 1.2) &
                     (sc_df["2000_2004/1995_1999"] <= 1.2) &
                     (sc_df["2000_2004/1995_1999"] >= 0.8) &
                     (sc_df["2005_2009/2000_2004"] >= 1.2) &
                     (sc_df["2010_2014/2005_2009"] >= 0.8), data_names][:12]


line_chart = pygal.Line(logarithmic=False, legend_at_bottom=True,
                        style=custom_style, y_title='# Normalized published articles')
line_chart.title = 'Steady interest in published papers keywords'
line_chart.x_labels = list(data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]))
line_chart.render_to_file(ROOT_PATH + '/steady_interest.svg')
SVG(filename=ROOT_PATH + '/steady_interest.svg')
Out[7]:
Steady interest in published papers keywords00200200400400600600800800100010001200120014001400160016001985_19891990_19941995_19992000_20042005_20092010_2014Steady interest in published papers keywords# Normalized published articles013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004605.3414772551.923076923276.1733368622005_2009633.2259882686.538461538269.6038993422010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004740.337295551.923076923244.36906422005_20091214.6371686.538461538132.6266371922010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004464.8991422551.923076923309.2607790272005_2009497.0613133686.538461538301.6835483012010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004423.2721897551.923076923319.0678602222005_2009425.3639844686.538461538318.5750448772010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004458.2855142551.923076923310.8189134222005_2009590.147341686.538461538279.7529918222010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004503.0247622551.923076923300.2785925112005_2009658.2296785686.538461538263.7131673432010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004525.9779416551.923076923294.8709496092005_20091742.727089686.5384615388.211538461542010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004497.5782451551.923076923301.5617620132005_2009590.147341686.538461538279.7529918222010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004710.3814507551.923076923251.4264964622005_20091380.022955686.53846153893.66263878732010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004757.0658834551.923076923240.4279007292005_2009653.7109393686.538461538264.7777574632010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004634.5192476551.923076923269.2992145292005_2009720.2870303686.538461538249.0927963572010_2014013.4615384615418.7884615381985_19890148.076923077418.7884615381990_19940282.692307692418.7884615381995_19990417.307692308418.7884615382000_2004721.663522551.923076923248.7685024932005_2009869.4054243686.538461538213.9613223852010_2014Drug Resistance, …Drug Resistance, ViralBiodiversityFranceMass ScreeningFish DiseasesProteomicsMethicillin-Resis…Methicillin-Resistant Staphylococcus aureusProteomeMicrobial Viabili…Microbial ViabilityMicroscopy, Elect…Microscopy, Electron, TransmissionProtein TransportMolecular Epidemi…Molecular Epidemiology
In [8]:
data = sc_df.sort_values(by="1995_1999/1990_1994",ascending=False)
data = sort_data.loc[(sc_df["1990_1994/1985_1989"] >= 0.8) &
                     (sc_df["1990_1994/1985_1989"] <= 1.2) &
                     (sc_df["1995_1999/1990_1994"] >= 1.2) &
                     (sc_df["2000_2004/1995_1999"] <= 0.8) &
                     (sc_df["2005_2009/2000_2004"] <= 1.2) &
                     (sc_df["2005_2009/2000_2004"] >= 0.8) &
                     (sc_df["2010_2014/2005_2009"] >= 0.8) &
                     (sc_df["2010_2014/2005_2009"] <= 1.2), data_names][:12]


line_chart = pygal.Line(logarithmic=False, legend_at_bottom=True,
                        style=custom_style, y_title='# Normalized published articles')
line_chart.title = 'Temporary interest in published papers keywords'
line_chart.x_labels = list(data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]))
line_chart.render_to_file(ROOT_PATH + '/moment_glory.svg')
SVG(filename=ROOT_PATH + '/moment_glory.svg')
Out[8]:
Temporary interest in published papers keywords001001002002003003004004005005006006007007001985_19891990_19941995_19992000_20042005_20092010_2014Temporary interest in published papers keywords# Normalized published articles013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994488.7573139285.115384615148.2287968441995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994462.7663284285.115384615162.6165351741995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994484.9537551285.115384615150.3343195271995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994741.6939783285.1153846158.211538461541995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994417.7575485285.115384615187.5318869171995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994490.0251669285.115384615147.526955951995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994413.3200631285.115384615189.9883300461995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994528.0607555285.115384615126.4717291261995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994432.9717839285.115384615179.1097961871995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994500.801917285.115384615141.561308351995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994432.9717839285.115384615179.1097961871995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014013.5769230769418.7884615381985_19890149.346153846418.7884615381990_1994417.7575485285.115384615187.5318869171995_19990420.884615385418.7884615382000_20040556.653846154418.7884615382005_20090692.423076923418.7884615382010_2014Infectious Diseas…Infectious Disease Transmission, VerticalLac OperonPorphyromonas gin…Porphyromonas gingivalisSequence AnalysisRNA-Binding Prote…RNA-Binding ProteinsOmeprazole3T3 CellsAnti-Ulcer AgentsElectron TransportDuodenal UlcerCell Line, Transf…Cell Line, TransformedHIV Envelope Prot…HIV Envelope Protein gp120

What happens, if you look for a few keywords in relation with your PhD ?

In [9]:
#Focused targets
data = sc_df[[0, 1, 2, 3, 4, 5]].ix[['Host-Pathogen Interactions',
                                     'Listeria monocytogenes',
                                     'Staphylococcus aureus',
                                     'RNA, Bacterial',
                                     'Microbiota',
                                     'Transcriptome']]
line_chart = pygal.StackedLine(fill=True, legend_at_bottom=True,
                               style=custom_style, y_title='# Normalized published articles',
                               truncate_legend=-1)
line_chart.title = 'Evolution in time of selected topics'
line_chart.x_labels = list(data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]), show_dots=False)
line_chart.render_to_file(ROOT_PATH + '/Topics_Line_2010_2014.svg')
SVG(filename=ROOT_PATH + '/Topics_Line_2010_2014.svg')
Out[9]:
Evolution in time of selected topics00100010002000200030003000400040005000500060006000700070001985_19891990_19941995_19992000_20042005_20092010_2014Evolution in time of selected topics# Normalized published articlesHost-Pathogen InteractionsListeria monocytogenesStaphylococcus aureusRNA, BacterialMicrobiotaTranscriptome

Keywords broken down with respect to Impact Factors

Reading the data

In [10]:
import json
import os
import matplotlib.pyplot as plt
import pandas as pd
import pygal
import numpy as np

ROOT_PATH = os.path.join(os.getcwd(), 'data')
if not os.path.isdir(ROOT_PATH):
    os.mkdir(ROOT_PATH)


# Read in data top keyword counts
data = dict()
data_names = []
for i in range(1985, 2015, 5):
    for low, high in [(0, 5), (5, 15), (15, 100)]:
        with open(ROOT_PATH + "/mostcommon_" + str(i) + "_" + str(i + 4) + '_IF_' +
                  str(low) + '_' + str(high) + ".txt", "r") as data_file:
            data[str(i) + "_" + str(i + 4) + '_IF_' + str(low) + '_' + str(high)] = json.load(data_file)
            data_names.append(str(i) + "_" + str(i + 4) + '_IF_' + str(low) + '_' + str(high))

# Merge dicts into dataframe
df = pd.DataFrame()
for i in data_names:
    if df.empty:
        df = pd.DataFrame(data[i].values(), index=data[i].keys(), columns=[str(i)])
    else:
        df = pd.merge(df, pd.DataFrame(data[i].values(), index=data[i].keys(), columns=[str(i)]),
                      left_index=True, right_index=True, how='outer').fillna(0)

sc_df = df.replace('%','',regex=True).astype('float')

How does most recent data look like ?

In [11]:
data = sc_df[[15, 16, 17]]
line_chart = pygal.Line(show_legend=False, legend_at_bottom=True,
                        style=custom_style, y_title='# Normalized published articles')
line_chart.title = 'Influence of IF on 2010-2014'
line_chart.x_labels = list(x.lstrip('2010_2014_') for x in data.keys())
for row in data.iterrows():
    line_chart.add(str(row[0]), list(row[1]))
line_chart.render_to_file(ROOT_PATH + '/IF_2010_2014.svg')
SVG(filename=ROOT_PATH + '/IF_2010_2014.svg')
Out[11]:
Influence of IF on 2010-201400100001000020000200003000030000400004000050000500006000060000700007000080000800009000090000IF_0_5IF_5_15IF_15_100