Citation Analysis

The ability to perform in-depth citation analysis using OpenAlex’s user interface is limited. By developing scripts that can gather results that cite, or are cited by, a set of publications, we enable a variety of approaches to research impact assessment and bibliometric analysis.

In this notebook, we will query the OpenAlex API to answer the following questions:

How many works has a researcher (or a group of researchers) published?
How many citations has a researcher (or a group of researchers) received in a given year?
What are the most influential publications in a given research area? (backward citation analysis)
Which countries/journals/authors cite my research most often? (forward citation analysis)
How do citation rates differ between my organization’s open access vs. non-open access publications?

Assessing Citation/Work Counts of Researcher

This can help an organization, department, or research institute track the citation rates of their researchers over time (an indicator of research engagement/impact).

Steps

Let’s start by dividing the process into smaller, more manageable steps:

We need to get all the works published by the researcher
We group the publications by year
We count the total numbers of publications and citations each year

Input

The only input we need is an identifier of the researcher in question and we opted for the researcher’s OpenAlex ID. We can look up the OpenAlex ID of the researcher by searching the name of the researcher in the OpenAlex landing page.

SAVE_CSV = False  # flag to determine whether to save the output as a CSV file 
FIRST_NUM_ROWS = 25  # number of top items to display

# input
author_id = "https://openalex.org/A5075441180"

Get all works published by researcher

The first step is to build the query URL to get the data we need. In this case, we will use the works entity type.

While authors entity type offers direct access to citation and work metrics, we discoverd significant discrepencies between this aggregated author data and the data available on a work-by-work basis. After consulting with OpenAlex, we decided the most accurate approach (at the time of the writing of this notebook) would be to gather the data avaiable at the works level and aggregate this data ourselves.

Our search criteria are as follows:

author.id: Author of a work (OpenAlex ID), author.id:https://openalex.org/A5075441180

Now we need to put the URL together from the following parameters:

Starting point is the base URL of the OpenAlex API: https://api.openalex.org/
We append the entity type to it: https://api.openalex.org/works
All criteria need to go into the query parameter filter that is added after a question mark: https://api.openalex.org/works?filter=
To construct the filter value we take the criteria we specified and concatenate them using commas as separators: https://api.openalex.org/works?filter=author.id:https://openalex.org/A5075441180&page=1&per-page=50
- page and per-page are parameters for pagination. Sometimes a query result is too big, it is a technique to break down the results and allow users to retrieve all the data in separate queries.

import requests
import pandas as pd


def get_works_by_author(author_id, page=1, items_per_page=50):
    # construct the api url with the given author id, page number, and items per page
    url = f"https://api.openalex.org/works?filter=author.id:{author_id}&page={page}&per-page={items_per_page}"

    # send a GET request to the api and parse the json response
    response = requests.get(url)
    json_data = response.json()

    # convert the json response to a dataframe
    df_json = pd.DataFrame.from_dict(json_data["results"])

    next_page = True
    if df_json.empty: # check if the dataframe is empty (i.e., no more pages available)
        next_page = False

    # if there are more pages, recursively fetch the next page
    if next_page:
        df_json_next_page = get_works_by_author(author_id, page=page+1, items_per_page=items_per_page)
        df_json = pd.concat([df_json, df_json_next_page])

    return df_json

def get_work_counts_by_year(df, author_id):
    # check if the dataframe is empty or if 'publication_year' column is missing
    if df.empty or "publication_year" not in df.columns:
        return pd.DataFrame()

    # count the occurrences of each publication year and convert to a dictionary
    results = df["publication_year"].value_counts().to_dict()

    # create a dictionary with counts by year
    records = {"counts_by_year": [{'year': year, 'works_by_count': count} for year, count in results.items()]}

    # normalize the json data to create a dataframe
    df_normalized = pd.json_normalize(records, "counts_by_year", [])

    # group by year and sum the work counts
    df_works = df_normalized.groupby(["year"])["works_by_count"].sum()

    # reset the index to convert the series to a dataframe
    df_works = df_works.reset_index()

    # add the 'author_id' to the dataframe
    df_works["id"] = author_id

    # pivot the dataframe to have years as columns and works count as values
    df_works = df_works.pivot(index=["id"], columns="year", values=["works_by_count"])

    return df_works

def get_cited_by_counts_by_year(df, author_id):
    # check if the dataframe is empty or if required columns are missing
    if df.empty or "id" not in df.columns or "counts_by_year" not in df.columns:
        return pd.DataFrame()

    # convert the relevant columns to a list of dictionaries
    records = df[["id", "counts_by_year"]].to_dict("records")

    # normalize the json data to create a dataframe
    df_normalized = pd.json_normalize(records, "counts_by_year", [])

    # group by year and sum the citation counts
    df_citations = df_normalized.groupby(["year"])["cited_by_count"].sum()

    # reset the index to convert the series to a dataframe
    df_citations = df_citations.reset_index()

    # add the 'author_id' to the dataframe
    df_citations["id"] = author_id

    # pivot the dataframe to have years as columns and citation count as values
    df_citations = df_citations.pivot(index=["id"], columns="year", values=["cited_by_count"])

    return df_citations

df_works = get_works_by_author(author_id)
if SAVE_CSV:
    df_works.to_csv(f"author_works.csv", index=True)

df_works_by_year = get_work_counts_by_year(df_works, author_id)

# reindex the dataframe columns in sorted order
df_works_by_year = df_works_by_year.reindex(sorted(df_works_by_year.columns), axis=1)

# group the dataframe by 'id' and sum the values
df_works_by_year = df_works_by_year.groupby(["id"]).sum()

# flatten the multiple index columns to a single level
df_works_by_year.columns = df_works_by_year.columns.get_level_values(1)

# reset the index to convert the 'id' index back to a column
df_works_by_year = df_works_by_year.reset_index()

Work counts by year

Here, we create a dataframe that displays the researcher’s publications (by year).

if SAVE_CSV:
    df_works_by_year.to_csv(f"work_counts_by_year.csv", index=True)

df_works_by_year

year	id	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
0	https://openalex.org/A5075441180	1	4	5	3	2	2	6	4	4	16	10	8	6	12	6	1	1

df_citations_by_year = get_cited_by_counts_by_year(df_works, author_id)

# reindex the dataframe columns in sorted order
df_citations_by_year = df_citations_by_year.reindex(sorted(df_citations_by_year.columns), axis=1)

# group the dataframe by 'id' and sum the values
df_citations_by_year = df_citations_by_year.groupby(["id"]).sum()

# flatten the multiindex columns to a single level
df_citations_by_year.columns = df_citations_by_year.columns.get_level_values(1)

# reset the index to convert the 'id' index back to a column
df_citations_by_year = df_citations_by_year.reset_index()

Citation counts by year

Here, we create a dataframe that displays citations to the researcher’s publications (by year).

if SAVE_CSV:
    df_citations_by_year.to_csv(f"citation_counts_by_year.csv", index=True)

df_citations_by_year

year	id	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024	2025
0	https://openalex.org/A5075441180	24	24	40	29	50	45	52	38	92	332	454	441	451	64

Backward Citation Analysis

This approach can be used to identify the body of literature that has influenced a given work or set of works. In this example, we explore the body of literature influencing the individual researcher from above (but this analysis could also be performed at the topic-, journal-, or organizational-level).

Steps

We need to get all the works published by the researcher
We get all the outgoing referenced works
We examine the occurrence of references in all works published by the researcher

Get all outgoing referenced works

Our search criteria are as follows:

cited_by: works found in the given work’s referenced_works section, cited_by:https://openalex.org/W2766808518

Now we need to build an URL for the query from the following parameters:

Starting point is the base URL of the OpenAlex API: https://api.openalex.org/
We append the entity type to it: https://api.openalex.org/works
All criteria need to go into the query parameter filter that is added after a question mark: https://api.openalex.org/works?filter=
To construct the filter value we take the criteria we specified and concatenate them using commas as separators: https://api.openalex.org/works?filter=cited_by:https://openalex.org/W2766808518&page=1&per-page=50

def get_all_outgoing_referenced_works(work_ids):

    def get_outgoing_referenced_work(work_id, page=1, items_per_page=50):
        # construct the api url with the given work id, page number, and items per page
        url = f"https://api.openalex.org/works?filter=cited_by:{work_id}&page={page}&per-page={items_per_page}"

        # send a GET request to the api and parse the json response
        response = requests.get(url)
        json_data = response.json()

        # convert the json response to a dataframe
        df_json = pd.DataFrame.from_dict(json_data["results"])

        next_page = True
        if df_json.empty: # check if the dataframe is empty (i.e., no more pages available)
            next_page = False

        # if there are more pages, recursively fetch the next page
        if next_page:
            df_json_next_page = get_outgoing_referenced_work(work_id, page=page+1, items_per_page=items_per_page)
            df_json = pd.concat([df_json, df_json_next_page])

        # add the 'work_id' to the dataframe
        df_json["original_work"] = work_id
        return df_json

    df_reference = pd.concat(map(get_outgoing_referenced_work, work_ids))
    return df_reference

df_outgoing_reference = get_all_outgoing_referenced_works(df_works["id"])
if SAVE_CSV:
    df_outgoing_reference.to_csv(f"outgoing_referenced.csv", index=True)

/var/folders/90/d4l4r4497dzbtl64rn9vp0gr0000gn/T/ipykernel_90481/2492055361.py:21: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_json = pd.concat([df_json, df_json_next_page])

Examine occurrence of outgoing referenced publication

Here, we create a dataframe displaying the number of times the researcher has cited each work.

# group the dataframe by 'id' and 'title', and aggregate the 'id' column by count
df_unique_outgoing_reference = df_outgoing_reference.groupby(["id", "title"]).agg({"id": "count"})

# rename the 'id' column to 'count'
df_unique_outgoing_reference.rename(columns={"id": "count"}, inplace=True)

# sort the dataframe by the 'count' column in descending order
df_unique_outgoing_reference.sort_values("count", ascending=False)

		count
id	title
https://openalex.org/W2038270470	Age effects on carbon fluxes in temperate pine forests	13
https://openalex.org/W2036147904	Above- and belowground ecosystem biomass and carbon pools in an age-sequence of temperate pine plantation forests	12
https://openalex.org/W2161094103	Net ecosystem production in a temperate pine plantation in southeastern Canada	12
https://openalex.org/W2047753483	Allometry and partitioning of above- and belowground tree biomass in an age-sequence of white pine forests	10
https://openalex.org/W2076711862	Water flux components and soil water‐atmospheric controls in a temperate pine forest growing in a well‐drained sandy soil	10
...	...	...
https://openalex.org/W2092100648	The influence of stand development on nutrient demand, growth and allocation	1
https://openalex.org/W2092191291	Storage and internal cycling of nitrogen in relation to seasonal growth of Sitka spruce	1
https://openalex.org/W2093065735	Class—A Canadian land surface scheme for GCMS. I. Soil model	1
https://openalex.org/W2093602709	Water relations of evergreen and drought-deciduous trees along a seasonally dry tropical forest chronosequence	1
https://openalex.org/W961912840	Clinical Research: A Multimethod Typology and Qualitative Roadmap	1

1093 rows × 1 columns

Foward Citation Analysis

If a researcher is trying to assert the impact of their own work (for promotion/tenure, grants, or awards), there is value in understanding how their research is being utilized/cited by others.

Steps

We need to get all the works published by the researcher
We get all the incoming referenced works
We examine the incoming citation counts by year and by journal

Get all incoming referenced works

Our search criteria are as follows:

cites: works that cite the given work, cites:https://openalex.org/W2766808518

Now we need to build an URL for the query from the following parameters:

Starting point is the base URL of the OpenAlex API: https://api.openalex.org/
We append the entity type to it: https://api.openalex.org/works
All criteria need to go into the query parameter filter that is added after a question mark: https://api.openalex.org/works?filter=
To construct the filter value we take the criteria we specified and concatenate them using commas as separators: https://api.openalex.org/works?filter=cites:https://openalex.org/W2766808518&page=1&per-page=50

def get_all_incoming_referenced_works(work_ids):

    def get_incoming_referenced_works(work_id, page=1, items_per_page=50):
        # construct the api url with the given work id, page number, and items per page
        url = f"https://api.openalex.org/works?filter=cites:{work_id}&page={page}&per-page={items_per_page}"

        # send a GET request to the api and parse the json response
        response = requests.get(url)
        json_data = response.json()

        # convert the json response to a dataframe
        df_json = pd.DataFrame.from_dict(json_data["results"])

        next_page = True
        if df_json.empty: # check if the dataframe is empty (i.e., no more pages available)
            next_page = False

        # if there are more pages, recursively fetch the next page
        if next_page:
            df_json_next_page = get_incoming_referenced_works(work_id, page=page+1, items_per_page=items_per_page)
            df_json = pd.concat([df_json, df_json_next_page])

        # add the 'work_id' to the dataframe
        df_json["original_work"] = work_id
        return df_json

    df_reference = pd.concat(map(get_incoming_referenced_works, work_ids))
    return df_reference

df_incoming_referenced = get_all_incoming_referenced_works(df_works["id"])
if SAVE_CSV:
    df_incoming_referenced.to_csv(f"incoming_referenced.csv", index=True)

Examine incoming citation counts by country

Here, we create a dataframe and simple visualization of citations to the researcher’s work by country.

# create a new column 'countries' by applying a lambda function to the 'authorships' column
df_incoming_referenced["countries"] = df_incoming_referenced["authorships"].apply(lambda authorships: set([country for authorship in authorships for country in authorship["countries"]]))

from collections import Counter

# flatten the list of sets and count occurrences of each country
country_counts = Counter(country for countries_set in df_incoming_referenced["countries"] for country in countries_set)

# convert the Counter object to a DataFrame for better readability
df_country_counts = pd.DataFrame.from_dict(country_counts, orient="index", columns=["count"])

# top countries to cite the author's publications
df_country_counts.sort_values(by="count", ascending=False).head()

	count
US	703
CN	673
CA	374
DE	326
GB	249

import matplotlib.pyplot as plt

# sort the dataframe by 'count' in descending order, select the top FIRST_NUM_ROWS rows, and plot as a bar chart
df_country_counts.sort_values(by="count", ascending=False).head(FIRST_NUM_ROWS).plot(kind="bar", legend=False)

# set the title of the plot
plt.title(f"Top {FIRST_NUM_ROWS} countries to cite author ({author_id})'s publications")
# set the x-axis label
plt.xlabel("Country")
# set the y-axis label
plt.ylabel("Count")

plt.show()

png

Examine incoming citation counts by publisher

In a work entity object, there are information about the publisher (primary_location) and the publication’s APC listed by the publisher.

primary_location is a Location object describing the primary location of this work. We are interested in the source of the location, which contains the information of the publisher, such as its OpenAlex id and display_name.

Here, we create a dataframe displaying the number of times the researcher’s work has been cited in different journals.

import numpy as np

# extract 'id' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_incoming_referenced["source_id"] = df_incoming_referenced["primary_location"].apply(lambda location: location["source"]["id"] if location["source"] else np.nan)

# extract 'source_name' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_incoming_referenced["source_name"] = df_incoming_referenced["primary_location"].apply(lambda location: location["source"]["display_name"] if location["source"] else np.nan)

# fill null values in 'source_id' and 'source_name'
df_incoming_referenced["source_id"] = df_incoming_referenced["source_id"].fillna("unknown source")
df_incoming_referenced["source_name"] = df_incoming_referenced["source_name"].fillna("unknown source")

# group the dataframe by 'source_id' and 'source_name' and aggregate the 'id' column by count
df_publisher_counts = df_incoming_referenced.groupby(["source_id", "source_name"]).agg({"id": "count"})

# rename the 'id' column to 'count'
df_publisher_counts.rename(columns={"id": "count"}, inplace=True)

# top publishers to cite the author's publications
df_publisher_counts[df_publisher_counts.index != ("unknown source", "unknown source")].sort_values(by='count', ascending=False).head()

		count
source_id	source_name
https://openalex.org/S17729819	Agricultural and Forest Meteorology	206
https://openalex.org/S55737203	Journal of Hydrology	68
https://openalex.org/S4210238840	Journal of Geophysical Research Biogeosciences	68
https://openalex.org/S13442111	Biogeosciences	68
https://openalex.org/S43295729	Remote Sensing	57

Open Access vs. Non-Open Access

Researcher’s and research institutions want to maximize the impact of their research by selecting the most influential journals for their publications.

Open access publications are more broadly accessible than their subscription-based counterparts. This leads us to wonder if open access publications at our institution are more highly cited than closed publcations.

This kind of analysis can help organizations promote open access publishing by exploring the impact is has on citation rates.

Steps

We need to get all the works published by researchers at the institution
We get the publisher and open access status for each publication
We analyze the open access citation rate

Input

For inputs, we first need to identify the Research Organization Registry (ROR) ID for our institution. In this example we will use the ROR ID for McMaster University (https://ror.org/02fa3aq29). You can search and substitute your own institution’s ROR here: https://ror.org/search.

We also need to identify the publication year we are interested in analyzing.

# input
ror_id = "https://ror.org/02fa3aq29"
publication_year = 2024

Get all works published by researchers at the institution

Our search criteria are as follows:

institutions.ror: ROR ID of the institution affiliated with the authors of a work, institutions.ror:https://ror.org/02fa3aq2
publication_year: the year the work was published, publication_year:2024

Now we need to build an URL for the query from the following parameters:

Starting point is the base URL of the OpenAlex API: https://api.openalex.org/
We append the entity type to it: https://api.openalex.org/works
All criteria need to go into the query parameter filter that is added after a question mark: https://api.openalex.org/works?filter=
To construct the filter value we take the criteria we specified and concatenate them using commas as separators: https://api.openalex.org/works?filter=institutions.ror:https://ror.org/02fa3aq29,publication_year:2024&page=1&per-page=50

def get_works_by_institution(ror_id, publication_year, page=1, items_per_page=50):
    # construct the api url with the given ror id, publication year, publication types, page number, and items per page
    url = f"https://api.openalex.org/works?filter=institutions.ror:{ror_id},publication_year:{publication_year}&page={page}&per-page={items_per_page}"

    # send a GET request to the api and parse the json response
    response = requests.get(url)
    json_data = response.json()

    # convert the json response to a dataframe
    df_json = pd.DataFrame.from_dict(json_data["results"])

    next_page = True
    if df_json.empty: # check if the dataframe is empty (i.e., no more pages available)
        next_page = False

    # if there are more pages, recursively fetch the next page
    if next_page:
        df_json_next_page = get_works_by_institution(ror_id, publication_year, page=page+1, items_per_page=items_per_page)
        df_json = pd.concat([df_json, df_json_next_page])

    return df_json

df_works = get_works_by_institution(ror_id, publication_year)
if SAVE_CSV:
    df_works.to_csv(f"institution_works_{publication_year}.csv", index=True)

Get Publishers and Open Access Status

In a work entity object, there are information about the publication’s access status (open_access) and the publisher (primary_location).

open_access is a OpenAccess object describing the access status of this work. The object contains Open Access status of this work (oa_status) with the following possible values:

diamond: Published in a fully OA journal—one that is indexed by the Directory of Open Access Journals (DOAJ) or that we have determined to be OA—with no article processing charges (i.e., free for both readers and authors).
gold: Published in a fully OA journal.
green: Toll-access on the publisher landing page, but there is a free copy in an OA repository.
hybrid: Free under an open license in a toll-access journal.
bronze: Free to read on the publisher landing page, but without any identifiable license.
closed: All other articles.

import numpy as np

# extract 'oa_status' from the 'open_access' dictionary for each row
df_works["oa_status"] = df_works["open_access"].apply(lambda open_access: open_access["oa_status"])

# extract 'id' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_works["source_id"] = df_works["primary_location"].apply(lambda location: location["source"]["id"] if location["source"] else np.nan)

# extract 'source_name' from 'source' within 'primary_location' if 'source' exists; otherwise, set to null
df_works["source_name"] = df_works["primary_location"].apply(lambda location: location["source"]["display_name"] if location["source"] else np.nan)

# extract total citation counts for each row
df_works["total_citation"] = df_works["counts_by_year"].apply(lambda citation_count_by_year: sum([count["cited_by_count"] for count in citation_count_by_year]))

# fill null values in 'source_id' and 'source_name'
df_works["source_id"] = df_works["source_id"].fillna("unknown source")
df_works["source_name"] = df_works["source_name"].fillna("unknown source")

Aggregate Data Analysis

Here, we create a dataframe that summarizes our institution’s citation rates by Open Access type (e.g. Gold, Bronze, Green, etc.).

In this case, both Gold and Hybrid OA publications are cited at a significantly higher rate than Closed access publications. This suggests that publishing in these types of open access journals could increase the influence of your work.

# group the dataframe by 'oa_status' and aggregate the 'id' column by count and 'total_citation' column by sum
df_oa = df_works.groupby(["oa_status"]).agg({"id": "count", "total_citation": "sum"})

# rename the 'id' column to 'count'
df_oa.rename(columns={"id": "count"}, inplace=True)

# calculate the citation rate by dividing 'total_citation' by 'id' and add it as a new column
df_oa["citation_rate"] = df_oa["total_citation"] / df_oa["count"]

df_oa

	count	total_citation	citation_rate
oa_status
bronze	245	376	1.534694
closed	2960	3741	1.263851
diamond	247	270	1.093117
gold	1751	2501	1.428327
green	664	591	0.890060
hybrid	1412	3021	2.139518

# group the dataframe by 'oa_status' and count the number of occurrences of each 'oa_status'
df_oa = df_works.groupby(["source_id", "source_name", "oa_status"]).agg({"id": "count"})

# reset the index to have 'source_name' and 'oa_status' as columns
df_oa = df_oa.reset_index()

# pivot the dataframe to have 'source_name' as index, 'oa_status' as columns and count as values
df_oa = df_oa.pivot(index="source_name", columns="oa_status", values="id")

if SAVE_CSV:
    df_oa.to_csv(f"institution_works_oa_{publication_year}.csv", index=True)

df_oa

oa_status	bronze	closed	diamond	gold	green	hybrid
source_name
2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall)	NaN	1.0	NaN	NaN	NaN	NaN
2021 IEEE Asia-Pacific Microwave Conference (APMC)	NaN	1.0	NaN	NaN	NaN	NaN
2021 IEEE International Conference on Big Data (Big Data)	NaN	NaN	NaN	NaN	1.0	NaN
2022 11th International Conference on Control, Automation and Information Sciences (ICCAIS)	NaN	1.0	NaN	NaN	NaN	NaN
2022 16th European Conference on Antennas and Propagation (EuCAP)	NaN	1.0	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...
unknown source	53.0	216.0	NaN	4.0	245.0	22.0
The journal of academic science.	NaN	NaN	NaN	NaN	NaN	1.0
The minerals, metals & materials series	NaN	1.0	NaN	NaN	NaN	NaN
Биологические мембраны Журнал мембранной и клеточной биологии	NaN	1.0	NaN	NaN	NaN	NaN
КАРДИОЛОГИЯ УЗБЕКИСТАНА	NaN	1.0	NaN	NaN	NaN	2.0

2547 rows × 6 columns

Identifying our institution’s most popular publication sources (by open access status)

Gold Open Access

# top 5 publications with most 'gold' 'oa_status'
df_oa[df_oa.index != "unknown source"].sort_values("gold", ascending=False).head()

oa_status	bronze	closed	diamond	gold	green	hybrid
source_name
PLoS ONE	NaN	NaN	NaN	111.0	NaN	NaN
Journal of the Canadian Association of Gastroenterology	NaN	NaN	NaN	49.0	NaN	NaN
BMJ Open	NaN	NaN	NaN	49.0	NaN	NaN
Scientific Reports	NaN	NaN	NaN	40.0	NaN	NaN
JAMA Network Open	NaN	NaN	NaN	29.0	NaN	NaN

Diamond Open Access

# top 5 publications with most 'diamond' 'oa_status'
df_oa[df_oa.index != "unknown source"].sort_values("diamond", ascending=False).head()

oa_status	bronze	closed	diamond	gold	green	hybrid
source_name
Genetics in Medicine Open	NaN	NaN	9.0	NaN	NaN	NaN
Canadian Medical Education Journal	NaN	NaN	8.0	NaN	NaN	NaN
Campbell Systematic Reviews	NaN	NaN	6.0	NaN	NaN	NaN
Studies in Social Justice	NaN	NaN	6.0	NaN	NaN	NaN
Frontiers in Health Services	NaN	NaN	5.0	NaN	NaN	NaN

Closed Access

# top 5 publications with most 'closed' 'oa_status'
df_oa[df_oa.index != "unknown source"].sort_values("closed", ascending=False).head()

oa_status	bronze	closed	diamond	gold	green	hybrid
source_name
Blood	1.0	60.0	NaN	NaN	NaN	1.0
Journal of Clinical Oncology	NaN	37.0	NaN	NaN	1.0	1.0
Journal of Obstetrics and Gynaecology Canada	2.0	30.0	NaN	NaN	NaN	6.0
Springer eBooks	NaN	28.0	NaN	NaN	NaN	NaN
Elsevier eBooks	NaN	25.0	NaN	NaN	NaN	NaN