A Comparative Analysis of Reference and Metadata Coverage in OpenAlex, Web of Science, and Scopus
Core Concepts
OpenAlex is a promising open-source alternative to commercial citation indexes like Web of Science and Scopus, demonstrating comparable reference coverage for recent publications but revealing discrepancies in metadata completeness and potential issues with author disambiguation.
Abstract
-
Bibliographic Information: Culbert, J.H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2024). Reference Coverage Analysis of OpenAlex compared to Web of Science and Scopus. arXiv preprint arXiv:2401.16359v3.
-
Research Objective: This paper investigates the suitability of OpenAlex as a reliable open-source alternative to established commercial citation indexes, Web of Science (WoS) and Scopus, by comparing their reference coverage and metadata completeness.
-
Methodology: The researchers created a "Shared Corpus" of publications from 2015 to 2022, present in all three databases, based on exact DOI matching. They analyzed and compared the average total reference count, source reference count, and internal reference coverage. Additionally, they examined the coverage of abstracts, ORCID identifiers, Open Access status, and funding information at the journal level.
-
Key Findings:
- OpenAlex demonstrates comparable source reference coverage to WoS and Scopus for recent publications in the Shared Corpus.
- Scopus exhibits slightly better internal reference coverage than OpenAlex and WoS.
- OpenAlex contains a significantly larger corpus of publications compared to WoS and Scopus, but these additional publications are not extensively cited within the Shared Corpus.
- Discrepancies were found between reported reference counts and actual reference data in both Scopus and OpenAlex.
- OpenAlex lags behind WoS and Scopus in abstract coverage but shows higher ORCID coverage, although potential issues with author disambiguation were observed.
- Open Access status information is similar across all three databases.
- WoS and Scopus provide more comprehensive funding information than OpenAlex.
-
Main Conclusions: OpenAlex presents a viable open-source alternative to WoS and Scopus for studying contemporary scientific output, particularly when focusing on a core corpus of publications. However, limitations exist in its completeness of reference data and metadata, particularly regarding abstracts and potential inaccuracies in author disambiguation.
-
Significance: This study provides valuable insights for researchers and bibliometricians considering OpenAlex for their analyses, highlighting its strengths and limitations compared to established commercial databases.
-
Limitations and Future Research: The study acknowledges limitations regarding the lack of ground truth for reference counts and the accuracy of reference matching algorithms. Future research could investigate the accuracy of reference matching, analyze internal coverage across disciplines, and delve deeper into the discrepancies observed in reference counts and metadata completeness. Further investigation into the accuracy of ORCID attribution within OpenAlex is also warranted.
Translate Source
To Another Language
Generate MindMap
from source content
Reference Coverage Analysis of OpenAlex compared to Web of Science and Scopus
Stats
OpenAlex has average source reference numbers and internal coverage rates comparable to both Web of Science and Scopus when restricted to a cleaned dataset of 16.8 million recent publications shared by all three databases.
OpenAlex captures more ORCID identifiers, fewer abstracts and a similar number of Open Access status indicators per article when compared to both the Web of Science and Scopus.
The Shared Corpus contains 23.6% and 25.6% of all records in WoS and Scopus, and 6.9% of those in OpenAlex.
The Shared Corpus contains 41.1%, 35.8% and 31.7% of the references in the whole corpora of WoS, Scopus and OpenAlex respectively.
The Shared Corpus contains 74.3% of the records in WoS published between 2015 and 2022, and 60.8% of the records in Scopus published between 2015 and 2022 and 21.8% of OpenAlex published between 2015 and 2022.
Over 92% of the articles in WoS and Scopus have abstract information, compared to a 87% coverage of abstracts in OpenAlex.
The proportion of articles in OpenAlex with at least one ORCID present is 92%, and the proportion of articles with at least one ORCID in WoS is 16% and in Scopus 32%.
The proportion of open access information in all three databases is around 49%.
Quotes
"OpenAlex is a promising alternative to proprietary bibliometric data sources as its permissible licensing creates the potential to support a transformation of research practice towards reproducible bibliometrics."
"As OpenAlex is actually much larger than Scopus and WoS... it could be expected that its internal reference coverage is at least not lower than those of the latter databases."
"In this respect, the Scopus coverage policy seems to be a bit more effective. However, one possible factor could also be that a comparatively poorer reference-matching algorithm misses a noticeable amount of actual source references."
"The vastly greater corpus of document records in OpenAlex, compared to WoS and Scopus, raises the question of what this additional content is, which is covered by OpenAlex but by neither established commercial provider. Our findings demonstrate what this content is not: it is not that part of the scientific literature which is referenced by items within WoS or Scopus."
Deeper Inquiries
How might the evolving landscape of open access publishing and metadata sharing impact the future development and adoption of OpenAlex compared to commercial citation indexes?
The evolving landscape of open access publishing and metadata sharing is poised to significantly impact the future development and adoption of OpenAlex compared to commercial citation indexes like Web of Science and Scopus. Here's how:
Positive Impacts:
Increased Data Availability: The open access movement is pushing for greater transparency and accessibility of research outputs, including metadata. This trend benefits OpenAlex, as it relies on open data sources. As more publishers embrace open metadata sharing, OpenAlex's coverage and data completeness will likely improve, potentially surpassing that of commercial indexes that face paywalls and restrictive data sharing agreements.
Enhanced Community-Driven Development: OpenAlex's open-source nature fosters community involvement in its development. This collaborative approach can lead to faster innovation, more comprehensive feature sets, and potentially more accurate data through community validation and error correction.
Cost-Effectiveness for Researchers: As open access publishing gains traction, researchers may find themselves less reliant on commercial indexes for accessing basic bibliographic information. OpenAlex, being free to use, could become an attractive alternative, especially for researchers in institutions or regions with limited access to expensive subscription-based databases.
Challenges:
Sustainability and Funding: While OpenAlex benefits from open data, its long-term sustainability depends on securing funding for infrastructure, maintenance, and development. Commercial indexes have established revenue streams, while OpenAlex needs to explore alternative models like institutional partnerships or philanthropic support.
Data Quality Control: Relying on diverse, voluntarily shared data sources presents challenges in maintaining consistent data quality. OpenAlex needs robust mechanisms for data validation, cleaning, and deduplication to ensure accuracy and reliability, especially as the volume of data grows.
Competition from Commercial Indexes: Commercial indexes are likely to adapt to the changing landscape by exploring open data integration and developing new features. OpenAlex needs to continuously innovate and demonstrate its value proposition to remain competitive.
Overall, the shift towards open access publishing and metadata sharing presents a significant opportunity for OpenAlex to establish itself as a leading citation index. However, addressing the challenges related to sustainability, data quality, and competition will be crucial for its long-term success.
Could the limitations in OpenAlex's metadata coverage be attributed to a reliance on voluntary data contributions from publishers, and if so, how might this be addressed to improve data completeness?
Yes, the limitations in OpenAlex's metadata coverage, as highlighted in the study, can be partly attributed to its reliance on voluntary data contributions from publishers. Here's a breakdown:
Voluntary Nature Leads to Inconsistent Coverage: Publishers have varying policies and practices regarding open metadata sharing. Some may not deposit complete metadata to open repositories like Crossref, leading to gaps in OpenAlex's data, particularly for abstracts, funding information, and even accurate publication dates.
Lack of Incentives for Comprehensive Sharing: Currently, there's limited incentive for publishers to prioritize comprehensive open metadata sharing. They might prioritize depositing basic information required for DOI registration but withhold richer metadata.
Addressing the Issue:
Advocacy and Raising Awareness: Promoting the benefits of open metadata for the entire scholarly ecosystem is crucial. This includes demonstrating how complete metadata facilitates discoverability, research assessment, and the development of new research tools and services.
Developing Shared Standards and Best Practices: Establishing clear, widely adopted standards for metadata schemas and deposition processes can encourage consistency and completeness in data sharing.
Incentivizing Open Metadata Sharing: Exploring mechanisms to incentivize publishers to share complete metadata is key. This could involve:
Funder Mandates: Research funders could require grantees to publish in venues that deposit full metadata to open repositories as a condition of funding.
Journal Ranking Metrics: Incorporating open metadata completeness as a factor in journal ranking metrics could motivate publishers to improve their data sharing practices.
Recognition and Rewards: Recognizing and rewarding publishers who demonstrate exemplary open metadata practices can encourage wider adoption.
By addressing the voluntary nature of metadata contributions and creating a more robust ecosystem for open metadata sharing, OpenAlex can overcome its limitations and provide a more comprehensive and reliable data source for the research community.
What are the ethical implications of relying on automated disambiguation techniques in large-scale citation indexes, and how can we ensure accuracy and fairness in attributing authorship?
Relying solely on automated disambiguation techniques in large-scale citation indexes like OpenAlex raises significant ethical implications, particularly regarding accurate authorship attribution. Here's a closer look:
Ethical Implications:
Inaccurate Attribution and Career Impact: Errors in author disambiguation can lead to publications being incorrectly assigned or missed altogether. This can negatively impact researchers' career progression, funding opportunities, and overall recognition for their work, especially for those with common names or from non-Western countries where name conventions might be different.
Bias Amplification: Automated systems trained on existing data can inherit and even amplify biases present in that data. For instance, if a system is trained on data predominantly from Western countries, it might struggle to accurately disambiguate authors from other regions, leading to systematic under-representation.
Lack of Transparency and Recourse: The inner workings of some disambiguation algorithms might be opaque, making it difficult for researchers to understand how their work is being attributed and to challenge incorrect assignments. This lack of transparency can erode trust in the system.
Ensuring Accuracy and Fairness:
Combining Automated and Human Oversight: A hybrid approach that combines the efficiency of automated disambiguation with human validation is crucial. This could involve manual verification of high-stakes attributions (e.g., for funding decisions) or implementing crowdsourcing mechanisms for data quality control.
Developing More Robust Algorithms: Investing in research and development of disambiguation algorithms that are less prone to bias and can handle diverse name conventions is essential. This includes incorporating cultural sensitivity and linguistic nuances into the algorithms.
Empowering Researchers with Control: Providing researchers with tools to manage their own author profiles, link their publications accurately, and flag potential errors is crucial. Integrating ORCID identifiers more effectively can significantly improve disambiguation accuracy.
Transparency and Explainability: Making the disambiguation process more transparent by providing researchers with insights into how their work is being attributed and offering clear channels for reporting errors is essential for building trust and accountability.
Addressing the ethical implications of automated disambiguation requires a multi-faceted approach that prioritizes accuracy, fairness, transparency, and researcher control. By combining technological advancements with ethical considerations, we can create more equitable and reliable citation indexes that accurately reflect the contributions of all researchers.