Analysis of Data Usage and Citation Practices in Medical Imaging Conferences
Core Concepts
Medical imaging papers often lack proper citation and acknowledgment of the datasets used, making it difficult to track dataset usage and impact.
Abstract
The authors present two open-source tools to detect dataset usage in scientific papers:
A pipeline using the OpenAlex citation index and full-text analysis to automatically identify dataset citations and mentions.
A PDF annotation tool to manually label dataset presence in papers.
The authors apply these tools to study the usage of 20 popular medical imaging datasets in papers from the MICCAI and MIDL conferences between 2013 and 2023. Key findings:
There is a concentration of research on a limited set of datasets, especially for tasks like cardiac segmentation and chest classification.
Many papers cite a dataset's paper without actually mentioning the dataset in the main text, indicating non-usage.
Conversely, some papers mention a dataset without citing the associated paper, making it difficult to track dataset usage.
The lack of standardized practices for indicating dataset usage is a major challenge. The authors recommend adopting a "Data Availability" section in papers to improve transparency and enable better tracking of dataset usage in the field.
[Citation needed] Data usage and citation practices in medical imaging conferences
Stats
"While the availability of a public dataset is a positive step towards getting a problem addressed by the community, the choice of a single dataset for evaluation also results in an overestimation of performances leading to a gap when applied on a different one (Wu et al., 2021)."
"We find few studies on understanding dataset use beyond their initial release in the field. We believe this is in part due to identifying dataset usage, as datasets may be used without corresponding citations, and vice versa."
"We find that almost every subset has more than 25% of datasets being only cited and around 10% being only mentioned."
Quotes
"Papers in major medical conferences tend to use a limited set of datasets, especially for papers addressing the same task."
"The lack of citation standards for dataset usage makes tracking such usage difficult, in particular due to (i) papers citing a dataset's paper without mentioning it in particular sections, indicating a non-usage, and (ii) papers mentioning a dataset without citing its paper, which classical bibliometric tools like OpenAlex can not detect."
How can the medical imaging community establish standardized practices for citing and acknowledging dataset usage in research papers?
To establish standardized practices for citing and acknowledging dataset usage in medical imaging research papers, the community can take several strategic steps. First, the implementation of a "Data Availability" section in research papers can be mandated, similar to practices in journals like NeuroImage. This section should detail the datasets used, including their origins, licenses, and any relevant ethical considerations. Second, the development of a standardized citation format for datasets, akin to the formats used for academic papers, can facilitate consistent referencing. This could involve creating a dedicated registry for datasets, where each dataset is assigned a unique identifier (similar to DOIs for publications) that researchers can use in their citations.
Furthermore, conferences and journals can enforce guidelines that require authors to explicitly mention datasets in the methodology or results sections, rather than only in the introduction or related works. This would help ensure that dataset usage is clearly documented and can be tracked effectively. Lastly, fostering collaboration between dataset creators, publishers, and researchers can lead to the establishment of best practices and guidelines that are widely accepted and adopted across the community.
What are the potential biases and limitations that may arise from the overuse of a small set of popular datasets in the field?
The overuse of a small set of popular datasets in medical imaging can lead to several biases and limitations that may compromise the validity and generalizability of research findings. One significant concern is the risk of overfitting, where algorithms trained on a limited variety of data may perform well on those specific datasets but fail to generalize to real-world scenarios or diverse patient populations. This can result in models that are not robust or reliable when applied to different datasets or clinical settings.
Additionally, reliance on a few datasets can introduce selection bias, as these datasets may not represent the full spectrum of medical conditions, demographics, or imaging modalities encountered in practice. For instance, if a dataset predominantly features images from a specific demographic group, the resulting models may not perform adequately for underrepresented populations, leading to disparities in healthcare outcomes.
Moreover, the concentration of research on popular datasets can stifle innovation and exploration of new methodologies or applications, as researchers may gravitate towards familiar datasets rather than seeking out novel or less-studied datasets that could provide valuable insights. This can create a feedback loop where certain datasets become increasingly popular, further marginalizing other potentially valuable datasets.
How can dataset creators and curators work with the research community to incentivize more diverse and representative dataset usage in medical imaging studies?
Dataset creators and curators can play a pivotal role in incentivizing more diverse and representative dataset usage in medical imaging studies through several collaborative strategies. First, they can actively engage with the research community to understand the specific needs and challenges faced by researchers. By soliciting feedback and conducting surveys, dataset creators can tailor their datasets to address gaps in representation and diversity.
Second, offering incentives such as grants, awards, or recognition for studies that utilize underrepresented datasets can motivate researchers to explore a broader range of data sources. This could include funding opportunities specifically aimed at projects that focus on diversity in dataset usage or that aim to validate models across multiple datasets.
Additionally, dataset creators can enhance the accessibility and usability of their datasets by providing comprehensive documentation, clear licensing information, and user-friendly interfaces. This can lower the barriers to entry for researchers who may be hesitant to use less popular datasets due to concerns about data quality or usability.
Finally, fostering partnerships between dataset creators, academic institutions, and industry stakeholders can facilitate collaborative research initiatives that prioritize diverse dataset usage. By creating platforms for sharing best practices and success stories, the community can collectively promote the importance of diversity in dataset selection, ultimately leading to more robust and generalizable research outcomes in medical imaging.
0
Visualize This Page
Generate with Undetectable AI
Translate to Another Language
Scholar Search
Table of Content
Analysis of Data Usage and Citation Practices in Medical Imaging Conferences
[Citation needed] Data usage and citation practices in medical imaging conferences
How can the medical imaging community establish standardized practices for citing and acknowledging dataset usage in research papers?
What are the potential biases and limitations that may arise from the overuse of a small set of popular datasets in the field?
How can dataset creators and curators work with the research community to incentivize more diverse and representative dataset usage in medical imaging studies?