ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
Core Concepts
ProCQA is a large-scale dataset extracted from StackOverflow, offering mixed-modal QA pairs for programming question answering, leading to significant performance improvements in code retrieval benchmarks.
Abstract
ProCQA introduces a large-scale dataset for programming question answering mined from StackOverflow.
The dataset offers mixed-modal QA pairs and covers 11 different programming languages.
Modality-agnostic contrastive pre-training on ProCQA leads to improved alignment of text and code representations.
The dataset serves as an evaluation benchmark and pre-training corpus for code language models.
Experiments show substantial performance gains over previous models across various code retrieval tasks.