Core Concepts
Existing datasets are insufficient for training neural networks to generate English descriptions of binary code functionality. A novel dataset evaluation method, Embedding Distance Correlation (EDC), is proposed to assess dataset quality independent of model details.
Abstract
The authors investigate the feasibility of training a neural network to generate English descriptions of binary code functionality. They survey existing datasets and find none that are suitable for this task, as the descriptions do not match the level of detail required for reverse engineering.
To evaluate dataset quality, the authors propose a novel method called Embedding Distance Correlation (EDC). EDC measures the correlation between the distances of input binary code embeddings and the distances of output English description embeddings. The intuition is that if two binary code samples are close in the input embedding space, their corresponding English descriptions should also be close in the output embedding space.
The authors apply EDC to several datasets, including their own Stack Overflow-derived dataset. They find that none of the datasets exhibit a strong correlation, indicating they are not suitable for training a binary code explanation generation model. They also validate EDC by applying it to a known high-quality dataset (BillSum) and synthetically degraded versions of it.
Additionally, the authors experiment with using the GPT-3 language model to generate binary code explanations, but find it performs poorly, often hallucinating irrelevant descriptions or providing overly generic summaries.
The authors conclude that existing datasets are insufficient for this task and recommend future work on assembling larger, higher-quality datasets, potentially through data augmentation techniques. They plan to make their Stack Overflow dataset available to the research community.
Stats
This function calculates the n-th Fibonacci number.
This assembly code snippet is performing a calculation.
The calculation breaks down into 5 steps.
Quotes
"This is a work in progress. We believe that the Embedding Distance Correlation (EDC) method for evaluating the quality of a dataset is valuable and novel and are excited to present it."
"GPT-3 does a very poor job of summarizing the code in this dataset."