Główne pojęcia
The author explores the relationship between classification accuracy and generalizability across different network architectures, highlighting the non-monotonic nature of generalization with layer depth.
Streszczenie
The study investigates the ability of deep networks to generalize to unseen classes in visual classification tasks. Different architectures show varying degrees of generalizability, with accuracy not always predicting generalization accurately. The research introduces a method to quantify generalization in a minimalist domain using a zero-shot paradigm. Results indicate that higher accuracy does not guarantee higher generalizability, emphasizing the importance of architecture in determining performance.
The study fine-tuned pretrained networks on a calligraphy dataset and developed a metric for measuring generalization based on cluster separability. Different architectures like ResNet, ViT, Swin Transformer, and others showed varying levels of generalizability across layers and epochs. The findings suggest that there is no consistent encoding strategy emerging after fine-tuning, challenging traditional notions of classification performance.
Furthermore, the research raises questions about the implications of basing deep learning classification on generalization rather than typical loss functions. The study provides insights into how networks learn to represent stroke patterns in calligraphy and highlights the potential applications of such representations beyond memorizing characters from specific artists.
Statystyki
Network ResNet achieved a g value of 0.62 for unseen classes and 0.88 for seen classes.
ViT had a g value of 0.70 for unseen classes and 0.95 for seen classes.
Swin Transformer had a g value of 0.62 for unseen classes and 0.80 for seen classes.
PViT achieved a g value of 0.77 for unseen classes and 0.93 for seen classes.
CvT had a g value of 0.67 for unseen classes and 0.94 for seen classes.
PoolFormer-S12 achieved a g value of 0.79 for unseen classes and 0.91 for seen classes.
ConvNeXt V2 had a g value of 0.63 for unseen classes and 0.92 for seen classes.
Cytaty
"Accuracy is not a good predictor of generalizability."
"Our approach leads to quantifying generalization in this minimalist domain."
"Different architectures yield surprisingly different latent representations."