The paper presents two key improvements to the Self-Structuring AutoEncoder (Self-StrAE) model:
The authors demonstrate that these changes allow Self-StrAE to be pre-trained from scratch on as little as 10M tokens of input data, and prove effective across multiple languages, including English, Spanish, and Afrikaans.
The core of the Self-StrAE model is its ability to learn embeddings that define their own hierarchical structure, extending from the subword to the sentence level. This inductive bias towards hierarchy is a key strength of the model, allowing it to be parameter and data efficient.
The authors compare the performance of different pre-training objectives, finding that combining cross-entropy reconstruction and contrastive loss (CECO) leads to the best results. They then explore the impact of the number of independent channels in the embeddings, surprisingly finding that increasing the number of channels while decreasing their size leads to significant improvements, even to the point of reducing the total number of non-embedding parameters to just 7.
The authors also demonstrate that the improvements hold across multiple languages, with the model performing comparably or better on Spanish and Afrikaans compared to English. The Afrikaans model in particular shows strong performance, even generalizing well to the related Dutch language.
Overall, the paper presents a simple yet effective approach to improving the Self-StrAE model, making it a promising alternative for semantic textual relatedness tasks, especially in low-resource language settings.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor