Core Concepts
Neural networks can leverage hidden factorizations within discrete data to learn more efficiently, exhibiting scaling laws influenced by the complexity of these underlying structures.
Stats
The study uses a default input space and output space of 4096 tokens each, representing a cardinality of 2^12.
The default input factorization consists of 12 factors with increasing cardinality, (2)^i for i from 1 to 12.
The default output factorization uses 4 factors, each with a cardinality of 8.
The default number of parent nodes influencing each output factor is set to 2.
The concentration parameter alpha is set to 0.1, balancing between deterministic and uniform probability distributions.
The default MLP architecture uses an embedding dimension of 32, a hidden dimension of 64, and a single layer.
Batch size for single-pass experiments is 8096.
The data split for generalization studies is set to 90% observed data.