Sample Efficiency of Sigmoid Gating Compared to Softmax Gating in Mixture of Experts Models for Expert Estimation
Sigmoid gating in mixture of experts (MoE) models offers superior sample efficiency compared to the commonly used softmax gating, particularly for estimating expert parameters, especially when experts are formulated as feed-forward networks with popular activation functions like ReLU and GELU.