Core Concepts
It is NP-hard to efficiently learn OR functions and parities from aggregated label proportions, in contrast to the efficient PAC learnability of these functions.
Abstract
The paper studies the computational learning aspects of the learning from label proportions (LLP) framework, where the training examples are aggregated into subsets or bags and only the average label per bag is available for learning an example-level predictor.
The key findings are:
For bags of size at most 2 that are consistent with an OR function, it is NP-hard to find a constant-clause CNF formula that satisfies a constant fraction of the bags. This separates the learnability of ORs using constant-clause CNFs versus halfspaces in the LLP setting.
It is NP-hard to satisfy more than a (1/2 + o(1)) fraction of such bags using a t-DNF formula for any constant t. This hardness was previously known only for learning noisy ORs in the standard PAC setting.
For parities, it is NP-hard to satisfy more than a (q/2^(q-1) + o(1)) fraction of q-sized bags that are consistent with a parity, while a random parity-based algorithm achieves a (1/2^(q-2))-approximation.
The hardness results demonstrate a qualitative difference between the learnability of simple Boolean functions like ORs and parities in the LLP setting versus the standard PAC learning framework, where they are efficiently learnable.