Title: New Problems and Perspectives on Learning, Sampling, and Memory, in the Small Data Regime
Abstract: I will discuss several new problems related to the general challenge of understanding what conclusions can be made, given a dataset that is relatively small in comparison to the complexity or dimensionality of the underlying distribution from which it is drawn. In the first setting we consider the problem of learning a population of Bernoulli (or multinomial) parameters. This is motivated by the ``federated learning" setting where we have data from a large number of heterogeneous individuals, who each supply a very modest amount of data, and ask the extent to which the number of data sources can compensate for the lack of data from each source. Second, I will introduce the problem of data "amplification". Given n independent draws from a distribution, D, to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy. We also discuss connections between this setting and the challenge of interpreting the behavior of GANs and other ML/AI systems. Finally (if there is time), I will also discuss memory/data tradeoffs for regression, with the punchline that any algorithm that uses a subquadratic amount of memory will require asymptotically more data than second-order methods to achieve comparable accuracy. This talk is based on four joint papers with various subsets of Weihao Kong, Brian Axelrod, Shivam Garg, Vatsal Sharan, Aaron Sidford, Sham Kakade, and Ramya Vinayak.
Bio: Greg Valiant is an Assistant Professor in Stanford's Computer Science Department, working at the intersection of Algorithms, Machine Learning, Statistics, and Information Theory. One of the main themes of his work is the design of efficient algorithms for accurately inferring information about complex distributions, given limited amounts of data, or limits on other resources such as the computation time, available memory or communication, or the quality of the available date. Prior to joining Stanford, he was a postdoc at Microsoft Research, New England, and received hisPhD from Berkeley in Computer Science, and BA in Math from Harvard.