Stay Connected:

## The 2008 Workshop on Sparsity in High Dimensional Statistics and Learning Theory

Date: March 22nd - March 24th, 2008
Location: Executive Conference Room (#228) of the ISyE complex, Georgia Institute of Technology

### Topics Covered:

• Asymptotic Geometric Analysis and High Dimensional Probability in Sparse Recovery Problems
• Model Selection and Oracle Inequalities for Statistical Models of Sparsity
• Penalization techniques (such as $\ell_1$-penalization)
• Dantzig Selector and other methods of sparse recovery
• Greedy Approximation Methods
• Concepts of Sparsity and Complexity in Learning Theory
• Sparsity and Generalization Bounds for Learning Machines

### Speakers:

Andrew Barron, Cong Huang, and Xi Luo (Yale University)
Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms
A presentation in three parts: 1. OVERVIEW OF RISK ANALYSIS (Andrew Barron). Variable-distortion, variable-complexity covers are shown to give an appropriate information-theoretic tool for demonstrating acceptable penalties for which the risk of the estimator is bounded by a corresponding population tradeoff between accuracy of approximation and penalty relative to the sample size. 2. L_1 PENALIZED SQUARED ERROR (Cong Huang). Accuracy is proven for a variant of greedy term selection in solving the ell_1 penalized squared error optimization. Its risk is shown to be superior to greedy term selection with subset size penalty in the case of libraries with certain covering properties. The bounds show smaller risk than the typical root{(log p)/n} where p is the library size and n is the sample size. 3. L_1 PENALIZED LOG-LIKELIHOOD (Xi Luo). Analogous conclusions are discussed for ell_1 penalized log likelihood in log-linear models and inverse covariance estimation. Sampling ideas for probabilistic demonstration of approximation, computation, and covering properties are an underlying tool for all three parts of the talk.

Flori Bunea (Florida State)
On $\ell_1$ regularized estimators in functional data and binary response regression models
The topic of $\ell_1$ regularized estimation has received considerable attention over the past decade. Recent theoretical advances have been mainly concerned with the risk of the estimators and corresponding sparsity oracle inequalities. In this talk we will investigate the quality of the $\ell_1$ penalized estimators from a slightly different perspective, shifting the emphasis to "in probability" statements. Our focus will be on the construction of confidence intervals and consistent variable selection.

We show that the $\ell_1$ type estimates can be used to construct honest confidence sets for the mean of a stochastic process. The inference is based on a sample of trajectories of the process, each observed at discrete times and corrupted by noise.

We will also discuss the usage of $\ell_1$ penalized estimators in binary response regression models, with emphasis on the logistic and square loss. We show that properly calibrated $\ell_1$ estimators can be used to identify, at any pre-specified confidence level, a small set of variables associated with the response, when the initial number of variables is larger than the sample size. As a consequence, we obtain conditions under which the false discovery rate, the specificity and sensitivity can be controlled. An application to SNIPS identification in genetics studies will be used to illustrate the method.

Albert Cohen (Paris VI)
Matching vs. basis pursuit for approximation and learning : a comparison
Basis pursuit and matching pursuit are two differentapproaches that are of common use in approximation theory, statistical learning, compressed sensing.Both exploit an underlying sparsity property of the target function to be estimated, in the sense that this function is well described as a combination of a few terms in a given dictionary. In this talk, their similarities and differences will be explored in the case of a general dictionary, in particular from the perspective of approximation theory.

Jianqing Fan (Princeton)
Correlation Learning and Sure Screening.
High dimensionality is a growing feature in many areas of contemporary statistics. Variable selection is fundamental to high-dimensional statistical modeling. For problems of large or huge scale, computational cost and estimation accuracy are always two top concerns. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as the SCAD, Dantzig selector, Lasso, or adaptive Lasso. The connections of these penalized least-squares methods are also elucidated. Joint work with Jinchi Lv

Sara van de Geer (ETH)
Joint work with P. Buehlmann and L. Meier (ETH Zuerich) and E. Mammen (University of Mannheim)
We consider an additive regression model where the number of components is large. The regression function is estimated using least squares with an additive penalty. We show that with a proper choice of the smoothness parameter, a sparse rate of convergence can be obtained. Thus, the estimator mimics an oracle which knows which components are relevant. The result can be shown under a coherence condition, similar to the one used in sparsity inequalities for the Lasso.

Gitta Kutyniok (Stanford)
Geometric Separation from a Microlocal Analysis Viewpoint using l1-Minimization and Cluster Coherence
Consider an image mixing two (or more) geometrically distinct but spatially overlapping phenomena - for instance, pointlike and curvelike structures in astronomical imaging of galaxies. This raises the Problem of Geometric Separation, taking a single image and extracting two images one containing just the pointlike phenomena and one containing just the curvelike phenomena. Although this seems impossible - as there are two unknowns for every datum - suggestive empirical results have been obtained by Jean-Luc Starck and collaborators.

Consider an image mixing two (or more) geometrically distinct but spatially overlapping phenomena - for instance, pointlike and curvelike structures in astronomical imaging of galaxies. This raises the Problem of Geometric Separation, taking a single image and extracting two images one containing just the pointlike phenomena and one containing just the curvelike phenomena. Although this seems impossible - as there are two unknowns for every datum - suggestive empirical results have been obtained by Jean-Luc Starck and collaborators.

We develop a theoretical approach to the Geometric Separation Problem in which a deliberately overcomplete representation is chosen made of two frames. One is suited to pointlike structures (wavelets) and the other suited to curvelike structures (curvelets or shearlets). The decomposition principle is to minimize the l1-norm of the analysis (rather than synthesis) frame coefficients. This forces the pointlike objects into the wavelet components of the expansion and the curvelike objects into the curvelet or shearlet part of the expansion. Our theoretical results show that at all sufficiently fine scales, nearly-perfect separation is achieved.

Our analysis has two interesting features. Firstly, we use a viewpoint deriving from microlocal analysis to understand heuristically why separation might be possible and to organize a rigorous analysis. Secondly, we introduce some novel technical tools: cluster coherence, rather than the now-traditional singleton coherence and l1-minimization in frame settings, including those where singleton coherence within one frame may be high.

Our general approach applies in particular to two variants of geometric separation algorithms. One is based on frames of radial wavelets and curvelets and the other uses orthonormal wavelets and shearlets. This is joint work with David Donoho (Stanford University).

Active and Semi-Supervised Learning Theory
Science is arguably the pinnacle of human intellectual achievement, yet the scientific discovery process itself remains an art. Human intuition and experience is still the driving force of the high-level discovery process: we determine which hypotheses and theories to entertain, which experiments to conduct, how data should be interpreted, when hypotheses should be abandoned, and so on. Meanwhile machines are limited to low-level tasks such as gathering and processing data. A grand challenge for scientific discovery in the 21st century is to devise machines that directly participate in the high-level discovery process. Towards this grand challenge, we must formally characterize the limits of machine learning. Statistical learning theory is usually based on supervised training, wherein a learning algorithm is presented with a finite set of i.i.d. labeled training examples. However, modern experimental methods often generate incredibly large numbers of unlabeled data for very little expense, while the task of labeling data is often painstaking and costly. Machine learning methods must leverage the abundance of unlabeled data in scientific problem domains. Active learning (AL) and semi-supervised learing (SSL) are two well known approaches to exploit unlabeled data. In both paradigms one has access to a large pool of unlabeled examples, and only a few labeled examples are provided or selected. AL is a sequential feedback process. Unlabeled examples that are predicted to have very informative labels, based on previously gathered labeled and unlabeled data, are selected for labeling. In SSL, labeled examples are randomly provided, without regard to potential informativeness. Today, little is known about theoretical limits of AL and SSL performance. Sparsity and complexity of the underlying data-generating distributions appear to play a central role in the performance of AL and SSL, and this talk will discuss some of the known theoretical results. This work is joint with Rui Castro, Aarti Singh and Jerry Zhu.

Vladimir Temlyakov (South Carolina)
On big jump estimators.
We consider a new method of construction of universal estimators. This method is based on a combination of two powerful ideas in building universal estimators. The first one is the use of penalized least squares estimators. This idea works well in the case of general setting with rather abstract methods of approximation. The second one is the idea of thresholding that works very well when we use wavelets expansions as an approximation tool. A new estimator that we call big jump estimator uses the least squares estimators and chooses a right model by a thresholding criteria instead of the penalization.

Joel Tropp (Caltech)
Iterative signal recovery from incomplete and inaccurate measurements
Compressive Sampling (CoSa) offers a new paradigm for acquiring signals that are compressible with respect to an orthobasis. The major algorithmic challenge in CoSa is to approximate a compressible signal from noisy samples. Until recently, all provably correct reconstruction techniques have relied on large-scale optimization, which tends to be computationally burdensome.

This talk describes a new iterative, greedy recovery algorithm, called CoSaMP that delivers the same guarantees as the best optimization-based approaches. Moreover, this algorithm offers rigorous bounds on computational cost and storage. It is likely to be extremely efficient for practical problems because it requires only matrix--vector multiplies with the sampling matrix. For many cases of interest, the running time is just O(N*log^p(N)), where N is the length of the signal and p is a small integer. This research is joint with D. Needell.

Alexandre Tsybakov (CREST and Paris 6)
Efficient sparse recovery with no assumption on the dictionary
Methods of sparse statistical estimation are mainly of the two types. Some of them, like the BIC, enjoy nice theoretical properties without any assumption on the dictionary but are computationally infeasible starting from relatively modest dimensions p. Others, like the Lasso or Dantzig selector, are easily realizable for very large p but their theoretical performance is conditioned by severe restrictions on the dictionary. The aim of this talk is to propose a new method of sparse recovery in regression, density and classification models realizing a compromise between theoretical properties and computational efficiency. The theoretical performance of the method is comparable with that of the BIC in terms of sparsity oracle inequalities for the prediction risk. No assumption on the dictionary is required, except for the standard normalization. At the same time, the method is computationally feasible for relatively large dimensions p. It is constructed using the exponential weighting with suitably chosen priors, and its analysis is based on the PAC-Bayesian ideas in statistical learning. In particular, we obtain some new PAC-Bayesian bounds with leading constant 1 and we develop a general technique to derive sparsity oracle inequalities from the PAC-Bayesian bounds. This is a joint work with Arnak Dalalyan

Ron DeVore (South Carolina)
A Taste of Compressed Sensing
We shall discuss the emerging field of compressed sensing. We shall stress open questions about effective decoding.

Marten Wegkamp (Florida State University)
Lasso type classifiers with a reject option
We consider the problem of binary classification where one can, for a particular cost, choose not to classify an observation. We present a simple oracle inequality for the excess risk of structural risk minimizers using a generalized lasso penalty.

Jon Wellner (University of Washington)
Testing for sparse normal mixtures: a new family of test statistics based on phi-divergences
Donoho and Jin (2004), following work of Ingster (1999), studied the problem of testing for a signal in a sparse normal means model and showed that there is a "detection boundary" above which the signal can be detected and below which no test has any power. They showed that Tukey's higher criticism'' statistic achieves the detection boundary. I will introduce a new family of test statistics based on phi-divergences indexed by $s \in [-1,2]$ which all achieve the Donoho-Jin-Ingster detection boundary. I will also review recent work on estimating the proportion of non-zero means.

Bin Yu (Berkeley)
Prediction using Side Information
Extracting useful information from high-dimensional data is the focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the vitues of both regularization and sparsity, the L1-penalized L2 minimization method Lasso has been popular. However, Lasso is often seen as not having enough regularization in the large p case.

In this talk, we propose two methods that take into account side information in the penalized L2 framework, in order to bring the needed extra regularization in the large p case. First, we combine different norms including L1 to to introduce the Composite Absolute Penalties (CAP) family. CAP allows the grouping and hierarchical relationships between the predictors to be expressed. It covers and goes beyond existing works including grouped lasso and elastic nets. Path following algorithms and simulation results will be presented to compare with Lasso in terms of prediction and sparsity. Second, motivated by the problem of predicting fMRI signals from input natural images, we investigate a method that uses side information in the unlabeled data for prediction. We present a theoretical result in the case of p/n -> constant and apply the method to the fMRI data problem. (It is noted that the second part is a report on on-going research.)

Tong Zhang (Rutgers)
The Effectiveness of Greedy Algorithms for Learning Sparse Representations
Consider linear least squares regression where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations.

Two methods that are widely used to solve this problem are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that simultaneously incorporates forward and backward steps into each greedy iteration, and show that the resulting procedure can effectively solve this problem under reasonable assumptions.

### Schedule

Saturday, March 22
9:00-9:50 Tropp
10:00-10:50 Nowak
11:30-12:20 Wellner
12:20-2:30 Lunch
2:30-3:20 Fan
3:30-4:20 Yu
4:50-5:25 Barron/Cong Huang/Xi Luo (I)
5:25-6:00 Barron/Cong Huang/Xi Luo (II)

Sunday, March 23
9:00-9:50 DeVore
10:00-10:50 Cohen
11:30-12:20 Temlyakov
12:20-2:30 Lunch
2:30-3:20 Tsybakov
3:30-4:20 van de Geer
5:00-5:50 Bunea

Monday, March 24
9:00-9:50 Wegkamp
10:00-10:50 Zhang
11:00-11:50 Kutyniok

### Organizers

Xiaoming Huo, Vladimir Koltchinskii, Justin Romberg, Ming Yuan