Dimensionality Reduction (10/12/2016)

We have so many details that we can “perceive”, yet we seem to process them very efficiently. This phenomenon is pervasive across all sensory systems.

Question: How does the brain learn concepts efficiently from a small number of examples even though each example contains a huge amount of information?

Easy Concept vs. Hard Concept

Some concepts seem to be easier to pick up than other concepts.

For example, we can learn a concept of ELEPHANT easily with only a small number of examples. It is easy to distinguish elephants from others because it has many features that stand out from other animals.

However, learning a concept of AFRICAN ELEPHANT seems to be more difficult. It will take a lot more examples until you figure out one of its distinguishing features (i.e. big ear).

Why is one concept more difficult to learn than others?

The number of (relevant) features do not seem to determine the difficulty of a concept.

Instead, the similarity of examples with the same label and dissimilarity of examples with different labels seem to play a key role -> robustness of a concept.

Recall that $\forall D, X \sim D$, a PAC-learning algorithm, after seeing examples $(x_1, l(x_1)), (x_2, l(x_2))...$, comes up with a hypothesis h(x) with probability 1-$\delta$ such that P(h(x) = l(x)) >= 1- $\epsilon$.

Ideally, you don’t want to have to see many examples to learn.

Learning Parity

Suppose $X \in \{0,1\}^n$ and we have a “parity” concept: $S \subseteq \{1, ..., n\}, l_S(x) = \sum\limits_{i \in S} x_i$.

How many # of examples do we need to learn this concept (i.e. learning all $i \in S$)? (There are $2^n$ possible such concepts)

Assuming a membership query model, we only need to query n+1 times by flipping each query’s bit position and monitoring its effect on the parity:

Query 1: (1,0,1,1,1, … 0) -> 1
Query 2: (0,0,1,1,1, … 0) -> 0 (parity changed, first bit is relevant)
Query 3: (0,1,1,1,1, … 0) -> 0 (parity unchanged, second bit is irrelevant)

Query N+1: (0,1,0,0,0, … 1) -> … (DONE! all $i \in S$ are learned)

Can we do better than n examples? NO!

Learning this concept can be represented as solving a system of linear equations:

$X_1 \cdot Y = 0 \mod 2$
$X_2 \cdot Y = 1 \mod 2$
$X_3 \cdot Y = 1 \mod 2$

$X_k \cdot Y = 0 \mod 2$

We need n linearly independent equations to get a unique solution.

But what if we use less than n linearly independent examples?

For example, using n-2 linearly independent examples will find out all correct bits in Y except 2 positions, and will leave us with 4 “candidate” concepts.

Can we simply use any of 4 candidate concepts left? Does all 4 candidate concepts have similar accuracy since they only differ by at most 2 bits?

Suppose we have S, T, $f_S, f_T,$ and $S \neq T$ (S, T are different in at least 1 bit).

Then, $P(f_S(x) = f_T(x)) = 1/2$ which means that you will be wrong half of the time (only as good as picking a random concept)!

Learning Halfspaces

$X \in \{0, 1\}^n$, $W \cdot X \geq 0$ iff f(x) = 1

Question: How many distinct W vectors (halfspaces) are there?

Clearly, $|W| \le 2^{2^n}$, but it has a tighter bound:

A halfspace can pivot until it touches n points without changing its classification (n degrees of freedom):

$|W| \le {2^n \choose n} \le 2^{n^2}$

Theorem: $O(\frac{n}{ \epsilon } log \frac{1}{ \epsilon })$ examples suffice for learning halfspaces from any distribution.

But these are still too many examples because it requires as many examples as  dimensions (features). Can we do better?

Attribute-Efficient Learning

A concept is defined by k relevant features out of n total features (k << n).

In this model, sample complexity goes down to $O(klogn)$.

But even if it reduces the number of examples required, the computational complexity is very high (there is no simple learning algorithm).

In fact, if a concept is a disjunction of k features, it is NP-hard to learn which k features are relevant from a set of examples.

l-Robust Concepts

A concept C is $\ell$-robust if $P_{X \sim D}(x \mid \forall y : \lVert y - x \rVert \leq \ell , label(x) \neq label(y) ) = 0$

Sample complexity for learning a $\ell$-robust concept is $\frac{1}{\ell^2}$, which does not depend on the number of features!

But how does having a $\ell$-sized gap help to reduce sample complexity? Also, does it come with a huge computational cost like attribute efficient learning?

A simple algorithm, random projection, can be used to reduce the dimensionality of examples while “preserving” a $l$-robust concept.

Random Projection

To project a given point x $\in \mathbb{R}^n$ to a k-dimensional space (k < n), we  first choose a n x k matrix R whose columns are random vectors $R_1 ... R_k$.

The projection can be succinctly written as: $y = R^Tx$.

Random projection is very simple to implement and “neuron-friendly” because it can be represented as a simple 1-layer neural network:

How do we choose each entry $R_{i,j}$ of matrix R \$?

• $R_{i,j} \sim N(0,1)$ [standard normal distribution with mean 0 and variance 1]
• $R_{i,j} \sim U(-1,1)$ [discrete distribution defined by  $r_i = 1$ with prob. 1/2 and $r_i = -1$ with prob. 1/2]

Does random projection “preserve” a $\ell$-robust concept?

Theorem: If $k = \frac{1}{\ell^2} log \frac{1}{\epsilon \ell \delta}$, then with probability $1-\delta$$\exists \widetilde{W} \in \Bbb{R}^n$ which is $\frac{\ell}{2}$-robust that can be learned with  $\tilde{O}(\frac{1}{\ell^2})$ examples.

Proof Idea: Followings occur with high probability after random projection:

1. For any pair of points $x, y$ and projections $\widetilde{x}, \widetilde{y}, \left| \widetilde{x} \cdot \widetilde{y} - x \cdot y \right| \leq \frac{\ell}{2}$
2. For each example $x$:
• if $W \cdot x \geq \ell$, then $\widetilde{W} \cdot \widetilde{x} \geq \frac{l}{2}$
• if $W \cdot x \leq - \ell$, then $\widetilde{W} \cdot \widetilde{x} \leq - \frac{l}{2}$

Lemma: Given $\widetilde{u} = \frac{1}{\sqrt{k}}R^Tu, R_{i,j} \sim N(0,1),$

$E(\lVert \widetilde{u} \rVert^2) = \lVert u \rVert^2$
$P(\lVert \widetilde{u} \rVert^2 > (1 + \epsilon) \lVert u \rVert^2) < e^{ \frac{-\epsilon^2 - \epsilon^3 k}{4} }$
$P(\lVert \widetilde{u} \rVert^2 < (1 + \epsilon) \lVert u \rVert^2) < e^{ \frac{-\epsilon^2 - \epsilon^3 k}{4} }$

As you increase target k, the probability it deviates from the expected length gets smaller! The lemma also works for many other distributions on the projection matrix.

Algorithm for learning robust concepts

Generic algorithm:

1. Since the target concept is robust, random projection of the examples to a much lower-dimensional subspace will “preserve” the concept
2. In the lower-dimensional space, the number of examples and the time required to learn concepts are relatively small.

$\ell$-robust halfspaces:

1. Choose an n x k random matrix R by picking each entry indepedently from N(0,1) or U(-1,1).
2. Obtain m examples from D and project them to $\mathbb{R}^k$ using R.
3. Run the Perceptron algorithm, which is guaranteed to terminate after at most $\frac{1}{l^2}$ iterations
4. Output W.