Sparse Coding (8/31/2016)

Visual Pathway: Retina (~1 million sensors) -> V1 (~140 million) -> V2, V3, V4, V5, V6 -> IT -> Hippocampus

There are certain visual patterns that neurons in V1 respond to.

For example, from Huebel and Wiesel’s paper:

result1

Question 1: Why these particular patterns?

Question 2: How these patterns?


Image representation using basis vectors

We can think an image of size n (total number of pixels) as a vector of pixel values:

(x1, x2, …, xn) ∈ \mathbb{R}^n

We can also represent an image using a set of basis vectors.

For example, if we have n vectors which have a single 1 in different “coordinates”:

p1 = (1,0,0, …, 0)
p2 = (0,1,0, …, 0)
p3 = (0,0,1, …, 0)

pn = (0,0,0, …, 1)

then we can represent ANY images using a linear combination of these vectors since they form a basis (there are n of them, and they are linearly independent, in fact pairwise orthogonal):

ANY Image = ∑αipi

For what kind of images would this pixel basis be “good”?

One possible criteria is counting #non-zero coefficients (ais) used to represent an image: using small number of basis vectors is better!

Then, these vectors are good for representing an image with few dots:

sparse

But not good for representing a dense image:

dense

What is another possible set of basis vectors? (“features”)?

Hadamard Basis:

Φ1 = (1,1,1,1,1,1, …, 1,1,1,1,1,1)
Φ2 = (1,1,1,1,1,1, …,-1,-1,-1,-1,-1,-1)
Φ3 = (1,1,1,-1,-1,-1, …, 1,1,1,-1,-1,-1)

Φn = (…)

Because these n vectors are orthogonal to each other (Φi•Φj = 0 for all i =! j), they can represent any images:

Rank(Φ1,Φ2,Φ3, …, Φn) = n
So, ∀X ∈ \mathbb{R}^n, ∃αi s.t. X = ∑αiΦi


So, there are many, many possible bases. Which are better and which should we pick?

It depends on shapes in images: circles, rings, or lines…

Many “natural” classes are bases, but some are not complex enough.

For example, horizontal and vertical half-spaces in 2-d cannot represent a single pixel in an image (try it!), and therefore, cannot represent all images. (There are only 2k such half-spaces, but we need k2 basis vectors to cover whole k by k sized image space).

Exercise. Can we represent any single pixel using all half-spaces (at any angle) in 2-d (and hence all images)?


Sparse Coding

Olshausen-Field Paper: let’s use a basis that can represent images as sparsely as possible!

Problem: Set of images: Y1, Y2, …, Ym ∈ \mathbb{R}^n

Find a basis Φ: Φ12, …, Φn and vectors X1, X2, X3, …, Xm such that YlΦXl AND every Xl is sparse.

We need to minimize:

\sum_{l=1}^m \lVert Y^{l}-\Phi X^{l} \rVert^2 + \lambda \sum_{l=1}^m S(X^l)

S() is a cost function that determines the sparseness of a vector.


How do we measure “sparseness” of a vector?

Sparse function should incur a higher cost for a vector which has more “activity” spread out.

One way can be using L_0 norm: \lVert X^l \rVert_0 = # of non-zero entries in a vector.

However, minimizing L_0 norm is an NP-hard problem. Also, it doesn’t represent sparseness well because it doesn’t consider magnitude of each entry.

What about the L_1 norm? (sum of absolute values of all entries)

Example. Suppose we have three normalized vectors:
a: (1,0, …, 0)
b: (1/\sqrt{n}, 1/\sqrt{n}, ..., 1/\sqrt{n})
c: (1/\sqrt{2}, 1/\sqrt{4}, 1/\sqrt{8} ..., 1/2^{n/2})

And here are sparseness cost of each vector using L0 norm and L1 norm:

L0 norm

L1 norm

a

1

1

b

n

 \sqrt{n}

c

n

O(1)

Using L_0 norm gives same sparseness cost for both vector b and c, even if c is more sparse because magnitude of entries approach 0.

Using L_1 norm correctly captures this fact (c is more sparse than b).

Other functions can be used to measure sparsity, and the paper presents two other such functions: -e^{-x^2} and log(1+x^2)


Algorithm (Gradient Descent): 

Minimize E = \sum_{l=1}^m \lVert Y^{l}-\Phi X^{l} \rVert^2 + \lambda \sum_{l=1}^m S(X^l)

  1. Initialize \Phi with random vectors.
  2. Compute X by finding local optimal to:
    \nabla_{X_j^l} E = 2 \sum\limits_{i} (Y_i^l - \Phi_i X^l) + \lambda S_j^{'} (X^l) = -2 (Y^l \Phi^j - \sum\limits_{k}\sum\limits_{i} X_k^l \Phi_{ik} \Phi_{ij}) + \lambda S_j^{'} (X^l) = -2 (Y^l \Phi^j - \sum\limits_{k} X_k^l \Phi^k \Phi^j) + \lambda S_j^{'} (X^l)
    [formula (5) in the paper]
  3. Compute gradient in \Phi and update \Phi :
    \nabla_{ \Phi^j } E = \sum\limits_{l} 2 (Y^l - \Phi X^l)(-X_j^l) = -2 \sum\limits_{l} X_j^l(Y^l - \Phi X^l)
    [formula (6) in the paper]
  4. Repeat until change in E is less than a chosen threshold.

We note this algorithm is a heuristic. Neither the running time nor the quality of the solution found have any rigorous guarantees in general. However, as we saw with the implementation, it produces very interesting results and is considered a seminal paper in the field.

Advertisements

Author: Suk Hwan Hong

Georgia Tech

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s