Stability

01 Theory - Bottleneck distance

The bottleneck distance is a metric on barcode diagrams. The bars in one barcode are matched against the bars in the other. The distance between barcodes is the worst discrepancy between bars in the optimal matching.

center

Suppose we are given two barcode diagrams. These are two finite collections of intervals:

𝕀={Ia}aA,𝕁={Jb}bB(A,B finite)Ia=[xa,ya),Jb=[xb,yb)

Now define the d-distance between two intervals by:

d(Ia,Jb)=max(|xbxa|,|ybya|)

Also, define:

λ(Ia)=d(Ia,)=|yaxa|2

This is the d-distance to the nearest point-interval (x,x).

Now, suppose we are given subsets AA and BB along with a choice of bijection θ:AB. Define the penalty of θ to be:

P;A,B,θ=max(maxaAd(Ia,Jθ(a))maxaA\Aλ(Ia)maxbB\Bλ(Jb))

Bottleneck distance

The bottleneck distance from 𝕀 to 𝕁 is:

d(𝕀,𝕁)=minAABBθ:ABP;A,B,θ

center

Bars far from the diagonal are matched to the closest bars (using the -norm whose unit balls are squares). Bars near the diagonal can be matched to the closest diagonal point. The distance to the closest point on the diagonal is λ(Ia).

center

center

Stability Theorem: Bottleneck Distance

Let 𝒫 and 𝒬 be two finite metric spaces. Fix k0. Let B𝒫 be the barcode diagram for Hk(VR(𝒫,r)) and let B𝒬 be the barcode diagram for Hk(VR(𝒬,r)). Then:

d(B𝒫,B𝒬)dGH(𝒫,𝒬)

This theorem is a kind of “stability” result regarding barcode diagrams, since it says that when two data sets 𝒫 and 𝒬 are close together (measured using the ideal, if not computable, metric dGH), then the barcode diagrams are also close together. Barcode diagrams cannot jump apart without the data sets jumping apart.

For example, if 𝒫 is a “noisy” version of 𝒬, then we expect dGH(𝒫,𝒬) to be small, and therefore the barcode diagrams should be close. Conversely, if barcode diagrams are far, the data sets differ by “more than noise.”

02 Theory - Wasserstein metrics

The Wasserstein metrics generalize the bottleneck metric. Instead of taking the worst discrepancy between two bars in the best matching, it uses a p-sum of the discrepancies of all bars in the best matching.

First define the p-penalty for a partial matching θ:AB as follows:

Pp;A,B,θ=(aAd(Ia,Jθ(a))p+aA\Aλ(Ia)p+bB\Bλ(Jb)p)1/p

Then we minimize the penalty (find the best matching):

dp(𝕀,𝕁)=minAABBθ:ABPp;A,B,θ

Remark

We can recast the p-penalty in a more uniform symbolism by supplementing A or B with empty bars so that each has the same number of “bars,” and recalling the convention that d(Ia,)=λ(Ia). Then we have:

dp(𝕀,𝕁)=minθ:AB(aA(||aθ(a)||)p)1/p

Multi-dimensional persistent homology

03 Theory - Several parameters

Thus far the persistent homology we have considered is derived from filtered spaces/complexes with a single filtration parameter r:

VR(𝒫,r),f1((,r))

It is possible to combine two types of filtration in a single structure having inclusion maps of each type.

Unfortunately there is no “structure theorem” for persistent homology with more than one parameter. This is basically because the structure theorem proof in one parameter involves homology with [x] coefficients, while with n parameters it involves homology with [x1,,xn] coefficients.

04 Illustration

Filtering also by density

Let us add a density filtration to VR(𝒫,r). Fix k0 and define ρ:𝒫 by:

ρ(xi)=1dk(xi),dk(xi)=distance to kth nearest neighbor

Now the points in 𝒫, and the VR-complex itself, can be filtered by a density minimum:

VR(𝒫,r,s)=VR({x𝒫|ρ(x)s},r)

This bi-filtered complex determines a 2-parameter persistence module:

Hk(VR(𝒫,r,s))

For large s, only points very close to k distinct neighbors are included. As s decreases, more points are brought in.

center

With data like this, the s filtration (for middling s values) removes points inside the disk because they are too sparse.

Filtering also by distance to a subspace

Suppose ι:kn is an embedded k-dimensional subspace of the background space n that contains the data:𝒫n.

Let fι:𝒫 measure the distance from data points to the image of ι.

Define 𝒫R=f1((,R)). This gives a filtration in terms of the parameter R. This filtration could be useful if the complete data set 𝒫 is too big to process all at once. The filtration allows one to process the data in layers.

These filtrations determine inclusions of VR complexes. For fixed R and r<s, we have:

VR(𝒫R,r)VR(𝒫R,s)

Upon applying homology, for k0, we have:

Hk(VR(𝒫R,r))Lr,sRHK(VR(𝒫R,s)))

And for fixed r and R<S, we have:

VR(𝒫R,r)VR(𝒫S,r)

Upon applying homology, we have:

VR(𝒫R,r)LrR,SVR(𝒫R,s)

Finally, these induced maps should commute with each other:

LsR,SLr,sR=Lr,sSLrR,S

05 Theory - Zig-zag persistent homology

Filtrations can be modified in another way by adding intermediate linking sets. Instead of the usual filtered space like this:

𝒦r1𝒦r2𝒦r3

We have a “zig-zag” filtered space like this:

𝒦1𝒦1,2𝒦2𝒦2,3𝒦3

In general, each arrow is allowed to go either direction. The induced map on homology will follow the same pattern.

06 Illustration

Morse layers

An application of zig-zag filtration arises by dividing the entire space into small layers for which the homology is easier to compute. For example, with a Morse function f:𝒦0, we have this zig-zag filtration:

f1(0)f1([0,1])f1(1)f1([1,2])f1(2)

center

Repeated samples

Suppose some data is sampled repeatedly and data sets are generated in a sequence:

X1,X2,X3,X4,

Then we can form a zig-zag filtration by:

X1X1X2X2X2X3X3X3X4

Mapper algorithm

07 Theory - Clustering and Reeb graph

The mapper algorithm was introduced by Singh-Mémoli-Carlsson in 2007.

Recall the dendrogram (clustering) algorithm:

center

Given a fixed ε>0, two points x,y are considered equivalent if there is a sequence of points

x=x0,x1,x2,,xk=y

and the distance between successive points is much less than ε.

Now consider connected components (“clusters”) for each fixed ε. The clustering algorithm computes this and the calculation is essential the same as persistent homology Hk for k=0.

Reeb graph

Given a Morse function f:M on a manifold M, the Reeb graph is formed by the equivalence relation:

Two points x,yM are equivalent if:

  • f(x)=f(y) AND
  • x,y are in the same connected component of f1(f(x))

center

08 - Mapper algorithm

Let us be given:

  • A finite metric space (X,d)
    • I.e. a finite collection of points {x1,x2,,xk} with distances between them; usually Xn
  • A “filter function” f:Xn
  • A finite cover 𝒞={Cα}αA of f(X) in n
    • Typically a collection of boxes in n

Proceed as follows:

(1) Consider the clustering of each f1(Uα):

  • Choose ε>0 and consider clusters at this scale ε.
  • Write Cα,i for the clusters, so f1(Uα)=iCα,i

(2) Form a discrete version of the Reeb graph:

  • Vertices: clusters Cα,i
  • Edges: when Cα,i and Cβ,j overlap
  • Color the vertices according to values of f (on average):

center

09 Illustration

Simple cover

Suppose f:X and suppose 𝒞 is just the interval [L,L] containing the entire f1(X).

In this case Mapper produces a graph with no edges. Its vertices correspond to clusters of X=f1([L,L]).

2-Component cover

Suppose f:X and suppose 𝒞 is the pair Uα=[0,5] and Uβ=[6,10]. Mind the gap (5,6).

Vertices in the Mapper graph are ε-clusters of f1(Uα) and of f1(Uβ). These clusters must be all disjoint. So there are no edges between the vertices.

If, however, the cover had been chosen to be Uα=[0,5] and Uβ=[4,10], there could be vertices between clusters in f1(Uα) and clusters in f1(Uβ).

Consider data points sampled around the unit circle. Let f:2 by f(x,y)=x. Suppose the image f(X) is covered by small overlapping intervals of equal length. Mapper recovers the topology of the circle:

center

center

If we had used a simple cover (single interval), the Mapper graph would have no edges and thus would not express the circle’s topology.

center

(Rabadan-Blumberg 2020)

center

(Nikolau-Levine 2011)

Dimensionality reduction and manifold learning

These techniques work well if the given data happens to lie near a lower-dimensional vector subspace or submanifold of the original space.

Foundational algorithms are

  • PCA “principle component analysis” — vector subspaces
    • starts with Xn a finite metric space (point-cloud data)
  • MDS “multidimensional sampling” — manifold subspaces
    • starts with arbitrary metric space (X,d)

Goal:

Given (small) k0, find an optimal map θ:Xk which in some sense represents the structure of the data by minimizing the error:

=xi,xjX(dX(xi,xj)dk(θ(xi),θ(xj)))2

10 Theory - PCA

Let M be an n×n matrix with eigenvalues λi and eigenvectors 𝐯i, so M𝐯i=λi𝐯i.

Recall that if M is symmetric, we know that λi and

λiλj𝐯i𝐯j

For PCA, we start with X={𝐱1,𝐱2,,𝐱3}n. Fix k0 small.

The goal is to find an “optimal” linear map θ:nk which distorts distances between data points as little as possible. (Minimal mean-square error.)

PCA Algorithm: (1) Normalization

Consider μ=1ni𝐱i and let 𝐱~i=𝐱iμ.

(2) Covariance matrix

Define the covariance matrix C:

C=1ni𝐱~i𝐱~i𝖳

Note that Cjk=𝐱j𝐱k. Then, since each summand of C is symmetric, C must be symmetric.

(3) Eigenvalues

Compute the top (largest) k eigenvalues of C.

If these are pairwise distinct, the corresponding eigenvectors are all linearly independent, so they span a k-dimensional linear subspace V.

Consider the orthogonal projection θ:nV.

(Note: one must somehow identify V with k first.)

If some eigenvalue λj has multiplicity, then consider the corresponding eigenspace Vj.

Theorem:

The outcome of this algorithm is an optimal θ which minimizes the error .

(In the context of PCA or MDS, in fact.)

Combining techniques

It is often reasonable to combine techniques discussed thus far.

For example, the outcome of PCA/MDS may be fed into Mapper; a coordinate project in the PCA output may be used as the filter function.

11 Theory - MDS

Multi-dimensional scaling starts with an arbitrary finite metric space (X,d), and attempts to embed this in k in a way that is optimal in the sense of minimizing . This is done as follows:

(1) Metric matrix:

Dij=dX(𝐱i,𝐱j)

(2) H and Z

Let H be the matrix H=I1n𝟏 where 𝟏 is the matrix with all entries 1.

Let Z be the matrix Z=12HDH.

(This step centers the result at the origin.)

(3) Define θ:Xk:

Create a matrix A with columns the eigenvectors 𝐯j, normalized so that ||𝐯j||2=λj. Then:

θ(xi)=ith row of A𝖳

12 Theory - Isomap manifold learning

The given data is aligned along a non-linear subspace of n.

This method is used in facial recognition, for example.

We are assuming that the data lies along (very near) a submanifold Mn. This M is assumed to have an “intrinsic” metric dM(x,y) for x,yM.

dM(x,y)=mincurves α01||α˙(t)||dt

center

The idea of isomap is that, assuming Mn is smoothly embedded, there is a scale at which

dM(x,y)dn(x,y)

center

Isomap description in detail:

Let us be given X={𝐱1,,𝐱m}n, as well as a scale ε>0 and a target dimension k.

Isomap procedure

(1) Build weighted graph G:

  • Vertices: the points 𝐱1,,𝐱m
  • Edges: join 𝐱i to 𝐱j when d(𝐱i,𝐱j)<ε
  • Weights: the distances wij=d(𝐱i,𝐱j)

center

(2) New metric space (X,dX):

Same points, new distances:

dX(𝐱i,𝐱j)=minpaths 𝐱i to 𝐱j in Gpath weight

(Use Dijkstra’s Algorithm to compute the minimal path.)

(3) Apply PCA/MDS:

Find an optimal map θ:Xk.

Isomap - single chart

Isomap is most effective when the manifold can be covered by a single chart and the algorithm effectively “flattens” and “unrolls” the space.

It is less effective for manifolds with topology. For example, when applied to 𝕊23, the output of isomap is a flat disk.

center

(Rabadan-Blumberg 2020)

center

13 Theory - Local Linear Embedding (LLE)

LLE is a variation of isomap manifold learning that is “locally linear.” It does not attempt to preserve the metric globally. The algorithm is a little more efficient than isomap as well.

LLE Procedure

Suppose we are given the data set {𝐱1,,𝐱m}n, the target dimension k and a “neighborhood size” K.

(1) Find minimizer weights:

For each 𝐱i, find weights wij that minimize the error:

=i(𝐱ijwij𝐱j)2

subject to the constraints:

{wij=0 if 𝐱i,𝐱j are NOT K-neighborsjwij=1

center

The weights determine approximations of a point using the plane that best fits its neighbors. If the manifold is actually linear around a point, the error at that point will be zero.

(2) Use minimizer weights to recover points:

Find points {y1,,ym}k that minimize the error:

=i(yijwijyj)2

The new embedding is θ:xiyi.

center

Locality

The Isomap and LLE algorithms are local, meaning that the output θ(x) is determined by the data around x. Data points far from x have no effect on θ(x).