Property Testing of Data Dimensionality: ICSI and UC Berkeley

Property Testing of Data Dimensionality
Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)
Data dimensionality
The analysis of large volumes of complex data is required in many disciplines. Such data is frequently represented by vectors in a highdimensional vector space.
E.g., sequential biological data (genome, proteins) A common method of representing data is feature extraction (vector representation in feature space).

Images databases Text corpora (via latent semantic indexing)

Testing Data Dimensionality 2
The issue of dimension
High-dimensional data is difficult to work with.
Complexity of many operations is heavily dependent (e.g. exponentially) on the dimension. Which allows to effectively reduce the dimension. 2 E.g. in R :
Real-life data often adheres to a low-dimensional structure

Dimensionality Reduction: Mapping into low-dimensional space (while preserving most of the data structure)
Trade-off accuracy for computational efficiency

Dimensionality reduction methods
Singular Value Decomposition (SVD)

Linear Structure
I.e., low-rank matrix approximation. Practical variants: Multidimensional Scaling (MDS), Principal Component Analysis (PCA)
Metric Structure Low-distortion embedding in low-dimensional lp

Of any Euclidean metric [Johnson-Lindenstrauss86] Of any metric [Bourgain86, Linial-London-Rabinovich93].
Other methods, e.g. combinatorial feature selection [Charikar-Guruswami-Kumar-Rajagopalan-Sahai00]

Property testing framework

Relaxed decision problems: Determine whether The input has a property P, or The input is far from having the property P, i.e. it needs to be modified significantly in order to have the property. Goal: Obtain Randomized algorithms (correct with probability 2/3), Whose complexity is low (does not depend on input size).
Trivial example: Testing if an input list contains only 0s or e-fraction of the entries are not 0 with O(1/e) queries.
Testing data dimensionality

Given a data set S, determine whether S has at most a (fixed) dimension d, or S is e-far from having this property,
i.e. at least an e-fraction of the entries of (a representation of S) needs to be modified for S to have the property.
Technicalities: Interpretation of dimension (i.e. type of structure) Representation of S
Assume it affects both query mechanism and farness measure

Our results Testing for linear structure
Algorithm for testing whether vectors v1,,vn lie in linear (or affine) subspace of dimension d.

Algorithm queries O(d/e) vectors. Holds for every vector space V.
Algorithm for testing whether a matrix Amn has rank d.

Algorithm queries the entries of an O(d/e) O(d/e) submatrix. Holds for matrices over any field F.
(Both algorithms have one-sided error.)

Our results Testing for metric structure
Testing whether v1,,vn l2 can be embedded into l2

m
Isometrically - achieved by querying O(d/e) vectors (corollary). With distortion D<1/e - requires querying W((n/D)1/2) vectors. With perturbation d>0 - requires W(min{n1/2 , m/log m}) queries.
m
Testing whether vectors v1,,vn l1 can be embedded d isometrically into l1 requires querying W(n1/4) vectors.
(Lower bounds are for algorithms with two-sided error.)
Our results Testing metrics and norm
Algorithm for testing whether a matrix Mnn is the distances matrix of a d-dimensional Euclidean metric.

Algorithm queries the entries of an O(d/e) O(d/e) submatrix. Slight improvement over O((dlog d)/e) O((dlog d)/e) of [Parnas-Ron01]. Algorithm queries O(e-3 log 1/e) entries (with two-sided error). Holds for any p and D. Allows to test the Frobenius norm of a matrix (such as the difference between a matrix and its low-rank approximation).
Algorithm for testing whether a vector has lp-norm D.

Property testing origins

Introduced by [Rubinfeld-Sudan96]
Testing algebraic properties of functions E.g. low-degree polynomials, Hadamard code, long code
Many PCPs involve testing of encodings
Testing of combinatorial properties initiated by [Goldreich-Goldwasser-Ron98]

They focused on graph properties (e.g. coloring). Later works considered testing monotonicity of functions, satisfiability of formulas, regularity of languages, equality of distributions, clustering of Euclidean vectors, metric spaces etc.
Related work
Property testing

Testing whether a distances matrix represents a tree metric, ultrametric, or a low-dimensional Euclidean metric [Parnas-Ron01]. Testing properties of Euclidean vectors, e.g. clustering [AlonDar-Parnas-Ron00] and convexity [Czumaj-Sohler-Ziegler00]. Testing various matrix properties, e.g. monotonicity [NewmanFischer01]. [Frieze-Kannan-Vempala98, Achlioptas-McSherry01] Farness measure considers the magnitude of the changes. Sampling depends on input size (unless input is uniform).
Fast low-rank approximation (by sampling)

Other related work
Finite point criterion for lp embeddability.
Namely, the minimum fp(d) such that d (any) metric space embeds in lp iff every fp(d) of its points do. For p = 2, [Menger28] showed fp(d) = d+3 . For p = 1 and any d > 2, [Bandelt-Chepoi-Laurent98] showed f1(d) d2-1, but it is not known whether f1(d) is finite.
Our results for l1 and l2 spaces establish somewhat similar bounds for a relaxed version of this question.
Algorithm for testing linear structure

Thm 1. Testing whether a set of vectors S lies in a subspace of dimension d can be achieved with O(d/e) queries. The algorithm. 1. Query O(d/e) vectors of S uniformly at random. 2. Accept if (and only if) the queried vectors lie in a linear (or affine) subspace of dimension d.
Testing Data Dimensionality
13
Proof of testing linear structure

Proof (correctness). Algorithm always accepts a data set S of dimension d. Let S be e-far from having dimension d. Consider sampling the O(d/e) vectors one by one. Let Xt be the dimension of the subspace spanned by the first t sampled vectors. Lemma 1. Pr[Xt+1 = Xt + 1 | Xt d] e . Proof. Since S is e-far from having dimension d, the subspace spanned by the first t sampled vectors contains less than (1-e)-fraction of the vectors of S.
A technical lemma
Lemma 2. Let 0 X0 X1 X2 ... be random variables. If Pr[Xt+1 = Xt + 1 | Xt d] e for all t 0, then for t* = 8d/e we have Pr[Xt* d] < 1/3. Proof sketch. Xt has binomial distribution as long as Xt d. Then E[Xt*] 8d and using Chernoff Pr[Xt* d] < 1/3. So with probability 2/3 we have Xt* > d and the algorithm rejects (for S that is e-far from dimension d). This completes the proof of Thm 1.
Similar approach allows to test if a matrix is low-rank and for distances matrix (slight improvement over [Parnas-Ron01]).
Lower bound for l1

Thm 2. Testing whether n vectors in l1 can be embedded d isometrically into l1 requires querying W(n1/4) vectors. Consider first algorithms with one-sided error. Suppose d=1, m=2. Consider the following point set S:
S is 1/24-far from l1 -embeddability because every cannot be embedded in the line.

Lower bound for l1 with one-sided error

Assume there is an algorithm that queries t << n1/2 points. WLOG it sees a random sample of S. With high probability 1 O(t2/n) = 1 o(1)
The sample contains no two points at distance O(1) from each other. d Then sample is l1 embeddable (since there is a geodesic line going through all its points). And so algorithm must accept S.
Contradiction (since S is 1/24-far).

Lower bound for l1 with two-sided error
We (randomly) create from S another data set S such that

S embeds in the line (WHP 1-o(1)). The algorithms view of S differs from its view of S with probability o(1), So probabilities of accepting S vs. that of S differ by o(1)<<1/3. Contradiction.
These inputs look the same
Here (to prove Thm 2):

Create S by choosing r << n1/2 random points from S and duplicating each one n/r times. Then a sample of << r1/2 points from S,S is almost the same.
Lower bound for l2 with perturbation

Thm 3. Testing whether n vectors in l2 can be perturbed by d d to be l2 embeddable requires W(min{n1/2 , m/log m}) queries.

m
Let d=0 (I.e. testing if the vectors are in a ball of radius d). Consider a sphere of radius d = d(1+1/2n) in l2 . Let S consist of n random vectors from this sphere. Let S consist of n/2 random vectors from the sphere and their n/2 antipodal vectors (-v).
m
WHP, the vectors of S are in a ball of radius d
By concentration of measure, WHP they are nearly orthogonal, e.g. the distance between every two is roughly d2. In fact, WHP they are all at distance <d from their center of mass, as claimed.
YES
Concentration of measure
S is 1/2-far from being in a ball of radius d
Because the distance between antipodal vectors in S is 2d > 2d. S
NO
Assume algorithm queries << n1/2

WHP view of S, S is the same. So, probability of accepting S and S should differ by o(1). Contradiction. This proves Thm 3.
Antipodals
21
Lower bound for l2 with distortion

Thm 4: Testing whether n vectors in l2m can be embedded in d l2 with distortion D<1/e requires W((n/D)1/2) queries. Let d=1 (embedding into a line with distortion D).

Consider a unit circle with equally spaced 10D points. Let S consist of points from n/10D (far apart) parallel copies of this circle in R3.
22
Lower bound for l2 with distortion

NO
10D points
S is 1/10D-far from having an embedding with distortion D
Since embedding each cycle into the line requires distortion > D.
23
Lower bound for one-sided error

YES
10D points
Assume algorithm queries << (n/D)1/2 points of S

WLOG it sees a random sample of S. WHP, this sample contains at most one point from each circle, And then it can be embedded with distortion < D into the line (by mapping each point to its circles center). So WHP algorithm must accept S. Contradiction.
Lower bound for two-sided error
We create S by choosing one point from each circle of S and duplicating it 10D times.

Then S can be embedded with distortion < D into the line. WHP view of << (n/D)1/2 points from S is the same as from S. So, probability of accepting S and S should differ by o(1). This proves Thm 4.
25
Future research
Testing whether

A matrix spectral norm ||A||2 is small. A distances matrix represents metric (triangle inequality). d A distances matrix represents an l1 metric. d A distances matrix represents an approximate l2 metric. a la [Frieze-Kannan-Vempala98, Achlioptas-McSherry01]
Testing with farness measure that depends on magnitude
26

Property Testing of Data Dimensionality: ICSI and UC Berkeley

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Property Testing of Data Dimensionality: ICSI and UC Berkeley

Uploaded by

Copyright:

Available Formats

Property Testing of Data Dimensionality

Images databases Text corpora (via latent semantic indexing)

The issue of dimension

High-dimensional data is difficult to work with.

Real-life data often adheres to a low-dimensional structure

Trade-off accuracy for computational efficiency

Dimensionality reduction methods

Singular Value Decomposition (SVD)

Metric Structure Low-distortion embedding in low-dimensional lp

Of any Euclidean metric [Johnson-Lindenstrauss86] Of any metric [Bourgain86, Linial-London-Rabinovich93].

Other methods, e.g. combinatorial feature selection [Charikar-Guruswami-Kumar-Rajagopalan-Sahai00]

Property testing framework

Testing data dimensionality

Technicalities: Interpretation of dimension (i.e. type of structure) Representation of S

Assume it affects both query mechanism and farness measure

Our results Testing for linear structure

Algorithm queries O(d/e) vectors. Holds for every vector space V.

Algorithm for testing whether a matrix Amn has rank d.

(Both algorithms have one-sided error.)

Our results Testing for metric structure

Testing whether v1,,vn l2 can be embedded into l2

Our results Testing metrics and norm

Algorithm for testing whether a vector has lp-norm D.

Property testing origins

Many PCPs involve testing of encodings

Testing of combinatorial properties initiated by [Goldreich-Goldwasser-Ron98]

Fast low-rank approximation (by sampling)

Other related work

Finite point criterion for lp embeddability.

Algorithm for testing linear structure

Testing Data Dimensionality

Proof of testing linear structure

Lower bound for l1

S is 1/24-far from l1 -embeddability because every cannot be embedded in the line.

Lower bound for l1 with one-sided error

Contradiction (since S is 1/24-far).

Lower bound for l1 with two-sided error

We (randomly) create from S another data set S such that

These inputs look the same

Here (to prove Thm 2):

Lower bound for l2 with perturbation

Lower bound for l2 with perturbation

WHP, the vectors of S are in a ball of radius d

Lower bound for l2 with perturbation

S is 1/2-far from being in a ball of radius d

Because the distance between antipodal vectors in S is 2d > 2d. S

Assume algorithm queries << n1/2

Lower bound for l2 with distortion

Testing Data Dimensionality

Lower bound for l2 with distortion

S is 1/10D-far from having an embedding with distortion D

Testing Data Dimensionality

Lower bound for one-sided error

Assume algorithm queries << (n/D)1/2 points of S

Lower bound for two-sided error

Testing Data Dimensionality

Testing with farness measure that depends on magnitude

Testing Data Dimensionality

You might also like