Professional Documents
Culture Documents
Property Testing of Data Dimensionality: ICSI and UC Berkeley
Property Testing of Data Dimensionality: ICSI and UC Berkeley
Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)
Data dimensionality
The analysis of large volumes of complex data is required in many disciplines. Such data is frequently represented by vectors in a highdimensional vector space.
E.g., sequential biological data (genome, proteins) A common method of representing data is feature extraction (vector representation in feature space).
Complexity of many operations is heavily dependent (e.g. exponentially) on the dimension. Which allows to effectively reduce the dimension. 2 E.g. in R :
Dimensionality Reduction: Mapping into low-dimensional space (while preserving most of the data structure)
Linear Structure
I.e., low-rank matrix approximation. Practical variants: Multidimensional Scaling (MDS), Principal Component Analysis (PCA)
Trivial example: Testing if an input list contains only 0s or e-fraction of the entries are not 0 with O(1/e) queries.
Testing Data Dimensionality 5
i.e. at least an e-fraction of the entries of (a representation of S) needs to be modified for S to have the property.
Algorithm for testing whether vectors v1,,vn lie in linear (or affine) subspace of dimension d.
Algorithm queries the entries of an O(d/e) O(d/e) submatrix. Holds for matrices over any field F.
Isometrically - achieved by querying O(d/e) vectors (corollary). With distortion D<1/e - requires querying W((n/D)1/2) vectors. With perturbation d>0 - requires W(min{n1/2 , m/log m}) queries.
m
Testing whether vectors v1,,vn l1 can be embedded d isometrically into l1 requires querying W(n1/4) vectors.
(Lower bounds are for algorithms with two-sided error.)
Testing Data Dimensionality 8
Algorithm for testing whether a matrix Mnn is the distances matrix of a d-dimensional Euclidean metric.
Algorithm queries the entries of an O(d/e) O(d/e) submatrix. Slight improvement over O((dlog d)/e) O((dlog d)/e) of [Parnas-Ron01]. Algorithm queries O(e-3 log 1/e) entries (with two-sided error). Holds for any p and D. Allows to test the Frobenius norm of a matrix (such as the difference between a matrix and its low-rank approximation).
Testing Data Dimensionality 9
Introduced by [Rubinfeld-Sudan96]
Testing algebraic properties of functions E.g. low-degree polynomials, Hadamard code, long code
They focused on graph properties (e.g. coloring). Later works considered testing monotonicity of functions, satisfiability of formulas, regularity of languages, equality of distributions, clustering of Euclidean vectors, metric spaces etc.
Testing Data Dimensionality 10
Related work
Property testing
Testing whether a distances matrix represents a tree metric, ultrametric, or a low-dimensional Euclidean metric [Parnas-Ron01]. Testing properties of Euclidean vectors, e.g. clustering [AlonDar-Parnas-Ron00] and convexity [Czumaj-Sohler-Ziegler00]. Testing various matrix properties, e.g. monotonicity [NewmanFischer01]. [Frieze-Kannan-Vempala98, Achlioptas-McSherry01] Farness measure considers the magnitude of the changes. Sampling depends on input size (unless input is uniform).
Testing Data Dimensionality 11
Namely, the minimum fp(d) such that d (any) metric space embeds in lp iff every fp(d) of its points do. For p = 2, [Menger28] showed fp(d) = d+3 . For p = 1 and any d > 2, [Bandelt-Chepoi-Laurent98] showed f1(d) d2-1, but it is not known whether f1(d) is finite.
Our results for l1 and l2 spaces establish somewhat similar bounds for a relaxed version of this question.
Testing Data Dimensionality 12
13
A technical lemma
Lemma 2. Let 0 X0 X1 X2 ... be random variables. If Pr[Xt+1 = Xt + 1 | Xt d] e for all t 0, then for t* = 8d/e we have Pr[Xt* d] < 1/3. Proof sketch. Xt has binomial distribution as long as Xt d. Then E[Xt*] 8d and using Chernoff Pr[Xt* d] < 1/3. So with probability 2/3 we have Xt* > d and the algorithm rejects (for S that is e-far from dimension d). This completes the proof of Thm 1.
Similar approach allows to test if a matrix is low-rank and for distances matrix (slight improvement over [Parnas-Ron01]).
Testing Data Dimensionality 15
Assume there is an algorithm that queries t << n1/2 points. WLOG it sees a random sample of S. With high probability 1 O(t2/n) = 1 o(1)
The sample contains no two points at distance O(1) from each other. d Then sample is l1 embeddable (since there is a geodesic line going through all its points). And so algorithm must accept S.
S embeds in the line (WHP 1-o(1)). The algorithms view of S differs from its view of S with probability o(1), So probabilities of accepting S vs. that of S differ by o(1)<<1/3. Contradiction.
Create S by choosing r << n1/2 random points from S and duplicating each one n/r times. Then a sample of << r1/2 points from S,S is almost the same.
Testing Data Dimensionality 18
Let d=0 (I.e. testing if the vectors are in a ball of radius d). Consider a sphere of radius d = d(1+1/2n) in l2 . Let S consist of n random vectors from this sphere. Let S consist of n/2 random vectors from the sphere and their n/2 antipodal vectors (-v).
m
Testing Data Dimensionality 19
By concentration of measure, WHP they are nearly orthogonal, e.g. the distance between every two is roughly d2. In fact, WHP they are all at distance <d from their center of mass, as claimed.
YES
Concentration of measure
Testing Data Dimensionality 20
NO
WHP view of S, S is the same. So, probability of accepting S and S should differ by o(1). Contradiction. This proves Thm 3.
Testing Data Dimensionality
Antipodals
21
Consider a unit circle with equally spaced 10D points. Let S consist of points from n/10D (far apart) parallel copies of this circle in R3.
22
10D points
Since embedding each cycle into the line requires distortion > D.
23
10D points
WLOG it sees a random sample of S. WHP, this sample contains at most one point from each circle, And then it can be embedded with distortion < D into the line (by mapping each point to its circles center). So WHP algorithm must accept S. Contradiction.
Testing Data Dimensionality 24
We create S by choosing one point from each circle of S and duplicating it 10D times.
Then S can be embedded with distortion < D into the line. WHP view of << (n/D)1/2 points from S is the same as from S. So, probability of accepting S and S should differ by o(1). This proves Thm 4.
25
Future research
Testing whether
A matrix spectral norm ||A||2 is small. A distances matrix represents metric (triangle inequality). d A distances matrix represents an l1 metric. d A distances matrix represents an approximate l2 metric. a la [Frieze-Kannan-Vempala98, Achlioptas-McSherry01]
26