Sublinear Time Algorithm

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

A Seminar Report On

SUB-LINEAR TIME ALGORITHMS


Submitted in partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
In

COMPUTER SCIENCE AND ENGINEERING


Submitted by

AJAY YADAV Roll No. - 1012210009 B.TECH (CS-61)


Under the guidance of

Ms. ANITA PAL Mr. KAMAL KUMAR SRIVASTAVA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI RAMSWAROOP MEMORIAL COLLEGE OF ENGINEERING & MANAGEMENT, LUCKNOW FEBRUARY 2013

ABSTRACT

The area of sublinear-time algorithms is a new rapidly emerging area of computer science. It has its roots in the study of massive data sets that occur more and more frequently in various applications. Financial transactions with billions of input data and Internet traffic analyses (Internet traffic logs, clickstreams, web data) are examples of modern data sets that show unprecedented scale. Managing and analyzing such data sets forces us to reconsider the traditional notions of efficient algorithms: processing such massive data sets in more than linear time is by far too expensive and often even linear time algorithms may be too slow. Hence, there is the desire to develop algorithms whose running times are not only polynomial, but in fact are sublinear in n. Constructing a sublinear time algorithm may seem to be an impossible task since it allows one to read only a small fraction of the input. However, in recent years, we have seen development of sublinear time algorithms for optimization problems arising in such diverse areas as graph theory, geometry, algebraic computations, and computer graphics. Initially, the main research focus has been on designing efficient algorithms in the framework of property testing. which is an alternative notion of approximation for decision problems. But more recently, we see some major progress in sublinear-time algorithms in the classical model of randomized and approximation algorithms. In this paper, we survey some of the recent advances in this area. Our main focus is on sublinear-time algorithms for combinatorial problems, especially for graph problems and optimization problems in metric spaces.

ACKNOWLEDGEMENT

Acknowledgements provide us the opportunity to express our gratitude to the people which make such tasks to come out as a success. I take this opportunity to thank Dr. (Mrs.) Vinodini Katiyar (Head of Department, Computer Science) for her immense support and help. I would also like to thank my seminar guides Ms. Anita Pal and Mr. Kamal Kumar Srivastava for their incessant and perpetual cooperation and encouragement which helped me overcome hurdles at various stages. I would also like to thank my family for providing me resources, support and encouragement this report would not have been possible without you.

AJAY YADAV

List of Symbols and Abbreviations


DFA R.E. MST NP PTAS Deterministic finite automata. Regular Expression Minimum Spanning Tree Non Polynomial polynomial time approximation scheme O Asymptotic upper bound Asymptotic lower bound Asymptotic average bound less than Summation

TABLE OF CONTENTS
Abstract.i Acknowledgement.ii List of Symbols, Abbreviations and Nomenclature..iii 1.Intoduction..5 1.1.Sublinear time algorithms5 1.2.Property Testing...6 1.2.1.Definition and Variants.7 1.2.2.Features and limitations....8 1.3.Data Streaming algorithms..8 2.Related work...9 2.1.Randomized algorithms...9 2.2.Approximation algorithms..9 2.3.Time complexity comparisons10 3.Design.12 3.1.Algorithm to determine sortedness of a list.12 3.1.1.algorithm...13 3.2.Successor search in a list..14 3.2.1.Search algorithm14 3.3.Testing Membership of a string in a R.E..15 3.3.1.Algorithm...15 3.4.Searching in sorted list..16 3.5.Geometry:Intersection of two polygons17 3.6. Sublinear Time Algorithms for Graphs Problems....19 3.6.1.Average degree of graph.19 3.7. Sublinear Time Approximation Algorithms for Problems in Metric Spaces...23 3.7. 1.Clustering via Random Sampling25 4.Conclusion..26 References..27

1.

INTRODUCTION

The concept of sublinear-time algorithms is known for a very long time, but initially it has been used to denote pseudo-sublinear-time algorithms, where after an appropriate preprocessing, an algorithm solves the problem in sublinear-time. For example, if we have a set of n numbers, then after an O(n log n) preprocessing (sorting), we can trivially solve a number of problems involving the input elements. And so, if the after the preprocessing the elements are put in a sorted array, then in O(1) time we can find the kth smallest element, in O(log n) time we can test if the input contains a given element x, and also in O(log n) time we can return the number of elements equal to a given element x. Even though all these results are folklore, this is not what we call nowadays a sublinear-time algorithm. In this Report, my goal is to study algorithms for which the input is taken to be in any standard representation and with no extra assumptions. Then, an algorithm does not have to read the entire input but it may determine the output by checking only a subset of the input elements. It is easy to see that for many natural problems it is impossible to give any reasonable answer if not all or almost all input elements are checked. But still, for some number of problems we can obtain good algorithms that do not have to look at the entire input. Typically, these algorithms are randomized (because most of the problems have a trivial linear-time deterministic lower bound) and they return only an approximate solution rather than the exact one (because usually, without looking at the whole input we cannot determine the exact solution). In this survey, we present recently developed sublinear-time algorithm for some combinatorial optimization problems.

1.1 Sublinear time algorithms: Sublinear time algorithms are that type of algorithm which runs in sublinear time. That is they runs always in upper bound of linear time algorithm .i.e. if we are defining f(n) as function for time complexity then f(n) < o(n).here o (small oh) can be define as the if g1(n) and g2(n) are two function then g1=o(g2) if and only if g1(n) < c*g2(n) for c > 0 and n > n0. An algorithm is said to run in sub-linear time (often spelled sublinear time) if T(n) = o(n). In particular this includes algorithms with the time complexities defined above, as well as others such as the O (n) Grover's search algorithm. Typical algorithms that are exact and yet run in sub-linear time use parallel processing (as the NC1 matrix determinant calculation does), nonclassical processing (as Grover's search does), or alternatively have guaranteed assumptions on the input structure (as the logarithmic time binary search and many tree maintenance algorithms

do).

Figure 1.1 Comparison between sublinear and linear algorithm

However, languages such as the set of all strings that have a 1-bit indexed by the first log(n) bits may depend on every bit of the input and yet be computable in sub-linear time. The specific term sublinear time algorithm is usually reserved to algorithms that are unlike the above in that they are run over classical serial machine models and are not allowed prior assumptions on the input. They are however allowed to be randomized, and indeed must be randomized for all but the most trivial of tasks. As such an algorithm must provide an answer without reading the entire input, its particulars heavily depend on the access allowed to the input. Usually for an input that is represented as a binary string b1,...,bk it is assumed that the algorithm can in time O(1) request and obtain the value of bi for any i. Sub-linear time algorithms are typically randomized, and provide only approximate solutions. In fact, the property of a binary string having only zeros (and no ones) can be easily proved not to be decidable by a (non-approximate) sub-linear time algorithm. Sub-linear time algorithms arise naturally in the investigation of property testing.

1.2. Property testing: In broad terms, property testing is the study of the following class of the problems: Given the ability to perform (local) queries concerning a particular object (e.g., a function, or a graph), the task is to determine whether the object has predetermined (global property) (e.g.

linearity or bipartiteness) ,or is far from having the property. The task should be performed by inspecting only a small (possibly randomly selected) part of the whole object, where a small probability of failure is allowed. In order to define a property testing problem, we need to specify the types of queries that can perform by the algorithm and a distance measure between objects. Later is required in order to define what it means that the object is far from having the property. we assume that the algorithm has given distance parameter is . In computer science, a property testing algorithm for a decision problem is an algorithm whose query complexity to its input is much smaller than the instance size of the problem. Typically property testing algorithms are used to decide if some mathematical object (such as a graph or a boolean function) has a "global" property, or is "far" from having this property, using only a small number of "local" queries to the object. For example, the following promise problem admits an algorithm whose query complexity is independent of the instance size (for an arbitrary constant > 0): "Given a graph G on n vertices, decide if G is bipartite, or G cannot be made bipartite even after removing an arbitrary subset of at most edges of G." Property testing algorithms are important in the theory of probabilistically checkable proofs.

1.2.1.Definition and variants Formally, a property testing algorithm with query complexity q(n) and proximity parameter for a decision problem L is a randomized algorithm that, on input x (an instance of L) makes at most q(|x|) queries to x and behaves as follows: If x is in L, the algorithm accepts x with probability at least . If x is -far from L, the algorithm rejects x with probability at least . Here, "x is -far from L" means that the Hamming distance between x and any string in L is at least |x|. A property testing algorithm is said to have one-sided error if it satisfies the stronger condition that the accepting probability for instances x L is 1 instead of . A property testing algorithm is said be non-adaptive if it performs all its queries before it "observes" any answers to previous queries. Such an algorithm can be viewed as operating in the following manner. First the algorithm receives its input. Before looking at the input, using its internal randomness, the algorithm decides which symbols of the input are to be queried. Next, the algorithm observes these symbols. Finally, without making any additional queries (but possibly using its randomness), the algorithm decides whether to accept or reject the input.

1.2.2.Features and limitations The main efficiency parameter of a property testing algorithm is its query complexity, which is the maximum number of input symbols inspected over all inputs of a given length (and all random choices made by the algorithm). One is interested in designing algorithms whose query complexity is as small as possible. In many cases the running time of property testing algorithms is sublinear in the instance length. Typically, the goal is first to make the query complexity as small as possible as a function of the instance size n, and then study the dependency on the proximity parameter . Unlike other complexity-theoretic settings, the asymptotic query complexity of property testing algorithms is affected dramatically by the representation of instances. For example, when = 0.01, the problem of testing bipartiteness of dense graphs (which are represented by their adjacency matrix) admits an algorithm of constant query complexity. In contrast, sparse graphs on n vertices (which are represented by their adjacency list) require property testing algorithms of query complexity ( . The query complexity of property testing algorithms grows as the proximity parameter becomes smaller for all non-trivial properties. This dependence on is necessary as a change of fewer than symbols in the input cannot be detected with constant probability using fewer than O(1/) queries. Many interesting properties of dense graphs can be tested using query complexity that depends only on and not on the graph size n. However, the query complexity can grow enormously fast as a function of . For example, for a long time the best known algorithm for testing if a graph does not contain any triangle had a query complexity which is a tower function of poly(1/), and only in 2010 this has been improved to a tower function of log(1/). O ne of the reasons for this enormous growth in bounds is that many of the positive results for property testing of graphs are established using the Szemerdi regularity lemma, which also has towertype bounds in its conclusions.

1.3 Data Streaming algorithms: In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item. These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream in memory.

2. Related work:-

2.1. Randomized Algorithms: A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behaviour, in the hope of achieving good performance in the "average case" over all possible choices of random bits. Formally, the algorithm's performance will be a random variable determined by the random bits; thus either the running time, or the output (or both) are random variables. One has to distinguish between algorithms that use the random input to reduce the expected running time or memory usage, but always terminate with a correct result in a bounded amount of time, and probabilistic algorithms, which, depending on the random input, have a chance of producing an incorrect result (Monte Carlo algorithms) or fail to produce a result (Las Vegas algorithms) either by signalling a failure or failing to terminate. In the second case, random performance and random output, the term "algorithm" for a procedure is somewhat questionable. In the case of random output, it is no longer formally effective. However, in some cases, probabilistic algorithms are the only practical means of solving a problem. In common practice, randomized algorithms are approximated using a pseudorandom number generator in place of a true source of random bits; such an implementation may deviate from the expected theoretical behaviour.

2.2 Approximation Algorithms: In computer science and operations research, approximation algorithms are algorithms used to find approximate solutions to optimization problems. Approximation algorithms are often associated with NP-hard problems; since it is unlikely that there can ever be efficient polynomialtime exact algorithms solving NP-hard problems, one settles for polynomial-time sub-optimal solutions. Unlike heuristics, which usually only find reasonably good solutions reasonably fast, one wants provable solution quality and provable run-time bounds. Ideally, the approximation is optimal up to a small constant factor (for instance within 5% of the optimal solution). Approximation algorithms are increasingly being used for problems where exact polynomialtime algorithms are known but are too expensive due to the input size. A typical example for an approximation algorithm is the one for vertex cover in graphs: find an uncovered edge and add both endpoints to the vertex cover, until none remain. It is clear that the resulting cover is at most twice as large as the optimal one. This is a constant factor approximation algorithm with a factor

of 2. NP-hard problems vary greatly in their approximability; some, such as the bin packing problem, can be approximated within any factor greater than 1 (such a family of approximation algorithms is often called a polynomial time approximation scheme or PTAS). Others are impossible to approximate within any constant, or even polynomial factor unless P = NP, such as the maximum clique problem. NP-hard problems can often be expressed as integer programs (IP) and solved exactly in exponential time. Many approximation algorithms emerge from the linear programming relaxation of the integer program. Not all approximation algorithms are suitable for all practical applications. They often use IP/LP/Semidefinite solvers, complex data structures or sophisticated algorithmic techniques which lead to difficult implementation problems. Also, some approximation algorithms have impractical running times even though they are polynomial time, for example O(n2000). Yet the study of even very expensive algorithms is not a completely theoretical pursuit as they can yield valuable insights. A classic example is the initial PTAS for Euclidean TSP due to Sanjeev Arora which had prohibitive running time, yet within a year, Arora refined the ideas into a linear time algorithm. Such algorithms are also worthwhile in some applications where the running times and cost can be justified e.g. computational biology, financial engineering, transportation planning, and inventory management. In such scenarios, they must compete with the corresponding direct IP formulations. Another limitation of the approach is that it applies only to optimization problems and not to "pure" decision problems like satisfiability, although it is often possible to conceive optimization versions of such problems, such as the maximum satisfiability problem (Max SAT). Inapproximability has been a fruitful area of research in computational complexity theory since the 1990 result of Feige, Goldwasser, Lovasz, Safra and Szegedy on the inapproximability of Independent Set. After Arora et al. proved the PCP theorem a year later, it has now been shown that Johnson's 1974 approximation algorithms for Max SAT, Set Cover, Independent Set and Coloring all achieve the optimal approximation ratio, assuming P != NP.

2.3. Time Complexity Comparisons: In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the length of the string representing the input. The time complexity of an algorithm is commonly expressed using big O notation, which excludes coefficients and lower order terms. When expressed this way, the time complexity is said to be described asymptotically, i.e., as the input size goes to infinity. For example, if the time required by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is O(n3).

10

Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, where an elementary operation takes a fixed amount of time to perform. Thus the amount of time taken and the number of elementary operations performed by the algorithm differ by at most a constant factor. Since an algorithm's performance time may vary with different inputs of the same size, one commonly uses the worst-case time complexity of an algorithm, denoted as T(n), which is defined as the maximum amount of time taken on any input of size n. Time complexities are classified by the nature of the function T(n). For instance, an algorithm with T(n) = O(n) is called a linear time algorithm, and an algorithm with T(n) = O(2n) is said to be an exponential time algorithm.

Table 1 Time complexity comparison

11

3.Design :
3.1. Algorithm for checking sortedness of a list in Sublinear Time: Input: a list of n numbers x1 , x2 ,..., xn. Question: Is the list sorted or far from sorted? Already saw: two different O((log n)/ ) time testers. (log n) queries are required for all constant 1/2 Today: log n) queries are required for all constant c for every 1-sided error nonadaptive test.

A test has 1-sided error if it always accepts all YES instances. A test is nonadaptive if its queries that do not depend on answers to previous queries.

A pair (xi, xj) is violated if xi < xj Claim. A 1-sided error test can reject only if it finds a violated pair. Proof: Every sorted partial list can be extended to a sorted list. Lolas distribution is uniform over the following log lists:

12

Claim 1. All lists above are 1/2-far from sorted. Claim 2. Every pair (xi , xj) is violated in exactly one list above. Let picks a set of positions to query.

Our test must be correct, i.e., must find a violated pair with probability 2/3 When input is picked according to Lolas distribution. Q contains a violated pair (ai, ai+1) is violated for some i Probability Pr = Lolas distribution If |Q| 2/3logn then this probability is <2/3 So, |Q|= (log n) By Yaos Minimax Principle, every randomized 1-sided error nonadaptive test for sortedness must make (log n) queries. 3.1.1 algorithm for sortedness checking: Repeat the following Q(1/e) times:

13

1.Pick an entry uniformly at random. Let x be the value in that entry. 2.Perform a binary search for x 3.If x is found, accept, otherwise, reject. Runtime: O(logn) 3.2 Successor search in a list in sublinear time: Input: a sorted doubly linked list of numbers and a search value k Output: first element greater than k Data structure:

Figure 3.1. Datastructure for doubly link list

3.2.1 Search Algorithm: Pick random list elements Determine closest predecessor and successor of k (within the random sample) Perform plain, linear search towards k (starting from these two elements)

Figure 3.2. Algorithm running status

14

Good case:

Figure 3.3. Good case for algorithm

Worst case:

Figure 3.4. worst case for algorithm Runtime: O( )

3.3 Testing Membership of a String in R.E. For fixed regular language L belongs to {0,1}*, testing algorithm should accept w.h.p. every word w belongs to L, and should reject w.h.p. every word w that differs on more than e n bits (n=|w|) from every w belongs to L (|w|=n). Algorithm can query any bit wi of w.

Figure 3.5. DFA for testing 3.3.1.Algorithm (simplified version): Uniformly and independently select (r/e) indices 1 i n . For each i selected, check that the substring wi wi+r/e is feasible. If any substring is infeasible then reject, otherwise accept

15

RunTime: O( ) 3.4. Searching in a sorted list: It is well-known that if we can store the input in a sorted array, then we can solve various problems on the input very efficiently. However, the assumption that the input array is sorted is not natural in typical applications. Let us now consider a variant of this problem, where our goal is to search for an element x in a linked sorted list containing n distinct elements. Here, we assume that the n elements are stored in a doubly-linked, each list element has access to the next and preceding element in the list, and the list is sorted (that is, if x follows y in the list, then y < x). We also assume that we have access to all elements in the list, which for example, can correspond to the situation that all n list elements are stored in an array (but the array is not sorted and we do not impose any order for the array elements). How can we find whether a given number x is in our input or is not? On the first glace, it seems that since we do not have direct access to the rank of any element in the list, this problem requires (n) time. And indeed, if our goal is to design a deterministic algorithm,then it is impossible to do the search in o(n) time.However, if we allow randomization, then we can complete the search in O(n) expected time (and this bound is asymptotically tight). Let us first sample uniformly at random a set S of (n) elements from the input. Since we have access to all elements in the list, we can select the set S in O(n) time. Next, we scan all the elements in S and in O(n) time we can find two elements in S, p and q, such that p x < q, and there is no element in S that is between p and q. Observe that since the input consist of n distinct numbers, p and q are uniquely defined. Next, we traverse the input list containing all the input elements starting at p until we find either the sought key x or we find element q.

Lemma 1: The algorithm above completes the search in expected O(n) time. Moreover, no algorithm can solve this problem in o(n) expected time. Proof. The running time of the algorithm if equal to O(n) plus the number of the input elements between p and q. Since S contains (n) elements, the expected number of input elements between p and q is O(n/|S|) = O(n). This implies that the expected running time of the algorithm is O(n). For a proof of a lower bound of (n) expected time.

16

3.5. Geometry: Intersection of Two Polygons:

Let us consider a related problem but this time in a geometric setting. Given two convex polygons A and B in R, each with n vertices, determine if they intersect, and if so, then find a point in their intersection. It is well known that this problem can be solved in O(n) time, for example, by observing that it can be described as a linear programming instance in 2-dimensions, a problem which is known to have a linear-time algorithmIn fact, within the same time one can either find a point that is in the intersection of A and B, or find a line L that separates A from B (actually, one can even find a bitangent separating line L, i.e., a line separating A and B which intersects with each of A and B in exactly one point). The question is whether we can obtain a better running time. The complexity of this problem depends on the input representation. In the most powerful model, if the vertices of both polygons are stored in an array in cyclic order, Chazelle and Dobkin showed that the intersection of the polygons can be determined in logarithmic time. However,a standard geometric representation assumes that the input is not stored in an array but rather A and B are given by their doubly-linked lists of vertices such that each vertex has as its successor the next vertex of the polygon in the clockwise order. Can we then test if A and B intersect?

Figure 1: (a) Bitangent line L separating CA and CB, and (b) the polygon PA. Chazelle et al. [2] gave an O(n)-time algorithm that reuses the approach discussed above for searching in a sorted list. Let us first sample uniformly at random _(n) vertices from each A and B, and let CA and CB be the convex hulls of the sample point sets for the polygons A and B,espectively. Using the linear-time algorithm mentioned above, in O(n) time we can check if
CA and CB intersects. If they do, then the algorithm will get us a point that lies in the intersection

of CA and CB, and hence, this point lies also in the intersection of A and B. Otherwise, let L be the bitangent separating line returned by the algorithm. Let a and b be the points in L that belong to A and B, respectively. Let a1 and a2 be the two vertices adjacent to a in A. We will define now a new polygon PA. If none of a1 and a2 is on the side CA of L the we define PA to be empty.

17

Otherwise, exactly one of a1 and a2 is on the side CA of L; let it be a1. We define polygon PA by walking from a to a1 and then continue walking along the boundary of A until we cross L again (see Figure 1 (b)). In a similar way we define polygon PB. Observe that the expected size of each of PA and PB is at most O(n).it is easy to see that A and B intersects if and only if either A intersects PB or B intersects PA. We only consider the case of checking if A intersects PB. We first determine if CA intersects PB. If yes, then we are done. Otherwise, let LA be a bitangent separating line that separates CA from PB. We use the same construction as above to determine a subpolygon QA of A that lies on the PB side of LA. Then, A intersects PB if and only if QA intersects PB. Since QA has expected size O(n) and so does PB, testing the intersection of these two polygons can be done in O(n) expected time. Therefore, by our construction above, we have solved the problem of determining if two polygons of size n intersect by reducing it to a constant number of problem instances of determining if two polygons of expected size O(n) intersect. This leads to the following lemma. Lemma 2 [2] The problem of determining whether two convex n-gons intersect can be solved in O(n) expected time, which is asymptotically optimal. Chazelle et al. [2] gave not only this result, but they also showed how to apply a similar approach to design a number of sublineartime algorithms for some basic geometric problems. For example, one can extend the result discussed above to test the intersection of two convex polyhedra in R3 with n vertices inO(n) expected time. One can also approximate the volume of an n-vertex convex polytope to within a relative error > 0 in expected time O(n/). Or even, for a pair of two points on the boundary of a convex polytope P with n vertices, one can estimate the length of an optimal shortest path outside P between the given points in O(n) expected time. In all the results mentioned above, the input objects have been represented by a linked structure: either every point has access to its adjacent vertices in the polygon in R2, or the polytope is defined by a doubly-connected edge list, or so. These input representations are standard in computational geometry, but a natural question is whether this is necessary to achieve sublinear-time algorithms what can we do if the input polygon/polytop is represented by a set of points and no additional structure is provided to the algorithm? In such a scenario, it is easy to see that no o(n)-time algorithm can solve exactly any of the problems discussed above. That is, for example, to determine if two polygons with n vertices intersect one needs (n) time. However, still, we can obtain some approximation to this problem, one which is described in the framework of property testing. Suppose that we relax our task and instead of determining if two (convex) polytopes A and B in Rd intersects, we just want to distinguish between two cases: either A and B are intersection-free, or one has to significantly modify A and B to make them intersection-free. The definition of the notion of significantly modify may depend on the application at hand, but the most natural characterization would be to

18

remove at least n points in A and B, for an appropriate parameter (see [3] for a discussion about other geometric characterization). Czumaj et al. [4] gave a simple algorithm that for any > 0, can distinguish between the case when A and B do not intersect, and the case when at least n points has to be removed from A and B to make them intersection-free: the algorithm returns the outcome of a test if a random sample of O((d/) log(d/)) points from A intersects with a random sample of O((d/) log(d/)) points from B.

3.6.Sublinear Time Algorithms for Graphs Problems In the previous section, we introduced the concept of sublinear-time algorithms and we presented two basic sublinear-time algorithms for geometric problems. In this section, we will discuss sublinear-time algorithms for graph problems. Our main focus is on sublinear-time algorithms for graphs, with special emphasizes on sparse graphs represented by adjacency lists where combinatorial algorithms are sought. 3.6.1Approximating the Average Degree Assume we have access to the degree distribution of the vertices of an undirected connected graph G = (V,E), i.e., for any vertex v V we can query for its degree. Can we achieve a good approximation of the average degree in G by looking at a sublinear number of vertices? At first sight, this seems to be an impossible task. It seems that approximating the average degree is equivalent to approximating the average of a set of n numbers with values between 1 and n 1, which is not possible in sublinear time. However, Feige [5] proved that one can approximate the average degree in O(n/) time within a factor of 2 + . The difficulty with approximating the average of a set of n numbers can be illustrated with the following example. Assume that almost all numbers in the input set are 1 and a few of them are n 1. To approximate the average we need to approximate how many occurrences of n 1 exist.If there is only a constant number of them, we can do this only by looking at (n) numbers in the set. So, the problem is that these large numbers can hide in the set and we cannot give a good approximation, unless we can find at least some of them. Why is the problem less difficult, if, instead of an arbitrary set of numbers, we have a set of numbers that are the vertex degrees of a graph? For example, we could still have a few vertices of degree n1. The point is that in this case any edge incident to such a vertex can be seen at another vertex. Thus, even if we do not sample a vertex with high degree we will see all incident edges at other vertices in the graph. Hence, vertices with a large degree cannot hide. We will sketch a proof of a slightly weaker result than that originally proven by Feige [5]. Let d denote the average degree in G = (V,E) and let dS denote the random variable for the average

19

degree of a set S of s vertices chosen uniformly at random from V . We will show that if we set s n/O(1) for an appropriate constant , then dS ( 1/2 ) d with probability at least 1/64. Additionally, we observe thatMarkov inequality immediately implies that dS (1+)d with probability at least 1 1/(1 + ) /2. Therefore, our algorithm will pick 8/ sets Si, each of size s, and output the set with the smallest average degree. Hence, the probability that all of the sets Si have too high average degree is at most (1 /2)/8 1/8. The probability that one of them has too small average degree is at most 8 / /64 = 1/8. Hence, the output value will satisfy both inequalities with probability at least 3/4. By replacing with /2, this will yield a (2+ )approximation algorithm. Now, our goal is to show that with high probability one does not underestimate the average degree too much. Let H be the set of the n vertices with highest degree in G and let L = V \H be the set of the remaining vertices. We first argue that the sum of the degrees of the vertices in L is at least ( 1/2 ) times the sum of the degrees of all vertices. This can be easily seen by distinguishing between edges incident to a vertex from L and edges within H. Edges incident to a vertex from L contribute with at least 1 to the sum of degrees of vertices in L, which is fine as this is at least 1/2 of their full contribution. So the only edges that may cause problems are edges within H. However, since |H| = n, there can be at most n such edges, which is small compared to the overall number of edges (which is at least n 1, since the graph is connected). Now, let dH be the degree of a vertex with the smallest degree in H. Since we aim at giving a lower bound on the average degree of the sampled vertices, we can safely assume that all sampled vertices come from the set L. We know that each vertex in L has a degree between 1 and dH. Let Xi, 1 i s, be the random variable for the degree of the ith vertex from S. Then, it follows from Hoeffding bounds that

We know that the average degree is at least dH |H|/n, because any vertex in H has at least degree dH. Hence, the average degree of a vertex in L is at least ( 1 /2 ) dH |H|/n. This just means E[Xi] ( 1/2)dH|H|/n. By linearity of expectation we get E[Ps i=1 Xi] s(1 /2)dH|H|/n. This implies that, for our choice of s, with high probability we have dS ( 1/2 ) d. Feige showed the following result, which is stronger with respect to the dependence on .

20

3.6.2.Minimum Spanning Trees: One of the most fundamental graph problems is to compute a minimum spanning tree. Since the minimum spanning tree is of size linear in the number of vertices, no sublinear algorithm for sparse graphs can exists. It is also know that no constant factor approximation algorithm with o(n2) query complexity in dense graphs (even in metric spaces) exists [6]. Given these facts, it is somewhat surprising that it is possible to approximate the cost of a minimum spanning tree in sparse graphs [7] as well as in metric spaces [8] to within a factor of (1 + ). In the following we will explain the algorithm for sparse graphs by Chazelle et al. [7]. We will prove a slightly weaker result than in [7]. Let G = (V,E) be an undirected connected weighted graph with maximum degree D and integer edge weights from {1, . . . ,W}. We assume that the graph is given in adjacency list representation, i.e., for every vertex v there is a list of its at most D neighbors, which can be accessed from v. Furthermore, we assume that the vertices are stored in an array such that it is possible to select a vertex uniformly at random. We assume also that the values of D and W are known to the algorithm. The main idea behind the algorithm is to express the cost of a minimum spanning tree as the number of connected components in certain auxiliary subgraphs of G. Then, one runs a randomized algorithm to estimate the number of connected components in each of these subgraphs. To start with basic intuitions, let us assume that W = 2, i.e., the graph has only edges of weight 1 or 2. Let G(1) = (V,E(1)) denote the subgraph that contains all edges of weight (at most) 1 and let c(1) be the number of connected components in G(1). It is easy to see that the minimum spanning tree has to link these connected components by edges of weight 2. Since any connected component in G(1) can be spanned by edges of weight 1, any minimum spanning tree of G has c(1)1 edges of weight 2 and n1(c(1)1) edges of weight 1. Thus, the weight of a minimum

Next, let us consider an arbitrary integer value for W. Defining G(i) = (V,E(i)), where E(i) is the set of edges in G with weight at most i, one can generalize the formula above to obtain that the cost MST of a minimum spanning tree can be expressed as

21

This gives the following simple algorithm

Thus, the key question that remains is how to estimate the number of connected components. This is done by the following algorithm.

To analyze this algorithm let us fix an arbitrary connected component C and let |C| denote the number of vertices in the connected component. Let c denote the number of connected components in G. We can write

And by linearity of expectation we obtain To show that is concentrated around its expectation, we apply Chebyshev inequality. Since bi is an indicator random variable, we have

22

With this bound for Var[], we can use Chebyshev inequality to obtain

From this it follows that one can approximate the number of connected components within additive error of n in a graph with maximum degree D in time and with probability

1 e. The following somewhat stronger result has been obtained in [7]. Notice that the obtained running time is independent of the input size n. 3.7.Sublinear Time Approximation Algorithms for Problems in Metric Spaces One of the most widely considered models in the area of sublinear time approximation algorithms is the distance oracle model for metric spaces. In this model, the input of an algorithm is a set P of n points in a metric space (P, d). We assume that it is possible to compute the distance d(p, q) between any pair of points p, q in constant time. Equivalently, one could assume that the algorithm is given access to the n n distance matrix of the metric space, i.e., we have oracle access to the matrix of a weighted undirected complete graph. Since the full description size of this matrix is O(n2), we will call any algorithm with o(n2) running time a sublinear algorithm. Which problems can and cannot be approximated in sublinear time in the distance oracle model? One of the most basic problems is to find (an approximation) of the shortest or the longest pairwise distance in the metric space. It turns out that the shortest distance cannot be approximated. The counterexample is a uniform metric (all distances are 1) with one distance being set to some very small value . Obviously, it requires O(n2) time to find this single short distance. Hence, no sublinear time approximation algorithm for the shortest distance problem exists. What about the longest distance? In this case, there is a very simple 1 /2 -approximation algorithm, which was first observed by Indyk [6]. The algorithm chooses an arbitrary point p and returns its furthest neighbor q. Let r, s be the furthest pair in the metric space. We claim that d(p, q) 1/ 2 d(r, s). By the triangle inequality, we have d(r, p) + d(p, s) d(r, s). This immediately implies that either d(p, r) 1/2 d(r, s) or d(p, s) 1/2 d(r, s). This shows the approximation guarantee. In the following, we present some recent sublinear-time algorithms for a few optimization problems in metric spaces. 3.7.1Clustering via Random Sampling

The problems of clustering large data sets into subsets (clusters) of similar characteristics are one

23

of the most fundamental problems in computer science, operations research, and related fields. Clustering problems arise naturally in various massive datasets applications, including data mining, bioinformatics, pattern classification, etc. In this section, we will discuss the uniformly random sampling for clustering problems in metric spaces, as analyzed in two recent papers

(a) A set of points in a metric space

(b) its 3-clustering (white points correspond to the centre points)

(c) the distances used in the cost for the 3-median.

24

Let us consider a classical clustering problem known as the k-median problem. Given a finite metric space (P, d), the goal is to find a set C P of k centers (points in P) that minimizes where d(p,C) denotes the distance from p to the nearest point in C. The k-median problem has been studied in numerous research papers. It is known to be NP-hard and there exist constant-factor approximation algorithms running in e O(n k) time. In two recent papers [20, 46], the authors asked the question about the quality of the uniformly random sampling approach to kmedian, that is, is the quality of the following generic scheme:

The goal is to show that already a sublinear-size sample set S will suffice to obtain a good approximation guarantee. Furthermore, as observed in [10] (see also [45]), in order to have any guarantee of the approximation, one has to consider the quality of the approximation as a function of the diameter of the metric space. Therefore, we consider a model with the diameter of the metric space given, that is, with d : P P [0,]. Limitations: What Cannot be done in Sublinear-Time The algorithms discussed in the previous sections may suggest that many optimization problems in metric spaces have sublinear-time algorithms. However, it turns out that the problems listed in the previous sections are more like exceptions than a norm. Indeed, most of the problems have a trivial lower bound that exclude sublinear-time algorithms. We have already mentioned in Section 4 that the problem of approximating the cost of the lightest edge in a finite metric space (P, d) requires (n2), even if randomization is allowed. The other problems for which no sublineartime algorithms are possible include estimation of the cost of minimumcost matching, the cost of minimum-cost bi-chromatic matching, the cost of minimum nonuniform facility location, the cost of k-median for k = n/2; all these problems require (n2) (randomized) time to estimate the cost of their optimal solution to within any constant factor . To illustrate the lower bounds, we give two instances of the metric spaces which are indistinguishable by any o(n2)-time algorithm for which the cost of the minimum-cost matching in one instance is greater than times the one in the other instance (see Figure 3). Consider a metric space (P, d) with 2n points, n points in L and n points in R. Take a random perfect matching M between the points in L and R, and then choose an edge e M at random. Next, define the distance in (P, d) as follows:

d(e) is either 1 or B, where we set B = n ( 1) + 2,

25

for any e*M\ {e} set d(e

) = 1, and

for any other pair of points p, q P not connected by an edge from M, d(p, q) = n3 . It is easy to see that both instances define properly a metric space (P, d). For such problem instances, the cost of the minimum-cost matching problem will depend on the choice of d(e): I d(e) = B then the cost will be n 1 + B > n , and if d(e) = 1, then the cost will be n. Hence any -factor approximation algorithm for the matching problem must distinguish between these two problem instances. However, this requires to find if there is an edge of length B, and this is known to require time (n2), even if a randomized algorithm is used.

5.Conclusions:
It would be impossible to present a complete picture of the large body of research known in the area of sublinear-time algorithms in such a reeport. In this report, my main goal was to give some flavor of the area and of the types of the results achieved and the techniques used. For more details, we refer to the original works listed in the references. i did not discuss two important areas that are closely related to sublinear-time algorithms: property testing and data streaming algorithms.

26

References:
[1] A. Czumaj and C. Sohler. Sublinear-time Algorithms. Bulletin of the EATCS, 89: 23 - 47, June 2006. [2] B. Chazelle, D. Liu, and A. Magen. Sublinear geometric algorithms. SIAM Journal on Computing, 35(3): 627646, 2006. [3] A. Czumaj and C. Sohler. Property testing with geometric queries. Proceedings of the 9th Annual European Symposium on Algorithms (ESA), pp. 266277, 2001. [4] A. Czumaj, C. Sohler, and M. Ziegler. Property testing in computational geometry. roceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 155166, 2000.

[5] U. Feige. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM Journal on Computing, 35(4): 964984, 2006. [6] P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), pp. 428434, 1999

[7] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM Journal on Computing, 34(6): 13701379, 2005.

[8] A. Czumaj and C. Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 175183, 2004.

[9] A. Czumaj and C. Sohler. Sublinear-time approximation for clustering via random sampling. Proceedings of the 31st Annual International Colloquium on Automata, Languages and Programming (ICALP), pp. 396407, 2004.

[10] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439447, 2001.

27

You might also like