Lecture 9--Order statistics. 10/11/97.

================================================================

We noted above that the problems of sorting a list and searching for an item in a list are very important in computer science. A related problem is the problem of finding the kth smallest item in a list. Special cases include finding the minimum, the maximum, and the median (k = n / 2) values. As part of your first homework assignment, you developed an algorithm to find the largest element in a list, and you showed it was a "capital theta" (n) algorithm for a list of n elements. Similarly, there is a "capital theta" (n) algorithm to find the smallest element. If the list is sorted, it is easy to find the kth smallest element for 1 <= k <= n. It is not immediately clear how to solve this problem in general if the list is not sorted, however. But it turns out that we can use the divide and conquer technique to develop an algorithm to find the kth smallest element, 1 <= k <= n, in an unsorted list in "capital theta" (n) time.

We use the notation | S | to denote the cardinality of a set S, i.e., the number of elements in S.

We will use the recursive function SELECT(k,S) to select the kth smallest element from S:

procedure SELECT (k,S)

if | S | < 50 then

sort S

return kth smallest element of S

else

divide S into |_ |S| / 5 _| lists of 5 elements each, with up to 4 "leftovers"

sort each 5-element list

let M be the list of medians of these 5-element sets

m = SELECT ( ceiling( |M| / 2 ) , M )

let S1, S2, S3 be the lists of elements in S that are <. =, > m respectively

if |S1| > k return SELECT (k, S1)

else if ( |S1| + |S2 | > = k) return m

else return SELECT ( k - |S1| - |S2| , S3)

Exercise 9.1. Show the steps in the above algorithm if S is an array of 500 integers, with

S[i] = 500 - i for 0 <= i <= 499 and if

a. k = 39.

b. k = 73.

Correctness. We can show that the above algorithm is correct by an induction argument on |S|, the size of S. We omit this proof here.

Running time. Suppose T(n) is the time to select the kth smallest element from a set of size n. Then the recursive call SELECT ( ( |M| / 2 ) , M) requires at most T(n/5) time.

Now how big are S1 and S3? First note that since m is the median of the set of medians M, at least |_ n / 10 _| elements of M are greater than or equal to m (since M contains about n/5 elements and at least half the elements in M are greater than or equal to m, the median of M). For each of these |_ n / 10 _| elements, there are two distinct elements in S which are at least as large. So S1 contains at most n - 3 |_ n / 10 _| elements, i.e.,

| S 1 | < = n - 3 |_ n / 10 _| < = 3n / 4 (for n >= 50).

Similarly we have

|S3 | <= 3n / 4.

Thus the second recursive call requires at most time T( 3n / 4 ).

All other statements require at most O(n) time.

So we have

T(n) <= cn for n <= 49 (since n is bounded).

T(n) <= T(n / 5) + T (3n / 4) + cn for n >= 50.

Claim: T(n) <= 20cn.

Proof: By induction.

Step 1. This is true for n <= 50, since T(50) <= T(10) + T(39) +cn <= 3cn.

Step 2. Assume this is true for n <= N for some N > 50.

Then T(N+1) <= T( (N+1) / 5 ) + T ( (3 (N+1)) / 4 ) + c ( N+1)

<= 20c ( (N+1) / 5 ) + 20 c( (3 (N+1)) / 4) + c (N+1)

<= 20c ((19 / 20 ) (N+1) + c (N+1) = 20 c (N+1).

Step 3. Since we have verified steps 1 and 2, we have proved by induction that

T(n) <= 20cn for n >= 1.

Exercise 9.2. (grad) Can the division into sets of size 5 be replaced by division into sets of size 3? What about of size 7?

Exercise 9.3. If S has 50 or fewer elements, then we just sort S and find the kth smallest element. But we will learn in the next few lectures that many sorts take "capital omega" (nlog2n) time. Is this a contradiction? Explain.