Lecture 6--simple searches: array, sorted array, linked list, tree. 10/3/97
===============================================================
Now that we have some tools for analyzing algorithm behavior, let us consider one of the simplest problems: we have a set of N items (integers or names or records,etc.) and we want to see if a particular item X is in this set. Let us assume that the set can be ordered, for example, it is a set of N integers or it is a set of N records, each of which is identified by an integer which is its unique "key".
Searching a list for a specific item and sorting a set of data are historically two of the most important types of problems which computers are used for, so there have been many algorithms propsed for these problems and a great deal of analysis of how the proposed algorithms behave. Much current work in algorithms still deals with sorting and searching problems, although typically today researchers are trying to use "intelligent" or "probabilistic" methods to search through and/or sort huge sets of data, and the data may be stored in complex graph or hypergraph structures rather than in simpler data structures such as arrays or trees or linked lists.
For our simple sorting problem, it is clear that we will need at least N memory cells to hold all the N items. It is also clear that if we know nothing at all about the data then we will, in the worst case, have to examine all the items in the set, so the number of operations we will need to do in the worst case will be proportional to N.
Example 1. Linear search. Suppose our data is stored in an unsorted array A. Then we will need to examine each element in A until either we find X or we have looked at all elements. It is fairly easy to see that an algorithm which looks at each item in the array will correctly solve this problem. This will take time proportional to N, i.e., T(N) = "capital theta"(N), in the worst case. Similarly, if the data is stored in a linked list L, since we can only start at the head of L and then step through the list in order, the time to search L for X may be as much as "capital theta"(N), and again it is fairly easy to convince ourselves that this procedure correctly determines whether or not X is in L.
Example 2. If our data is stored in a binary search tree T, then the time to search T may be, in the worst case, "capital theta"(N).
Exercise 6.1. Prove the statement in example 2.
Example 3. Now suppose the data is stored in an array A in sorted order (let us assume it is sorted smallest to largest). Then we can use linear search to look for X. but we can also use a much faster algorithm to see if X is in A, namely, the procedure called binary search:
Algorithm binary search.
Input: A, an array of N keys, with A[0] <= A[1] <= ......<= A[N-1] and X, a key.
Output: The first I for which X = A[I] or -1 if X is not in A.
begin
call search(0, N-1, X) where search is given below.
procedure search (min, max, X)
if min > max return -1
else begin
mid = (min + max) / 2
if X = A[mid] return mid
else if X > A[mid]
search(mid+1, max, X)
else search(min, mid-1, X)
Exercise 6.2. Derive a nonrecursive version of binary search by first explicitly encoding the stack management and then removing the stack if possible.
Is the binary search algorithm correct? At each step we throw away half of the items we have not yet examined. But we know that we do not need to look at these items, since the array is sorted. So binary search is a correct algorithm.
How much time does binary search take in the worst case?
Let T(N) be the time binary search takes to run if an array of length N is input.
Claim: T(1) = c1, T(N) = T( |_ N/2 _| ) + c2.
Exercise 6.2. Explain why this expression for T(N) is valid.
Exercise 6.3. Find a closed form expression for T(N).
Exercise 6.4. Give reasonable values for the constatns c1 and c2. Explain how you arrived at these values.
Exercise 6.5. The time to look for item X in a sorted array A using linear search can be expressed as c3N + c4. Give reasonable values for c3 and c4. Based on these values and the values for c1 and c2 calculated in exercise 6.4, for what range of values of N should we use linear search and for what range of values of N would binary search be preferred?
Example 4. If our data is stored in a binary search tree T with N nodes, then looking for X can also be done in time proportional to log2N, if T has minimum height. Consider the simplest case where N = 2k - 1 for some k and T is a complete binary tree. Then from the properties T1-T3 of trees (Lecture 4) we can conclude that T has height k-1 and so we can search T for X in at most k = "capital theta"(log2N) iterations. In the general case we will have
2k - 1 < N <= 2k
for some k, and we can with patience calculate the maximum number of iterations needed. We see that it is about log2N also in the general case.
Example 5. Generalizations of the binary tree structure can be used to store large amounts of data and to allow retrieval of data items in approximately logarithmic time. One example is the B tree structure.