Lecture 18--External sorts. 11/7/97.

================================================================================

Note: in the programming assignment note that the nonrecursive version of quicksort should explicitly stack pairs of integers (first,last), rather than the array elements themselves.

================================================================================

18.1. One algorithm for sorting large files. The external sorting problem is the problem of sorting a large file, stored on tape or on disk, which is so large that not all keys in the file will fit in main memory at one time. This problem would occur often, for example, with large database files. Since the file will not fit entirely in memory, it must be sorted in blocks. The time to retrieve or store blocks or the time to look for a new block must be taken into account, since access times for external storage devices are typically an order of magnitude greater than main memory accesses. If the file is stored on tape, the access time can be even greater, because of the sequential nature of tape drives.

In this section we look at just one example of such an algorithm. Like many algorithms for this problem, it is a variation of the Mergesort algorithm we have studied previously. It is an example of a class of algorithms called Polyphase Merge Sorting algorithms.

Let n be the number of keys to be sorted. Let m (< n) be the number of keys which fit in main memory at one time. The parameter m must be chosen so that all necessary programming statements also fit in the memory. Note that each key may be associated with a large record, so m may actually be rather small.

The basic algorithm strategy is to arrange the keys in "runs", i.e., ordered subsequences, in two or more files and then to merge the runs. The algorithm described here will use four files, T0, T1, T2, and T3. We assume that initially the keys to be sorted are in T0.

Phase 1. Construct the runs, placing them alternately in T2 and T3

      j = 2

while T0 is not empty do

read m records into main memory

sort them (using an appropriate internal sorting routine)

append the run of m records to Tj

if j = 2 then

set j = 3

else

set j = 2

Rewind the tapes or reset the disk files.

Phase 2. Merge the runs. Initially T2 and T3 will be input files, T0 and T1 output files.

These roles will reverse on each pass.

j = 2; j' = 3

k = 0; k' = 1

while there is more than one run do

repeat

merge the next run in Tj with the next run in Tj'.

put the resulting run in Tk

merge the next run in Tj with the next run in Tj'

put the resulting run in Tk'

until Tj and Tj' are empty

rewind tapes or reset files

swap (j,k)

swap (j', k')

Example. (Baase, p. 92). m = 6

initial after phase 1: phase 2, 1st pass: phase 2, 2nd pass: (final:

T0:     T2:     T3:     T0:     T1:     T2:     T3:      T0:

19 8 2 2 9 2 6 holds

42 13 4 4 11 4 27 sorted

13 19 7 7 15 7 29 data)

8 39 12 8 18 8 30

87 42 17 12 24 9 49

39 87 32 13 35 11 58

7 * * 17 38 12 59

17 9 15 19 44 13 63

4 11 35 32 56 15 65

2 18 38 39 62 17 67

32 24 56 42 71 18 68

12 44 62 87 91 19 74

18 91 71 * * 24 84

24 * * 27 6 32 89

91 30 27 30 29 35 96

11 58 49 49 74 38 *

44 65 59 58 * 39 EOF

9 67 63 59 EOF 42

15 89 68 63 44

62 96 84 65 56

71 * * 67 62

35 6 EOF 68 71

56 29 84 87

38 74 89 91

67 * 96 *

89 EOF * EOF

65 EOf

58

96

30

84

59

27

68

49

63

29

74

6

EOF

Time to execute this algorithm:

We must count key comparisons, passes through the data, and tape rewinds (rewinds can be ignored for disks).

In one execution of the loop in phase 2, each key (and associated record) is transferred once into main memory and once to an output file, and one (simultaneous) tape rewind is done.

Now after phase 1 there are r = n / m runs in each of the files T2 and T3. On the jth pass there will be r/ (2jo) runs in the output files. So the number of passes in phase 2 will be about log2r.

Exercise 18.1. Give an estimate of the total time for the algorithm given above to execute. Explain carefully how you arrived at the estimate.

18.2. Some miscellaneous sorting problems.

Exercise 18.2. Classify each of the algorithms in the programming assignment as stable or not stable. If the algorithm is not stable, give an example showing this. Recall that the definition of stability is given on the midterm.

Exercise 18.3. 6:30 p.m. 11/6. This will be available later this evening.