Lecture 18--External sorts. 11/7/97.
================================================================================
Note: in the programming assignment note that the nonrecursive version of quicksort should explicitly stack pairs of integers (first,last), rather than the array elements themselves.
================================================================================
18.1. One algorithm for sorting large files. The external sorting problem is the problem of sorting a large file, stored on tape or on disk, which is so large that not all keys in the file will fit in main memory at one time. This problem would occur often, for example, with large database files. Since the file will not fit entirely in memory, it must be sorted in blocks. The time to retrieve or store blocks or the time to look for a new block must be taken into account, since access times for external storage devices are typically an order of magnitude greater than main memory accesses. If the file is stored on tape, the access time can be even greater, because of the sequential nature of tape drives.
In this section we look at just one example of such an algorithm. Like many algorithms for this problem, it is a variation of the Mergesort algorithm we have studied previously. It is an example of a class of algorithms called Polyphase Merge Sorting algorithms.
Let n be the number of keys to be sorted. Let m (< n) be the number of keys which fit in main memory at one time. The parameter m must be chosen so that all necessary programming statements also fit in the memory. Note that each key may be associated with a large record, so m may actually be rather small.
The basic algorithm strategy is to arrange the keys in "runs", i.e., ordered subsequences, in two or more files and then to merge the runs. The algorithm described here will use four files, T0, T1, T2, and T3. We assume that initially the keys to be sorted are in T0.
Phase 1. Construct the runs, placing them alternately in T2 and T3
j = 2
while T0 is not empty do
read m records into main memory
sort them (using an appropriate internal sorting routine)
append the run of m records to Tj
if j = 2 then
set j = 3
else
set j = 2
Rewind the tapes or reset the disk files.
Phase 2. Merge the runs. Initially T2 and T3 will be input files,
T0 and T1 output files.
These roles will reverse on each pass.
j = 2; j' = 3
k = 0; k' = 1
while there is more than one run do
repeat
merge the next run in Tj with the next run in Tj'.
put the resulting run in Tk
merge the next run in Tj with the next run in Tj'
put the resulting run in Tk'
until Tj and Tj' are empty
rewind tapes or reset files
swap (j,k)
swap (j', k')
Example. (Baase, p. 92). m = 6
initial after phase 1: phase 2, 1st pass: phase 2, 2nd pass: (final:
T0: T2: T3: T0: T1: T2: T3: T0:Time to execute this algorithm:19 8 2 2 9 2 6 holds
42 13 4 4 11 4 27 sorted
13 19 7 7 15 7 29 data)
8 39 12 8 18 8 30
87 42 17 12 24 9 49
39 87 32 13 35 11 58
7 * * 17 38 12 59
17 9 15 19 44 13 63
4 11 35 32 56 15 65
2 18 38 39 62 17 67
32 24 56 42 71 18 68
12 44 62 87 91 19 74
18 91 71 * * 24 84
24 * * 27 6 32 89
91 30 27 30 29 35 96
11 58 49 49 74 38 *
44 65 59 58 * 39 EOF
9 67 63 59 EOF 42
15 89 68 63 44
62 96 84 65 56
71 * * 67 62
35 6 EOF 68 71
56 29 84 87
38 74 89 91
67 * 96 *
89 EOF * EOF
65 EOf
58
96
30
84
59
27
68
49
63
29
74
6
EOF
We must count key comparisons, passes through the data, and tape rewinds (rewinds can be ignored for disks).
In one execution of the loop in phase 2, each key (and associated record) is transferred once into main memory and once to an output file, and one (simultaneous) tape rewind is done.
Now after phase 1 there are r = n / m runs in each of the files T2 and T3. On the jth pass there will be r/ (2jo) runs in the output files. So the number of passes in phase 2 will be about log2r.
Exercise 18.1. Give an estimate of the total time for the algorithm given above to execute. Explain carefully how you arrived at the estimate.
18.2. Some miscellaneous sorting problems.
Exercise 18.2. Classify each of the algorithms in the programming assignment as stable or not stable. If the algorithm is not stable, give an example showing this. Recall that the definition of stability is given on the midterm.
Exercise 18.3. 6:30 p.m. 11/6. This will be available later this evening.