Lecture 27. More on the PRAM model; realizable parallel machines. 12/1/97.

==================================================================================

27.1. More on the PRAM model. In the last lecture we saw how to use a simple procedure to find the max of n keys in time "capital theta" (log2n) on a theoretical machine, the PRAM, in which all processors communicate through a shared memory.

Clearly even in a theoretical model we need to have protocols for what happens when more than one processor wants to read from one specific memory location or wants to write to one specific memory location. If there are conflicts, how are they to be handled? In the PRAM model we distinguish between four submodels, each of which has different rules for these cases:

EREW--exclusive read, exclusive write. Only one processor is allowed to read from a cell or write to a cell in each time step. This is the most restrictive protocol. The algorithm for finding the max in Lecture 26 would be allowable in this protocol, since each processor P(i) reads from a different memory location M(i + step) and writes to the memory location M(i) in each loop iteration.

CREW--concurrent read, exclusive write. Any number of processors may read the (same) value from a specific memory location in a given time step, but only one processor can write to a specific memory location in a given time step.

(ERCW--exclusive read, concurrent write. Not generally used).

CRCW--concurrent read, concurrent write. Any number of processors may read the (same) value from a specific memory location in a given time step. Any number of processors may (attempt to) write to a specific memory location in a given time step. For concurrent write to produce predictable results, some rule for resolving conflicts must be adhered to. Two such rules which are often used are:

"common write"--several processors are allowed to write to the same memory location as long as they all write the same value

"priority write"--if several processors try to write to one memory cell, the one with the smallest index wins

Example. computing the OR of n boolean variables X(1), ... , X(n), where X(i) is in M(i).

The result will be in M(1).

Method: let P(i) read X(i) from M(i). If X(i) = 1, P(i) writes 1 in M(1).

In a CRCW PRAM, with either "priority write" or "common write" protocol, this algorithm is legal and correct. It will execute in two time steps, one for the read operation and one for the write operation.

In an EREW or CREW PRAM, the algorithm must be modified. For example, the same strategy used for finding the maximum in Lecture 26 can be used (in fact the OR of n bits is the maximum of the bit values). If this strategy is used, then the algorithm will require "capital theta"(log2n) steps.

27.2. A parallel sorting algorithm. One parallel sorting method which works well on the PRAM model is a modification of merge sort. If we recall how merge sort works, we see that we recursively merge smaller and smaller lists. This can be easily parallelized if we start by sorting small lists.

Merge procedure for two sorted lists of k elements, stored in M(1), ... , M(2k).

For clarity label the elements in M(1), ... , M(k) as a(1), ... , a(k), and the elements in M(k+1), ... , M(2k) as b(1), ... , b(k). Also assume that all are distinct.

Search: If P(i) holds a(i), then P(i) does a binary search on b(1), ... , b(n) to find the smallest j with a(i) < b(j). (P(i) will write a(i) in M(i + j - 1) ). Similarly, if P(j) holds b(j), P(j) does a binary search on a(1), ... , a(n) to find the smallest i such that b(j) < a(i). (P(j) will write b(j) in M(i + j - 1)).

Write: Each P writes its element into the correct memory location.

Clearly this can be done in time "capital theta"(log2n + 1) = "capital theta(log2n).

Sort. Assume n = 2m for some m. (If not, "pad" the array with some known value).

For count = 1 to m do

listsize = 2count-1

P(i), ... , P(i + 2*listsize - 1) merge the two sorted lists of size listsize beginning at M(i).

This loop is executed "capital theta"(log2n) times, so the sorting algorithm requires a total of "capital theta"( (log2n)2) time steps.

There are no write conflicts, so any of the PRAM models can be used.

Exercise 27.1. Show how a variation of the PRAM algorithm for finding the max can be used to sum n numbers in time "capital theta"(log2n). State explicitly which write protocol(s) would allow your algorithm to execute correctly.

Exercise 27.2. A variation of bubblesort, called even-odd transposition sort, works as follows:

on pass i, if i is even then all even numbered processors compare M(i), M(i+1),

and switch if these two elements are not in the right order; if i is odd, all odd-

numbered processors do the same

a. Show the steps to sort 8 3 1 2 7 6 5 0

b. How long will your algorithm take, as a function of the number of keys to be sorted, n, in the worst case? Give an example of "worst case" input.

27.2. Limitations of the PRAM. Some realistic architectures. Clearly the PRAM is an unrealizable machine, since it does not take communication between each processor and all shared memory locations into account. It can provide useful insight into possible lower bounds. For example, if we prove that finding the maximum of n integers takes time "capital omega"(log2n) for the PRAM model, then we know that at least that much time will be required on a "real" architecture, no matter how clever an algorithm we come up with.

Once we depart from the theoretical PRAM model, we are really in the domain of computer architecture. For a real understanding of this field, it is therefore necessary to consider many architectural issues. Related questions lead into the fields of design of parallel languages and operating systems. Not much can be said in general; rather, specific cases must be analyzed.

Some common architectures often used in parallel processing are shown in Figure P1. These include the array or 1-dimensional mesh, the two-dimensional mesh, the tree, the hypercube, and the butterfly network. The last two are defined as follows:

To get an n-dimensional hypercube, take two copies of n-1 dimensional hypercubes, labeled 0 and 1, and join vertices whose last n-1 bits are identical. Thus, if we have two one-dimensional hypercubes, one with vertices labeled 00,01, and one with vertices labeled 10,11, we form a two-dimensional hypercube by adding edges from 00 to 10 and from 01 to 11 (Figure P2).

The butterfly network consists of linear arrays of processors such that at each stage j, processors are connected to those 2j away in the next stage. It is one example of a class of networks sometimes called "shuffle exchange" networks which have long been used to implement fast sorting techniques. Typically each processor can pass through the data from the preceding stage or switch the inputs and output them in the opposite order.

27.3. Architecture characteristics. Clearly each of the architectures pictured can be viewed as a graph. Several graph characteristics are important in determining how useful an architecture is, for example:

1. degree-what is the degree of each node? what are the maximum and minimum degrees? How does the degree grow as the number of processors, n, grows? The greater the degree, the faster processors can communicate, but the harder it will be to actually build the machine.

2. diameter--what is the maximum path length from any node in the graph to any other? This gives some idea of how long it will take for information to be shared among all processors,

3. connectivity--what is the minimum number of processors that need to fail before the network becomes disconnected? This measures reliability and fault tolerance of the network. It also gives some measure of how much parallel communication might be possible.

Exercise 27.3. Answer the questions above for each of the architectures in Figure P1. Assume in each case that there are n processors (nodes).

4. Another characteristic which must be defined is how the processors will communicate. The basic modes of communication are classified as:

SISD--single instruction, single data. This is a sequential machine.

SIMD--single instruction, multiple data. There is one control unit and in each time step all the processors do the same instruction (or some do nothing). This is the control method typically used in "vector processors". Many one- and two-dimensional mesh-type architectures use this method. It can be very efficient for problems arising in scientific computation, for example.

MIMD--multiple instruction, multiple data. Each processor has its own program or list of instructions, which may differ from the programs being run on the other processors. Processors communicate by message passing. This protocol has been used on some hyper-cube based machines, for example. It matches well with an object-oriented model of computation.

MISD--multiple instruction, single data. This term is usually applied to pipelined machines, where data items are "piped" through multiple computational units, each of which does some small amount of work. Vector processors, such as some CRAY models, have used this protocol effectively.

5. Related to communication is the question of whether processors will share one memory or have their own local memories. (It is assumed that each processor will have at least some small amount of local memory.) 6. Another important question relates to how general the architecture is. For example, if we have a 2-dimensional mesh and an algorithm designed to run on a hypercube, how easy is it to adapt the algorithm to run on the mesh and what loss in performance is there?

27.3. Sorting. We list some sorting algorithms which could be implemented on some of these architectures. In each case we will assume we are sorting n distinct integers. We are interested in the time to do the sort and the number of processors we use (as a function of n).

Method 1. On the one-dimensional mesh we can easily implement even-odd transposition sort (see Section 27.1) This requires n processors and n time steps in the worst case.

Method 2. We can sort n elements on a binary tree of 2n-1 processors in time n.

Initially place each of the n elements in a leaf node.

At each time unit:

each processor looks at the elements stored in its 2 children (if any) and marks one as the "smallest"

each processor stores the marked data item from one of its children and passes its data to its parent, if the parent has room available. The root outputs the element it has stored.

(See Figure P3 for example).

This requires 2n-1 processors and n time steps in the worst case.

Method 3. This sort can be accomplished in time "capital theta"((log2n)2),

but uses "capital theta"(n(log2n)2) processors. We use a "comparator" element defined by:

inputs      outputs      s     +             -

x x min(x,y) max(x,y)

y y max(x,y) min(x,y)

This sort uses a "perfect shuffle" network, which takes as inputs x(1), ... , x(n) and y(1), ... , y(n) and outputs x(1), y(1), ... , x(n), y(n) (like shuffling a deck of cards). (See Figure P4).

The procedure works by constructing and merging "bitonic" sequences, i.e., sequences which consist of an increasing sequence concatenated with a decreasing sequence.

27.4. Performance measures.

We define the speedup of a given algorithm to be the speed obtained by using P processors instead of one for a given algorithm. The time for processor communication should be included, along with the actual computation time. Formally, if T(P,N) is the time to solve a problem of size N using P processors, then the speedup is

S(P,N) = T(1,N) / T(P,N)

For example, if T(P,N) = N / P + log(P) then S(P,N) = N / (N / P + log(P))

If N is much larger than P, the speedup is essentially P.

We define the efficiency E(P,N) to be S(P,N) / P.

Exercise 27.4. Suppose we implement even-odd transposition sort for N elements on N processors and another version on one processor. what is the speedup and efficiency in this case?

Amdahl's Law. Amdahl's Law (G.M. Amdahl, 1967) gives a quick way to estimate the speedup possible for a problem. Let F be the fraction of the time that a problem can use the parallel enhancement. Suppose

E(O) = execution time without the parallel enhancement;

E(P) = execution time with the parallel enhancement;

S(P) =speedup from parallel enhancement;

S = overall speedup.

Then we have

S = E(O) / E(N) = 1 / { 1 - F + F / S(P) }

Example. Suppose a parallel portion of the algorithm runs 10 times faster than the serial portion but is only usable 40% of the time. What is the overall speedup?

F = 0.4, S(P) = 10, S = 1 / { 0.6 + 0.4/10) = 1.56

For example, if 60% of the work is in I/O, and this cannot be parallelized, then the above calculation gives the "true" speedup attainable.

Exercise 27.5. Suppose 50% of the time is spent on I/O. What is the speedup if we can increase the speed of the computation portion by a factor of 5?

A theoretical measure. Next quarter we will learn about the problem classes P and NP. Roughly speaking, a problem is in class P if there is an algorithm to solve P which runs (on a sequential machine) in some time which is bounded by a polynomial in P. For example, sorting n integers is in P, as are finding the biconnected components of a graph or finding the strongly connected components of a digraph. (One problem which is not in P is determining whether a graph has a Hamiltonian cycle, i.e., a cycle which visits each vertex exactly once. This can easily be solved for some graphs, but there is no known algorithm which will answer the question in polynomial time P(n) for any graph G with n vertices). For parallel computing, a similar problem classification has been made.

We define NC to be the class of problems that can be solved (on a PRAM) by a parallel algorithm where P, the number of processors, is bounded by a polynomial in the input size, and the number of time steps is bounded by a polynomial in the log of the input size. For example, sorting is in NC, since the (PRAM) sort we exhibited above requires n processors and (log2n)2 time steps to sort n elements. It turns out that this class is the same, even if we make the stricter requirement that every connection in our network of processors is of bounded degree.