Friday, December 4, 2009

Revisions

Question posed by Ras Bodik:

Think and propose how you want to measure the amount of parallelism that your tool have exposed. What level of parallelism will you consider to be a "success"?

As discussed in the original proposal, sequential programs will be parallelized by converting it into a DAG of read-write dependences. To measure the amount of parallelism this achieves, we will take the ratio of the length of the longest path in this DAG from the sources to sinks to the number of bytecode operations in the program. Although we had wanted to create the DAG before we compile to bytecode, we now think that it may be better to create the DAG after we compile to bytecode because it will be easier to figure out the read and write sets from the bytecode instructions. We will also be able to achieve a finer degree of parallelism at the bytecode level than at the statement level. For instance, consider the following statement.

def x = 1 + 2 + 3 + 4 + y

If the DAG were constructed at the statement level, this statement will not execute until y is read. On the other hand, if the DAG were constructed at the instruction level, the additions 1 + 2 + 3 + 4 can be executed in parallel with the process that computes y.

Comments from Ras Bodik:

Your solution seems to execute the program in a dataflow fashion (“The interpreter will then run the program by traversing this DAG starting from all its source nodes.”) This is likely to be very expensive in terms of runtime overhead. Instead, try to find large functions that can be executed in parallel without tracking (or traversing) any dependence edges during the execution. Often, divide and conquer programs have such a structure, eg merge sort. Look at programs that the CIlk project parallelizes. Can you discover automatically that these programs are parallel?

Although, as Ras pointed out, the DAG method may incure runtime overhead that defeats the parallelism this method detects, we will stick with this method because it seems to be the most straight-forward solution. The DAG can be post-processed to unparallelize the parts that may create too much of this overhead. However, to simplify the project, we will not implement this post-processing procedure or even formulate a test to detect the parts of the DAG that will have this overhead. (Another post-processing method is converting parts of the DAG that look like singly-linked lists into single blocks of instructions; this gets rid of the runtime overhead of traversing these parts of the DAG.) We will also stick with the DAG method because it will automatically detect and parallelize large programs, such as those that implement divide-and-conquer algorithms. For instance, mergesort divides the list to be sorted into two halves and involks recursive calls, one for each half of the list. Since the results of these calls are only needed in the "merge" step, the DAG method will have these two processes run in parallel. Thus, although naively running a program as a DAG of read-write dependences may have too much overhead, we feel that the DAG is a useful formulation for automatically detecting parallelism. In real-world applications, this DAG can be converted into code that keeps the detected parallelism, so dependence edges will not explicitly be traversed.

Comment from Ras Bodik:

You answer the required questions but in your case you need to describe also how you will discover parallelism. Please find one more complex example and run the algorithm that you intend to implement by hand. Please email us what you come up with. We will regrade your proposal after you resubmit.

As we have discussed, we will detect parallelism by creating a DAG where instructions that read a variable will have an arc from the last instruction that wrote to the variable. If the program cannot be parallelized at all, this DAG will simply be a singly-linked list.

Here is an example of converting a sequential program into a parallel one.

Code:

def_const fib(n) {
if (n <= 1) {
1
} else {
def x = fib(n-1)
def y = fib(n-2)
x + y
}
}
fib(5)


Pseudo-bytecode with read and write sets (note the arcs drawn from write sets to read sets):



DAG given to the interpreter (note that this is exactly the same as the pseudo-bytecode but redrawn):



We should be able to detect that the two recursive fib calls are independent, so they can be parallelized. Furthermore, we will detect that "return x+y" depends on reading x and y, so the two fib calls must finish before the "return x+y" is executed. The Cilk project has this example on http://supertech.csail.mit.edu/cilk/intro.html, but because we first assume that our variables are passed by value, we should be able to automatically detect where to put the "spawn" and "sync" commands.

Objects are much trickier, and we have an idea about how to do it by having class based objects where we store a class's functions once and the objects are just structs (I think this is how C does it) and whenever we pass an object to a variable we copy over its fields. This, however, is inefficient. We are still thinking about it.

No comments:

Post a Comment