Saturday, December 19, 2009

Final project presentation

The final project powerpoint is here


Friday, December 18, 2009

Final Project Write-up, Part 4: The Implementation

Frontend

Our language is a modification of the 164 language, so we already had the bytecode interpreter and the grammar. We ended up modifying the bytecode to implement while loops, if statements, and function definitions as bytecode instructions instead of syntactic sugar in terms of conds and lambdas. Each program is compiled just like a 164 program, but instead of feeding the bytecode directly into an interpreter, we first build a dependency graph from the bytecode. Then we send that graph (a dag) into a modified 164 interpreter that simulates parallel execution.

The Core Language

Although we used the 164 grammar and a modified version of the bytecode, our final language differs significantly from the 164 language. To simplify the project, we took away some features included in the 164 language. First, in order to build the dependency graph statically, we eliminated first class functions and lambdas from the language. Therefore, while the 164 language implements if statements and while loops as lambdas, we implement them as separate bytecode instructions. In addition, our language is fully functional, meaning that every variable a function refers to must be passed as an argument. Finally, all mutable variables are passed by value. Lists are passed by reference since mutation of lists is not allowed.

We did not implement objects or structs.

Internal Representation

The program is converted into a dag which is sent to our interpreter. It would be better to have the compiler automatically insert thread operations into the program before final compilation, but for this project we think the representation of the program as a dag is sufficient. Our main goal was to effectively parallelize programs.

Compiler

Each program goes through two stages of compilation: the initial compilation into bytecode (based on the 164 bytecode) and the compilation into a dependency dag. The first compilation is nearly identical to the 164 compilation (except we modified the bytecode to include "def-func", "if-else" and "while" instructions). Compilation into the dag is more complicated and involves 5 dependencies:

1. Each read of a variable depends on the instruction that wrote to it last. We call this a W-R dependency.

2. Writes depend on writes because when we read a variable we want to make sure that we are reading the correct value.

3. A write depends on all previous reads because we want the read of the variable to occur before the write to the variable. Otherwise, we will read an incorrect value.

4. All commands (input/print) must be in order. If the user intends to print "1 2 3", we do not want the program to print "3 2 1". To do this, we look inside functions to see if a "print" or "input" statement is ever possible within the execution of that function. This involves doing a dfs from a print or input node in a function dependency graph to see if it is in the same connected component as the function. If it is, then the function is treated like a command.

We do a similar search to make sure all the functions that a function f may call are defined before f is called.

5. All instructions must precede the return instruction because we do not want the block of code to return before other instructions in the block have completed. For the following code, if the dag does not have the sink edge, then x cannot be guaranteed to be 1 when it is printed.

def x = 0
if (true) {
def y = 0
x = 1
y
}
print x




To create the graph, we walk down the bytecode and calculate the parents based on the bytecode we have already seen.

Interpreter

The interpreter traverses the dag in a random fashion and executes the instruction at each node while ensuring that each instruction gets executed after all its parent instructions have been executed. There is a limitation to our simulation, however. Since a function call is executed as a single instruction, the execution of two function bodies do not interleave, but they can be executed in a different order.

Measuring Parallelism

To measure the amount of parallelism this achieves, we take the ratio of the length of the longest path in this DAG from the sources to sinks to the number of bytecode operations in the program.

Debugging

We debugged our implementation by using GraphViz to visualize the DAG, and, of course, by running programs since the order of the instructions is randomized by the interpreter.

Final Project Write-up, Part 3: Supported Features

1. The Domain

Our algorithm for auto-parallelization can be applied on general programs written in a modified version of the 164 language we developed in the previous project. This is ideal for divide and conquer algorithms like mergesort. Although the language only allows mutation for primitives and does not make functions first-class, the constructs the language provides for operating on strings, lists, and matrices still make it versatile.

2. Programming Model

We sought to pursue a language model that is as powerful as it can be while being simple enough to allow auto-parallelization to be feasible. Thus, our language follows the structured and functional programming paradigms.

However, functions in the language are not entirely functional: commands such as input and print statements are other ways functions can receive input and produce output. (This is the reason why functions in our language cannot be first-class.) In such cases, they must be treated as if they are themselves commands. In particular, the order of commands cannot be rearranged. To determine if a function should be treated as commands, all functions are global like in C and must be declared before any other program statements. Functions can thus be placed in a namespace separate from that of variables as it is done in Logo.

Besides basic control structures such as if- and while-statements, our language allows operations on strings and lists (called deques as in C++). Since deques can be nested deeply, they can be thought of as matrices, so our language allows for matrix and element-wise operations. Strings and deques can be concatenated using the addition operator and can be split using the slice operator as in Python. This is how strings and deques are operated on in the absence of mutation for them.

3. Example Programs

Five dependencies:

def x = 0
if (true) {
____def y = x
____x = 1
____y
}
print x
print "end"




Fibonacci:

# fibonacci

def fib(n){
____if (n < 3) {
________1
____} else {
________def a = fib(n-1)
________def b = fib(n-2)
________a + b
____}
}

def first = fib(1)
def second = fib(2)
def third = fib(11)

print first
print second
print third




Mergesort:

# merge sort

def merge(list1, list2) {
____if (len(list2) == 0) {
________list1
____} else if (len(list1) == 0) {
________list2
____} else if (list2[0] < list1[0]) {
________[list2[0]] + merge(list1, list2[1:])
____} else {
________[list1[0]] + merge(list1[1:], list2)
____}
}

def mergesort(the_list) {
____def length = len(the_list)
____if (length <= 1) {
________the_list
____} else if (length == 2) {
________def first = the_list[0]
________def second = the_list[1]
________if (second < first) {
____________[second, first]
________} else {
____________[first, second]
________}
____} else {
________# recursively call the program
________def mid_index = int(len(the_list) / 2)
________def a = mergesort(the_list[:mid_index])
________def b = mergesort(the_list[mid_index:])
________merge(a, b)
____}
}

def test1 = [4,6,5,8,9,3,1,7,10,2]
print mergesort(test1)

def test2 = [6,2,4,5,1,8,3,10,9,7]
print mergesort(test2)

def test3 = [5,8,3,1,4,9,6,2,10,7]
print mergesort(test3)




Final Project Write-up, Part 2: Existing Solutions

The Cilk project provides a straight-forward example of parallelizing the fib() function at http://supertech.csail.mit.edu/cilk/intro.html. Although cilk simplifies the process of parallel programming, it relies on the programmer to specify how the program is split up. Because we ensure that variables are passed by value or are immutable, we are able to automatically detect how to parallelize this function (See part 4).

Final Project Write-up, Part 1: The Problem

1. What problem are you addressing?

We address the problem of automatically splitting up a sequential program into components that can be run in parallel.

2. Why this problem is hard.

We can give two reasons for why it is difficult to convert sequential programs into parallel programs. First, focus shifts from the natural algorithm for solving a problem to how to get the parallel algorithm. It may be difficult to force parallism into the abstractions used to solve the problem. Second, the implementation may require a lot of overhead. A programmer would have to deal with constructs for parallelism, such as threads and mutexes, and the issues that arise with them, such as when to join threads or lock mutexes.

3. Why this problem is worth solving.

People have come to expect processor performance to increase exponentially each year, in other words, to follow Moore's Law. However, because processors are beginning to reach the power wall, where the doubling of the power required for doubling performance is the limiting factor, the trend in processor design is to parallize. That is, multiple slow processors working in parallel are used rather than a single fast processor.

However, unlike the case when writing sequential programs, the hardware currently cannot be abstracted away for programmers writing parallel programs. Thus, taking advantage of the benefits of parallism is hard for programmers. If converting sequential programs into parallel programs can be done automatically, programmers will be free to concentrate on solving problems with high-level thinking rather than be bogged down by the low-level details of managing threads. Such code would be easier to maintain, especially when it is adapted to new requirements.

Wednesday, December 9, 2009

Final Project Milestone: Three Programs to Demo


Prompt from Ras Bodik:
Next milestone. To help you bite your teeth into the design problem, we ask that for the next milestone (this Friday 9pm) you post on your blogs three programs that you want to demo at the poster session.
Program 1: Fibonacci

Code:


Bytecode:


Parallel bytecode:


Output:

1
1
89


Program 2: Loops

Code:


Bytecode:


Parallel bytecode:


Output:

0
1
2
3
4
z is 2
0
-1
-2
-3
-4
z is 3
0
-1
-2
-3
-4
z is 4
0
-1
-2
-3
-4
z is 5
0
-1
-2
-3
-4
z is 6
0
-1
-2
-3
-4
5


Program 3: Merge Sort

Code:


Function dependence graph:


Parallel bytecode:


Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Friday, December 4, 2009

Revisions

Question posed by Ras Bodik:

Think and propose how you want to measure the amount of parallelism that your tool have exposed. What level of parallelism will you consider to be a "success"?

As discussed in the original proposal, sequential programs will be parallelized by converting it into a DAG of read-write dependences. To measure the amount of parallelism this achieves, we will take the ratio of the length of the longest path in this DAG from the sources to sinks to the number of bytecode operations in the program. Although we had wanted to create the DAG before we compile to bytecode, we now think that it may be better to create the DAG after we compile to bytecode because it will be easier to figure out the read and write sets from the bytecode instructions. We will also be able to achieve a finer degree of parallelism at the bytecode level than at the statement level. For instance, consider the following statement.

def x = 1 + 2 + 3 + 4 + y

If the DAG were constructed at the statement level, this statement will not execute until y is read. On the other hand, if the DAG were constructed at the instruction level, the additions 1 + 2 + 3 + 4 can be executed in parallel with the process that computes y.

Comments from Ras Bodik:

Your solution seems to execute the program in a dataflow fashion (“The interpreter will then run the program by traversing this DAG starting from all its source nodes.”) This is likely to be very expensive in terms of runtime overhead. Instead, try to find large functions that can be executed in parallel without tracking (or traversing) any dependence edges during the execution. Often, divide and conquer programs have such a structure, eg merge sort. Look at programs that the CIlk project parallelizes. Can you discover automatically that these programs are parallel?

Although, as Ras pointed out, the DAG method may incure runtime overhead that defeats the parallelism this method detects, we will stick with this method because it seems to be the most straight-forward solution. The DAG can be post-processed to unparallelize the parts that may create too much of this overhead. However, to simplify the project, we will not implement this post-processing procedure or even formulate a test to detect the parts of the DAG that will have this overhead. (Another post-processing method is converting parts of the DAG that look like singly-linked lists into single blocks of instructions; this gets rid of the runtime overhead of traversing these parts of the DAG.) We will also stick with the DAG method because it will automatically detect and parallelize large programs, such as those that implement divide-and-conquer algorithms. For instance, mergesort divides the list to be sorted into two halves and involks recursive calls, one for each half of the list. Since the results of these calls are only needed in the "merge" step, the DAG method will have these two processes run in parallel. Thus, although naively running a program as a DAG of read-write dependences may have too much overhead, we feel that the DAG is a useful formulation for automatically detecting parallelism. In real-world applications, this DAG can be converted into code that keeps the detected parallelism, so dependence edges will not explicitly be traversed.

Comment from Ras Bodik:

You answer the required questions but in your case you need to describe also how you will discover parallelism. Please find one more complex example and run the algorithm that you intend to implement by hand. Please email us what you come up with. We will regrade your proposal after you resubmit.

As we have discussed, we will detect parallelism by creating a DAG where instructions that read a variable will have an arc from the last instruction that wrote to the variable. If the program cannot be parallelized at all, this DAG will simply be a singly-linked list.

Here is an example of converting a sequential program into a parallel one.

Code:

def_const fib(n) {
if (n <= 1) {
1
} else {
def x = fib(n-1)
def y = fib(n-2)
x + y
}
}
fib(5)


Pseudo-bytecode with read and write sets (note the arcs drawn from write sets to read sets):



DAG given to the interpreter (note that this is exactly the same as the pseudo-bytecode but redrawn):



We should be able to detect that the two recursive fib calls are independent, so they can be parallelized. Furthermore, we will detect that "return x+y" depends on reading x and y, so the two fib calls must finish before the "return x+y" is executed. The Cilk project has this example on http://supertech.csail.mit.edu/cilk/intro.html, but because we first assume that our variables are passed by value, we should be able to automatically detect where to put the "spawn" and "sync" commands.

Objects are much trickier, and we have an idea about how to do it by having class based objects where we store a class's functions once and the objects are just structs (I think this is how C does it) and whenever we pass an object to a variable we copy over its fields. This, however, is inefficient. We are still thinking about it.