DUE: Wednesday, October 11, 2006

Compiling Arithmetic Expressions

Overview: Your assignment for this project is to write a simple compiler from infix arithmetic expressions to an imaginary assembly language. Assembly language is "a human-readable notation for the machine language used to control a specific computer architecture" [wikipedia]. The imaginary assembly language instructions you will be generating here have the form:

add a b ==> c
mul b 3 ==> d

There are four assembly instructions, corresponding to the basic arithmetic operations: mul, div, add, sub.

Your final program should read an expression from standard input (keyboard) and print to standard output the compiled result: (the underlined portion represents the user's input)

Enter an expression: a + b - 56 * 34 / b3_3f add a b ==> var000 mul 56 34 ==> var001 div var001 b3_3f ==> var002 sub var000 var002 ==> var003

As you can see, the compiler breaks the infix arithmetic expression into single arithmetic steps, storing the results in newly generated variable names as it goes. I will provide you a simple function to generate the variable names.

Algorithm

(This problem taken from Objects, Abstraction, Data Structures, and Design using Java by Koffman and Wolfgang).

Assume that the tokens (operators and operands) in the input string are separated by spaces. You will use two stacks in this algorithm. Your program should read in each token and process it as follows:

If the character is neither an operand nor an operator, display a helpful error message and terminate the program. If it is an operand, push it onto the operand stack. If it is an operator, compare its precedence to that of the operator on top of the operator stack. If the current operator has higher precedence than the one on top of the stack (or if the stack is empty), it should be pushed onto the operator stack. As long as the current operator has the same or lower precedence to the one on top of the operator stack, the operator on top of the operator stack must be evaluated next. This is done by popping that operator off the operator stack along with a pair of operands from the operand stack and writing a new line in the output table. The variable selected to hold the result should then be pushed onto the operand stack. Continue this process until the top of the operator stack has lower precedence than the current operator, or until the stack is empty. At this point, push the current operator onto the top of the stack and examine the next token in the input.

When the end of the input is reached, pop any remaining operator along with its operand pair and output a line. Remember to push the result variable onto the operand stack after each line of output is generated.

Notes

Don't make this problem harder than it is. Read the algorithm above and make sure you understand it well enough to be able to carry it out step by step with paper and pencil. The completed program should probably not be longer than this web page. In implementing the algorithm, pay attention to the following programmatic details:

Valid operands are either integer numbers or variable names made up of letters, numbers, and the underscore character. Variable names must not start with a digit, though. You may write a little function isOperand that uses functions from the standard library (ctype.h) - isalpha, isalnum, isdigit - to determine if a token is a valid operand.
To generate temporary variable names, you can just keep a small integer for the next variable to use and then use the following function to convert it to a string (using the standard library sprintf function):
```
char* tempVarName(short num)
{
  char *varName = (char*) malloc(10);
  sprintf(varName, "var%03d", num);
  return varName;
}
```
Don't worry about freeing the strings that are allocated and returned by this function because it will probably be hard to keep track of them to know when exactly to free them. Although this will technically cause your program to have a "memory leak" it should not affect its execution, practically speaking.
To tokenize the input, you can use the strtok function from the standard library (string.h). If your input string is stored in a char* variable called exp, the following example code will just print out the tokens one per line. Look up the description of strtok to understand how it works, what it returns, etc.
```
char *tok = strtok( exp, WHITESPACE );  /* get first token */
while (tok) {
   printf("%s\n", tok);
   tok = strtok( NULL, WHITESPACE ); /* for subsequent tokens pass NULL */
}
```
Once you have compiled and tested this code, you can replace the printf with the implementation of the compilation algorithm described several paragraphs above.
The getline function defined in the following utility file will allow you to efficiently read in input lines of any length:
- getline.h
- getline.c

Extra Credit

For extra credit, extend the algorithm above in some way; for example, support input expressions with parentheses, or a better tokenizing function that doesn't need whitespace between tokens. If you do try adding features to your implementation, document them in a README file and submit that with your source code.

CSC220A - Fall 2006 - Programming Project #2

Compiling Arithmetic Expressions

Algorithm

Notes

Extra Credit