Lab 4
CSCI 320 Performance of Your Ultra
September 18, 2007

Purpose: To determine CPI of a processor. To compute MIPS of a processor. To practice enhancing the speed of a digital system. More exploration of the Verilog HDL notation.

Exercise 1: Using Non-Blocking Assignment Statements (<=)

In Exercise 3 you will be modifying your Ultra implementation to make it run faster.  Up to this point, we specified that each register transfer took one clock cycle; therefore, your individual register transfers have a #clock in front of them.  Because of this, you did not need to be very careful about the use of blocking (shown as =) and non-blocking assignment (shown as <=) statements.  Non-blocking assignment statements (shown as <=) take place at the end of the clock cycle, that is, the destination register is written at the end of the clock cycle.  A blocking assignment (shown as =) changes the destination register immediately.  Because each register transfer had its own clock in your implementation, you did not need to think (much) about whether the assignment should be blocking or non-blocking.

The modification in Exercise 3 will have you remove clock statements to make your processor run faster (have a lower CPI).  Therefore, you should go through the control sequence of your current implementation of Ultra3 and change any blocking assignments (=) to non-blocking (<=). However, you should continue to use (=) in the initialization of your machine.

Exercise 2: Compute the average CPI and MIPS for Ultra3

Using your final Ultra3 machine from Lab 3, run the following software program (PC starts at 10):

         3      DATA    2
4 DATA 1
10 LOADI R3,2
11 LOADD R2,2
12 LOAD R1,3
13 BNE R1,R2,2
14 ADD R1,R1,R2
15 STORE R1,5
16 ADD R1,R0,R2
17 BEQ R1,R2,2
18 JUMP 10
19 STORED R1,6
20 LOADI R1,4
21 BLT R2,R1,1
22 JUMPD 21
23 JUMP 24
24 SUB R1,R2,R3
25 JUMPI -15
Check to make sure your machine functions properly on the program.

Hand in the working Verilog code and output for the above software program, along with comments that show what it does and why you think it is correct.

Answer the following three questions (to be handed in).

A. What is the Clock Periods per Instruction (CPI) for each instruction? That is, how many clock periods does each instruction take? An easy way to do this is to run the above program and count the clock periods needed for each instruction. Note that the CPI includes the time to fetch the instruction.

Use a $display() to print out the start of fetching an instruction and other $display()s to print the name of each instruction.

                  CPI        Frequency (in %)
Add ________ 20
BEQ ________ 10
BLT ________ 5
BNE ________ 10
Load ________ 25
LoadI ________ 10
LoadD ________ 4
Store ________ 5
StoreD ________ 5
Sub ________ 2
Jump ________ 2
JumpD ________ 1
JumpI ________ 1
B. Assume the instructions have the above frequency of use from a representative workload.
    What is the average CPI? ________________
C. Assume the clock period is 100 nanoseconds, how many MIPS (Millions of Instructions Per Second) does your Ultra3 do? Show work.
    What is the MIPS? ________________

Exercise 3: Speeding Up Your Ultra3

Copy your Ultra3 to a file called Ultra4.v. Modify your Ultra4 to make it faster by removing instances of "#clock" and combining register transfers.

We strongly urge you to read and study pages 11 and 12 in Realization of Verilog HDL Computation Model on how to speed up Verilog code.

When one removes a #clock, one must analyze the effect of doing the statement in the same clock period as the previous statement no matter how one branched there.

Important: After making a speed improvement to your Ultra4 make sure that it still works by running it on veriwell. We recommend that initially you use #clocks liberally in your designs. Once your design is working properly then remove one (1) #clock at a time and check the behavior of the machine after each removal.

For example, removing the #clock on the MA <= PC; register transfer

                #clock PC <= PC + 1;
if ( IR[0] & IR[1] )
#clock MA <= PC;
means that if both IR[0] and IR[1] are 1, the old value of PC will be transferred to the MA during the same clock period that the PC is incremented.

In the below control sequence, the test for the first four bits of the IR is on the old value of the IR. If one intends to test the new value of IR, one needs to add a #clock before the if.

                #clock IR <= MD;
if ( IR[0:3] == 4'b000 ))
#clock MA <= IR[8:15];
NOTE: Avoid doing anything like this!
                while( P[4] )
begin
P <= P + 1;
end
The above is potentially an infinite loop which takes no time periods in the simulation. Why? Because there are no #clocks inside the loop to move time forward. The old value of P[4] is always used; not the updated one. Therefore, if P[4] is 1, a Verilog simulator will go into an infinite loop and just spin with this code.

You may combine register transfers and do them in the same clock period such as

  #clock A <= B;   changed to   #clock A <= B;
#clock C <= D; C <= D;
as long as they do not interfere. You can't remove a #clock if a result of one depends on the second, for example,
		#clock MA <= PC;
#clock B <= MA + 1;
can't be combined as the new value of MA is used in the second statement. This is called a data dependency.

However, you could rewrite the above to be faster by substituting MA with PC in the second register transfer as shown:

		#clock MA <= PC; B <= PC + 1;

Important: After making a speed improvement to your Ultra4 make sure that it still works by running it on veriwell. We recommend that initially you use #clock's liberally in your designs. Once your design is working properly then remove one (1) #clock at a time and check the behavior of the machine after each removal.

Hand In:

Hand in printouts of code and runs of the original Ultra3 and your enhanced Ultra4. On the printout of the enhanced Ultra4, explain your improvements by underlining and adding written comments.

For your enhanced Ultra4, answer the following four questions:

A. What is the CPI for each instruction?

                  CPI        Frequency (in %)
Add ________ 20
BEQ ________ 10
BLT ________ 5
BNE ________ 10
Load ________ 25
LoadI ________ 10
LoadD ________ 4
Store ________ 5
StoreD ________ 5
Sub ________ 2
Jump ________ 2
JumpD ________ 1
JumpI ________ 1
B. Assume the instructions have the same frequencies of use as before.
    What is the average CPI? ________________
C. Assume the clock period is 100 nanoseconds, how many MIPS (Millions of Instructions Per Second) does your Ultra3 do? Show work!
    What is the MIPS? ________________

D. Compare the old and the enhanced versions of your machine by computing the speedup.

    What is the speed up of the enhanced Ultra4 over the original
Ultra3?
___________

Remember at the end of lab to save the ULTRA4.v file for later labs.

End of lab.


A Contest for the Fastest Machine. (Must still work!)

Two prizes will be awarded, one for each category:

Category 1: Fastest Ultra4 in MIPS which adheres to our memory model of always accessing memory through the MA and MD registers.

Category 2: Fastest Ultra4 in MIPS with the above restriction relaxed. Any legal operation of the simulator is allowed but the machine must still work properly.
______________________________

You must state which category you wish to enter. You may submit entries to both categories. An entry must include a listing of Verilog code, a run of the program in Exercise 2 and correctly computed MIPS. All entries to be considered for judging must be in by Tuesday 1 PM September 25, 2007. For each category, the prize for first place is a half dozen chocolate chip cookies. The machine must produce the correct results. Incorrect machines are automatically disqualified. Decisions of the judges are final. Employees of the company that sponsors the contest and their family members are not eligible to enter. Contest is void in states where regulations prohibit such contests.


Page maintained by Dan Hyde, hyde at bucknell dot edu Late update September 14, 2007
Back to CSCI 320 Home Page.