Purpose:
1. To explore instruction look-ahead and its implementation in Verilog HDL.
2. To explore the Verilog HDL fork and join constructs of concurrency.
3. Design a 2-stage instruction pipeline of a CPU. You will need to remove structural hazards, i.e., conflicts in using resources, and handle potential control hazards due to branch instructions. Maybe even data hazards depending on how you implement branches.
Introduction
In the computer designers' quest for faster CPU performance, they have used many techniques. One is a technique called instruction look-ahead (really a 2-stage instruction pipeline). The idea of instruction look-ahead is to overlap the decode and execution of instruction i with the fetch of the next instruction i + 1. Since this is two-stage instruction pipeline, it has a potential speedup of two. Instruction look-ahead was first used on the IBM 7030 (Project Stretch) which was designed in the late 1950s. Even the lowly MC6502 microprocessor of the Apple IIe used the technique.
In this lab, we will implement instruction look-ahead in phases. First, we will balance the time the two portions (the fetch and the execute portions) take. Second, we implement instruction look-ahead ignoring the problems surrounding jumps and branches. Then third, we will fix up the jumps and branches.
This lab requires you to think of algorithms executing in parallel,
i.e., several actions occurring at the same time. Since this may be new to you, please ask if you need help.
Exercise #1 - Balance the Fetch and Execute Portions of Ultra7
Copy the Verilog HDL code of the unspeeded up version of Ultra4 from Lab #4 Exercise #1 to a new file called Ultra7.v.
Run your Ultra7.v using the software program of Lab 4 and check the time to fetch an instruction and the time to decode and execute an instruction. Add a $monitor to display lots of values so that it shows each clock cycle. Add more $display's to your code if you are having trouble tracing execution. You want your machine to take two or three clock periods to fetch the instruction and two or three clock periods to decode and execute. If your machine meets this criteria, save a copy of the machine for later comparison and go on to End of Exercise 1. For many of you, this will be the case.
If your machine is too slow or not balanced, do the following. By removing #clocks and combining control steps, improve the performance of Ultra7.v so that each instruction only takes two or three clock periods to fetch the instruction and only two or three clock periods to decode and execute. The goal is to balance the time for the two stages of the pipeline not to be the fastest machine. For some of you, the faster version of Ultra4 from Lab #4 may be the way to go.
Verify that the improved Ultra7 works and save a copy of the Verilog code for later comparison.
If you have changed your Ultra7, hand in a listing and run of the program in Exercise #1, with $display()s for start of each instruction and comments on output!
End of Exercise 1 - For Everyone
For your Ultra7, how many total clock periods does it take to do the first three Load instructions? _______
Exercise #2 - Instruction look-ahead ignoring jumps and branches
Make a copy of Ultra7 and call it Ultra7B.
The idea of instruction look-ahead is to overlap the decode and execution of instruction i with the fetch of the next instruction i + 1. How to do this in Verilog HDL? You execute two threads of control: one for the steps of the instruction fetch and another for the steps of each instruction's execution. To do this, send the control pulse on both paths with a Verilog fork construct and merge the two control pulses into one with a Verilog join.
fork: two //send control pulse to both begin-endsEach statement between the fork and join, in this case, the two begin-end blocks, is executed concurrently, i.e., capable of being done in parallel. After all the threads complete, the next statement after the join is executed.
begin
//fetch instruction i + 1
end
begin
//decode and execute instruction i
end
join
You must be careful that there is no interference between registers in the two threads, i.e., you must eliminate structural hazards due to conflicts in use of resources. For example, you can't read a word of memory in both threads during the same clock period.
Ideally, we should adhere to our simple memory model, e.g., a memory read is MD <= MEM[MA]. However, having two threads of control and requiring you to read and write only with the MA and MD registers is tricky. Therefore, you may use other registers to read and write MEM, e.g., MD2 <= MEM[MA2].
Hint: use some more registers, if necessary.
A second hint: You should fetch the very first instruction then do a while loop with the fork-join inside.
Draw a "Step Diagram" for your Ultra7 which indicates the computations (register transfers) that happen at each Step for one instruction - this is the non-pipelined case (Steps 1 to 6).
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6Assuming the fetch of the instruction takes 3 clock periods, superimpose Step 1 on Step 4, Step 2 on Step 5, Step 3 on Step 6 for the pipelined case.
Steps 1/4 Steps 2/5 Steps 3/6Analyze the use of the registers and determine any conflicts. One can eliminate register conflicts (structural hazards) by adding or removing #clocks to shift computations around; by adding new registers and renaming registers; or by moving computations earlier or later.
Before you change any Verilog code, it is best to have a plan based on your final pipelined Step Diagram. Just hacking code does not work! This is common when designing systems with parallelism. One must analyze the situation carefully.
Implement instruction look-ahead for all the instructions except the jumps and branches. Just comment out jumps and branches from test program for this exercise. This new machine we call Ultra7B. We will compare the new Ultra7B with Ultra7 based on the first three Load instructions of the software program of Lab 4.
For your Ultra7B, how many total clock periods does it take to do the first three Load instructions? _________ (Don't count the original instruction fetch.)
What kind of speedup over Ultra7 did you get for the first three Loads? ________
Hand in a listing and run of the program in Exercise #2 (With $displays and comments on output!) and answers to question.
Exercise #3 - Fixing up the Jumps and Branches
Copy and modify the newest Ultra7B to implement instruction look-ahead for jumps and branches and call this machine Ultra7C. Note, depending on your approach in the last exercise, branches may or may not work.
The conditional branch instructions can be tricky because before the instruction has executed, you can not tell whether it will branch or not.
State how you solved the problem with the branches. _______________________________________
______________________________________________________________________________
In order to fix the branches you might have to slow other parts of the machine including the Load instructions. For your Ultra7C, how many total clock periods does it take to do the first three Load instructions? _________
What kind of speedup over Ultra7 did you get for the first three Loads? ________ (Don't count the original instruction fetch.)
Assuming the same frequencies as in Lab 4, what is the average CPI for Ultra7C? ______ And MIPS? _______
Hand in a run of the software program of Lab 4 on your newest Ultra7C (With $display()s and comments on output!) and answers to questions.
Exercise #4 - A subtle IssueThis example demonstrates why most machines don't allow self modifying code!
Now change the program, i. e., a software change, by replacing at Mem[12] the Load R1,3 with Store R3,13. Run and analyze the results.
Discuss in a FEW SENTENCES why this software program may cause problems.
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
Does your version of Ultra7C work correctly on this software program? ________________________________________________________________________________
________________________________________________________________________________
If not, what is wrong and how can it be fixed? You do NOT have to fix your machine only discuss how to fix it.
_________________________________________________________________________________
_________________________________________________________________________________
_________________________________________________________________________________
Hand in a run of the program in Exercise #4 (With $display()s and comments on output!) and answers to the above questions.