Reference: Computer Architecture: A Quantitative Approach, by John Hennessy and David Patterson, Morgan Kaufmann, Fourth Edition, 2007, pages C-1 to C-38.
Introduction:
This lab continues the development of the cache memory of last week's lab (Lab 10). In the first part of the lab you will do timing studies of the Verilog HDL program of last week. The second part of the lab, you will attempt to speed up writes by adding a write buffer.
Insert $display statements in your Verilog code to display the start of a cache read with $time and the start of a cache write with $time. Also, insert $display statements to print the end of a read hit, read miss, write hit or write miss and $time with each. With $displays print each instruction name that is being executed as well, e. g., LOAD.
In the Verilog code insert counters that count the four situations. Also, add a counter every time a Jump is finished. Use Verilog's integer type to declare the five counters.
For the software program of Lab 10, when the counter for Jump instructions is three, display the four counters for hits and misses.
After you are confident your program is working correctly, remove any $monitor statements and all $displays other than the ones specified.
Handin the Verilog code and the output. From the output, fill in the following:
Exercise 2:
Change the Store R1,5 instruction to a Store R1,16. This will force a write miss when you execute your machine.
For the modified software program, i. e., with Store R1,16, run the program.
Hand in only the output and fill in the following
How many total clock periods for the three times through the loop?
________
Using the results from both Exercises 1 and 2 and analyzing your Verilog HDL control sequence, answer the following:
Note: For the Lab, on a write miss, count the time to read the block from memory as 20 clock periods. This is an artifact of the lab.
Exercise 3:
Change the software program to the following with PC starting at 0.
0 Load R1,32
1 Load R2,33
2 Add R1,R1,R2
3 Store R1,34
4 Jump 1
32 Data 2
33 Data 1
Notice that this is the same program as in Exercise 1 but shifted in memory.
Run this program and supply only the output and answer the following:
Observe that a compiler can make a big difference in performance if it carefully generates code to effectively utilize cache. That is, it should generate code like in Exercise 1 and NOT like in Exercise 3.
Exercise 4:To speed up both write hits and write misses, we want to add a write buffer to the cache. The idea is to have two threads of control where the write to memory (which takes 20 clock periods) is done in one thread and the rest of the CPU is in the other thread. That is, while the memory write is being performed, the rest of the CPU can zoom along until another write request. Since the only place we can have a write request is in a Store, the machine can conceivable do many instructions before the next Store. Hence, we can overlap the two activities and achieve significant speedup. Note: Since the CPU can now both read and write memory at the same time, this requires a more expensive memory with two ports. However, this has little consequence in our Verilog code.
Add a write buffer to your cache and run the program of Exercise 3 except change the store R1,34 to store R1,40.
Discussion:
1. First, add new registers so there are no conflicts in the two threads. This includes a new register to hold a block address and a register which holds a cache line (the so called write buffer which, in this case, is really a one item queue).
2. Unfortunately, the Verilog fork-join construct we used in the Instruction Lookahead Lab is not appropriate. The fork-join construct waits for all threads to complete, i.e., it synchronizes the threads, then the merged thread continues. For this situation, we want two threads of control which are much less coupled. We don't want the CPU thread to wait unless it needs to perform another STORE instruction. In that case, we must stall the CPU until the memory write is completed.
One way to have two uncoupled threads of control is to have two control units (CU) driven by same clock. This is done in Verilog by two initial or always constructs. Therefore, move the memory write statement, i.e., #(20*clock) MEM[BA]<=Buffer[??]; to a new always construct at the top level. Use two one-bit registers write and busy to signal the start and completion of a memory write. Use Verilog wait statements to control the signaling. Read about the wait statement in the CSCI 320 Verilog HDL Manual which is available online.
Hand in a listing of program and a run with output in hex of important registers and contents of memory address 40. Leave in the $displays from earlier exercises.
A Monkey Wrench? - Notice we don't normally need to stall a read request while waiting for the write buffer to be written because we are reading a different block. BUT what if we have a read request of a word in the block that we are writing in the write buffer? We don't want to read the stale block from memory. Let us explore when this is a problem and how to deal with it.
Since on a read hit the block is in the cache, this situation can only happen on a cache read miss when the block is not in cache. Luckily, since the proper block is in the write buffer, we can move a copy of the write buffer to the proper place in cache. This saves the machine time since we now don't need to read the block from memory and take the 20 clock period miss penalty. Wow! We just turned a potential stall problem into a performance enhancement!
You need to develop a way to detect this situation and implement the necessary Verilog code. Use a $display to print when your program detects this situation.
Run the program of Exercise 3 (with Store R1,40).
Run the below program with PC starting at 2.
1 Data 2 2 Load R1,1 3 Load R2,32 4 Add R1,R1,R2 5 Store R1,0 6 Jump 3 32 Data 1
Hand in a listing of program and a run with output in hex of important registers and contents of address 0. Label your changes to the program.