Notice: Constant FORCE_SSL_ADMIN already defined in /nfs/unixspace/linux/accounts/COURSES/cs206/public_html/sp17/wp-config.php on line 94
Lab 14 | CSCI 206 Computer Organization & Programming

Lab 14

Cache Performance

A Level-One (L1) cache is part of the actual processor chip, that is, it is integrated onto the same die as ALUs, control units, and other major components. The collection of performance statistics for the cache (number of hits, number of misses, etc.) is possible with the use of specialized software that allows one to read performance counters that are kept within the processors. This method of analysis is appropriate when a processor has already been produced and marketed, when one wants to investigate and/or fine-tune the performance of particular applications.

It is important, also, that one can understand the performance of a specific processor design before it has gone into production. The analysis at this stage can help one to identify and to correct the worst engineering mistakes, and also to quantify the extent of proposed performance improvements. This analysis has to be done on the designed processor via the application of simulation techniques.

Computer designers experiment with different cache sizes, organizations, replacement policies, etc. using a cache simulator. The simulator takes two kinds of input: (1) a description of the cache design, and (2) address traces that represent memory references made by one or more “typical” applications or benchmarks. The output produced by the experiments for each address trace can include the number of cache misses and even estimates of cache penalties.

Goals

  • Learn how to use a cache simulator to explore how cache parameters effect performance on real applications.
  • Learn about memory access patterns in Matrix Multiply.

Set Up

  • Open a shell, go into your local git repo, and do a “git pull“ to make sure you have the latest feedback from the TAs.
  • Copy the contents of ~cs206/Labs/Lab14 into your working directory at Labs/Lab14.
  • Complete the prelab if you haven’t already.

Exercise 1 – Introduction to matrix multiply

To measure the real world performance of our cache we’re going to need a good application to study. Matrix multiplication may not sound exciting but it does highlight some important features of our memory design. In the provided source files, you will find the file dgemm-naive.c (shown below).

The file is also in the collection you just copied from the ~cs206/Labs/Lab14/ folder.

dgemm is short for double-precision general matrix multiply. It implements the calculation C = C + A * B where A, B, and C are n-by-n matrices. This version does the calculation with no optimizations. The benchmark program (benchmark.c) calls the dgemm routine in solving a random linear system of equations. This computation is common to many scientific application. A Makefile is provided for building the benchmark tool. Type make to build the executable benchmark before proceeding; this will generate two executable files for you: benchmark-naive and benchmark-blocked. You will first execute: ./benchmark-naive 64 where the command line argument indicates the size of the system to solve (the matrices used are n-by-n). After a short time, you should get a result like:

This application gives you two performance measures:

  • The number of floating point operations per second in solving the problem the unit shown is Millions of Floating Point Operations per Second, pronounced megaflops, and
  • The percentage and the average percentage of peak, which are calculated relatively to a reference computer.

Our goal is to experiment with various designs of cache and to optimize its design parameters for this specific application. In order to do that, we need to use the Pin program presented in the pre-lab to create a trace file, which is a log of memory accesses. Because these traces get very large, we will store them on the local machine in the /tmp folder. To keep these files from conflicting with other students add your user (login) ID to the filename. If you completed the setup stage indicated in the prelab, the line below should execute just fine and you should have some idea what it is doing.

$ $PIN_ROOT/pin.sh -t $PIN_TOOL -o /tmp/$USER-naive.out -- ./benchmark-naive 32

Because the traces are large, do not add them to your git repo! If you turn out to add them by mistake, please make sure to remove them, as soon as possible. Also, when you are done working with the trace files in /tmp, please remember to remove them.

Exercise 2: Naive Cache Analysis

Now you should have a memory access trace of the naive algorithm using 32 x 32 matrices. We will now use the dinero cache simulator to compare several hardware design options on this memory access trace. The executable file is located in ~cs206/bin/.

The cache design for each simulation experiment is specified by command line parameters given to dinero at run time. You should take a moment to learn what command line options are available to dinero using the following command:

$ ~cs206/bin/dinero -help

While you don’t need to understand all of these options, this information will help you understand what is being done in the example below. Before you continue, stretch your terminal window as wide as you can (there will be a large table output). dinero expects input from stdin so you can use a pipe as shown below to feed your memory trace (generated by the pin command above) into dinero:

$ ~cs206/bin/dinero -l1-usize 8k -l1-ubsize 8 -l1-uassoc 1 -l1-uccc < /tmp/$USER-naive.out

This should run for a little and then generate a screen full of output.

The options selected were:

  • -l1-usize 8k : the -l1-usize part defines a level-one (letter ‘l’ followed by digit ’1′) unified cache (both data and instructions in the same unified cache) with 8K bytes. The 1 in -l1 is actually a parameter that you can vary. Other possibilities for the type of cache are instruction and data. Note that the size of a cache is defined in bytes; you can use k and m to indicate kilobytes and megabytes, respectively.
  • -l1-ubsize 8 : indicates that the block size for a cache move is 8 bytes (that is, one word on a 64-bit machine). A block size of 32 would mean that each cache move would copy 4 words.
  • -l1-uassoc 1 : indicates that the associativity set size is 1 word per block. This means it is direct-mapped (no associativity). Other possibilities would be a set size of 2/4/8 or larger powers of two.
  • -l1-uccc : print Compulsory/Capacity/Conflict miss statistics for the L1 unified cache.

Create the file answers.txt. You will now use dinero (with different command line options) to answer the questions below. You might want to use a spreadsheet (like Google Docs) as a lab notebook to store results from your experiments. If you create a plot download the file as a pdf and add it to your git repo and reference this plot from your text file.

(2.1) What do you observe when the block size is increased from 8 to 64? What principle of cache operation is responsible for this effect?

(2.2) What do you observe when you double the size of the cache (from 8K to 16K) while keeping the block size at 8? Explain these results.

(2.3) What is the performance of a caches with size 32K, 64K, 128K, 256K, and 512K using a block size of 32? Look at results other than total demand miss rate and comment on the changes.

(2.4) What is the effect of setting 2-way and 4-way set associativity for a unified 8K cache with a block size of 8? Explain your findings.

(2.5) What is the effect of setting 2-way and 4-way set associativity for a unified 128K cache with a block size of 8? Compare to previous results and explain your findings.

When you complete this problem make sure to add answers.txt to your git repo and push to gitlab!

Exercise 3: Blocked Cache Analysis

In the last exercise, we looked at several ways to influence the cache to improve performance. However, many times it is also possible to improve the algorithm. The file dgemm-blocked.c uses a technique called blocking to split the matrix multiply into smaller blocks. Because these smaller blocks fit better in the cache, this can significantly improve performance.

The contents of dgemm-blocked.c are shown below. The value BLOCK_SIZE sets the maximum size of data to operate on. Currently this is set at 10, so there would be three 10×10 matrices of doubles in the cache at any time, for a total of 2,400 bytes (excluding loop control variables, etc).

To generate the memory trace for this algorithm we have to rerun Pin:

$ $PIN_ROOT/pin.sh -t $PIN_TOOL -o /tmp/$USER-blocked.out -- ./benchmark-blocked 32

Recall from the Prelab that the line above generates a .out ASCII file reporting all of the memory reads and writes indicated by running the program “./benchmark-blocked 32.” This output can then be fed into the dinero cache simulator (as in Exercise 2), which reports on the effectiveness of your cache design (specified by the parameters to dinero) on that specific program “./benchmark-blocked 32.”

Your final challenge is to select a cache design that provides a good balance between low miss rate and small size on the blocked cache trace above. In your answers file create a new section for Exercise 3.  Be sure to document your analysis and provide several experiments to support your conclusion using the blocked trace. Clearly demonstrate the effect of each cache parameter. Your goal is to demonstrate that blocking allows us to use a smaller cache to achieve similar performance as the naive algorithm. In your answers.txt file write a few paragraphs summarizing your experiments, results, and conclusions.

When you complete this problem add any additional resources you created (spreadsheets, pdf files, etc) to git and push everything to gitlab.

Cleanup

Remove the files /tmp/$USER-blocked.out and /tmp/$USER-naive.out from the lab computers.

Grading Rubric

  • Prelab: 25 points: ls.out, pwd.out and wc.out added to git repo.
  • Exercise 2: 50 points: 10 points for each question.
  • Exercise 3: 25 points. Documented evidence for various cache size, block size, and associativity for the blocked cache design and made a reasonable argument for a solution.
Print Friendly
Posted in Lab Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*