39 – GPU Introduction


The supercomputer pioneer, Seymour Cray, famously said: “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?” In this, he was referring to the design of high-speed computing machines. Should they have a few powerful CPUs or a larger number of less powerful CPUs? Thus far we have focused on making the CPU as efficient and powerful as possible. In this class, we begin to consider how we might design and use a system of many parallel CPUs using our CUDA-enabled GPUs.

Watch the video below “An Introduction to CUDA Programming” by Cliff Woolley (posted by RichReport).

Part 1: Exploring the GPU Capabilities

To begin, we need to know what GPU capabilities are present in our system. To do this we will compile and run our first simple CUDA program. You must do this on a lab machine, as the Linux remotes do not have CUDA-capable cards.

You don’t have to download anything, the CUDA toolkit is already set up. To start the environment just run:

Make a working directory (e.g., mkdir -p ~/csci320/cuda ). Recursively copy the files from ~csci320/cuda  to your local working directory.

Compile enum_gpu.cu with the command:

If you have any errors, seek help now.

Finally, run enum_gpu and take careful note of the “Max threads per block” and all of the other interesting numbers. 

Part 2: GPU Vector Squares

For an introduction to the GPU programming model, watch this video.

Examine the code in the file square.cu . Compile and run it. Try to understand every line.

For a line-by-line discussion of the code, watch these videos. If you fully understand everything, move along.

Part 3: Kernel launch parameters

So now you think you can square numbers in parallel? Copy square.cu to square2048.cu and modify square2048.cu to compute the first 2048 squares. This isn’t as easy as you might think. The original program used 64 threads to compute the first 64 squares. Unfortunately, the GPU only supports 1024 total threads, so we can’t simply increase the number of threads. What to do? For a hint watch the video below. You might want to add CUDA error checking in your code as well, this code snippet should help (put it just after you kernel call):

Part 4: Challenge: square_user

Copy square2048.cu to square_user.cu. This program will read a command-line argument providing the number of squares to compute. Your program should allow computing a very large number of squares. Since writing all these values to the screen is less useful, the program should write the squares to a text file called sqaures.txt with each square on a new line. You will want to consider how many bytes you can cudaMalloc, how many blocks/threads the GPU supports, and how large the results can become (switch to doubles?).

Part 5: Challenge: CUDA Thrust

Thrust is a template library for high-level CUDA programming. Please open up the documentation here and here.

Copy square_user.cu or square2048.cu to square_thrust.cu. Implement the corresponding functionality, but with thrust::host_vector and thrust::device_vector instead.  See also thrust::sequence, thrust::transform, thrust::copy, and other library calls that may be useful.


Before the end of class, demonstrate square2048. Challenge programs can be demonstrated for 1 bonus point each on exam 2 anytime on or before the last day of classes.


Some (much) material comes from the excellent (and free) Udacity course: Intro to Parallel Programming, Using CUDA to Harness the Power of GPUs.

Print Friendly

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Posted in Slides Tagged with: , , , , , , ,