The lecture notes are designed such that they can make the basis for practical exercising on computer labs.
1. Introduction to parallel programming on GPU, a brief history, CUDA
- evolution of graphic processors
- the beginning of programming on GPU, GPGPU
- programmable pipeline
- current architectures
2. CUDA architecture and its integration within standard C++ project
- key features of selected architecture
- decomposition of an algorithm
- hardware vs. software decomposition
3. Threads and kernel functions
- thread and its meaning in GPU
- threads hierarchy, basic thread life cycle, limits
- calling of kernel functions, parameters and restrictions
4. CUDA memories, patterns and usage
- global, shared, and constant memory, registers and texture memory
- allocation and deallocation of memory
- memory alignment
- copying data from RAM to VRAM and wise versa
5. Memory bank conflicts
- access optimization
- suitable data structures
6. Program execution control, distribution af an algorithm
- streams, parallel calling of kernel functions
- synchronization on several levels – threads, blocks, GPU vs. CPU
- program distribution on more GPUs
7. Algorithm performance with respect to its parallelization on GPU
- case study, experiment with more variants of the same program
8. Vectors and matrices
- case study, large data processing
- parallel reduction
9. Support library CUBLAS
- introduction to several support libraries for linear algebra
10. Performance optimization
- case study, image manipulation
- double buffering
- optimization at the level of blocks, registers, etc.
11. Case study
- interesting research topics
- outline of possible solutions
- experiments
- program tuning, debugging with nVidia nSight
1. Introduction to parallel programming on GPU, a brief history, CUDA
- evolution of graphic processors
- the beginning of programming on GPU, GPGPU
- programmable pipeline
- current architectures
2. CUDA architecture and its integration within standard C++ project
- key features of selected architecture
- decomposition of an algorithm
- hardware vs. software decomposition
3. Threads and kernel functions
- thread and its meaning in GPU
- threads hierarchy, basic thread life cycle, limits
- calling of kernel functions, parameters and restrictions
4. CUDA memories, patterns and usage
- global, shared, and constant memory, registers and texture memory
- allocation and deallocation of memory
- memory alignment
- copying data from RAM to VRAM and wise versa
5. Memory bank conflicts
- access optimization
- suitable data structures
6. Program execution control, distribution af an algorithm
- streams, parallel calling of kernel functions
- synchronization on several levels – threads, blocks, GPU vs. CPU
- program distribution on more GPUs
7. Algorithm performance with respect to its parallelization on GPU
- case study, experiment with more variants of the same program
8. Vectors and matrices
- case study, large data processing
- parallel reduction
9. Support library CUBLAS
- introduction to several support libraries for linear algebra
10. Performance optimization
- case study, image manipulation
- double buffering
- optimization at the level of blocks, registers, etc.
11. Case study
- interesting research topics
- outline of possible solutions
- experiments
- program tuning, debugging with nVidia nSight