

# GKLEE: Concolic Verification and Test Generation for GPUs

Guodong Li \*

Fujitsu Laboratories of America,  
Sunnyvale, CA 94085, USA  
gli@us.fujitsu.com

Peng Li Geof Sawaya

Ganesh Gopalakrishnan

School of Computing,  
University of Utah,  
Salt Lake City, UT 84112, USA  
{peterlee,sawaya,ganesh}@cs.utah.edu

Indradeep Ghosh

Sreeranga P. Rajan

Fujitsu Laboratories of America,  
Sunnyvale, CA 94085, USA  
{ighosh,sree.rajan}@us.fujitsu.com

## Abstract

Programs written for GPUs often contain correctness errors such as races, deadlocks, or may compute the wrong result. Existing debugging tools often miss these errors because of their limited input-space and execution-space exploration. Existing tools based on conservative static analysis or conservative modeling of SIMD concurrency generate false alarms resulting in wasted bug-hunting. They also often do not target performance bugs (non-coalesced memory accesses, memory bank conflicts, and divergent warps). We provide a new framework called GKLEE that can analyze C++ GPU programs, locating the aforesaid correctness and performance bugs. For these programs, GKLEE can also automatically generate tests that provide high coverage. These tests serve as concrete witnesses for every reported bug. They can also be used for downstream debugging, for example to test the kernel on the actual hardware. We describe the architecture of GKLEE, its symbolic virtual machine model, and describe previously unknown bugs and performance issues that it detected on commercial SDK kernels. We describe GKLEE’s test-case reduction heuristics, and the resulting scalability improvement for a given coverage target.

**Categories and Subject Descriptors:** D.2.4 [Software Engineering]: Software/Program Verification—*Validation*

**General Terms:** Reliability, Verification

**Keywords:** GPU, CUDA, Parallelism, Symbolic Execution, Formal Verification, Automatic Test Generation, Virtual Machine

## 1. Introduction

Multicore CPUs and GPUs are making inroads into virtually all aspects of computing, from portable information appliances to supercomputers. Unfortunately, programming multicore systems to achieve high performance often requires many intricate optimizations involving memory bandwidth and the CPU/GPU occupancy. A majority of these optimizations are still being carried out manually. Given the sheer complexity of these optimizations in the context of actual problems, designers routinely introduce correctness and performance bugs. Locating these bugs using today’s commer-

cial debuggers is always a ‘hit-or-miss’ affair: one has to be *lucky* in so many ways, including (i) picking the right test inputs, (ii) ability to observe of data corruption (and be able to reliably attribute it to races), (iii) whether the compiler optimization match programmer assumptions, and (iv) whether the platform masks bugs because of the specific thread/warp scheduling algorithms used. If the execution deadlocks, one has to manually reason out the root-cause.

Recent formal and semi-formal analysis based tools [1–3] have improved the situation in many ways. They, in effect, examine whole classes of inputs and executions, by resorting to symbolic analysis or static analysis methods. They also analyze abstract GPU models without making hardware-specific thread scheduling assumptions. These tools also have many drawbacks. The first problem with predominantly static analysis based approaches is *false alarms*. False alarms waste precious designer time and may dissuade them from using a tool. Another limitation of today’s tools is that they do not help generate tests that achieve high code coverage. Such tests are important for unearthing compiler bugs or “unexpected” bugs that surface during hardware execution. Existing tools also do not cover one new data race category that we identify (we call it *warp-divergence race*). Compilation based approaches can, in many cases, eliminate the drudgery of GPU program optimization; however, their code transformation scripts are seldom separately formally verified.

We present a new tool framework called GKLEE for analyzing GPU programs with respect to important correctness and performance issues (the tool name coming from “GPU” and “KLEE [4]”). GKLEE profits from KLEE’s code base and philosophy of testing a given program using *concrete plus symbolic* (“concolic”) execution. GKLEE is the first concolic verifier and test generator tailored for GPU programs. Concolic verifiers allow designers to declare certain input variables as ‘symbolic’ (the remaining inputs are concrete).

In GKLEE, the execution of a program expression containing symbolic variables results in constraints amongst the program variables, including constraints due to conditionals, and explicit constraints (*assume* statements) on symbolic inputs. Conditionals are resolved by KLEE’s decision procedures (“SMT solvers [5]”) that find solutions for symbolic program inputs. This approach helps concolic verifiers do something beyond bug-hunting: they can automatically enumerate test inputs in a *demand-driven* manner. That is, if there is a control/branch decision that can be affected by some input, a concolic verifier can automatically compute and record the input value in a test which is valuable for downstream debugging. Recent experience shows that formal methods often have the biggest impact when they can compute tests automatically, exposing software defects and vulnerability [6–8].

The architecture of GKLEE is shown in Figure 1. It employs a C/C++ front-end based on LLVM-GCC (with our customized

\* Guodong Li started this project while a student of University of Utah.

extensions for CUDA syntax) to parse CUDA programs. It supports the execution of both CPU code and GPU code. GKLEE employs a new approach to model the symbolic state (recording the execution status of a kernel) with respect to the CUDA memory model.

**Contributions:** Our main contribution is a symbolic virtual machine (VM) to model the execution of GPU programs on open inputs. We detail the construction and operation of this virtual machine, showing exactly how it elegantly integrates error-detection and analysis, while not generating false alarms or missing execution paths when generating concrete tests. This approach also allows one to effect scalability/coverage tradeoffs. The following features are integrated into our symbolic VM approach:

- GPU programs can suffer from several classes of insidious data races. GKLEE finds such races (sometimes even in well-tested GPU kernels).
- GKLEE detects and reports occurrences of divergent thread warps (branches inside SIMD paths), as these can degrade performance. In addition, GKLEE guarantees to find deadlocks caused by divergent warps in which two threads may encounter different sequences of barrier (`__syncthreads()`) calls.
- GKLEE’s symbolic virtual machine can systematically generate concrete tests while also taking into account any input constraints the programmer may have expressed through `assume` statements.
- While tests generated by GKLEE guarantee high coverage, it may lead to test explosion. GKLEE employs powerful heuristics for reducing the number of tests. We evaluate these heuristics on a variety of examples and identify those heuristics that result in high coverage while still only generating fewer tests.
- We can automatically run GKLEE-generated tests on the actual hardware; one such experiment alerted us to the need for a new error-check type, which we have added to GKLEE: *has a volatile declaration been possibly forgotten?* This can help eliminate silent data corruption caused by reads that may pick up stale write values.
- We target two classes of memory access inefficiencies, namely non-coalesced global memory accesses and shared memory accesses that result in bank conflicts, and show how GKLEE can spot these inefficiencies, also “understanding” platform rules (*i.e.*, compute capability 1.x or 2.x). Some kernels originally thought free of these errors are actually not so.
- GKLEE’s VM incorporates the CUDA memory model within its concolic execution framework, while (i) accurately modeling the SIMD concurrency of GPUs, (ii) avoiding interleaving enumeration through an approach based on race checking, and (iii) scaling to large code sizes.
- GKLEE handles many C++/CUDA features including: struct, class, template, pointer, inheritance, CUDA’s variable and function derivatives, and CUDA specific functions.
- GKLEE’s analysis occurs on LLVM byte-codes (also targeted by Fortran and Clang). Byte-code level analysis can help cover pertinent compiler-induced bugs in addition to supporting future work on other binary formats.



**Figure 1.** GKLEE’s architecture.

**Roadmap:** § 2 explains the error-classes covered by GKLEE. § 3 presents GKLEE’s concolic verification: state model, memory

type inference, and concolic execution (§ 3.1) and error checking/analysis (§ 3.2). § 5 presents experimental results, covering issues pertaining to correctness checking/performance (§ 5.1) and test set generation/reduction (§ 5.2). § 6 presents related work and § 7 concludes.

## 2. Examples of our Analysis/Testing Goals

### 2.1 Basics of GPU Programs

GKLEE currently supports the CUDA [9] syntax (with OpenCL [10] to be addressed in future). A CUDA kernel is launched as an 1D or 2D *grid of thread blocks*. The total size of a 2D grid is  $\text{gridDim.x} \times \text{gridDim.y}$ . Each block at location  $\langle \text{blockIdx.x}, \text{blockIdx.y} \rangle$  has dimensions  $\text{blockDim.x}, \text{blockDim.y}$  and  $\text{blockDim.z}$ . Each block contains  $\text{blockDim.x} \times \text{blockDim.y} \times \text{blockDim.z}$  threads, with IDs  $\langle \text{threadIdx.x}, \text{threadIdx.y}, \text{threadIdx.z} \rangle$ . These threads can share information via *shared memory*, and synchronize via *barriers*. Threads belonging to distinct blocks must use the much slower *global memory* to communicate, and may not synchronize using barriers. The values of *gridDim* and *blockDim* determines the *configuration* of the system, *e.g.* the sizes of the grid and each block. For a thread, *blockIdx* and *threadIdx* give its block index in the grid and its thread index in the block respectively. For brevity, we use *gdim* to denote *gridDim*, *bid* for *blockIdx*, *bdim* for *blockDim*, and *tid* for *threadIdx*. The constraints  $\text{bid.*} < \text{gdim.*}$  for  $* \in \{x, y\}$  and  $\text{tid.*} < \text{bdim.*}$  for  $* \in \{x, y, z\}$  always hold. Groups of 32 (a “warp”) consecutively numbered threads within a thread block are scheduled at a time in a Single Instruction Multiple Data (SIMD) fashion.

### 2.2 CUDA Error Classes and Test Generation

#### 2.2.1 Deadlocks

Deadlocks occur when any two threads in a thread block fail to encounter the same *textually aligned* barriers [11], as in kernel `deadlock` below. Here, threads satisfying  $\text{tid.x} + i > 0$  invoke the barrier while the other threads do not:

```
__global__ void deadlock(int i) {
    if (tid.x + i > 0)
        { ...; __syncthreads(); } }
```

Random test input generation does not guarantee path coverage especially when conditionals are deeply embedded, whereas GKLEE’s *directed test generation* based on SMT-solving ensures coverage. While the basic techniques for such test generation have been well researched in the past, GKLEE’s contributions in this area include addressing the CUDA semantics and memory model, and detecting non-textually aligned barriers, a simple example of which is below. Here, the threads encounter different barrier calls if they diverge on the condition  $\text{tid.x} + i > 0$ .

```
if (tid.x + i > 0) { ...; __syncthreads(); }
else { ...; __syncthreads(); }
```

#### 2.2.2 Data Races

There are three broad classes of races: intra-warp races, inter-warp races, and device/CPU memory races. Intra-warp races can be further classified into intra-warp races without warp divergence, and intra-warp races with warp divergence.

**Intra-warp Races Without Warp Divergence:** Given that any two threads within a warp execute the same instruction, an intra-warp race (without involving warp divergence) has to be a write-write race. The following is an example of such a race which GKLEE can successfully report. In this example, writes to shared array `v[]` overlap; *e.g.*, thread 0 and 1 concurrently write four bytes beginning at `v[0]` (in a 32-bit system).

```
--global__ void race()
{ x = tid.x >> 2; v[x] = x + tid.x; }
```

**Intra-warp Races With Warp Divergence:** In a divergent warp, a conditional statement causes some of the threads to execute the *then* part while others execute the *else* part. But because of the SIMD nature, *both* parts are executed with respect to all the threads in *some unspecified order* (undefined in the standard). Thus, in example ‘race’, depending on the hardware platform: (i) the even threads may read *v* first, and then the odd threads write *v*; or (ii) the odd threads may write *v* and then the even threads may read *v*:

```
--global__ void race() {
    if (tid.x % 2) { ... = v; }
    else { v = ...; }
}
```

While on a given machine the results are predictable (either the *then* or the *else* happens first) an unpleasant surprise can result when this code is ported to a future machine where the *else* happens first (think of it as a “*porting race*”—race-like outcome that surfaces when the code is ported). The culprit is of course overlapped accesses across divergent-warp threads, but if *v* is a complicated array expression, this fact is virtually impossible to discern manually. GKLEE’s novel contribution is to detect such overlaps exactly regardless of the complexity of the conditionals or the array accesses. (For simplicity, we do not illustrate a variant of this example where both accesses are updates to *v*.)

This example also covers another check done by GKLEE: it reports the number of occurrences of divergent warps over the whole program.

**Inter-warp Races:** Inter-warp races could be read-write, write-read, or write-write: we illustrate a read-write race below. Here there is the danger that thread 0 and thread *bdim.x* – 1 may access *v[0]* simultaneously while these two threads also belong to different warps in a thread block.

```
--global__ void race() {
    v[tid.x] = v[(tid.x + 1) % bdim.x]; }
```

Testing may fail to reveal this bug because this bug is typically noticed only when the write by one thread occurs before the read by the other thread. However, the execution order of threads in a GPU is non-deterministic depending on the scheduling, and latencies of memory accesses. GKLEE guarantees to expose this type of race.

**Global Memory Races:** GKLEE also detects and reports races occurring on global device variables:

```
--device__ x;
--global__ void race()
{ ...conflicting accesses to x by two threads... }
```

### 2.2.3 Memory Access Inefficiencies

There are two kinds of memory access inefficiencies: bank conflicts and non-coalesced memory accesses. GKLEE reports their severity by reporting the absolute number and the percentage of accesses that suffer from this inefficiency, as described in § 5.1 in detail.

**Shared Memory Bank Conflicts:** Bank conflicts result when the threads in a half warp (for the CUDA compute capability 1.x model) or entire warp (for capability 1.2) access the same memory bank. GKLEE checks for conflicts by symbolically comparing whether two such accesses can fall into a memory bank.

**Non-coalesced Device Memory Accesses:** Non-coalesced memory accesses waste considerable bus bandwidth when fetching data from the device memory. Memory coalescing is achieved by following access rules specific to the GPU compute capability. GKLEE faithfully models all 1.x and 2.x compute capability coalescing rules, and can be run with the compute capability specified as a flag option (illustrates the flexibility to accommodate future such options from other manufacturers).

### 2.2.4 Test Generation

The ability to automatically generate high quality tests and verify kernels over all possible inputs is a unique feature of GKLEE. The BitonicSort (Figure 2) kernel taken from CUDA SDK 2.0 [9] sorts *values*’s elements in an ascending order. The steps taken in this kernel to improve performance (coalescing global memory accesses, minimizing bank conflicts, avoiding redundant barriers, and better address generation through bit operations) unfortunately end up obfuscating the code. Manual testing or random input-based testing does not ensure sufficient coverage. Instead, given a post-condition pertaining to the sortedness of the output array, GKLEE generates targeted tests that help exercise all conditional-guarded flows. Also, running this kernel under GKLEE by keeping all configuration parameters symbolic, we could learn (through GKLEE’s error message) that this kernel works only if *bdim.x* is a power of 2 (an undocumented fact).

Covering all control-flow branches can result in too many tests. GKLEE includes heuristics for test-case minimization, as detailed in § 4.

```
--shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b)
{ unsigned tmp = a; a = b; b = tmp; }

--global__ void BitonicKernel(unsigned* values) {
1:   unsigned int tid = tid.x;
2:   // Copy input to shared mem.
3:   shared[tid] = values[tid];
4:   __syncthreads();
5:
6:   // Parallel bitonic sort.
7:   for (unsigned k = 2; k <= bdim.x; k *= 2)
8:     for (unsigned j = k / 2; j > 0; j /= 2) {
9:       unsigned ixj = tid ^ j;
10:      if (ixj > tid) {
11:        if ((tid & k) == 0)
12:          if (shared[tid] > shared[ixj])
13:            swap(shared[tid], shared[ixj]);
14:        else
15:          if (shared[tid] < shared[ixj])
16:            swap(shared[tid], shared[ixj]);
17:      }
18:      __syncthreads();
19:    }
20:
21:   // Write result.
22:   values[tid] = shared[tid];
}
```

Figure 2. The Bitonic Sort Kernel

## 3. Algorithms for Analysis, Test Generation

Given a C++ program, the GKLEE VM (Figure 1) executes the following steps, in order, for each control-flow path pursued during execution (to a first approximation, one can think of a control-flow tree and imagine all the following steps occurring *for each tree path* and *for each barrier interval along the path*). Deadlock checking and test generation occur per path (spanning barrier intervals; the notion of barrier intervals is explained in § 3.2). GKLEE checks for barriers being textually aligned and applies a canonical schedule going from one textually aligned barrier to another one.

- Create the GPU memory objects as per state model; infer memory regions representing GPU memory dynamically (§ 3.1)
- Execute GPU kernel threads via the *canonical schedule* (§ 3.2)

- Fork new states upon non-determinism due to symbolic values, apply search heuristics and path reduction if needed (§ 2.2.4)
- In a state, at the end of the barrier interval or other synchronization points, perform checks for data races, warp divergence, bank conflicts, and non-coalesced memory accesses (§ 3.2)
- When execution path ends, report deadlocks and global memory races (if any), perform test-case selection, and write out a concrete test file (§ 4)

### 3.1 LLVM<sub>cuda</sub>

The front-end compiles a C/C++ kernel program into LLVM bytecode with extensions for CUDA. Figure 3 shows an excerpt of its syntax. One main extension is that a variable is attached with its memory sort indicating which memory it refers to.

|              |             |                                                                                                                                                    |                                                                                                                                                      |
|--------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\tau$       | $\coloneqq$ | $\tau_*, \tau_l, \tau_s, \tau_d, \tau_h$                                                                                                           | memory sort                                                                                                                                          |
| $var$        | $\coloneqq$ | $var_{cuda} \mid v : \tau$                                                                                                                         | variable                                                                                                                                             |
| $var_{cuda}$ | $\coloneqq$ | $tid, bid, \dots$                                                                                                                                  | CUDA built-in                                                                                                                                        |
| $lab$        | $\coloneqq$ | $l_1, l_2, \dots$                                                                                                                                  | label                                                                                                                                                |
| $e$          | $\coloneqq$ | $var \mid n$                                                                                                                                       | atomic expression                                                                                                                                    |
| $instr$      | $\coloneqq$ | $br v lab1 lab2$<br>  $br lab$<br>  $store e v$<br>  $v = load v$<br>  $v = binop e e$<br>  $v = alloc n \tau$<br>  $v = getelptr v e$<br>  $sync$ | conditional branch<br>unconditional jump<br>store<br>load<br>binary operation<br>memory allocation<br>address calculation<br>synchronization barrier |

Figure 3. Syntax of LLVM<sub>cuda</sub> (excerpt)

Figure 4 gives a small-step operational semantics of LLVM<sub>cuda</sub> using the following elements. A program is a map from labels to instructions; a value consists of one or more bytes (our model has byte-level accuracy); a memory or store maps variables to values, where each variable is assigned an integer address by the compiler. GKLEE models CUDA’s memory hierarchy in a symbolic state as in Figure 5: each thread has its own local memory and stack (we combine them into a single local state in GKLEE); the threads in a block share the shared memory; and all blocks share the device memory and the CPU memory. Each thread has a program counter (pc) recording the label of the current instruction.

|                 |             |                                                         |
|-----------------|-------------|---------------------------------------------------------|
| Program         | $\coloneqq$ | $\mathbb{L} \subset lab \mapsto instr$                  |
| Value           | $\coloneqq$ | $\mathbb{V} \subset \text{byte}^+$                      |
| Memory, Store   | $\coloneqq$ | $M \subset var \mapsto \mathbb{V}$                      |
| Shared state    | $\coloneqq$ | $\mathbb{M} \subset (bid \mapsto M) \times M \times M$  |
| Local state     | $\coloneqq$ | $\sigma \subset var \mapsto \mathbb{V}$                 |
| Data State      | $\coloneqq$ | $\Sigma \subset (tid \mapsto \sigma) \times \mathbb{M}$ |
| Program counter | $\coloneqq$ | $\mathbb{P} \subset tid \mapsto lab$                    |
| State           | $\coloneqq$ | $\Phi \subset \Sigma \times \mathbb{P}$                 |

A state  $\Phi$  consists of a data state  $\Sigma$  and a PC  $\mathbb{P}$ . Thread  $t$ ’s pc is given by  $\mathbb{P}[t]$ . Notations  $\Sigma[v]$  and  $\Sigma[v \mapsto k]$  indicate reading  $v$ ’s value from  $\Sigma$  and updating  $v$ ’s value in  $\Sigma$  to  $k$  respectively. Notation  $\Sigma \vdash e$  evaluates  $e$ ’s value over  $\Sigma$ , e.g.  $\Sigma \vdash e_1 = e_2$  is true if  $\Sigma[e_1] = \Sigma[e_2]$ . The semantics of an instruction is modeled by a state transition, e.g. the execution of an instruction br  $l'$  at thread  $t$  updates the  $t$ ’s pc to  $l'$  and keeps the data state unchanged. Rule 9 specifies the barrier’s semantics: a thread can proceed to the next instruction only after all the threads in the same block have reached the barrier. As indicated by other rules, non-barrier instructions are executed without synchronizing with other threads (except for lock-step requirement for intra-warp threads).

**Memory Typing.** After a source program is compiled into LLVM bytecode, it is difficult to determine which memory is used when an

1.  $\frac{\mathbb{L}[l] = \text{br } l'}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma, \mathbb{P}[t \mapsto l'])}$
2.  $\frac{\mathbb{L}[l] = \text{br } v \ l_1 \ l_2 \ \Sigma \vdash v}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma, \mathbb{P}[t \mapsto l_1])} \quad \frac{\mathbb{L}[l] = \text{br } v \ l_1 \ l_2 \ \Sigma \vdash \neg v}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma, \mathbb{P}[t \mapsto l_2])}$
3.  $\frac{\mathbb{L}[l] = (v = \text{alloc } n \ \tau)}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma[(v:\tau) \mapsto 0^n], \mathbb{P}[t \mapsto l+1])}$
4.  $\frac{\mathbb{L}[l] = (v_2 = \text{getelptr } v_1 : \tau \ e)}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma[v_2:\tau \mapsto \Sigma[v_1] + \Sigma[e]], \mathbb{P}[t \mapsto l+1])}$
5.  $\frac{\mathbb{L}[l] = (v = \text{binop } e_1 : \tau \ e_2 : \tau_.)}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma[v:\tau \mapsto \text{binop}(\Sigma[e_1], \Sigma[e_2])], \mathbb{P}[t \mapsto l+1])}$
6.  $\frac{\mathbb{L}[l] = (v_2 = \text{load } v_1 : \tau) \ \tau \neq \tau_.}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma[v_2:\tau \mapsto \Sigma[v_1]], \mathbb{P}[t \mapsto l+1])}$
7.  $\frac{\mathbb{L}[l] = (\text{store } e \ v : \tau) \ \tau \neq \tau_.}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma[v:\tau \mapsto \Sigma[e]], \mathbb{P}[t \mapsto l+1])}$
8.  $\frac{v : \tau_-. \ ((v':\tau') \mapsto k) \in \Sigma \quad \Sigma \vdash v' \leq v \leq v' + \text{sizeof}(k)}{v : \tau'}$
9.  $\frac{\mathbb{L}[l] = \text{sync} \quad \forall t' \in \text{blk.of}(t) : \mathbb{P}[t'] \in \{l, l+1\}}{(\Sigma, \mathbb{P}) \rightarrow_t (\Sigma, \mathbb{P}[t \mapsto l+1])}$

Figure 4. Operational semantics of LLVM<sub>cuda</sub> (excerpt)



Figure 5. Components in a Symbolic State.

access is made because the address of this access may be calculated by multiple bytecode instructions. We employ a novel and simple GPU-specific memory sort inference method by computing for each (possibly symbolic) expression a sort  $\tau$  which is either  $\tau_*$  (unknown),  $\tau_l$  (local),  $\tau_s$  (shared),  $\tau_d$  (device), or  $\tau_h$  (host), as per the rules (here we present the simplified version) in Figure 4. In our experience, these rules have been found to be sufficiently precise on all the kernels we have applied GKLEE to.

For example, Rule 4 models *getelptr* which refers to pointer dereferencing where  $v_2$ ’s type is obtained from  $v_1$ ’s type. Rule 6 indicates that a load instruction can be executed only if the address type is known; and the value loaded from memory has unknown type. Rule 8 says that a valid type is found for  $v$  if there exists a memory object associated with  $v'$  such that  $v$ ’s value falls within this object. Basically it searches the memory hierarchy to locate the target memory when the previous analysis fails to find  $v$ ’s type. If  $v$  represents a pointer which can refer to multiple objects (determined by SMT solving), then multiple states are generated, each of which needs to apply this rule. This often reveals memory type related bugs in the source kernel, e.g. mixing up the CPU and GPU memory. We plan to use Clang’s ongoing support for LLVM+CUDA [12] to simplify such inference. More semantics rules (with sort inference) are available in [13].

**State Model.** In a symbolic state in GKLEE, each thread (in a block) has its own stack and local memory; each block has a shared memory; all blocks can access the device memory in the GPU and the main memory in the CPU. Figure 5 gives an example state for a GPU with grid size  $n \times m$  and block size  $32 \times i$ . Each block consists of  $i$  of warps; each warp contains 32 threads. To support test generation, a state also contains a path condition recording the branching decisions made so far.

**CUDA Built-in Variables.** CUDA built-in variables include the block size, block id, thread id, and so on. The executor accesses these variables during the execution. GKLEE sets their values in respective memories before the execution. For example, the variable for the thread id,  $tid$ , is assigned *three* 32 bit words in the local memory of each thread. These words record the  $tid$ 's values in dimension  $x, y$  and  $z$  respectively.

|                             |     |
|-----------------------------|-----|
| tid : $\tau_l$ (96b)        | ... |
| {x : 32b, y : 32b, z : 32b} | ... |

### 3.2 Canonical Scheduling and Race Checking

We now focus on the interleavings of all the threads within a thread block *from one barrier call to another* (global memory accesses across thread blocks are discussed later). Naively interleaving these threads will result in an astronomical number of interleavings. GKLEE employs the following schedule generation approach:

- Pursue just one schedule, namely the canonical schedule shown in Figure 6 where each thread is fully executed within a barrier interval before moving on to another thread.
- During the execution of all the threads in the current barrier interval, build a read-set  $\mathcal{R}$  and a write set  $\mathcal{W}$ , recording in them (respectively) all loads and stores (these will be in mixed symbolic/concrete form) encountered in the execution.
- After the check points (as shown in Figure 6), build all possible conflict pairs, where a pair  $\langle r_1, w_1 \rangle$  or  $\langle w_2, w_1 \rangle$  is any pair that could potentially race or other conflicts.
- Through SMT-solving, decide whether any of these conflicts are races. If none are races (do not overlap in terms of a memory address), then the canonical schedule is equivalent to any other schedule. Thus, we can carry on to the next barrier interval with the next-state calculated as per the canonical schedule.

Canonical scheduling is sound for safety properties (will neither result in omissions or false alarms). The caveats that go with this argument are that C/C++ has no standard shared memory consistency semantics to define safe compiler optimizations, and the CUDA programming guide [14] provides only an informal characterization of CUDA's weak execution semantics. Assume that the instructions within CUDA threads in a barrier interval can be reordered; then under no conflicts (DRF), reordering transformations are sound [15]. This result also stems from [16] where it is shown that race detectors for sequential consistency can detect the *earliest race* even under weak orderings. One can also infer this result directly from [17] where it is shown that under the absence of conflict edges, the *delay set* (set of required program orderings) can be empty. We further elaborate on the soundness of the canonical scheduling method (also considering SIMD execution) in [13].

Consider the following two schedules, we record the writes and reads on  $v_1$  and  $v_2$ , and see whether these accesses overlap at the end point (the check is denoted by a “ $!$ ”). A race occurs in schedule 2 if and only if it also occurs in schedule 1.

$$\begin{aligned} \text{Schedule 1 : } & \frac{W \ v_1}{\cdot} \xrightarrow{t_1} \cdot \frac{W \ v_2}{\cdot} \xrightarrow{t_1} \cdot \frac{R \ v_1}{\cdot} \xrightarrow{t_2} \cdot \rightarrow \cdots \rightarrow \cdot (!) \\ \text{Schedule 2 : } & \frac{R \ v_1}{\cdot} \xrightarrow{t_2} \cdot \frac{W \ v_2}{\cdot} \xrightarrow{t_1} \cdot \frac{W \ v_1}{\cdot} \xrightarrow{t_1} \cdot \rightarrow \cdots \rightarrow \cdot (!) \end{aligned}$$

**Intra-warp scheduling.** A schedule is a sequence of state transitions made by the threads. The threads within a warp are executed in lock-step manner, and if they diverge on a condition, then one side (*e.g.* the “*then*” side) is executed first, with the threads in the other side blocked; and then the other side is executed (this is sound after checking for the absence of intra-warp races). (Note that GKLEE executes LLVM byte-codes, and is therefore able to capture the effect of compiler optimizations.)

In GKLEE, we schedule these threads in a lock-step manner, and provide an option to not execute the two sides sequentially. Now we show that these two scheduling methods are equivalent if no data race occurs. Specifically, the sequence (up to the next joint point)

$$\Phi_0 \xrightarrow{c} \Phi_1 \xrightarrow{c} \Phi_2 \xrightarrow{c} \cdots \xrightarrow{c} \Phi_n \xrightarrow{\neg c} \cdots \xrightarrow{\neg c} \Phi_{2n}$$

can be shuffled into the following one provided that it is race-free. We use  $\xrightarrow{c}_{t_i}$  to indicate that thread  $t_i$  makes the transition with condition  $c$ .

$$\Phi_0 \xrightarrow{c} \Phi_1 \xrightarrow{\neg c} \Phi'_1 \xrightarrow{c} \Phi'_2 \xrightarrow{c} \cdots \xrightarrow{c} \Phi'_{2n-1} \xrightarrow{\neg c} \Phi_{2n}$$

Since  $c$  exclusive-or ( $\oplus$ )  $\neg c$  holds for a thread, the sequence is equivalent to the following one (where  $\Phi''_n = \Phi_{2n}$ ) which GKLEE produces. This is the *canonical schedule for intra-warp steps*.

$$\Phi_0 \xrightarrow{c \oplus \neg c} \Phi''_1 \xrightarrow{c \oplus \neg c} \Phi''_2 \xrightarrow{c \oplus \neg c} \cdots \xrightarrow{c \oplus \neg c} \Phi''_n$$

Hence GKLEE's intra-warp scheduling is an equivalent model of the CUDA hardware's. It eases formal analysis and boosts the performance of GKLEE. Similarly, as in Figure 6 we can reduce a race-free schedule to a canonical one for inter-warps, multi-blocks, and barrier intervals (BIs). These transition relations are represented by  $\rightarrow_w$ ,  $\rightarrow_b$ , and  $\rightarrow_{bi}$  respectively.



**Figure 6.** Canonical scheduling and conflict checking in GKLEE.

**Conflict checking:** Figure 6 indicates that GKLEE supports various conflict checking:

- Intra-warp race (denoted as  $!$ ), checked at the end of a warp. Threads  $t_1$  and  $t_2$  incur such a WW race if they write different values to the same memory location in the same store instruction:  $\exists l : \mathbb{L}[l] = \text{store } e \ v \wedge \mathbb{P}[t_1] = \mathbb{P}[t_2] = l \wedge \Sigma \vdash v_{t_1} = v_{t_2} \wedge e_{t_1} \neq e_{t_2}$  (GKLEE issues a warning if  $e_{t_1} = e_{t_2}$ ). For a diverged warp, RW and WW races are also checked by considering whether the accesses in both sides can conflict (discussed in Section 2.2).
- Inter-warp race (denoted as  $!$ ), checked at the end of a block for each BI. Thread  $t_1$  and  $t_2$  (in different warps) incur such a race if they access the same memory location, and one of them is a write, and different values are written if both accesses are writes. Formally, let  $R\langle t, v, e \rangle$  and  $W\langle t, v, e \rangle$  denote that thread  $t$  reads  $e$  from location  $v$  and writes  $e$  to  $v$  respectively. Then a RW race occurs if  $\exists R\langle t_1, v_1, e_1 \rangle, W\langle t_2, v_2, e_2 \rangle : \Sigma \vdash v_1 = v_2$  (or the case of exchanging  $t_1$  and  $t_2$ ); a WW race occurs if  $\exists W\langle t_1, v_1, e_1 \rangle, W\langle t_2, v_2, e_2 \rangle : \Sigma \vdash v_1 = v_2 \wedge e_1 \neq e_2$  (again GKLEE will prompt for investigation if  $e_{t_1} = e_{t_2}$ ).
- Global race (denoted as  $!$ ), checked at the end of the kernel execution. Similar to inter-warp race but on the device or CPU memory. Deadlocks are also checked at  $!$ .

Conflict checking is performed at the byte level to faithfully model the hardware. Suppose a thread reads  $n_1$  bytes starting from address  $a_1$ , and another thread writes  $n_2$  bytes starting from address  $a_2$ , then a overlap exists iff the following constraint holds.

```

__global__ void histogram64Kernel(unsigned *d_Result,
                                 unsigned *d_Data, int dataN){
    const int threadPos =
        ((threadIdx.x & (~63)) >> 0) |
        ((threadIdx.x & 15) << 2) |
        ((threadIdx.x & 48) >> 4); ...
    __syncthreads();
    for(int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;
        pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) {
        unsigned data4 = d_Data[pos]; // top 10 is symb. for t5,
        ...
        addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }
    __syncthreads(); ...
}

inline void addData64(unsigned char *s_Hist, int threadPos,
                      unsigned int data) {
    // Race of T5 and T13 with threadPos of 20,52 resp.
    s_Hist[threadPos + IMUL(data, THREAD_N)]++; //<- Race! }

```

**Figure 7.** Write-write race in Histogram64 (SDK 2.0)

$$(a_1 \leq a_2 \wedge a_2 < a_1 + n_1) \vee (a_2 \leq a_1 \wedge a_1 < a_2 + n_2)$$

Without abstracting pointers and arrays, GKLEE inherits KLEE’s methods for handling them: suppose there are  $n$  arrays declared in a program. Then, when  $*p$  is evaluated, for every array the concolic executor will check whether  $p$  can fall within the array, spawning a new state if so (works particularly well for CUDA, where pointers are usually used for indexing array elements).

Note that our method reports accurate results in contrast to static analysis methods such as [18] (where no decision procedures are applied) and [1] (which uses SMT solving but relies heavily on abstractions). The method in [2] uses run-time checking to rule out false alarms produced by its static analyzer; while GKLEE builds all the checks into its VM and produces no false alarms.

### 3.3 Power of Symbolic Analysis

We now present how GKLEE detected a WW race condition in `histogram64Kernel` (Figure 7), a CUDA SDK 2.0 kernel. Since the invocation of this kernel in `main` passes `d_Data` that can be quite large, a user of GKLEE (in this case, us) chose to keep only the first ten locations of this array symbolic, and the rest concrete at value 0. (This is the only manual step needed; without this, GKLEE’s solver will be inundated, trying to enumerate every array location). GKLEE now determines that `addData64` can be called concurrently by two distinct threads. Drilling into this function, GKLEE generates constraints for `s_Hist[threadPos + IMUL(data, THREAD_N)]++` (not marked `atomic`) to race. The SMT solver picks two thread IDs 5 and 13; for this, `threadPos` assumes values 20 and 52, respectively. What flows into `data` is `data4 >> 26 & 0x3FU`, where `data4` obtains the value of `d_Data[pos]`. Since the top 10 elements of `d_Data[DATA_N]` are symbolic, thread 5 assigns a symbolic value denoted by `d_Data[5]` to `data4`, while thread 13 assigns the concrete value of 0 to `d_Data[13]`. The SMT solver now solves  $20 + ((d\_Data[5] \gg 21) \& 2016) = 52 + 0$  ( $\gg 21$  changed to  $\gg 21$  because `THREAD_N` is 32), resulting in `d_Data[5]` obtaining value 0x04040404 which causes a race! The user not only obtains an automatic race alert, but also the concrete input of 0x04040404 to set `d_Data[5]` to, in case they want to study this race through any other means.

## 4. Test Generation

During its symbolic execution, GKLEE’s VM has the ability to *fork* two execution paths whenever it “encounters a non-deterministic situation;” *e.g.* when a conditional is evaluated and both choices are true, or when a symbolic pointer is accessed, and it may point to multiple memory objects. GKLEE organizes the resulting execution states as a tree. The initial state of the GPU kernel forms the root of

this tree. It then searches the state space guided by various search reduction heuristics.

**The essence of the VM executor:** GKLEE can be regarded as a symbolic model checker (for GPU kernels) with the symbolic state modeling the hardware state and the transitions modeling non-determinism due to symbolic inputs.

With this view, it is natural that GKLEE supports facilities such as state caching and search heuristics (*e.g.* depth-first, weighted-random, bump-merging, *etc.*), all of which are inherited from KLEE. The checks discussed in Section 3 are essentially built-in global safety properties examined at each state. In the state space tree, a path from the root to a leaf represents a valid computation with a path condition recording all the branching decisions made by all the threads. At a leaf state, we can generate a test case by solving the satisfiability of this path condition. This ability makes GKLEE a powerful test generator.

**Soundness and completeness of the test generator:** Given a race free kernel with a set of symbolic inputs, GKLEE visits a path if and only if there exists a schedule where the decisions made by threads (recorded in the path condition) are feasible.

Note that the feasibility of a path condition is calculated by SMT solving, which is precise without any approximation. At the first glance, the completeness of test generation may be not be obvious since we consider only one (canonical) schedule, while another schedule may apply the branchings in a different order.

To clarify this, consider the following situation where thread  $t_0$  ( $t_1$ ) branches on conditions  $c_{0,0}$  ( $c_{1,0}$ ):

$$\begin{array}{ll} t_0 & t_1 \\ \text{if } (c_{0,0}) \dots; & \text{if } (c_{1,0}) \dots; \end{array}$$

If  $t_0$  executes before  $t_1$ , then a depth-first search visits 4 paths with path conditions  $c_{0,0} \wedge c_{1,0}$ ,  $c_{0,0} \wedge \neg c_{1,0}$ ,  $\dots$ . If  $t_1$  executes before  $t_0$ , then the 4 path conditions become  $c_{1,0} \wedge c_{0,0}$ ,  $c_{1,0} \wedge \neg c_{0,0}$ ,  $\dots$ . The commutativity of the  $\wedge$  operator ensures, under the race-free constraint, the equivalence of these two path sets. Hence, it suffices to consider only one canonical schedule in test generation as in conflict checking (Section 3).

**Example.** Consider the Bitonic kernel running on one block with 4 threads. Suppose the input *values* is of size 4 and has symbolic value  $v$ . Lines 1-4 copy the input to *shared*:  $\forall i \in [0, 3] : shared[i] = v[i]$ . For thread 0, since lines 7-8 involve no symbolic values, they are executed concretely. In the first iteration of the inner loop, we have  $k = 2$ ,  $j = 1$ , and  $ixj = 1$ . The conditional branch at line 10 is evaluated to be true; so does that at line 11. Then the execution reaches the branch at line 12. GKLEE queries the constraint solver to determine that both branches are possible; it explores both paths and proceeds to the loop’s next iteration. Finally the execution terminates with 28 paths (and test cases).

**Coverage Directed State/Path Reduction.** Given that a kernel is usually executed by a large number of threads, there is a real danger, especially with complex/large kernels, that multiple threads may end up covering some line/branch while no threads visit other lines/branches.<sup>1</sup> We have experimented with several heuristics that help GKLEE achieve *coverage directed* search reduction. Basically, we keep track of whether some feature (line or branch) is covered by all the threads at least once, or some thread at least once. These measurements help GKLEE avoid exploring states/paths that do not result in added coverage.

Another usage of these metrics is to perform *test case selection* which still explores the entire state space, but outputs only a subset of test cases (for downstream debugging use) after the entire execution is over, with no net loss of coverage. Details of these heuristics

are discussed in § 5.2. To the best of our knowledge, coverage measures for SIMD programs have not been previously studied.

## 5. Experimental Results

As described in Section 1, a GPU kernel along with a CPU driver is compiled into LLVM bytecode, which is symbolically executed by GKLEE. Since GKLEE can handle GPU and CPU style code, we can mix the computation of CPU and GPU, *e.g.* execute multiple kernels in a sequence.

```
CPU code; GPU code; CPU code; GPU code; ...
```

**Driver.** The user may give as input a kernel file to test together with a driver representing the main (CPU side) program. To cater for the need of LLVM-GCC, we redefine some CUDA specific directives and functions, *e.g.* we use C attributes to interpret them, as illustrated by the following definition of `__shared__`.

```
#define __shared__  
    __attribute__((section ("__shared__")))  
  
#define cutilSafeCall(f)  
void cudaMalloc(void** devPtr, size_t size) {  
    *devPtr = malloc(size);  
}  
void cudaMemcpy(void* a, void* b, size_t size, ...)  
{ memcpy(a,b,size); }
```

We show below an example driver for the Bitonic Sort kernel. The user specifies what input values should have symbolic values; and may place `assert` assertions anywhere in the code, which will be checked during execution. Particularly, the pre- and post- conditions are specified before and after the GPU code respectively. Function `__begin_GPU(NUM)` (a more general format is `__begin_GPU(bdim.x, bdim.y, bdim.z, gdim.x, gdim.y, gdim.z)`) specifies that the x dimension of the block size is NUM.

```
int main() {  
    int values[NUM];  
    gklee_make_symbolic(values, NUM, "input");  
  
    int* dvalues;  
    cutilSafeCall(cudaMalloc((void**)&dvalues,  
                           sizeof(int)*NUM));  
    cutilSafeCall(cudaMemcpy(dvalues, values,  
                           sizeof(int)*NUM, cudaMemcpyHostToDevice));  
  
    // <<<...>>>(BitonicKernel(dvalues))  
    __begin_GPU(NUM);           // block size = <NUM>  
    BitonicKernel(dvalues);  
    __end_GPU();  
  
    // the post-condition  
    for (int i = 1; i < NUM; i++)  
        assert(dvalues[i-1] <= dvalues[i]);  
  
    cutilSafeCall(cudaFree(dvalues));  
}
```

A concrete GPU configuration can be specified at the command line. For instance, option `-blocksize=[4,2]` indicates that each block is of size  $4 \times 2$ . These values can also be made symbolic so as to reveal configuration limitations.

### 5.1 Results I: Symbolic Identification of Issues

GKLEE supports (through command-line arguments) bank conflict detection for 1.x (memory coalescing checks cover 1.0 & 1.1, and 1.2 & 1.3), as well as 2.x device capabilities. Table 1 presents results from SDK 2.0 kernels while Table 2 presents those from

SDK 4.0 (many of these are written for 2.x). These are widely publicized kernels. Our results are with respect to symbolic inputs.

**Tables (1 and 2):** (#T denoting the number of threads analyzed) asserts that, under valid configurations, (i) all barriers were found to be well synchronized; (ii) the functional correctness is verified (w.r.t the configurations); *but only the canonical schedule is considered for cases with races (marked with \*)* (thus for cases with fatal races, we are unsure of the overall functional correctness); (iii) performance defects (to specific degrees) were found in many kernels; (iv) two races were observed (Histogram64 and RadixSort kernels); and (v) several alerts pertaining to the use of `volatile` declarations were reported. ‘WW’ denotes write-write races; they are marked *benign* (ben.) if the same value is written in our concrete execution trace. The computation is expected to be deterministic.

The race in Radix Sort was within function

`radixSortBlockKeysOnly()` involving `sMem1[0] = key.x` for distinct `key.x` written by two threads. In Histogram64, we mark the race WW<sup>1</sup> as we are unsure whether `s_Hist[. . .]++` of Figure 7 executed by two threads *within one warp* is fatal (apparently, CUDA guarantees<sup>2</sup> a net increment by 1). It is poor coding practice anyhow (we note correctness as ‘Unknown’).

One row result is presented for Bank Conflicts, Memory Coalescing, and Warp Divergence, this row averaging over barrier intervals. The 71% for Scan Best under Bank Conflict (compute capability 2.x) is obtained by: 14 BIs were analyzed, and out of it, 10 had bank conflicts, which is 71%. All other “z%” entries may be read similarly. This sort of a feedback enables a programmer to attempt various optimizations to improve performance. When a kernel’s execution contains multiple paths (states), the average numbers for these paths are reported. Also, with GKLEE’s help, we tried a variety of configurations (*e.g.* symbolic configurations) and discovered undocumented constraints on kernel configurations and inputs.

To show that the numbers reported by GKLEE track CUDA profiler reports, we employed GKLEE-generated concrete test cases and ran selected kernels on the Nvidia GTX 480 hardware. GKLEE includes a utility script, `gklee-replay`, that compiles the kernels using `nvcc`, executes them on the hardware and optionally invokes the NVIDIA command line profiler (which is the back end to their Compute Visual Profiler). We found GKLEE’s findings to be in agreement with that discovered by the profiler. GKLEE’s statistics can be used for early detection of these performance issues on symbolic inputs.

**Volatile Checking Heuristic** GKLEE employs a heuristic to help users check for potentially missed volatile qualifiers. Basically, GKLEE analyzes for data sharings between threads within one warp involving two *distinct* SIMD instructions. The gist of an example (taken from the CUDA SDK 2.0) when it was compiled for device capability of 2.x, was as follows: a sequence ‘`a;b`’ occurred inside a warp where SIMD instruction ‘`a`’ writes a value into addresses  $a_1$  and  $a_2$  on behalf of  $t_0$  and  $t_1$ , respectively; and SIMD instruction ‘`b`’ reads  $a_0$  and  $a_1$  in  $t_0$  and  $t_1$ , respectively. Now  $t_1$  was meant to see the value written into  $a_1$ , but it did not, as the value was held in a register and not written back (a volatile declaration was missing in the SDK 2.0 version of the example). An Nvidia expert confirmed our observation and has updated the example to now have the volatile declaration.

We now provide a few more details on this issue. The SDK 4.0 version of this example *has* the volatile declaration in place. We exposed this bug when we took a newer release of the `nvcc` compiler (released around SDK 4.0 and does volatile optimizations), compiled the SDK 2.0 version of this example (which omits the volatile), ran the program on our GTX 480 hardware, finding incor-

<sup>1</sup> We have extended GKLEE’s symbolic VM to measure statement and branch coverage in terms of LLVM byte-code instructions.

<sup>2</sup> As confirmed through discussions with engineers at Nvidia.

rect results emerging. The solution in GKLEE is to flag for potentially missed volatiles in the aforesaid manner; in future, we hope to extend GKLEE to “understand” compiler optimizations and deal with this issue more thoroughly.

**Table 3** compares the execution times of GKLEE and our functional correctness checking tool PUG [1]. This result shows the pros and cons of a full SMT based static analyzer (like PUG) or a testing based approach (like GKLEE) which is far more scalable. We performed experiments on a laptop with an Intel Core(TM)2 Duo 1.60GHz processor and 2GB memory. Here the GPU times in GKLEE count in sanity checking and test generation. Similar to GKLEE, PUG also sequentializes the threads and unrolls the loops when checking functional correctness. GKLEE outperforms PUG due partially to its various optimizations such as expression rewriting, value concretization, constraint independence, and so on. A more important factor is that GKLEE is a *concolic* tool which simplifies the expressions on-the-fly and puts much less burden to the SMT solver, in addition to generating concrete tests, which PUG does not. Both tools perform poorly on the “Bitonic Sort” kernel since the relation between this kernel’s input elements are complicated, *e.g.* thus GKLEE needs to explore many paths. Section 2.2.4 presents GKLEE’s reduction heuristics to ameliorate this.

As an added check, we tested GKLEE on the same 57 kernels used in [1]. GKLEE found the same 2 real bugs (one deadlock and one WR race). It also revealed that 4 of other kernels contain functional correctness bugs.

## 5.2 Results II: Testing and Coverage

We assess GKLEE with respect to newly proposed coverage measures and coverage directed execution pruning. In Table 4, we attempt to measure the source-code coverage by converting the given kernel into a sequential version (through Perl scripts) and applying the gcov tool (better means are part of future work). The point is that source-code coverage may be deceptively high, as shown (“a/b” means “statements/branches” covered; collectively, we call this a *target*). This is the reason we rely upon only byte-code measures, described in the sequel.

GKLEE first generated tests for the shown kernels covering all feasible paths, and subsequently performed *test case selection*. For example, it first generated the 28 execution paths of Bitonic Sort; then it trimmed back the paths to just 5 because these five tests covered all the statements and branches at the byte-code level.

Four byte-code based target coverage measures were assessed first: (i) avg.  $\text{Cov}^t$  measures the number of targets covered by threads across the whole program, averaged over the threads, (ii) max.  $\text{Cov}^t$  that measures the maximum by any thread, (iii) avg.  $\text{CovBI}^t$  computes  $\text{Cov}^t$  separately for each barrier interval and reports the overall average, and (iv) max.  $\text{CovBI}^t$  is similar to avg.  $\text{CovBI}^t$  except for taking a maximum value. From Table 4, we conclude that the maximum measures give an overly optimistic impression, so we set them aside. We choose avg.  $\text{CovBI}^t$  for our baseline because activities occurring within barrier intervals are closely related, and hence separately measuring target coverage within BIs tracks programmer intent better.

Armed with avg.  $\text{CovBI}^t$  and min #tests, we assess several benchmarks (Table 5) with ‘No Reductions’, and two test reduction schemes. Runs with ‘No Reductions’ and no *test case selection* applied show the total number of paths in the kernels, and the upper limits of target coverage (albeit at the expense of considerable testing time).  $\text{Red}_{TB}$  is a reduction heuristic where we separately keep track of the coverage contributions by different threads. We continue searching till each thread is given a chance to hit a test target. For instance, in one barrier interval, if one target is reachable by all the threads, we continue exploring all these threads; but if the same target is reachable again (say in a loop), we cut off

the search through the loop. In contrast,  $\text{Red}_{BI}$  only looks for some thread reaching each target; once that thread has, subsequent thread explorations to that target are truncated (more aggressive reductions). While the coverage achieved is nearly the same (due to the largely SIMD nature of the computations), it is clear that  $\text{Red}_{TB}$  is a bit more thorough.

The overall conclusion is that to achieve high target coverage (virtually the same coverage as with ‘No Reductions’), reduction heuristics are of paramount importance, as they help contain test explosion. Specifically, the number of paths explored with reductions is much lower than that done with ‘No Reductions.’ A powerful feature of GKLEE is therefore its ability to output these minimized high-quality tests for downstream debugging.

Additional sanity-checking: we generated purely random inputs (as a designer might do); in all cases, GKLEE’s test generation and test reduction heuristics provided far superior coverage with far fewer tests.

## 6. Related Work

Traditional CUDA program debuggers [19–21] do not solve path constraints to home into relevant inputs that can trigger bugs. They examine bugs that occur only within platform executions.

Symbolic techniques for program analysis go back to works such as [22] with concolic versions proposed in [6, 8] and more recently in KLEE [4]. GKLEE’s approach is based on [4] which has inspired many projects [7] similar to ours. Concolic-execution based solvers for special domains also exist. None of these methods incorporate ways to deal with SIMD concurrency in GPUs and look for GPU-specific correctness or performance issues.

Except for GKLEE, there are only few GPU-specific checkers reported in the past. Table 6 gives a comparison of these tools. An instrumentation based technique is reported [3] to find races and shared memory bank conflicts. This is an ad-hoc testing approach, where the program is instrumented with checking code, and only those executions occurring in a platform-specific manner are considered. A similar method [2] is used to find races with the help of a static analysis phase. Static analysis is performed first to locate possible candidates so as to reduce the runtime overheads caused by instrumented code. These runtime methods cannot accept symbolic inputs and verify function correctness on open inputs, not to mention test generation. Moreover GKLEE supports a rich set of C++ language features (including those considered specifically in tools such as [23]) which other tools do not handle. In [24], a static analysis based method for divergence analysis and code optimization is presented.

Aiken and Gay [18] proposed a type system to check global synchronization errors by applying a simple single-value analysis, which may produce false alarms by rejecting correct programs. GKLEE uses SMT solving to compare expressions and is more precise.

While the approach of PUG [1] is SMT-based, it is not very scalable as shown in Table 3. Recently, simple analysis for memory coalescing was added to it [25]. PUG is also a kernel-at-a-time analyzer while GKLEE can analyze whole GPU programs.

Even if we narrow down to race detection on concrete inputs, instrumentation based tools may suffer from performance or extensibility problems because it is hard to implement sophisticated execution controls and decision procedures on the source level, while GKLEE does everything over an optimized symbolic virtual machine. As pointed out by Boyer [3], although it is possible to run an instrumentation based tool on the GPU (thus parallelizing its execution), CUDA only supports useful features (*e.g.* display debugging information, or recording traces in a file) in emulation mode which disables parallelism in GPU. Note that GKLEE supports test case replaying on the GPU. It also supports kernel simulation on

| Kernels                      | Loc   | Race | Func. Corr. | #T | Bank Conflict (↓ perf.) | Coalesced Accesses (↑ perf.) |           |           | Warp Diverg. (↓ perf.) | Volatile Needed |
|------------------------------|-------|------|-------------|----|-------------------------|------------------------------|-----------|-----------|------------------------|-----------------|
|                              |       |      |             |    | 1.x                     | 2.x                          | 1.0 & 1.1 | 1.2 & 1.3 | 2.x                    |                 |
| Bitonic Sort                 | 30    | WW?  | yes         | 4  | 0%                      | 0%                           | 100%      | 100%      | 100%                   | 60%             |
| Scalar Product               | 30    |      | yes         | 64 | 0%                      | 0%                           | 11%       | 100%      | 100%                   | 100%            |
| Matrix Mult                  | 61    |      | yes         | 64 | 0%                      | 0%                           | 100%      | 100%      | 100%                   | 0%              |
| Histogram64 <sup>tb</sup> .  | 69    |      | Unknown     | 32 | 66%                     | 66%                          | 100%      | 100%      | 100%                   | 0%              |
| Reduction (7)                | 231   |      | yes         | 16 | 0%                      | 0%                           | 100%      | 100%      | 100%                   | 16~83%          |
| Scan Best                    | 78    |      | yes         | 32 | 71%                     | 71%                          | 100%      | 100%      | 100%                   | 71%             |
| Scan Naive                   | 28    |      | yes         | 32 | 0%                      | 0%                           | 50%       | 100%      | 100%                   | 85%             |
| Scan Workefficient           | 60    |      | yes         | 32 | 83%                     | 16%                          | 0%        | 100%      | 0%                     | 83%             |
| Scan Large                   | 196   |      | yes         | 32 | 71%                     | 71%                          | 100%      | 100%      | 100%                   | 71%             |
| Radix Sort                   | 750   |      | yes*        | 16 | 3%                      | 0%                           | 0%        | 100%      | 100%                   | 5%              |
| Bisect Small                 | 1,000 | WW   | –           | 16 | 38%                     | 0%                           | 97%       | 100%      | 100%                   | 43%             |
| Bisect Large <sup>tb</sup> . | 1,400 | ben. | –           | 16 | 15%                     | 0%                           | 99%       | 100%      | 100%                   | 53%             |

**Table 1.** SDK 2.0 Kernel results. ‘‘Reduction’’ contains 7 kernels with different implementations; we average the results. Results for ‘‘Histogram64,’’ and ‘‘Bisect Large’’ are time-bounded (tb.) to 20 mins. Func. Corr. results about float values are skipped at –. We checked the integer version of ‘‘Radix Sort’’; and CUDPP library calls involved in ‘‘Radix Sort’’ were not analyzed.

| Kernels                     | Loc   | Race | #T  | Bank Conflict(↓ perf.) |        | Coalesced Accesses (↑ perf.) |           |        | Warp Diverg. | Volatile(N/M) |
|-----------------------------|-------|------|-----|------------------------|--------|------------------------------|-----------|--------|--------------|---------------|
|                             |       |      |     | 1.x                    | 2.x    | 1.0 & 1.1                    | 1.2 & 1.3 | 2.x    |              |               |
| Clock                       | 38    | WW   | 64  | 0%                     | 0%     | 0%                           | 100%      | 100%   | 85%          | no/no         |
| Scalar Product              | 47    |      | 128 | 0%                     | 0%     | 50%                          | 100%      | 100%   | 36%          | no/no         |
| Histogram64 <sup>tb</sup> . | 70    |      | 64  | 0%                     | 33%    | 0%                           | 0%        | 0%     | 0%           | no/no         |
| Scan Short                  | 103   |      | 64  | 0%                     | 0%     | 0%                           | 100%      | 100%   | 0%           | yes/no        |
| Scan Large                  | 226   |      | 64  | 0%                     | 0%     | 0%                           | 67%       | 67%    | 25%          | yes/no        |
| Transpose (8)               | 172   |      | 256 | 0~50%                  | 0~100% | 0~100%                       | 0~100%    | 0~100% | 0%           | no/no         |
| Bisect Small                | 1,000 |      | 16  | 38%                    | 0%     | 97%                          | 100%      | 100%   | 43%          | yes/yes       |

**Table 2.** SDK 4.0 Kernel results. If volatiles needed (N) is ‘‘yes’’ and missed (M) is ‘‘no’’, the code annotation is correct. Examples with both ‘‘yes’’ (missed volatiles) were found. Transpose contains 8 different implementations; we report the results as a range through ‘‘~’’. Kernels having the same results as their SDK 2.0 versions, including Bitonic Sort, MatrixMult and Bisect Large, are not presented.

| Kernels        | #T = 4 |              | #T = 16 |              | #T = 64      |           | #T = 256 | #T = 1,024 |
|----------------|--------|--------------|---------|--------------|--------------|-----------|----------|------------|
|                | PUG    | GKLEE        | PUG     | GKLEE        | GKLEE        | GKLEE     | GKLEE    | GKLEE      |
| Simple Reduct. | 2.8    | < 0.1(< 0.1) | T.O     | < 0.1(< 0.1) | < 0.1(< 0.1) | 0.2(0.3)  | 2.3(2.9) |            |
| Matrix Transp. | 1.9    | < 0.1(< 0.1) | T.O     | < 0.1(0.3)   | < 0.1(3.2)   | < 0.1(63) | 0.9(T.O) |            |
| Bitonic Sort   | 3.7    | 0.9(1)       | T.O     | T.O          | T.O          | T.O       | T.O      |            |
| Scan Large     | –      | < 0.1(< 0.1) | –       | < 0.1(< 0.1) | 0.1(0.2)     | 1.6(3)    | 22(51)   |            |

**Table 3.** Execution times (in seconds) of GKLEE and PUG [1] on some kernels for functional correctness check. #T is the number of threads. Time is reported in the format of GPU time (entire time); T.O means > 5 minutes.

| Kernels                     | src. code coverage | min #test | avg. Cov <sup>t</sup> | max. Cov <sup>t</sup> | avg. CovBI <sup>t</sup> | max. CovBI <sup>t</sup> | exec. time |
|-----------------------------|--------------------|-----------|-----------------------|-----------------------|-------------------------|-------------------------|------------|
| Bitonic Sort                | 100%/100%          | 5         | 78%/76%               | 100%/94%              | 79%/66%                 | 90%/76%                 | 1s         |
| Merge Sort                  | 100%/100%          | 6         | 88%/70%               | 100%/85%              | 93%/86%                 | 100%/100%               | 1.6s       |
| Word Search                 | 100%/100%          | 2         | 100%/81%              | 100%/85%              | 100%/97%                | 100%/100%               | 0.1s       |
| Suffix Tree Match           | 100%/90%           | 7         | 55%/49%               | 98%/66%               | 55%/49%                 | 98%/83%                 | 31s        |
| Histogram64 <sup>tb</sup> . | 100%/100%          | 9         | 100%/75%              | 100%/75%              | 100%/100%               | 100%/100%               | 600s       |

**Table 4.** Cov<sup>t</sup> and CovTB<sup>t</sup> measure bytecode coverage w.r.t threads. min #test tests are obtained by performing test case selection after the execution. Result for ‘‘Histogram64’’ is limited to 600 s. No test reductions used in generating this table. Exec. time on typical workstation.

the CPU as the CUDA debugger does. Last but not least, GKLEE can look for compiler-related bugs due to omitted volatiles.

The KLEE-FP [26] tool extends KLEE to cross-check IEEE 754 floating-point programs and their SIMD-vectorized versions. Two floating-point expressions are equivalent if they can be normalized to the same form. This tool does not address the same class of correctness and performance bugs as GKLEE, neither does it produce concrete test cases. However, its floating-point package can help overcome GKLEE’s current inability to handle float numbers. Recently KLEE-FP has been extended [27]<sup>3</sup> to handle OpenCL code, targeted in particular at crosschecking OpenCL code against an initial scalar sequential version, and on finding races in such code.

<sup>3</sup>This work and that in this paper were concurrent and independent.

**Some Limitations of GKLEE.** GKLEE cannot be used to analyze the functional correctness of CUDA applications that involve floating-point calculations (efficient SMT methods for floating-point arithmetic, when available, will help here). The concolic nature of GKLEE can help ameliorate this drawback by sometimes ‘‘concretizing’’ the floating numbers to integers. All other analyses done by GKLEE are unaffected by floating-point types, as typically variable addresses involve only unsigned integers.

## 7. Concluding Remarks

We presented GKLEE, the first symbolic virtual machine based correctness checker and test generator for GPU programs written in CUDA/C++. It checks several error categories, including one pre-

| Kernels           | No Reductions |                         | Red <sub>TB</sub> |                         | Red <sub>BI</sub> |                         |
|-------------------|---------------|-------------------------|-------------------|-------------------------|-------------------|-------------------------|
|                   | #path         | avg. CovBI <sup>t</sup> | #path             | avg. CovBI <sup>t</sup> | #path             | avg. CovBI <sup>t</sup> |
| Bitonic Sort      | 28            | 79%/66%                 | 5                 | 79%/66%                 | 5                 | 79%/65%                 |
| Merge Sort        | 34            | 93%/86%                 | 4                 | 92%/84%                 | 4                 | 92%/84%                 |
| Word Search       | 8             | 100%/97%                | 2                 | 100%/97%                | 2                 | 94%/85%                 |
| Suffix Tree Match | 31            | 55%/49%                 | 6                 | 55%/49%                 | 6                 | 55%/49%                 |
| Histogram64       | 13            | 100%/100%               | 5                 | 100%/100%               | 5                 | 100%/100%               |

**Table 5.** Reduction Heuristic Comparisons.

| Comparison Categories | GKLEE                                                                                                                                               | PUG [1]                                       | GRace[2]                     | [3]                                |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|------------------------------|------------------------------------|
| Methodology           | Concolic Exec. in virtual machine                                                                                                                   | Symbolic Analysis                             | Static Analysis + Dyn. Check | Dynamic Check                      |
| Level of Analysis     | LLVM Bytecode                                                                                                                                       | Source Code                                   | Source Code (Instrument.)    | Source Code (Instrument.)          |
| Bugs Targeted         | Race (intra-/inter- warp, all memory),<br>Warp Divergence, Deadlocks,<br>Memory Coalesce, Bank Conflicts<br>Compilation level bugs (e.g. Volatiles) | Shared Mem. Race,<br>Deadlocks, Bank Conflict | Intra-/Inter- Warp Race      | Shared Mem. race,<br>Bank Conflict |
| False alarm elim.     | SMT-solving, GPU replaying                                                                                                                          | Auto./Manual Refinement                       | Dynamic Execution            | Dynamic Execution                  |
| Test Generation       | Automatic, Hardware Execution,<br>Coverage Measures, Test Reduction                                                                                 | Not supported                                 | Not supported                | Not supported                      |

**Table 6.** Comparison of Formal Verifiers of GPU Programs

viously unidentified race type. We discussed logical errors and performance bottlenecks detected by GKLEE in real-world kernels. For many realistic kernels, finding these issues takes less than a minute on a modern workstation. We propose several novel code coverage measures and show that GKLEE’s test generation and test reduction heuristics achieve high coverage. Several future directions are planned: (i) OpenCL [10] support, (ii) handling formats other than LLVM (e.g., Nvidia’s PTX) using frameworks such as Ocelot [28], (iii) scalability enhancement, including parameterized methods for SIMD programs, and (iv) using static performance analysis results of GKLEE to guide dynamic performance analysis on typical input data sets.

**Acknowledgements:** We thank the authors of [4] for releasing KLEE well designed and documented. The Utah authors were supported by NSF awards CNS-1035658 and CCF-0935858.

## References

- [1] G. Li and G. Gopalakrishnan, “Scalable SMT-based verification of GPU kernel functions,” in *SIGSOFT FSE*, 2010.
- [2] M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal, “GRace: A low-overhead mechanism for detecting data races in GPU programs,” in *PPoPP*, 2011.
- [3] M. Boyer, K. Skadron, and W. Weimer, “Automated dynamic analysis of CUDA programs,” in *Third Workshop on Software Tools for Multi-Core Systems*, 2008.
- [4] C. Cadar, D. Dunbar, and D. R. Engler, “KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in *OSDI, 8th USENIX Symposium*, 2008.
- [5] “SMT-COMP. <http://www.smtcomp.org/2011>.”
- [6] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed automated random testing,” in *PLDI*, 2005.
- [7] “KLEE open projects,” <http://klee.llvm.org/OpenProjects.html>.
- [8] K. Sen, D. Marinov, and G. Agha, “CUTE: a concolic unit testing engine for C,” in *10th ESEC/FSE*, 2005.
- [9] “CUDA zone. [www.nvidia.com/object/cuda\\_home.html](http://www.nvidia.com/object/cuda_home.html).”
- [10] OpenCL. <http://www.khronos.org/opencl>.
- [11] A. Kamil and K. A. Yelick, “Concurrency Analysis for Parallel Programs with Textually Aligned Barriers,” in *LCPC*, 2005.
- [12] “The LLVM compiler infrastructure. <http://www.llvm.org/>.”
- [13] “GKLEE Technical Report. <http://www.cs.utah.edu/fv/GKLEE>.”
- [14] “Cuda programming guide version 4.0. [http://developer.download.nvidia.com/compute/cuda/4\\_0/toolkit/docs/CUDA\\_C\\_Programming\\_Guide.pdf](http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_C_Programming_Guide.pdf).”
- [15] J. Sevcik, “Safe Optimisations for Shared-Memory Concurrent Programs,” in *PLDI*, 2011.
- [16] S. V. Adve, M. D. Hill, B. P. Miller, and R. H. Netzer, “Detecting data races on weak memory systems,” in *ISCA*, 1991.
- [17] D. Shasha and M. Snir, “Efficient and correct execution of parallel programs that share memory,” *ACM TOPLAS*, vol. 10, no. 2, pp. 282–312, 1988.
- [18] A. Aiken and D. Gay, “Barrier inference,” in *POPL*, 1998.
- [19] NVIDIA, “CUDA-GDB,” Jan. 2009, an extension to the GDB debugger for debugging CUDA kernels in the hardware.
- [20] Nvidia, “Parallel Nsight,” Jul. 2010.
- [21] Rogue Wave, “Totalview for CUDA,” Jan. 2010.
- [22] J. M. Cobleigh, L. A. Clarke, and L. J. Osterweil, “Flavers: A finite state verification technique for software systems,” *IBM Systems Journal*, vol. 41, no. 1, 2002.
- [23] S. K. Lahiri, S. Qadeer, and Z. Rakamaric, “Static and precise detection of concurrency errors in systems code using SMT solvers,” in *21st Computer Aided Verification (CAV)*, 2009.
- [24] B. Coutinho, D. Sampaio, F. M. Quintao Pereira, and W. Meira Jr., “Divergence analysis and optimizations,” in *PACT*, 2011.
- [25] J. Lv, G. Li, A. Humphrey, and G. Gopalakrishnan, “Performance degradation analysis of GPU kernels,” in *EC2 Workshop*, 2011.
- [26] P. Collingbourne, C. Cadar, and P. H. J. Kelly, “Symbolic crosschecking of floating-point and SIMD code,” in *EuroSys*, 2011.
- [27] P. Collingbourne, C. Cadar, and P. Kelly, “Symbolic testing of OpenCL code,” in *Haifa Verification Conference (HVC)*, 2011.
- [28] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems,” in *PACT*, 2010.