An introduction to Thrust

A CUDA library of parallel algorithms

Jose Nunez Gonzalez

Michel A. Rivero Corona
What is Thrust?

Thrust is a C++ template library for CUDA.

Requires CUDA 3.0 (included in CUDA 4.x)
Thrust allows you to program GPUs with
minimal programing effort using an interface
similar the C++ Standard Template Library
You just need to #include the appropriate
header files into your .cu file and compile with
What is Thrust?

Thrust is a cohesive collection of algorithms

and data structures in a single package.
Thrust is self-contained and requires no
additional libraries.
Thrust is open-source software.
Thrust has been tested extensively on Linux,
Windows and MacOSX systems.
Thrust components
Container Classes
Storage your data
Vector, list, map ...

Algorithm Classes
frequently used algorithms
sort, find, binary search, ...

Iterator Classes
Vector containers
Thrust provides


These vector data structures simplify memory

management and transferring data between the
host and device.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(void)
// H has storage for 4 integers
thrust::host_vector<int> H(4);

// initialize individual elements

H[0] = 14; H[1] = 20; H[2] = 38; H[3] = 46;

// resize H

// Copy host_vector H to device_vector D

thrust::device_vector<int> D = H;

// elements of D can be modified

D[0] = 99;
D[1] = 88;

// print contents of D
for(int i = 0; i < D.size(); i++)
std::cout << "D[" << i << "] = " << D[i] << std::endl;

// H and D are automatically deleted when the function returns

return 0;
STL Vector::Member functions
begin Return iterator to beginning
end Return iterator to end

size Return size
resize Change size

assign Assign vector content
push_back Add element at the end
pop_back Delete last element
insert Insert elements
erase Erase elements
An important question
Can I create a thrust::device_vector from memory
I've allocated myself?

Answer: No, instead, wrap your externally

allocated raw pointer with thrust::device_ptr and
pass it to Thrust algorithms.
#include <thrust/device_ptr.h>
#include <thrust/fill.h>
#include <cuda.h>

int main(void)
size_t N = 10;

// raw pointer to device memory

int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// wrap raw pointer with a device_ptr

thrust::device_ptr<int> dev_ptr(raw_ptr);

// use device_ptr in thrust algorithms

thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

// access device memory through device_ptr

dev_ptr[0] = 1;

// free memory

return 0;
Thrust algorithms
● Linear Search
find, find_if ...
● Subsequence Matching

search, find_end ...

● Counting Elements

count, count_if
● for_each

● Comparing Two Ranges

equal, mismatch …
● Generalized Numeric Algorithms

inner_product, adjacent_difference ...

Thrust algorithms
● Copy Ranges
copy, copy_n ..
● Swapping Elements

swap, swap_ranges ...

● Replacing Elements

replace, replace_if, replace_copy …

● Permuting Elements

● Others

sort, generate, random

Thrust (STL) algorithms
Approximately 60 standard algorithms

● Search
● Sort

● Transformations

● Numeric

Most functions take the form

Function(Iter_begin,Iter_end, ...)
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <cstdlib>
#include <math.h>

int main(void)
int k,n;

// generate random data on the host

thrust::host_vector<int> h_vec(n);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device
thrust::device_vector<int> d_vec = h_vec;

// sort on device
thrust::sort(d_vec.begin(), d_vec.end());

return 0;
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <cstdlib>

int main(void)
// generate random data on the host
thrust::host_vector<int> h_vec(100);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and compute sum

thrust::device_vector<int> d_vec = h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), (int) 0,
return 0;
●Iterators provide a means for accessing data stored in
container classes such a vector.

● Iterators can be thought of as limited pointers.

● Thrust algorithms (discussed before) use iterators.

● For instance, if you had an Thrust device vector storing

integers, you could create an iterator for it as follows:

thrust::device_vector<int> d_vec;
thrust::device_vector<int>::iterator vecIterator;
Thrust provides


#include <thrust/iterator/constant_iterator.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>

// for printing
#include <thrust/copy.h>
#include <iterator>

int main(void)
thrust::device_vector<int> data(4);
data[0] = 3;
data[1] = 7;
data[2] = 2;
data[3] = 5;

// add 10 to all values in data

thrust::transform(data.begin(), data.end(), thrust::constant_iterator<int>(10),
data.begin(), thrust::plus<int>());

// data is now [13, 17, 12, 15]

// print result
thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, "\n"));

return 0;
Thrust provides


#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
// for printing
#include <thrust/copy.h>
#include <iterator>
using namespace thrust;

int main(void)
device_vector<int> A(3);
device_vector<char> B(3);
A[0] = 10; A[1] = 20; A[2] = 30;
B[0] = ‘x’; B[1] = ‘y’; B[2] = ‘z’;

// create iterator (type omitted)

begin = make_zip_iterator(make_tuple(A.begin(), B.begin()));
end = make_zip_iterator(make_tuple(A.end(), B.end()));

begin[0] // returns tuple(10, ‘x’)

begin[1] // returns tuple(20, ‘y’)
begin[2] // returns tuple(30, ‘z’)

// maximum of [begin, end)

maximum< tuple<int,char> > binary_op;
reduce(begin, end, begin[0], binary_op); // returns tuple(30, ‘z’)
return 0;
Make sure that files that #include Thrust have a
.cu extension.
Other extensions (e.g..cpp) will cause nvcc to
treat the file incorrectly and produce an error
Some C++ templates could not be supported.

