Opencl Float4 Argument Essay

OpenCL C is based on C99 with some restrictions and specific extensions to the language for parallelism. In this chapter, the authors describe how to write data-parallel kernels using OpenCL C and cover the features supported by OpenCL C.

This chapter is from the book 

The OpenCL C programming language is used to create programs that describe data-parallel kernels and tasks that can be executed on one or more heterogeneous devices such as CPUs, GPUs, and other processors referred to as accelerators such as DSPs and the Cell Broadband Engine (B.E.) processor. An OpenCL program is similar to a dynamic library, and an OpenCL kernel is similar to an exported function from the dynamic library. Applications directly call the functions exported by a dynamic library from their code. Applications, however, cannot call an OpenCL kernel directly but instead queue the execution of the kernel to a command-queue created for a device. The kernel is executed asynchronously with the application code running on the host CPU.

OpenCL C is based on the ISO/IEC 9899:1999 C language specification (referred to in short as C99) with some restrictions and specific extensions to the language for parallelism. In this chapter, we describe how to write data-parallel kernels using OpenCL C and cover the features supported by OpenCL C.

Writing a Data-Parallel Kernel Using OpenCL C

As described in Chapter 1, data parallelism in OpenCL is expressed as an N-dimensional computation domain, where N = 1, 2, or 3. The N-D domain defines the total number of work-items that can execute in parallel. Let's look at how a data-parallel kernel would be written in OpenCL C by taking a simple example of summing two arrays of floats. A sequential version of this code would perform the sum by summing individual elements of both arrays inside a for loop:

void scalar_add (int n, const float *a, const float *b, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] + b[i]; }

A data-parallel version of the code in OpenCL C would look like this:

kernel void scalar_add (global const float *a, global const float *b, global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; }

The function declaration uses the qualifier to indicate that this is an OpenCL C kernel. Note that the kernel includes only the code to compute the sum of each individual element, aka the inner loop. The N-D domain will be a one-dimensional domain set to . The kernel is executed for each of the work-items to produce the sum of arrays and . In order for this to work, each executing work-item needs to know which individual elements from arrays and need to be summed. This must be a unique value for each work-item and should be derived from the N-D domain specified when queuing the kernel for execution. The returns the one-dimensional global ID for each work-item. Ignore the qualifiers specified in the kernel for now; they will be discussed later in this chapter.

Figure 4.1 shows how can be used to identify a unique work-item from the list of work-items executing a kernel.

The OpenCL C language with examples is described in depth in the sections that follow. The language is derived from C99 with restrictions that are described at the end of this chapter.

OpenCL C also adds the following features to C99:

  • Vector data types. A number of OpenCL devices such as Intel SSE, AltiVec for POWER and Cell, and ARM NEON support a vector instruction set. This vector instruction set is accessed in C/C++ code through built-in functions (some of which may be device-specific) or device-specific assembly instructions. In OpenCL C, vector data types can be used in the same way scalar types are used in C. This makes it much easier for developers to write vector code because similar operators can be used for both vector and scalar data types. It also makes it easy to write portable vector code because the OpenCL compiler is now responsible for mapping the vector operations in OpenCL C to the appropriate vector ISA for a device. Vectorizing code also helps improve memory bandwidth because of regular memory accesses and better coalescing of these memory accesses.
  • Address space qualifiers. OpenCL devices such as GPUs implement a memory hierarchy. The address space qualifiers are used to identify a specific memory region in the hierarchy.
  • Additions to the language for parallelism. These include support for work-items, work-groups, and synchronization between work-items in a work-group.
  • Images. OpenCL C adds image and sampler data types and built-in functions to read and write images.
  • An extensive set of built-in functions such as math, integer, geometric, and relational functions. These are described in detail in Chapter 5.

is there any easy way how to pass float4 or any other vector argument to OpenCL kernel? For scalar argument (int, float) you can pass it directly while calling kernel. For array argument you have to first copy it to GPU using cl.Buffer() and than pass pointer. Sure it is probably possible to pass float4 the same way as array. But I ask if there is any easier and more clear way. ( especially using Python, numpy, pyOpenCL)

I tried pass numpy array of size 4*float32 as float4 but it does not work. Is it possible to do it somehow else?

For example : kernnel:


I got error :

the other possibiliy is passing it as set of scalar int or float - like:

but this is also not very convenient - you can be easily lost in many variable names if you want for example pass 4x float4 and 5x int3 to the kernell.

I think passing vectors (2,3,4) of int and float must be quite common in OpenCL - for example the size of 3D data grids. So I wonder if it is really necessary to pass it using cl.Buffer() as pointers.

I guess that constant argument float4 is also faster than *float (because it can be shared as a constant by all workitems)


0 thoughts on “Opencl Float4 Argument Essay”


Leave a Comment

Your email address will not be published. Required fields are marked *