clFFT  2.0
OpenCL Fast Fourier Transforms (FFTs)

The clFFT library is an OpenCL library implementation of discrete Fast Fourier Transforms. The library:

  • provides a fast and accurate platform for calculating discrete FFTs.
  • works on CPU or GPU backends.
  • supports in-place or out-of-place transforms.
  • supports 1D, 2D, and 3D transforms with a batch size that can be greater than or equal to 1.
  • supports planar (real and complex components are stored in separate arrays) and interleaved (real and complex components are stored as a pair in the same array) formats.
  • supports lengths that are any combination of powers of 2, 3, 5, and 7.
  • supports single and double precision floating point formats.

Introduction to clFFT

The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O( \(N^2\)) to O( \( N \log N\)) when the sequence length, N, is the product of small prime factors. Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that are optimized for AMD graphics processors, and that are also functional across CPU and other compute devices.

Supported radices

clFFT supports transform sizes that are powers of 2, 3, 5, and 7. This means that the vector lengths can be a combination of powers of two, three, five, and seven; examples include \(2^7, 2^1*3^1, 3^2*5^4, 2^2*3^3*5^5\), up to the limit that the device can support.

Transform size limits

Currently, there is an upper bound on the transform size that the library can support for certain transforms. This limit is \(2^{24}\) for real 1D single precision and \(2^{22}\) for real 1D double precision.

Dimensionality

clFFT currently supports FFTs of (up to) three dimensions, given by the enum clfftDim. This enum is a required parameter of clfftCreateDefaultPlan() to create an initial plan, where a plan is the collection of (almost) all the parameters needed to specify an FFT computation. For more information about clFFT plans, see the section clFFT plans. Depending on the dimensionality that the client requests, clFFT uses the following formulations to compute the DFT:

  • For a 1D complex DFT

    \[ {\tilde{x}}_j = {{1}\over{scale}}\sum_{k=0}^{n-1}x_k\exp\left({\pm i}{{2\pi jk}\over{n}}\right)\hbox{ for } j=0,1,\ldots,n-1 \]

    where, \(x_k\) are the complex data to be transformed, \(\tilde{x}_j\) are the transformed data, and the sign \(\pm\) determines the direction of the transform: \(-\) for forward and \(+\) for backward. Note that you must provide the scaling factor. By default, the scale is set to 1 for forward transforms, and \({{1}\over{N}}\) for backward transforms, where N is the size of transform.
  • For a 2D complex DFT

    \[ {\tilde{x}}_{jk} = {{1}\over{scale}}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rq}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right) \]

    for \(j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\), where, \(x_{rq}\) are the complex data to be transformed, \(\tilde{x}_{jk}\) are the transformed data, and the sign \(\pm\) determines the direction of the transform. By default, the scale is set to 1 for forwards transforms and \({{1}\over{MN}}\) for backwards transforms, where M and N are the 2D size of the transform.
  • For a 3D complex DFT

    \[ \tilde{x}_{jkl} = {{1}\over{scale}}\sum_{s=0}^{p-1}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1} x_{rqs}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)\exp\left({\pm i}{{2\pi ls}\over{p}}\right) \]

    for \(j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1\), where \(x_{rqs}\) are the complex data to be transformed, \(\tilde{x}_{jkl}\) are the transformed data, and the sign \(\pm\) determines the direction of the transform. By default, the scale is set to 1 for forward transforms and \({{1}\over{MNP}}\) for backward transforms, where M, N, and P are the 3D size of the transform.

Setup and Teardown of clFFT

clFFT is initialized by the API clfftSetup(), which must be called before any other API of clFFT. This allows the library to create resources needed to manage the plans that you create and destroy. This API also takes a structure clfftInitSetupData() that is initialized by the client to control the behavior of the library.

After you use the library, the clfftTeardown() method must be called. This function instructs clFFT to release all resources allocated internally, and resets acquired references to any OpenCL objects.

Thread safety

The clFFT API is designed to be thread-safe. It is safe to create plans from multiple threads and to destroy those plans in separate threads. Multiple threads can call clfftEnqueueTransform() to place work in a command queue at the same time. clFFT does not provide a single-threaded version of the library. The overhead of the synchronization mechanisms inside a clFFT thread-safe is expected to be minor.

Currently, you must manage the multi-device operation. You can create OpenCL contexts that are associated with multiple devices, but clFFT only uses a single device from that context to transform the data. You can manage a multi-device operation by creating multiple contexts, in which each context contains a different device; you are responsible for scheduling and partitioning the work across multiple devices and contexts.

Row major formats

clFFT expects all multi-dimensional input passed to it to be in row-major format. This is compatible with C-based languages. However, clFFT is very flexible in the organization of the input and output data, and it accepts input data by letting you specify a stride for each dimension. This feature can be used to process data in column major arrays and other non-contiguous data formats. See clfftSetPlanInStride() and clfftSetPlanOutStride().

OpenCL object creation

Your application must allocate and manage OpenCL objects, such as contexts, cl_mem buffers and command queues. All the clFFT interfaces that interact with OpenCL objects take those objects as references through the API. Specifically, the plan creation function clfftCreateDefaultPlan() takes an OpenCL context as a parameter reference, increments the reference count on that object, and keeps the object alive until the corresponding plan is destroyed by the call clfftDestroyPlan().

Flushing of command queues

The clFFT API operates asynchronously; with the exception of thread safety locking with multiple threads, all APIs return immediately. Specifically, the clfftEnqueueTransform() API does not explicitly flush the command queues that are passed by reference to it. It pushes the transform work onto the command queues and returns the modified queues to the client. The client is free to issue its own blocking logic by using OpenCL synchronization mechanisms or push further work onto the queue to continue processing.

Environment variables

The clFFT library looks for definition of two environment variables: CLFFT_CACHE_PATH and CLFFT_REQUEST_LIB_NOMEMALLOC. If the variable CLFFT_CACHE_PATH is defined, the library caches OpenCL binaries. This enables a subsequent run of the application with the same type of transforms to avoid the expensive compilation step. Instead, the stored binaries are loaded and executed. The CLFFT_CACHE_PATH must point to a folder location where the library can store binaries. The other variable CLFFT_REQUEST_LIB_NOMEMALLOC when defined, requests the library to do all computations in-place and avoid allocating extra device memory whenever possible. This feature is experimental and currently works only for certain types of transforms when the library decomposes the input into square matrices or rectangular matrices with dimensions in the ratio 1:2. Currently, it works for 1D complex transforms of size of powers of 2.

clFFT plans

A plan is the collection of (almost) all the parameters needed to specify an FFT computation. A clFFT plan includes the following parameters:

  • The OpenCL context that executes the transform
  • Dimension of the transform (1D, 2D or 3D)
  • Length or extent of data in each dimension
  • Number of datasets that are transformed
  • Precision of the data
  • Scaling factor to the transformed data
  • In-place or Out-of-place transform
  • Format of the input data - interleaved, planar or real
  • Format of the output data - interleaved, planar or real

The clFFT plan does not include the following parameters:

  • The OpenCL handles to the input and output data buffers.
  • The OpenCL handle to a temporary scratch buffer (if needed).
  • Direction of execution of the transform (forward or reverse transform).

These parameters are specified when the plan is executed.

Default plan values

When a new plan is created by calling clfftCreateDefaultPlan(), its parameters are initialized as follows:

  • Dimensions: as provided by the caller
  • Lengths: as provided by the caller
  • Batch size: 1
  • Precision: CLFFT_SINGLE
  • Scaling factors:
    1. for the forward transform, the default is 1.0, or no scale factor is applied
    2. for the reverse transform, the default is 1.0 / P, where P is the product of the FFT lengths
  • Location: CLFFT_INPLACE
  • Input layout: CLFFT_COMPLEX_INTERLEAVED
  • Input strides: the strides of a multidimensional array of the lengths specified, where the data is compactly stored using the row-major convention
  • Output layout: CLFFT_COMPLEX_INTERLEAVED
  • Output strides: same as input strides

Writing client programs that depend on these initial values is not recommended.

Supported memory layouts

There are two main types of Discrete Fourier Transform (DFT) in clFFT:

  1. Transformation of complex data - clFFT supports the following two layouts to store complex numbers:
    • Planar format - where the real and imaginary components are kept in separate arrays:
      Buffer1: RRRRR
      Buffer2: IIIII
    • Interleaved format - where the real and imaginary components are stored as contiguous pairs:
      Buffer1: RIRIRIRIRIRI
  2. Transformation of real to complex data and vice versa - clFFT provides enums to define these formats. For transforms involving real data, there are two possibilities:
    • Real data being subject to forward FFT transform that results in complex data.
    • Complex data being subject to backward FFT transform that results in real data. See the section FFTs of real data.

Strides and Distances

For one-dimensional data, if clStrides[0] = strideX = 1, successive elements in the first dimension are stored contiguously in memory. If strideX is an integral value greater than 1, gaps in memory exist between each element of the vectors. For multi-dimensional data, if clStrides[1] = strideY = LenX for 2 dimensional data and clStrides[2] = strideZ = LenX*LenY for 3 dimensional data, no gaps exist in memory between each element, and all vectors are stored tightly packed in memory. Here, LenX, LenY, and LenZ denote the transform lengths clLengths[0], clLengths[1], and clLengths[2], respectively, which are used to set up the plan.

By specifying non-default strides, it is possible to process either row-major or column-major arrays. Data can be extracted from arrays of structures. Almost any regular data storage pattern can be accommodated.

Distance is the amount of memory that exists between corresponding elements in an FFT primitive in a batch. Distance is measured in units of the FFT primitive; complex data measures in complex units, and real data measures in real units. Stride between tightly packed elements is 1 in either case. Typically, one can measure the distance between any two elements in a batch primitive, be it 1D, 2D, or 3D data. For tightly packed data, the distance between FFT primitives is the size of the FFT primitive, such that dist=LenX for 1D data, dist=LenX*LenY for 2D data, and dist=LenX*LenY*LenZ for 3D data. It is possible to set the distance of a plan to be less than the size of the FFT vector; most often 1 for this case. When computing a batch of 1D FFT vectors, if distance == 1, and strideX == length(vector), a transposed output is produced for a batch of 1D vectors. You must verify that the distance and strides are valid (not intersecting); if not valid, undefined results may occur. A simple example would be to perform a 1D length 4096 on each row of an array of 1024 rows x 4096 columns of values stored in a column-major array, such as a FORTRAN program might provide. (This would be equivalent to a C or C++ program that has an array of 4096 rows x 1024 columns stored in a row-major manner, on which you want to perform a 1-D length 4096 transform on each column.) In this case, specify the strides as [1024, 1].

A more complex example would be to compute a 2D FFT for each 64 x 64 subtile of the grid that has an input buffer with a raster grid of 1024 x 1024 monochrome pixel values. Specifying strides allows you to treat each horizontal band of 1024 x 64 pixels as an array of 16 64 x 64 matrixes, and process an entire band with a single call clfftEnqueueTransform(). (Specifying strides is not quite flexible enough to transform the entire grid of this example with a single kernel execution.) It is possible to create a Plan to compute arrays of 64 x 64 2D FFTs, then specify three strides: [1, 1024, 64]. The first stride, 1, indicates that the rows of each matrix are stored consecutively; the second stride, 1024, gives the distance between rows, and the third stride, 64, defines the distance between two matrices. Then call clfftEnqueueTransform() 16 times – once for each horizontal band of pixels.

Supported precisions in clFFT

Both CLFFT_SINGLE and CLFFT_DOUBLE precisions are supported by the library for all supported radices. For both these enums the math functions of the host computer are used to produce the sine and cosine tables that are used by the OpenCL kernel. Both CLFFT_SINGLE_FAST and CLFFT_DOUBLE_FAST generate faster kernels with reduced accuracy, but are disabled in the current build. See clfftPrecision, clfftSetPlanPrecision(), and clfftGetPlanPrecision().

clfftDirection

For complex transforms, the direction of the transform is not baked into the plan; the same plan can be used to specify both forward and backward transforms. To specify the direction, clfftDirection is passed as a parameter into clfftEnqueueTransform(). For real transforms, the input and output layouts of the plan determine the direction.

In-place and out-of-place transforms

The clFFT API supports both in-place and out-of-place transforms. With in-place transforms, only the input buffers are provided to the clfftEnqueueTransform() API, and the resulting data is written in the same buffer, overwriting the input data. With out-of-place transforms, distinct output buffers are provided to the clfftEnqueueTransform() API, and the input data is preserved. In-place transforms require that the cl_mem objects created by the client application, have both read and write permissions. This is given in the nature of the in-place algorithm. Out-of-place transforms require that the destination buffers have read and write permissions, but input buffers can still be created with read-only permissions. This is a clFFT requirement because internally the algorithms may go back and forth between the destination buffers and internally allocated temp buffers. For out-of-place transforms, clFFT never writes back to input buffers.

Batches

The efficiency of clFFT is improved by utilizing transforms in batches. Sending as much data as possible in a single transform call leverages the parallel compute capabilities of OpenCL devices (and GPU devices in particular), and minimizes the penalty of transfer overhead. It is best to think of an OpenCL device as a high-throughput, high-latency device. Using a networking analogy as an example, this approach is similar to having a massively high-bandwidth pipe with very high ping response times. If the client is ready to send data to the device for compute, it should be sent in as few API calls as possible and this can be done by batching. clFFT plans have a parameter clfftSetPlanBatchSize() to describe the number of transforms being batched, and another parameter clfftSetPlanDistance() to describe how those batches are laid out and spaced in memory. 1D, 2D, or 3D transforms can be batched.

Using clFFT in a client application

To perform FFT calculations using clFFT, the client program must perform the following tasks:

  • Initialize the library by calling clfftSetup().
  • For each distinct type of FFT needed:

    1. Create an FFT Plan object. To create an FFT Plan object, do either of the following.
      • Call the factory function clfftCreateDefaultPlan() and specify the value of the most fundamental parameters, such as plHandle, context, dim, and clLengths, while other parameters assume default values. The OpenCL context must be provided when the plan is created; it cannot be changed.
        Or
      • Call clfftCopyPlan().
      Note: In both the cases, the function returns an opaque handle to the plan object.
    2. Complete the specification of all the Plan parameters by calling various parameter-setting functions that have the prefix clfftSet.
    3. Optionally, "bake" or finalize the plan by calling clfftBakePlan() function. This signals the library the end of the specification phase, and causes it to generate and compile the exact OpenCL kernels that perform the specified FFT on the provided OpenCL device. At this point, all performance-enhancing optimizations are applied, possibly including executing benchmark kernels on the OpenCL device context to maximize the runtime performance.

      Although the previous step is optional, it is recommended to use it so that you can have control on when to do this work. Usually, this time consuming step is done when the application is initialized. If you do not call clfftBakePlan(), this work is done during the first call of clfftEnqueueTransform().

  • Execute the OpenCL FFT kernels as many times as needed.
    1. Call clfftEnqueueTransform(). At this point, specify whether you want to execute a forward or reverse transform; also, provide the OpenCL cl_mem handles for the input buffer(s), output buffer(s)–unless you want the transformed data to overwrite the input buffers, and (optionally) scratch buffer. clfftEnqueueTransform() performs one or more calls to the OpenCL function clEnqueueNDRangeKernel. Like clEnqueueNDRangeKernel, clfftEnqueueTransform() is a non-blocking call. The commands to execute the FFT compute kernel(s) are added to the OpenCL context queue to be executed asynchronously. An OpenCL event handle is returned to the caller. If multiple NDRangeKernel operations are queued, the final event handle is returned.
    2. Add any application OpenCL tasks to the OpenCL context queue. For example, if the next step in the application process is to apply a filter to the transformed data, the application calls clEnqueueNDRangeKernel, and specifies its output buffer(s) as the input to the filter kernel, and provides the transform's event handle to ensure proper synchronization.
    3. If the application accessed the transformed data directly, it calls one of the OpenCL functions for synchronizing the host computer execution with the OpenCL device (for example: clFinish()).
  • Terminate the library by calling clfftTeardown().

FFTs of real data

When real data is subject to DFT transformation, the resulting complex output data follows a special property. About half of the output is redundant because they are complex conjugates of the other half. This is called the Hermitian redundancy. So, for space and performance considerations, it is only necessary to store the non-redundant part of the data. Most FFT libraries use this property to offer specific storage layouts for FFTs involving real data. clFFT provides three enumerated types to deal with real data FFTs:

  • CLFFT_REAL
  • CLFFT_HERMITIAN_INTERLEAVED
  • CLFFT_HERMITIAN_PLANAR

The CLFFT_REAL enum specifies that the data is purely real. This can be used to feed real input or get back real output. The CLFFT_HERMITIAN_INTERLEAVED and CLFFT_HERMITIAN_PLANAR enums are similar to the corresponding full complex enums in the way they store real and imaginary components, but store only about half of the complex output. Client applications can do just a forward transform and analyze the output or they can process the output and do a backward transform to get back real data. This is illustrated in the following figure.

realfft_fwdinv.jpg
Forward and Backward Transform Processes

Let us consider a 1D real FFT of length N. The full output looks as shown in following figure.

realfft_1dlen.jpg
1D Real FFT of Length N

Here, C* denotes the complex conjugate. Since the values at indices greater than N/2 can be deduced from the first half of the array, clFFT stores data only up to the index N/2. This means that the output contains only 1 + N/2 complex elements, where the division N/2 is rounded down. Examples for even and odd lengths are given below.

Example for N = 8 is shown in following figure.

realfft_ex_n8.jpg
Example for N = 8

Example for N = 7 is shown in following figure.

realfft_ex_n7.jpg
Example for N = 7

For length 8, only (1 + 8/2) = 5 of the output complex numbers are stored, with the index ranging from 0 through 4. Similarly for length 7, only (1 + 7/2) = 4 of the output complex numbers are stored, with the index ranging from 0 through 3. For 2D and 3D FFTs, the FFT length along the least dimension is used to compute the (1 + N/2) value. This is because the FFT along the least dimension is computed first and is logically a real-to-hermitian transform. The FFTs along other dimensions are computed afterwards; they are simply 'complex-to-complex' transforms. For example, assuming clLengths[2] is used to set up a 2D real FFT, let N1 = clLengths[1], and N0 = clLengths[0]. The output FFT has N1*(1 + N0/2) complex elements. Similarly, for a 3D FFT with clLengths[3] and N2 = clLengths[2], N1 = clLengths[1], and N0 = clLengths[0], the output has N2*N1*(1 + N0/2) complex elements.

Supported Modes

Out-of-place transforms:

  • CLFFT_REAL to CLFFT_HERMITIAN_INTERLEAVED
  • CLFFT_REAL to CLFFT_HERMITIAN_PLANAR
  • CLFFT_HERMITIAN_INTERLEAVED to CLFFT_REAL
  • CLFFT_ CLFFT_HERMITIAN_PLANAR to CLFFT_REAL

In-place transforms:

  • CLFFT_REAL to CLFFT_HERMITIAN_INTERLEAVED
  • CLFFT_HERMITIAN_INTERLEAVED to CLFFT_REAL

Setting strides

The library currently requires you to explicitly set input and output strides for real transforms. See the following examples to understand what values to use for input and output strides under different scenarios. These examples show typical usages, but you can allocate the buffers and layout data according to your need.

Examples

The following pages provide figures and examples to explain in detail the real FFT features of this library.

realfft_expl_01.jpg
1D FFT - Real to Hermitian
realfft_expl_02.jpg
1D FFT - Real to Hermitian, Example 1
realfft_expl_03.jpg
1D FFT - Real to Hermitian, Example 2
realfft_expl_04.jpg
1D FFT - Real to Hermitian, Example 3
realfft_expl_05.jpg
1D FFT - Hermitian to Real
realfft_expl_06.jpg
1D FFT - Hermitian to Real, Example
realfft_expl_07.jpg
2D FFT - Real to Hermitian In Place
realfft_expl_08.jpg
2D FFT - Real to Hermitian, Example

clFFT Callbacks

The callback feature of clFFT has the ability to invoke user provided OpenCL™ inline functions to pre-process or post-process data from within the FFT kernel. The inline OpenCL callback function is passed as a string to the library. It is then incorporated into the generated FFT kernel. This eliminates the need for an additional kernel launch to carry out the pre/post processing tasks, thus improving overall performance. There are 2 types of callback; Pre-callback and Post-callback. Pre-callback invokes user callback function to perform custom pre-processing of the input data before FFT is executed. Post-callback invokes user callback function to perform custom post-processing of the output data after FFT is executed.

Callback Workflow

The workflow of FFT execution using callback feature of clFFT is as follows:

  1. Create the clFFT Plan and initialize the standard clFFT parameters.
  2. Use clfftSetPlanCallback() API to register the callback function with library
    const char* funcName,
    const char* funcString,
    int localMemSize,
    clfftCallbackType callbackType,
    void *userdata,
    int numUserdataBuffers)
    The library uses the arguments passed to this API, including callback function string, to stitch the callback code into the generated FFT kernel. The arguments for clfftSetPlanCallback are
    • clFFT plan handle
    • Name of the callback function
    • Callback function as character array; the character array can include custom datatype declaration if any custom type is used by callback function
    • Size of local memory requested by callback, if any, in bytes
    • Type of callback: ‘PRECALLBACK’ or ‘POSTCALLBACK’; this is an enumerator
    • Supplementary user data, if any, used by callback function
    • Number of user data buffers; the library currently supports only one user data buffer per callback registration
    Multiple callback registration calls to the same type of callback result in overwriting the previously registered callback function
  3. Invoke Bake Plan step Library inserts the callback code into the main FFT kernel during bake plan and compiles it. If there are any compilation errors caused by syntax or incompatible callback function prototype, the library reports failure.
  4. Enqueue clFFT transform

The caller is responsible to provide a callback function that matches the function prototype based on the type of callback(pre/post), type of transform(real/complex) and whether LDS is used. The bake plan step checks the function prototype.

Callback Function Prototypes

clFFT expects the callback function to be of a specific prototype depending on the type of callback(pre/post), type of transform(real/complex) and whether LDS is used. These are as follows:

Pre-callback Prototypes

FFT Type Function Prototype
C2C/C2R – Interleaved Single Precision Without LDS
float2 <precallback_func> ( __global void *input, uint inoffset, __global void *userdata)
With LDS
float2 <precallback_func> ( __global void *input, uint inoffset, __global void *userdata, __local void *localmem)
C2C/C2R – Interleaved Double Precision Without LDS
double2 <precallback_func> ( __global void *input, uint inoffset, __global void *userdata)
With LDS
double2 <precallback_func> ( __global void *input, uint inoffset, __global void *userdata, __local void *localmem)
C2C – Planar Single Precision Without LDS
float2 <precallback_func> ( __global void *inputRe, __global void *inputIm, uint inoffset, __global void *userdata)
With LDS
float2 <precallback_func> ( __global void *inputRe, __global void *inputIm, int inoffset, __global void *userdata, __local void *localmem)
C2C – Planar Double Precision Without LDS
double2 <precallback_func> ( __global void *inputRe, __global void *inputIm, uint inoffset, __global void *userdata)
With LDS
double2 <precallback_func> ( __global void *inputRe, __global void *inputIm, uint inoffset, __global void *userdata, __local void *localmem)
R2C Single Precision Without LDS
float <precallback_func> ( __global void *input, uint inoffset, __global void *userdata)
With LDS
float <precallback_func> ( __global void *input, uint inoffset, __global void *userdata, __local void *localmem)
R2C Double Precision Without LDS
double <precallback_func> ( __global void *input, uint inoffset, __global void *userdata)
With LDS
double <precallback_func> ( __global void *input, uint inoffset, __global void *userdata, __local void *localmem)

Parameters

  • input : The base pointer of the input buffer for R2C and Interleaved C2C/C2R transforms
  • inputRe : The base pointer of the “Real” input buffer for Planar C2C transforms
  • inputIm : The base pointer of the “Imaginary” part input buffer for Planar C2C transforms
  • inoffset : Index of the current element of the input buffer from the start
  • userdata : Buffer containing optional caller specified data. The userdata pointer is useful for passing any supplementary data to the callback function. For example, buffer having convolution filter data or any scalar value. The userdata can be of any custom data type/structure, in which case, you have to declare the custom data type and include it along with the callback function string.
  • localmem : Pointer to local memory. This memory is allocated by library based on the size you specify and is subjected to local memory availability.

For Planar C2C, the return type of callback is a vector (float2/double2) that contains the Real and Imaginary elements as computed in the callback.

Post-callback Prototypes

FFT Type Function Prototype
C2C/R2C – Interleaved Single Precision Without LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, float2 fftoutput)
With LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, float2 fftoutput, __local void *localmem)
C2C/R2C – Interleaved Double Precision Without LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, double2 fftoutput)
With LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, double2 fftoutput, __local void *localmem)
C2C/R2C – Planar Single Precision Without LDS
void <postcallback_func> ( __global void *outputRe, __global void *outputIm, uint outoffset, __global void *userdata, float fftoutputRe, float fftoutputIm)
With LDS
void <postcallback_func> ( __global void *outputRe, __global void *outputIm, uint outoffset, __global void *userdata, float fftoutputRe, float fftoutputIm, __local void *localmem)
C2C/R2C – Planar Double Precision Without LDS
void <postcallback_func> ( __global void *outputRe, __global void *outputIm, uint outoffset, __global void *userdata, double fftoutputRe, double fftoutputIm)
With LDS
void <postcallback_func> ( __global void *outputRe, __global void *outputIm, uint outoffset, __global void *userdata, double fftoutputRe, double fftoutputIm, __local void *localmem)
C2R Single Precision Without LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, float fftoutput)
With LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, float fftoutput, __local void *localmem)
C2R Double Precision Without LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, double fftoutput)
With LDS
void <postcallback_func> ( __global void *output, uint outoffset, __global void *userdata, double fftoutput, __local void *localmem)

Parameters

  • output : The base pointer of the output buffer for C2R and Interleaved R2C/C2C transforms
  • outputRe : The base pointer of the “Real” output buffer for Planar R2C/C2C transforms
  • outputIm : The base pointer of the “Imaginary” part output buffer for Planar R2C/C2C transforms
  • outoffset : Index of the current element of the output buffer from the start
  • userdata : Buffer containing optional caller specified data. The userdata pointer is useful for passing any supplementary data to the callback function. For example, buffer having convolution filter data or any scalar value. The userdata can be of any custom data type/structure, in which case, you have to declare the custom data type and include it along with the callback function string.
  • fftoutput : The result computed by clFFT for the element corresponding to outoffset argument
  • localmem : Pointer to local memory. This memory is allocated by library based on the size you specify and is subject to local memory availability.

Sample Callback Code

//**************************************************************************
//* Step 1 : Store the pre and post callback function in a string.
//**************************************************************************
const char* precallbackstr = "float2 pre_mulval(__global void* input, \n
uint inoffset, \n
__global void* userdata, \n
__local void* localmem) \n
{ \n
float scalar = *((__global float*)userdata + inoffset); \n
float2 ret = *((__global float2*)input + inoffset) * scalar; \n
return ret; \n
} \n";
const char* postcallbackstr = "void post_mulval(__global void* output, \n
uint outoffset, \n
__global void* userdata, \n
float2 fftoutput, \n
__local void* localmem) \n
{ \n
float scalar = *((__global float*)userdata + outoffset); \n
*((__global float2*)output + outoffset) = fftoutput * scalar; \n
} \n";
//**************************************************************************
//* Step 2 : Initialize arguments, if any, required by the callback.
//**************************************************************************
int h_preuserdata[N] = { };
cl_mem preuserdata = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * N, (void*)h_preuserdata, NULL);
int h_postuserdata[N] = { };
cl_mem postuserdata = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * N, (void*)h_postuserdata, NULL);
//**************************************************************************
//* Step 3 : Register the callback.
//**************************************************************************
status = clfftSetPlanCallback(plan_handle, "pre_mulval", precallbackstr, 0, PRECALLBACK, &preuserdata, 1);
status = clfftSetPlanCallback(plan_handle, "post_mulval", postcallbackstr, 0, POSTCALLBACK, &postuserdata, 1);
//**************************************************************************
//* Step 4 : Bake plan and enqueue transform.
//**************************************************************************
status = clfftBakePlan( plan_handle, 1, &queue, NULL, NULL );
status = clfftEnqueueTransform( plan_handle, dir, 1, &queue, 0, NULL, &outEvent,
&input_buffers[ 0 ], buffersOut, clMedBuffer );

Important Notes on Callback

  1. The caller is responsible to provide a callback function in string form that matches the function prototype based on the type of callback, type of transform(real/complex), and whether LDS is used.
  2. clFFT considers the value returned by the pre-callback function as the new value of the input at the index corresponding to the inoffset argument.
  3. Callback function can request for local memory for its own use. If the requested amount of local memory is available on the device, clFFT passes a pointer to the local memory when it invokes the callback function.
  4. clFFT may invoke the FFT kernels several times depending on the input parameters. However, the pre-callback function provided by the caller is invoked only once for each point in the input. Similarly, it calls the post-callback function only once for each point in the output.
  5. If clFFT is implementing a given FFT in multiple phases, it calls the pre-callback function only from the first phase kernel. Similarly, it calls the post-callback function only from the last phase kernel.