Page Migration: Host Allocations In Prefetch Explained

by Alex Johnson 55 views

Introduction

In this comprehensive article, we will delve into the intricate issue of prefetch operations utilizing host allocations as a result of page migration. This discussion stems from observations within the CHIP-SPV and chipStar categories, where using zeMemAllocShared or clSharedMemAllocINTEL for managed memory leads to segmentation faults (segfaults) following GPU kernel execution, specifically when a SIGSEGV handler is installed. This issue arises due to Intel's driver employing page migration for shared allocations, causing host page mappings to be left in a ---s state (no permissions) post-GPU access. Understanding the root cause, workaround, trade-offs, and affected tests is crucial for developers working with shared memory in heterogeneous computing environments.

Root Cause: Page Migration and Memory Access Permissions

To truly grasp this issue, let's dive deeper into the root cause. The heart of the problem lies in how Intel's drivers manage shared memory allocations between the host (CPU) and the device (GPU). When you use functions like zeMemAllocShared or clSharedMemAllocINTEL, you're essentially telling the system to create a memory region that can be accessed by both the CPU and the GPU. This is incredibly useful for performance, as it avoids explicit data copies between the two processing units.

However, Intel's drivers employ a technique called page migration to optimize memory usage. Page migration is where the physical location of the memory pages can be changed dynamically. Here's how it works:

  1. Initially, the memory pages might reside in host (CPU) memory.
  2. When the GPU needs to access this memory, the driver migrates the necessary pages to the device (GPU) memory. This migration is transparent to the user code.
  3. After the GPU kernel has finished its execution, the driver might leave the host page mappings in a ---s state. This state signifies that the host currently has no permissions to access these memory pages. This is where the problem arises.

When the host code attempts to access this memory:

  • Without a Signal Handler: The kernel intercepts the page fault. The driver steps in, migrates the data back to the host, and the access succeeds seamlessly. You likely wouldn't even notice anything went wrong.
  • With a Signal Handler: This is where things get tricky. Signal handlers, such as those used by debuggers (like GDB) or testing frameworks (like Catch2), intercept the SIGSEGV signal before the driver has a chance to handle the page fault. The signal handler reports the segmentation fault, leading to a crash. The reason this is critical for debuggers and testing frameworks is that they depend on being able to catch the error signal and inspect the state of the program at the time of the error. This capability is a foundation for being able to find problems and improve software.

This behavior leads to a frustrating situation where your code might run perfectly fine under normal conditions but crashes when you try to debug it or run it within a testing framework. This inconsistency makes it incredibly challenging to identify and fix the underlying issues.

Workaround: Host Memory Allocations

Fortunately, there's a workaround: use zeMemAllocHost or clHostMemAllocINTEL instead of shared allocations for managed memory. This approach allocates memory specifically within the host's address space. Host memory remains accessible from both the host and the device without the need for page migration. This avoids the permission issues that trigger the segfaults.

Using host memory allocations can bypass the page migration issues encountered with shared memory. When memory is allocated using zeMemAllocHost or clHostMemAllocINTEL, it resides directly in the host's system memory. This means that both the host (CPU) and the device (GPU) can access it without the need for the driver to migrate pages between them. Because the memory is always accessible, the problem of host page mappings being left in a ---s state after GPU access is avoided entirely.

This approach ensures that the memory remains consistently accessible from both the host and the device, preventing the segmentation faults that occur when a signal handler intercepts the SIGSEGV signal before the driver can manage the page fault. By allocating memory in this way, developers can ensure more stable and predictable behavior, especially in debugging or testing environments where signal handlers are active.

Trade-off: Performance Implications

However, this workaround comes with a trade-off: a potential performance penalty. Since the data stays in system memory, it doesn't migrate to the device memory, which is often faster for GPU access. This can lead to increased latency and reduced overall performance, especially for memory-intensive operations.

When considering the performance trade-off, it's important to weigh the benefits of stability and debuggability against the potential performance hit. In scenarios where debugging or testing is the primary focus, using host memory allocations is often the better choice. This is because the ability to reliably identify and fix issues is crucial during development phases, even if it means sacrificing some speed.

On the other hand, in production environments or performance-critical applications, the trade-off might lean towards using shared memory allocations despite the risk of encountering the page migration issue. In these cases, developers might need to implement additional strategies to mitigate the risk of segmentation faults, such as avoiding the use of signal handlers in production code or employing alternative memory management techniques.

It's also worth noting that the performance impact of using host memory allocations can vary depending on the specific hardware and the nature of the application. In some cases, the overhead of data transfer between the host and the device might be minimal, especially if the amount of data being transferred is small or if the application is not heavily reliant on memory bandwidth. Therefore, it's essential to profile and benchmark the application under different memory allocation strategies to determine the optimal approach for a given use case.

Affected Tests and Scenarios

This issue particularly affects any test that utilizes hipMallocManaged followed by GPU kernel execution and subsequent host access, especially when run under a framework like Catch2, which uses signal handlers. These tests are prone to crashing due to the SIGSEGV signal being intercepted before the driver can handle the page fault.

Specifically, tests using hipMallocManaged are susceptible because this function allocates memory that is intended to be seamlessly accessible by both the host and the device. This memory management approach relies on the underlying system to handle data migration and synchronization between the CPU and GPU. When the page migration issue occurs, the host's access to this memory is disrupted, leading to the aforementioned segmentation faults.

Furthermore, the problem is exacerbated when these tests are executed within a testing framework like Catch2. Testing frameworks often install signal handlers to catch errors and exceptions during test execution. This is a crucial feature for robust testing, as it allows the framework to identify and report failures accurately. However, in this case, the signal handler's intervention becomes a liability, as it prevents the driver from resolving the page fault, thus causing the test to crash.

To illustrate, consider a scenario where a test allocates a managed memory buffer using hipMallocManaged, initializes it with some data on the host, then launches a GPU kernel to process the data, and finally attempts to verify the results on the host. If the page migration issue arises between the kernel execution and the host's verification step, the test will likely crash if a signal handler is active.

The implications of this issue extend beyond just failing tests. It can also hinder the debugging process, as the crashes might not occur consistently, especially if the code is run outside of a testing environment or without a debugger attached. This inconsistency can make it difficult to pinpoint the root cause of the problem, leading to increased development time and potential instability in the final product.

Reproducer Code: Demonstrating the Issue

The following code snippet provides a reproducer for this bug, showcasing how the issue manifests when using shared memory allocations with OpenCL:

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include <CL/cl_ext.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <setjmp.h>

static jmp_buf jump_buffer;
static volatile sig_atomic_t got_segfault = 0;

void segfault_handler(int sig) {
    got_segfault = 1;
    longjmp(jump_buffer, 1);
}

#define CHECK_CL(err, msg) do { \
    if (err != CL_SUCCESS) { \
        fprintf(stderr, "OpenCL Error %d at %s:%d - %s\n", err, __FILE__, __LINE__, msg); \
        exit(1); \
    } \
} while(0)

// Minimal kernel that just reads from input and writes to output
const char* kernelSource = R"(
__kernel void copy(__global int* output, __global const int* input, int N) {
    int gid = get_global_id(0);
    if (gid < N) {
        output[gid] = input[gid];
    }
}
)";

typedef void* (*clSharedMemAllocINTEL_fn)(cl_context, cl_device_id,
    const cl_mem_properties_intel*, size_t, cl_uint, cl_int*);
typedef cl_int (*clMemFreeINTEL_fn)(cl_context, void*);

void print_mapping(void* ptr) {
    char cmd[256];
    snprintf(cmd, sizeof(cmd),
        "grep %lx /proc/%d/maps 2>/dev/null | head -1 || echo 'not found'",
        ((unsigned long)ptr) >> 12 << 12, getpid());
    system(cmd);
}

int main(int argc, char** argv) {
    int use_svm = (argc > 1 && strcmp(argv[1], "svm") == 0);

    printf("Intel OpenCL USM Shared Memory Bug Reproducer\n");
    printf("==============================================\n");
    printf("Allocation method: %s\n\n", use_svm ? "SVM (clSVMAlloc)" : "USM (clSharedMemAllocINTEL)");

    cl_int err;
    cl_platform_id platform = NULL;
    cl_device_id device = NULL;

    // Find Intel GPU
    cl_uint numPlatforms;
    clGetPlatformIDs(0, NULL, &numPlatforms);
    cl_platform_id* platforms = (cl_platform_id*)malloc(numPlatforms * sizeof(cl_platform_id));
    clGetPlatformIDs(numPlatforms, platforms, NULL);

    for (cl_uint i = 0; i < numPlatforms; i++) {
        char platName[256];
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(platName), platName, NULL);

        cl_uint numDevices;
        err = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
        if (err == CL_SUCCESS && numDevices > 0) {
            platform = platforms[i];
            clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
            printf("Platform: %s\n", platName);
            break;
        }
    }
    free(platforms);

    if (!device) {
        fprintf(stderr, "No GPU found\n");
        return 1;
    }

    char deviceName[256], driverVersion[256];
    clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    clGetDeviceInfo(device, CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
    printf("Device: %s\n", deviceName);
    printf("Driver: %s\n\n", driverVersion);

    // Check for USM extension
    char extensions[8192];
    clGetDeviceInfo(device, CL_DEVICE_EXTENSIONS, sizeof(extensions), extensions, NULL);
    if (!use_svm && !strstr(extensions, "cl_intel_unified_shared_memory")) {
        fprintf(stderr, "USM extension not supported\n");
        return 1;
    }

    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    CHECK_CL(err, "clCreateContext");

    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, NULL, &err);
    CHECK_CL(err, "clCreateCommandQueue");

    // Build kernel
    cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &err);
    CHECK_CL(err, "clCreateProgramWithSource");
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
    if (err != CL_SUCCESS) {
        char log[4096];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        fprintf(stderr, "Build error: %s\n", log);
        return 1;
    }
    cl_kernel kernel = clCreateKernel(program, "copy", &err);
    CHECK_CL(err, "clCreateKernel");

    const int N = 8192;  // Larger allocation to ensure issue manifests
    const size_t size = N * sizeof(int);
    void* input = NULL;
    void* output = NULL;

    clSharedMemAllocINTEL_fn clSharedMemAllocINTEL = NULL;
    clMemFreeINTEL_fn clMemFreeINTEL = NULL;

    // Allocate memory
    printf("Step 1: Allocate memory\n");
    if (use_svm) {
        input = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
        output = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
        if (!input || !output) {
            fprintf(stderr, "SVM allocation failed\n");
            return 1;
        }
    } else {
        clSharedMemAllocINTEL = (clSharedMemAllocINTEL_fn)
            clGetExtensionFunctionAddressForPlatform(platform, "clSharedMemAllocINTEL");
        clMemFreeINTEL = (clMemFreeINTEL_fn)
            clGetExtensionFunctionAddressForPlatform(platform, "clMemFreeINTEL");

        input = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
        CHECK_CL(err, "clSharedMemAllocINTEL (input)");
        output = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
        CHECK_CL(err, "clSharedMemAllocINTEL (output)");
    }
    printf("   input=%p output=%p\n", input, output);
    printf("   Memory mapping: "); print_mapping(input);

    // Initialize from host - direct access for USM, mapped access for SVM
    printf("\nStep 2: Initialize memory from host\n");
    int* in = (int*)input;
    int* out = (int*)output;

    if (use_svm) {
        // SVM coarse-grain requires map/unmap
        clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, input, size, 0, NULL, NULL);
        clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, output, size, 0, NULL, NULL);
    }
    // USM shared memory can be accessed directly (that's the point of USM)
    for (int i = 0; i < N; i++) {
        in[i] = i;
        out[i] = 0;
    }
    if (use_svm) {
        clEnqueueSVMUnmap(queue, input, 0, NULL, NULL);
        clEnqueueSVMUnmap(queue, output, 0, NULL, NULL);
        clFinish(queue);
    }
    printf("   SUCCESS: Wrote %d elements, in[0]=%d\n", N, in[0]);
    printf("   Memory mapping: "); print_mapping(input);

    // Prefetch to device (like hipMemPrefetchAsync does)
    printf("\nStep 3: Prefetch to device\n");
    {
        const void* ptrs[] = {input};
        const size_t sizes[] = {size};
        cl_event ev;
        err = clEnqueueSVMMigrateMem(queue, 1, ptrs, sizes, 0 /*to device*/, 0, NULL, &ev);
        CHECK_CL(err, "clEnqueueSVMMigrateMem");
        clWaitForEvents(1, &ev);
        clReleaseEvent(ev);
        clFinish(queue);
    }
    printf("   Prefetch completed\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Set kernel arguments and run
    printf("\nStep 4: Run kernel\n");
    if (use_svm) {
        void* ptrs[] = {input, output};
        clSetKernelExecInfo(kernel, CL_KERNEL_EXEC_INFO_SVM_PTRS, sizeof(ptrs), ptrs);
    } else {
        void* ptrs[] = {input, output};
        clSetKernelExecInfo(kernel, 0x4203 /*CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL*/, sizeof(ptrs), ptrs);
        // Also set indirect access flag like chipStar does
        cl_bool indirect = CL_TRUE;
        clSetKernelExecInfo(kernel, 0x4202 /*CL_KERNEL_EXEC_INFO_INDIRECT_SHARED_ACCESS_INTEL*/,
                           sizeof(cl_bool), &indirect);
    }
    clSetKernelArgSVMPointer(kernel, 0, output);
    clSetKernelArgSVMPointer(kernel, 1, input);
    clSetKernelArg(kernel, 2, sizeof(int), &N);

    size_t globalSize = N;
    cl_event ev;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, &ev);
    CHECK_CL(err, "clEnqueueNDRangeKernel");
    clWaitForEvents(1, &ev);
    clReleaseEvent(ev);
    clFinish(queue);
    printf("   Kernel completed\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Try to read from host - NO SVM MAP, just direct access like HIP managed memory
    printf("\nStep 5: Read memory from host (direct access, no clEnqueueSVMMap)\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Set up segfault handler
    struct sigaction sa, old_sa;
    sa.sa_handler = segfault_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    sigaction(SIGSEGV, &sa, &old_sa);

    if (setjmp(jump_buffer) == 0) {
        printf("   Attempting to read input[0]... ");
        fflush(stdout);
        volatile int val = in[0];
        printf("SUCCESS: %d\n", val);
    } else {
        printf("SEGFAULT!\n");
    }

    // Restore handler
    sigaction(SIGSEGV, &old_sa, NULL);

    if (got_segfault) {
        printf("\n*** BUG REPRODUCED: USM memory inaccessible from host after kernel ***\n");
        return 1;
    }

    // Cleanup
    if (use_svm) {
        clSVMFree(context, input);
        clSVMFree(context, output);
    } else {
        clMemFreeINTEL(context, input);
        clMemFreeINTEL(context, output);
    }
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    printf("\nTEST PASSED\n");
    return 0;
}

This code demonstrates the issue by allocating shared memory using clSharedMemAllocINTEL, initializing it, running a simple kernel, and then attempting to access the memory from the host. When a signal handler is installed, the access will likely result in a segfault, illustrating the core problem.

The reproducer code is designed to be a minimal, self-contained example that clearly demonstrates the issue. By running this code, developers can quickly verify whether they are affected by the bug and experiment with the proposed workaround.

Let's break down the code step by step:

  1. Includes and Setup: The code starts by including necessary headers for OpenCL, standard I/O, and signal handling. It also defines a segfault_handler function that is called when a segmentation fault occurs. This handler sets a flag and uses longjmp to jump back to a predefined point in the code.
  2. OpenCL Initialization: The code then initializes the OpenCL environment, including finding an Intel GPU, creating a context, and setting up a command queue. It also checks for the cl_intel_unified_shared_memory extension, which is necessary for using USM (Unified Shared Memory) allocations.
  3. Kernel Compilation: A simple OpenCL kernel is defined that copies data from an input buffer to an output buffer. The code compiles this kernel and creates a cl_kernel object.
  4. Memory Allocation: This is the crucial part of the code. It allocates memory using either clSVMAlloc (for SVM, Shared Virtual Memory) or clSharedMemAllocINTEL (for USM). The choice between these two allocation methods can be controlled by a command-line argument. The code prints the addresses of the allocated memory and their initial memory mappings.
  5. Memory Initialization: The allocated memory is initialized with some data on the host. For USM, the memory can be accessed directly. For SVM, the memory needs to be mapped using clEnqueueSVMMap before it can be accessed and unmapped using clEnqueueSVMUnmap afterwards.
  6. Prefetch to Device: The code then prefetches the input memory to the device using clEnqueueSVMMigrateMem. This step simulates a scenario where the GPU is about to access the memory.
  7. Kernel Execution: The OpenCL kernel is executed, copying data from the input buffer to the output buffer. The code sets the kernel arguments and enqueues the kernel for execution on the device.
  8. Host Access (Triggering the Bug): This is where the bug is triggered. The code attempts to read from the input memory on the host without mapping it first (in the case of SVM) or without any explicit synchronization. This direct access after the kernel has potentially migrated the memory to the device can lead to a segmentation fault.
  9. Signal Handling: The code sets up a signal handler for SIGSEGV before attempting to access the memory. If a segmentation fault occurs, the segfault_handler is called, which sets the got_segfault flag and jumps back to the setjmp call. This allows the code to gracefully handle the segmentation fault and report that the bug has been reproduced.
  10. Cleanup: Finally, the code cleans up all OpenCL resources, including memory allocations, kernels, programs, command queues, and contexts.

By running this reproducer, developers can observe the segmentation fault and confirm that they are experiencing the issue described in this article. They can then experiment with the workaround, which involves using clHostMemAllocINTEL instead of clSharedMemAllocINTEL or clSVMAlloc, to see if it resolves the problem.

Conclusion

In conclusion, the issue of prefetch operations using host allocations due to page migration is a complex one, arising from the interaction between shared memory management, driver optimizations, and signal handling mechanisms. Understanding the root cause, workaround, and trade-offs is crucial for developers working with heterogeneous computing environments. By using host memory allocations when necessary and being mindful of the performance implications, developers can mitigate the risk of segmentation faults and ensure the stability of their applications.

For further reading on memory management in OpenCL and related topics, please visit the Khronos Group website.