by Kirill Evdakimov on May 14, 2024 under Engine

51 minute read

In the old days of OpenGL and Direct3D 11 to extend the engine externally said engine had to expose just a few graphical structures to the game programmer such as: ID3D11Device, ID3D11DeviceContext, ID3D11Texture1D/2D/3D and corresponding access views for DX11 and Texture Index for OpenGL. But everything changed when we got new APIs to work with, they exposed a lot of synchization to the render programmers that initially was done internally by drivers. So now, we have to think of ID3D12CommandQueue and VkQueue as another thread that is actually being executed on a separate device in the unknown time-frame. But what do we do if we want to access our texture on a different Graphics API which also executes at the unknown time-frame? We have to synchronize GPU threads! Since we gathered here to talk about CUDA we will have to synchronize CUDA GPU threads called cudaStream_t with some other GPU thread like ID3D12CommandQueue.

But before synchronizing our data we have to talk about how to exposed said data between Graphics APIs and CUDA? Today I am going to explain the way you can expose Texture data between APIs. Vertex and Index buffers should expose similarly, albeit they are allocated linearly you’d have to accound for that. More on that later in each section. Each API provide own way of importing resources, so I will describe the way you import memory into each API in their corresponding sections.

I divided this post into 3 major sections: Texture and Synchronization Interops

Handles

Another piece of crusial information that we have to know first is the way we export and import said memory addresses. Since I had to do this both for Windows and Linux, I looked into both ways those OSes handle this (no pun intended). They do so by using handles (for Win32) and file descriptors for (POSIX) but basically this is just an OS’s shared address to GPU address of the exported memory. I know it a bit convoluted but we work with what we got.

Windows expects you to supply some security attributes to each shared handle so it can be successfully exposed but it still had it’s quirks, more on them later.

We can fulfill this requirement with this struct:

NOTE

This is required only for Windows builds and doesn’t require to be exposed

#include <aclapi.h>

class WindowsSecurityAttributes
{
public:
	WindowsSecurityAttributes()
	{
		descriptor = (SECURITY_DESCRIPTOR*)calloc(1, SECURITY_DESCRIPTOR_MIN_LENGTH + 2 * sizeof(void **));
		assert(descriptor != (PSECURITY_DESCRIPTOR)nullptr);

		PSID *pp_sid = (PSID *)((PBYTE)descriptor + SECURITY_DESCRIPTOR_MIN_LENGTH);
		PACL *pp_acl = (PACL *)((PBYTE)pp_sid + sizeof(PSID *));

		InitializeSecurityDescriptor(descriptor, SECURITY_DESCRIPTOR_REVISION);

		SID_IDENTIFIER_AUTHORITY sid_identifier_authority = SECURITY_WORLD_SID_AUTHORITY;
		AllocateAndInitializeSid(&sid_identifier_authority, 1, SECURITY_WORLD_RID, 0, 0, 0, 0, 0, 0, 0, pp_sid);

		EXPLICIT_ACCESS explicit_access;
		ZeroMemory(&explicit_access, sizeof(EXPLICIT_ACCESS));
		explicit_access.grfAccessPermissions = STANDARD_RIGHTS_ALL | SPECIFIC_RIGHTS_ALL;
		explicit_access.grfAccessMode = SET_ACCESS;
		explicit_access.grfInheritance = INHERIT_ONLY;
		explicit_access.Trustee.TrusteeForm = TRUSTEE_IS_SID;
		explicit_access.Trustee.TrusteeType = TRUSTEE_IS_WELL_KNOWN_GROUP;
		explicit_access.Trustee.ptstrName = (LPTSTR)*pp_sid;

		SetEntriesInAcl(1, &explicit_access, nullptr, pp_acl);

		SetSecurityDescriptorDacl(descriptor, TRUE, *pp_acl, FALSE);

		attributes = (SECURITY_ATTRIBUTES*)calloc(1, sizeof(SECURITY_ATTRIBUTES));
		attributes->nLength = sizeof(attributes);
		attributes->lpSecurityDescriptor = descriptor;
		attributes->bInheritHandle = TRUE;
	}
	~WindowsSecurityAttributes()
	{
		PSID *pp_sid = (PSID *)((PBYTE)descriptor + SECURITY_DESCRIPTOR_MIN_LENGTH);
		PACL *pp_acl = (PACL *)((PBYTE)pp_sid + sizeof(PSID *));

		if (*pp_sid)
			FreeSid(*pp_sid);
		if (*pp_acl)
			LocalFree(*pp_acl);
		free(descriptor);
		free(attributes);
	}

	_SECURITY_ATTRIBUTES *operator&() { return attributes; }

protected:
	_SECURITY_ATTRIBUTES *attributes;
	_SECURITY_DESCRIPTOR *descriptor;
};

Texture Interopability

Basic outline

The steps we have to perform for each of the APIs are quite similar:

We create resource we wish to export with some kind of SHARED flag
Create shared handle
Import handle into target API
Tell the API or wrapper how it should interpret such handle

Interoping resources with Direct3D 12

This API was most straightforward of all. Spent just a few hour understanding how I can synchronize GPU threads.

Create shared texture

NOTE

This piece of code uses D3D12MA but it will look similarly when using dx12 original functions, you’d have to use CreateCommitedResource.

if (is_shared)
{
	// Basic check that uses SharedResourceCompatibilityTier from D3D12_FEATURE_DATA_D3D12_OPTIONS4
	// to check if format is supported or not
	// 
	// All formats are outlined at MSDN:
	// https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_shared_resource_compatibility_tier
	if (isSharedResourceFormatSupported(texture_desc.Format) == false)
	{
		// Show some error message that we can't create such texture because GPU or driver doesn't support this format
		return false;
	}
}

D3D12MA::ALLOCATION_DESC allocation_desc{};
allocation_desc.HeapType = D3D12_HEAP_TYPE_DEFAULT;
allocation_desc.Flags = D3D12MA::ALLOCATION_FLAG_COMMITTED;
allocation_desc.ExtraHeapFlags = D3D12_HEAP_FLAG_SHARED;

d3d12_allocator->CreateResource(&allocation_desc, &resource_desc, creation_state, nullptr, &allocation, IID_PPV_ARGS(&resource));

Create shared handle

WindowsSecurityAttributes windows_security_attributes;
device->CreateSharedHandle(texture->getResource(), &windows_security_attributes, GENERIC_ALL, L"", &handle);

Import memory into D3D12

This is quite simple, you just have to call ID3D12Device::OpenSharedHandle

device->OpenSharedHandle(handle, IID_PPV_ARGS(&resource));

Had no time to actually check how well it works and it’s not that important right now.

Acquiring VRAM Size

For correct exposure to CUDA we have to specify it correct resource size, we can get one like so:

vram_size = device->GetResourceAllocationInfo(0, 1, &resource_desc).SizeInBytes;

Linear buffers

Very similar to regular textures, albeit a bit easier. Since they work the same way, I won’t go into details here. The only thing being is that exporting textures as linear buffers have major restrictions in DX12, such as what formats can you use and where the allocated memory can reside on the hardware.

Interoping resources with Vulkan

I spent way too much time on this one, since there were a lot of difficulties debugging this particular interop setup for both Linux and Windows. And on top of that infinite waits on GPU threads with no ability to signal semaphores or kill the app without physically pressing power button and rebooting PC this way. Beware.

But alas, we have to do basically same steps as with DX12, albeit with a few quirks.

Firstly set tiling mode in VkImageCreateInfo to VK_IMAGE_TILING_OPTIMAL this is required, since we are going to access memory allocated by the same driver. Also we have to add VkExternalMemoryImageCreateInfo struct and specify handle types we want to access through.

VkExternalMemoryImageCreateInfo vkExternalMemImageCreateInfo{};
vkExternalMemImageCreateInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO;
#ifdef _WIN32
	vkExternalMemImageCreateInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT;
#else
	vkExternalMemImageCreateInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR;
#endif

image_info.tiling = VK_IMAGE_TILING_OPTIMAL;
image_info.pNext = &vkExternalMemImageCreateInfo;

Device Extensions

Before we can start exporting any memory from vulkan we will have to request some device extensions:

device_extensions_set.append(VK_KHR_EXTERNAL_MEMORY_EXTENSION_NAME);
#ifdef _WIN32
	device_extensions_set.append(VK_KHR_EXTERNAL_MEMORY_WIN32_EXTENSION_NAME);
#else
	device_extensions_set.append(VK_KHR_EXTERNAL_MEMORY_FD_EXTENSION_NAME);
#endif

Shove this one into initialization sequence and we good to go.

Create shared texture (VMA)

I had an opportunity to try both ways so I will present you the both ways to actually integrate this.

We begin by defining flags we want to use. Our hw memory region of interest can be accessed by GPU only and it’s HOST_VISIBLE for some reason. We also specify that we want to allocated dedicated memory for said texture with VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT shoved into allocation flags.

VmaAllocationCreateInfo alloc_info{};
alloc_info.usage = VMA_MEMORY_USAGE_AUTO;
alloc_info.priority = 1.0f;

if (is_shared)
{
	alloc_info.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
	alloc_info.requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
	alloc_info.usage = VMA_MEMORY_USAGE_GPU_ONLY;
	alloc_info.memoryTypeBits = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;

Next we define some structs to export memory. We specify windows security attributes, access and handle types. We don’t have to specify access and security attributes for POSIX.

NOTE

VMA doesn’t copy those structs so you will have to either store them somewhere close to your pool if you want to make more allocations using it, or just allocate 1 pool per shared texture. This does look ugly but it can be fixed by writing own allocator that will actually do a better job than VMA ever will.

	VkExportMemoryAllocateInfoKHR vulkanExportMemoryAllocateInfoKHR{ VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };

	#ifdef _WIN32
		WindowsSecurityAttributes windows_security_attributes;
		VkExportMemoryWin32HandleInfoKHR vulkanExportMemoryWin32HandleInfoKHR{ VK_STRUCTURE_TYPE_EXPORT_MEMORY_WIN32_HANDLE_INFO_KHR };

		vulkanExportMemoryWin32HandleInfoKHR.pAttributes = &windows_security_attributes;
		vulkanExportMemoryWin32HandleInfoKHR.dwAccess = DXGI_SHARED_RESOURCE_READ | DXGI_SHARED_RESOURCE_WRITE;

		vulkanExportMemoryAllocateInfoKHR.pNext = &vulkanExportMemoryWin32HandleInfoKHR;
		vulkanExportMemoryAllocateInfoKHR.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT;
	#else
		vulkanExportMemoryAllocateInfoKHR.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR;
	#endif

In order to allocated shared memory through VMA we have to create custom pool with previously defined structs. First we want to find memory type index for our feature allocation.

	uint32_t memory_type_index{};
	vmaFindMemoryTypeIndexForImageInfo(vma_allocator, &info, &alloc_info, &memory_type_index);

We pass it to VmaPoolCreateInfo struct along with our extensions defined previously. Notice that we specify blockSize to 0. This is actually from VMA docs:

NOTE On VMA

New versions of this library support creating dedicated allocations in custom pools. It is supported only when VmaPoolCreateInfo::blockSize = 0.

	VmaPoolCreateInfo pool_info{};
	pool_info.memoryTypeIndex = memory_type_index;
	pool_info.blockSize = 0;
	pool_info.maxBlockCount = 1;
	pool_info.pMemoryAllocateNext = &vulkanExportMemoryAllocateInfoKHR;

	vmaCreatePool(vma_allocator, &pool_info, &pool);

The only thing left is actual allocation and getting correct vram size of it.

	alloc_info.pool = pool;

	vmaCreateImage(vma_allocator, &info, &alloc_info, &image, &allocation, nullptr);

	VkMemoryRequirements memory_req{};
	vkGetImageMemoryRequirements(vulkan_device, image, &memory_req);
	vram_size = memory_req.size;
}

Create shared texture (Native)

Integrating this into own memory allocator was quite easy, if we don’t count the amount of headaches I had due to poor debug info, no documentation notes and other shenanigans.

First we create image and ask it’s size:

vkCreateImage(vulkan_device, &image_info, nullptr, image);

VkMemoryRequirements memory_req{};
vkGetImageMemoryRequirements(vulkan_device, image, &memory_req);
vram_size = memory_req.size;

Steps pretty similar to VMA but we have to add another struct here, so that vulkan knows that we want dedicated allocation:

	VkMemoryDedicatedAllocateInfoKHR dedicatedInfo{ VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO };
	dedicatedInfo.image = image;

Same as before with VMA, but notice that we specify dedicatedInfo into some structs:

	VkExportMemoryAllocateInfoKHR vulkanExportMemoryAllocateInfoKHR{ VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };

	#ifdef _WIN32
		WindowsSecurityAttributes windows_security_attributes;
		VkExportMemoryWin32HandleInfoKHR vulkanExportMemoryWin32HandleInfoKHR{ VK_STRUCTURE_TYPE_EXPORT_MEMORY_WIN32_HANDLE_INFO_KHR };
		vulkanExportMemoryWin32HandleInfoKHR.pNext = &dedicatedInfo;
		vulkanExportMemoryWin32HandleInfoKHR.pAttributes = &windows_security_attributes;
		vulkanExportMemoryWin32HandleInfoKHR.dwAccess = DXGI_SHARED_RESOURCE_READ | DXGI_SHARED_RESOURCE_WRITE;

		vulkanExportMemoryAllocateInfoKHR.pNext = &vulkanExportMemoryWin32HandleInfoKHR;
		vulkanExportMemoryAllocateInfoKHR.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT;
	#else
		vulkanExportMemoryAllocateInfoKHR.pNext = &dedicatedInfo;
		vulkanExportMemoryAllocateInfoKHR.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR;
	#endif

Then we just fill allocate info struct:

VkMemoryAllocateInfo allocInfo{};
allocInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
allocInfo.allocationSize = memory_req.size;
allocInfo.memoryTypeIndex = FindMemoryType(memory_req.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);

And finally allocate memory and then bind it by using vkAllocateMemory and vkBindImageMemory respectively.

WARNING vkGetMemoryWin32HandleKHR return’s nullptr handles with no errors!

vkGetMemoryWin32HandleKHR may return VK_SUCCESS but handle will be nullptr. Please double check if you actually fill and added VkExportMemoryAllocateInfoKHR into pNext chain! Works this way on both NVIDIA and AMD.

Importing memory into Vulkan

For importing memory into Vulkan we can use VkImportMemoryWin32HandleInfoKHR and VkImportMemoryFdInfoKHR for POSIX handles.

VkImportMemoryWin32HandleInfoKHR desc{};
desc.sType = VK_STRUCTURE_TYPE_IMPORT_MEMORY_WIN32_HANDLE_INFO_KHR;
desc.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT; // Same as before, just the other way around
desc.handle = my_handle;

They both for the same way:

We create an image to be imported with vkCreateImage.
Specify our Import memory structure into pNext of the VkMemoryAllocateInfo struct.
Allocate image as usual

Linear buffers

Very similar to regular textures, albeit a bit easier. Since they work the same way, I won’t go into details here.

Extra: Interoping resources with Direct3D 11

This one was simple almost the same way as DX12, it took me just about an hour to implement, so i included it since we will probably support this one for a bit longer.

Create shared texture

I only did that for textures since buffers required a bit of a extra hassle with keyed mutexes. And since we’re going to remove DX11 support in the future in favor of DX12 and Vulkan, it wouldn’t really be great to add 2 extra functions to lock/unlock keyed mutex just for the DX11.

But alas, in order to create a DX11 texture that will support shared handles we specify 2 flags, NTHANDLE tells to the API that we want to request NTHandle and the second one tells that we want to be able to share this resource to the external API.

misc_flags |= D3D11_RESOURCE_MISC_SHARED_NTHANDLE | D3D11_RESOURCE_MISC_SHARED;

If we were to use Keyed mutex here, we would replace D3D11_RESOURCE_MISC_SHARED_NTHANDLE with D3D11_RESOURCE_MISC_SHARED_NTHANDLE instead.

Create shared handle

This one requires a bit of a tinkering. First you QueryInterface for the IDXGIResource1 and then you can create shared handle from it.

IDXGIResource1 *dxgi_resource = nullptr;
resource->QueryInterface(__uuidof(IDXGIResource1), (void **)&dxgi_resource);

WindowsSecurityAttributes windows_security_attributes;

DWORD flags = DXGI_SHARED_RESOURCE_READ | DXGI_SHARED_RESOURCE_WRITE;
dxgi_resource->CreateSharedHandle(&windows_security_attributes, flags, L"", &handle);

Import shared handle

I haven’t looked into this yet, but it seems straightforward enough:

device->OpenSharedResource(handle, __uuidof(ID3D11Texture2D), (void **)&texture);

Acquiring VRAM Size

Since old APIs like DX11 allocate things for you, the only way you can check for the size of a resource is by calculating it yourself with a simple expression like so:

texture_size_in_bytes = 0;

for (num_mips)
{
	texture_size_in_bytes += width * height * depth;
	if (width > 1) width >>= 1;
	if (height > 1) height >>= 1;
	if (depth > 1) depth >>= 1;
}

texture_size_in_bytes *= array_size * (isCubeOrCubeArray() ? 6 : 1) * (1 << multisampling_mode) * format_size_in_bits / 8;

Interoping resources with CUDA

Finally the one we want to interop with.

We will have to fill a couple of structs, such as cudaExternalMemoryHandleDesc and cudaExternalMemoryMipmappedArrayDesc.

The first one is pretty easy. Depending on your graphics API that your engine runs on, you will have to choose between a few options of resource types:

cudaExternalMemoryHandleDesc desc{};
switch (Render::GetAPI())
{
	case Render::API::DIRECT3D11: desc.type = cudaExternalMemoryHandleTypeD3D11Resource; break;
	case Render::API::DIRECT3D12: desc.type = cudaExternalMemoryHandleTypeD3D12Resource; break;
	case Render::API::VULKAN:
		#ifdef _WIN32
			desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
		#else
			desc.type = cudaExternalMemoryHandleTypeOpaqueFd;
		#endif
		break;
}

Specify handles we got from Vulkan or DX11/12

#ifdef _WIN32
	desc.handle.win32.handle = resource->GetHandleWin32();
#else
	desc.handle.fd = resource->GetHandleFd();
#endif

Here we will have to set vram_size that we calculated for our texture previously and tell CUDA that we allocated dedicated memory. Not really sure about the way DX12 allocates memory but when you call CreateCommitedResource it will be dedicated, based on my experiences with debugging vulkan allocations. By default vulkan won’t do dedicated allocations, hence you have to specify some helper structs for that on vulkan side. Obviously you don’t have to do this but the reason I did it, is that so code written for a single Engine’s API will be similar for a different Engine API. Not a lot of the users actually know how memory is allocated behind the scenes so to help with that I tried my best to make it tiny bit easier to work with (: .

desc.size = texture->GetVRAMSize();
desc.flags = cudaExternalMemoryDedicated;

cudaImportExternalMemory(&external_memory, &desc);

cudaExternalMemoryMipmappedArrayDesc asks for texture dimensions, format and some flags. I will go with 2D texture for simplicity:

cudaExternalMemoryMipmappedArrayDesc mip_mapped_array_desc{};
mip_mapped_array_desc.extent = make_cudaExtent(texture->GetWidth(), texture->GetHeight(), 0);

For more info on specifying texture dimensions look up cudaMallocMipmappedArray in the CUDA docs.

Format is going to be RGBA 8 bit per channel.

mip_mapped_array_desc.formatDesc = { 8, 8, 8, 8, cudaChannelFormatKindUnsigned };

I have 1 mip level.

mip_mapped_array_desc.numLevels = 1;

For flags we ask for Load and Store access (basically UAV)

mip_mapped_array_desc.flags = cudaArraySurfaceLoadStore;

And if this texture is ever going to be used as render target via Engine we say that by adding another flag:

mip_mapped_array_desc.flags |= cudaArrayColorAttachment;

Now the rest is easy, get 0th mip map and create surface object so that we can read/write anything from/to the texture using CUDA kernels.

cudaArray_t cu_array{};
cudaGetMipmappedArrayLevel(&cu_array, mip_array, 0);

cudaResourceDesc resource_desc{};
resource_desc.resType = cudaResourceTypeArray;
resource_desc.res.array.array = cu_array;
cudaCreateSurfaceObject(&surface, &resource_desc);

WARNING

If you encounter an invalid argument error at this step, please check that you have specified correct texture size at the previous step and that your external_memory object is not 0x00000000.

Synchronization Interopability

We already half way through our journey! Enter the land of perma-frozen GPU thread-lands and synchronization of the GPU trains! Next stop is - insanity.

Timeline

At first glance we have a very simple syncronization problem we would like to solve. I used API as an umbrella term for DX11/DX12/Vulkan graphics APIs.

But in reality our diagram should look like this:

We still have to account for CPU, this way we can actually see where each of the API call is being made at and debug accordingly. But to make this image a bit simplier to understand let’s put actual calls into perspective:

Direct3D 12

Let’s start with DX12 since it’s easier to grasp.

We need 3 things: A new fence, a handle for that fence and Win32 event:

WindowsSecurityAttributes windows_security_attributes;

fence_value = 0;
device->CreateFence(fence_value, D3D12_FENCE_FLAG_SHARED, D3D12_IID_PPV_ARGS(&fence));
device->CreateSharedHandle(fence, &windows_security_attributes, GENERIC_ALL, L"", &handle);

fence_event = CreateEvent(nullptr, FALSE, FALSE, nullptr);

Win32 event is used for CPU <-> GPU signaling and locking on one of the threads (either CPU or GPU). Where’s a new fence is our sync primitive between GPU threads. Now, let’s start with 4 basic operations that we need to implement:

CPU Wait GPU Before we can destroy any of the resources, and we actually do care when we can destroy our resources, we need to know when we can do it. By far the simpliest and most basic way is to just wait for it.

And since other GPU thread outside of our engine may still want to execute some stuff we have to do this in order:
1. Stop Engine GPU thread you already should have that function, it may be called something like WaitGPU or WaitGraphics we want to modify it a bit:
  1. Signal our shared fence from the CPU
  2. Then the rest is usual
  We want to signal shared fence from the CPU, so it knows that it should do some work, instead of waiting to be fed.
2. Increment fence value
3. Signal our fence on Command Queue with current fence value
4. Wait for the Event, similar to WaitGPU that we already should have in our engine.
5. Increment fence_value one last time.
CPU Signal GPU
```
fence->Signal(fence_value);
```
That is all there is to it (:

GPU Wait

Very simple and elegant

CommandQueue->Wait(fence, fence_value);

GPU Signal

Before we issue our signal, we need to increment our fence_value. Very simple and very elegant!
```
CommandQueue->Signal(fence, fence_value);
```

Just before we call ExecuteCommandLists where our texture is going to be accessed, we need to add some guards: At the frame, when our fence is going to be used for the first time, we skip GPU Wait, all subsequent calls should issue GPU Wait here. After ExecuteCommandLists call always issue GPU Signal.

Then you want to insert an event, callback or something like that for the Engine API, so the other API can insert it’s GPU Wait, Work and GPU Signal here, right before IDXGISwapchain::Present() call.

Vulkan

Vulkan implementation isn’t as elegant as DX12’s is, but it has same points of interest.

Let’s start by creaing Vulkan’s Timeline semaphore, it requires VK_KHR_timeline_semaphore extension. And we also need VK_KHR_external_semaphore and VK_KHR_external_semaphore_fd / VK_KHR_external_semaphore_win32 for POSIX and Win32 respectively. We use Timeline semaphore, since it acts similar to DX12 fence.

VkSemaphoreCreateInfo semaphore_desc{};
semaphore_desc.sType = VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO;

VkExportSemaphoreCreateInfoKHR export_semaphore_desc{};
export_semaphore_desc.sType = VK_STRUCTURE_TYPE_EXPORT_SEMAPHORE_CREATE_INFO_KHR;
#ifdef _WIN32
	// FIXME: WindowsSecurityAttributes issue with CUDA external semaphore signal
	//
	// For some unknown reason timline semaphores won't signal on Windows via CUDA
	// when we provide VkExportSemaphoreWin32HandleInfoKHR struct
	// This might be avaliable as an optional toggle in feature for CUDA workaround

	//WindowsSecurityAttributes windows_security_attributes;

	//VkExportSemaphoreWin32HandleInfoKHR export_semaphore_handle_desc{};
	//export_semaphore_handle_desc.sType = VK_STRUCTURE_TYPE_EXPORT_SEMAPHORE_WIN32_HANDLE_INFO_KHR;
	//export_semaphore_handle_desc.pAttributes = &windows_security_attributes;
	//export_semaphore_handle_desc.dwAccess = DXGI_SHARED_RESOURCE_READ | DXGI_SHARED_RESOURCE_WRITE;

	//export_semaphore_desc.pNext = &export_semaphore_handle_desc;
	export_semaphore_desc.handleTypes = VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT;
#else
	export_semaphore_desc.handleTypes = VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BIT;
#endif

VkSemaphoreTypeCreateInfoKHR timeline_info{};
timeline_info.sType = VK_STRUCTURE_TYPE_SEMAPHORE_TYPE_CREATE_INFO;
timeline_info.pNext = &export_semaphore_desc;
timeline_info.initialValue = fence_value;
timeline_info.semaphoreType = VK_SEMAPHORE_TYPE_TIMELINE;

semaphore_desc.pNext = &timeline_info;
vkCreateSemaphore(vk_device, &semaphore_desc, nullptr, &semaphore);

#ifdef _WIN32
	VkSemaphoreGetWin32HandleInfoKHR semaphore_handle_desc{};
	semaphore_handle_desc.sType = VK_STRUCTURE_TYPE_SEMAPHORE_GET_WIN32_HANDLE_INFO_KHR;
	semaphore_handle_desc.handleType = VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT;

	semaphore_handle_desc.semaphore = semaphore;
	vkGetSemaphoreWin32HandleKHR(vk_device, &semaphore_handle_desc, &handle);
#else
	VkSemaphoreGetFdInfoKHR semaphore_handle_desc{};
	semaphore_handle_desc.sType = VK_STRUCTURE_TYPE_SEMAPHORE_GET_FD_INFO_KHR;
	semaphore_handle_desc.handleType = VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BIT;

	semaphore_handle_desc.semaphore = semaphore;
	vkGetSemaphoreFdKHR(vk_device, &semaphore_handle_desc, &handle);
#endif

Another strange bit here is absence of VkExportSemaphoreWin32HandleInfoKHR. For some unknown reason when you export semaphore with this struct and import handle into CUDA. The latter starts to act strangely. GPU Signal doesn’t work and always returns cudaErrorUnknown, while Wait does work as expected. Sync point for the Vulkan implementation is the same as in D3D12, so I will skip it.

All 4 functions that we need are going to be very similar to DX12, albeit not so beautiful and elegant.

CPU Wait GPU

Since we are using timeline semaphores we need to add one more struct to this whole thing, VkTimelineSemaphoreSubmitInfoKHR, it will allow as to specify fence_value we want to signal or wait on.

vk_queue->WaitGPU();

fence_value++;

VkSubmitInfo submit{};
submit.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit.pSignalSemaphores = &semaphore;
submit.signalSemaphoreCount = 1;

VkTimelineSemaphoreSubmitInfoKHR timeline_semaphore_info{};
timeline_semaphore_info.sType = VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO;
timeline_semaphore_info.pSignalSemaphoreValues = &fence_value;
timeline_semaphore_info.signalSemaphoreValueCount = 1;
submit.pNext = &timeline_semaphore_info;

vkQueueSubmit(vk_queue, 1, &submit, VK_NULL_HANDLE);

fence_value++;

Implementation is very similar to DX12.

CPU Signal GPU

Here we can use vkSignalSemaphoreKHR function to signal our semaphore from the CPU.

VkSemaphoreSignalInfoKHR info{};
info.sType = VK_STRUCTURE_TYPE_SEMAPHORE_SIGNAL_INFO_KHR;
info.semaphore = semaphore;
info.value = fence_value;

vkSignalSemaphoreKHR(vk_queue, &info);

GPU Wait

Not so elegant solution by Vulkan:

VkPipelineStageFlags wait_mask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT;

VkSubmitInfo submit{};
submit.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit.pWaitSemaphores = &semaphore;
submit.waitSemaphoreCount = 1;
submit.pWaitDstStageMask = &wait_mask;

VkTimelineSemaphoreSubmitInfoKHR timeline_semaphore_info{};
timeline_semaphore_info.sType = VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO;
timeline_semaphore_info.pWaitSemaphoreValues = &fence_value;
timeline_semaphore_info.waitSemaphoreValueCount = 1;
submit.pNext = &timeline_semaphore_info;

vkQueueSubmit(vk_queue, 1, &submit, VK_NULL_HANDLE);

VkSemaphoreWaitInfo wait_info{};
wait_info.sType = VK_STRUCTURE_TYPE_SEMAPHORE_WAIT_INFO;
wait_info.semaphoreCount = 1;
wait_info.flags = VK_SEMAPHORE_WAIT_ANY_BIT;
wait_info.pValues = &fence_value;
wait_info.pSemaphores = &semaphore;

vkWaitSemaphores(vk_device, &wait_info, 0xFFFFFFFF);

Same as CommandQueue->Wait() but not so nice.

GPU Signal

Same as before in GPU Wait, the only difference being is that we use our fence to signal, not to wait on it.

fence_value++;

VkSubmitInfo submit{};
submit.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit.pSignalSemaphores = &semaphore;
submit.signalSemaphoreCount = 1;

VkTimelineSemaphoreSubmitInfoKHR timeline_semaphore_info{};
timeline_semaphore_info.sType = VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO;
timeline_semaphore_info.pSignalSemaphoreValues = &fence_value;
timeline_semaphore_info.signalSemaphoreValueCount = 1;
submit.pNext = &timeline_semaphore_info;

vkQueueSubmit(vk_queue, 1, &submit, VK_NULL_HANDLE);

Extra: Direct3D 11

Remember how Vulkan was similar to DX12? DX11 is no exception here, I’ll quickly glance over 4 basic functions and we will end this section with CUDA Interop.

It has same API calls to create new fence, and requires same 3 things:

WindowsSecurityAttributes windows_security_attributes;

device5->CreateFence(fence_value, D3D11_FENCE_FLAG_SHARED, D3D12_IID_PPV_ARGS(&fence));
fence->CreateSharedHandle(&windows_security_attributes, GENERIC_ALL, L"", &handle);

fence_event = CreateEvent(nullptr, FALSE, FALSE, nullptr);

CPU Wait GPU

Since we have no Command Queue exposed in D3D11 we have to do ID3D11DeviceContext::Flush() here instead. But everything after that is same as before.

context4->Flush();

fence_value++;

context4->Signal(fence, fence_value);
fence->SetEventOnCompletion(fence_value, fence_event);
WaitForSingleObjectEx(fence_event, INFINITE, FALSE);

fence_value++;

CPU Signal GPU

For some reason I couldn’t find anything related to that.

GPU Wait

Very elegant api calls by DirectX 11:

context4->Wait(fence, fence_value);

GPU Signal

Here’s one more:

context4->Signal(fence, fence_value);

Sync Point

Again, since we don’t have Command Queue exposed, we have to go a bit different route in order to simulate same behavior. What currently works for me is to do the same thing as in DX12 implementation but instead of wrapping ExecuteCommandList calls, I wrap my logic around ID3D11DeviceContext::Flush() right before Present calls.

CUDA

Finally we are almost done with this. Since we want our CUDA work to be synchronized with Vulkan/D3D12/D3D11 work we need to have some kind of thread. Luckily for us CUDA has Streams which act as individual Command Queues we can launch our work kernels on and use them to sync-up our GPU work between APIs.

API for that is actually pretty easy, we create CUDA Stream in a single line and immediately call for sync. Why? I am not sure.. But without it may not work for some reason. Might be something along the lines of “threads are always doing some work but you need to sync it before you submit any of your own”. It may require some more testing on my end but I am already at my limit with all of this, so I’ll toss it into the endless list of things I have to look into later.

cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cudaStreamSynchronize(stream);

Now that we have another GPU thread, let’s import a fence from the APIs I described above. The process is remarkably simple and quite intuitive: 1) Specify type and handle of the imported fence 2) Get new CUDA object back

cudaExternalSemaphoreHandleDesc desc{};
#ifdef _WIN32
	switch (Render::GetAPI())
	{
		case Render::API::DIRECT3D11: desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence; break;
		case Render::API::DIRECT3D12: desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence; break;
		case Render::API::VULKAN: desc.type = cudaExternalSemaphoreHandleTypeTimelineSemaphoreWin32; break;
	}

	desc.handle.win32.handle = fence->GetHandleWin32();
#else
	desc.type = cudaExternalSemaphoreHandleTypeTimelineSemaphoreFd;
	desc.handle.fd = fence->GetHandleFd();
#endif

cudaImportExternalSemaphore(external_semaphore, &desc);

The only 2 things left is to make 2 functions for GPU Wait and Signal operations and launch our CUDA kernel.

GPU Wait

Both Wait and Signal functions ask for the CUDA Stream you want them to operate on. We can specify the one we created recently, as for the value, our Engine Fence should have which value we should be interested in.

cudaExternalSemaphoreWaitParams desc{};
desc.params.fence.value = fence->GetValue();

cudaWaitExternalSemaphoresAsync(&external_semaphore, &desc, 1, stream);

GPU Signal

As you may have noticed, each Signal op is preceeded with value incrementation. This trend continues here as well.

fence->IncrementValue();

cudaExternalSemaphoreSignalParams desc{};
desc.params.fence.value = fence->GetValue();
cudaSignalExternalSemaphoresAsync(&external_semaphore, &desc, 1, stream);

Nothing fancy to describe here either.

Sync Point

Lastly we want to start our work for the frame and we can do so by subscribing to specific callback I described earlier in DX12 sync section. It should be right after our Wait/ExecuteCommandList/Signal chain and right before Present calls.

if (have_work_kernel == false)
{
	CUDAWait();
	LaunchKernel();
}
CUDASignal();

Resource Destruction / Re-creation

Okay, it was quite the journey but we still have to tidy up our resources at some point, so we are not done yet. We still have 3 more topics to cover like: 1) Closing Handles 2) Destroying Resources 3) Re-creating Resources

What’s interesting here is that they all rely on the previous ones. We can’t re-create resource handles without destroying the old ones and we can’t destroy old resources without closing their corresponding handles. But fear not everything here is actually more straightforward than you might’ve fought.

Closing Handles

Let’s start with this one. Every time our resource is cleared, destroyed or changed memory location we need to close old handle. On Windows we just have to call CloseHandle for that. But what if the underlying handle is null? Let’s add check for that! But what if the handle isn’t null but it’s invalid for some reason? We can check that with GetHandleInformation! The only thing left is to notify CUDA or other API that we handed this handle(no pun intended) to that it is no longer valid. In the end we get something along the lines of:

DWORD flags;
if (handle != nullptr && GetHandleInformation(handle, &flags))
{
	event_close_handle.Run();
	CloseHandle(handle);
}
handle = nullptr;

Perfect, now let’s do the Linux version of this. Again, closing handle is as simple as calling close function. Checking if handle is valid is the same as checking for return value of fcntl with F_GETFD flag.

if (handle != -1 && fcntl(handle, F_GETFD))
{
	event_close_handle.run();
	close(handle);
}
handle = -1;

Destroying Resources

For the internal version of a resource clear i’ve used destructor of external memory object which automatically closes handle when it is destroyed. And it is destroyed the moment original resource is destroyed or re-created.

For the external one I did only bare minimum so that original samples are not overwhelming. But I left some comments at the destruction of shared resource wrapper:

// NOTE:
// This is not optimal for Mesh Dynamic re-creation between frames
// as it's introduces high-cost GPU <-> CPU Sync point before we can destory old resources.
//
// What you want to do instead is to have double-buffered Mesh Dynamics
// and to have a separate thread which will carry destruction of the current Mesh Dynamic
// while the other one is used to render something.
//
// Also you can use a single fence for all of your shared resources so that you will have to wait
// single time instead of waiting for multiple fences, this also may improve performance
// as far fewver sync Wait/Signal pairs will be issued and waited on.

So, the idea is simple: before we can destroy this resource, we must ensure that it is no longer needed on neither of the GPU Timelines.

cudaStreamSynchronize(stream);
CUDASignal();
fence->WaitGPU(); // Wait for Engine's Queue to finish

Now we certain that resource is no longer needed and we can call cudaFree, cudaDestroySurfaceObject or cudaFreeMipmappedArray. After that we can destroy cudaDestroyExternalMemory and cudaDestroyExternalSemaphore.

Re-creating Resources

If you haven’t skipped other parts of this post, you must’ve noticed event_close_handle before. The only thing we missing is event_create_handle, it should be right after CreateSharedHandle call for external memory and similar in Vulkan. Those events fire correspondingly when we close or create a handle to the resource externally. On the create handle event we can call create_handle function that will make a new CUDA handle for the resource, no things to change here. Closing handle is simple as destroying CUDA’s external memory, since engine already destroyed resource internally and called closeHandle which fired event_close_handle.

Tips & Tricks

This should do it but if you wish to go by the same path as me, i’ve collected a few more Tips that may help you to save some time.

Sync

Use a bit of nsys from time to time, it may help you understand sync timeline a bit better.

Permafreeze

CPU permafreeze

Use procexp64 from Sysinternals Suite to firstly kill some App threads that have been Waiting for eternity, and then try to terminate it.

GPU permafreeze

If you have done something wrong, your App may freeze indefinitely due to lack proper Signal/Wait chain on GPU. Windows TDR won’t handle this at all, especially for Vulkan. Althrough 1 time TDR happend to kill an DX11 GPU permarozen app after 10 or so minutes. But If you want to continue your journey then you have 2 choises: 1) Restart PC by long-press of the hardware button 2) Rename resulting binaries and try again

You can also wrap all of your CUDA calls into some macro that contains an assert:

#define cudaCheck(x) \
{ \
	cudaError_t res = x; \
	if (res != cudaSuccess) \
		Log::Assert("%s: %s Failed with error '%s' (%d)", __FUNCTION__, #x, cudaGetErrorString(res), res); \
}


...

cudaCheck(cudaSignalExternalSemaphoresAsync(...));

So it can, hopefully, prevent permafreezing your App.

Linux

On linux you may want to use cuda-gdb to catch a few strange bugs that may occur.

If for some reason your app can’t call CUDA Driver and CUDA Runtime APIs, try this (you may want to change path to libcuda.so): LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so cuda-gdb -ex=r --args ../bin/app_x64d <my_args_go_here_as_usual>

If everything works fine, check that you don’t have libcuda.so.6 or something like this neither in ../bin/ nor in the $PATH. I copied some random .so way back when i was experimenting at it tripped me for like a whole week or so. I tried everything from re-installing drivers 5 times to other obscure things. Knowing what libs app was loading at the startup would’ve helped a lot, much like VS does in it’s output window.

Memory artifacts

If you are unsure what may cause weird image artifacts then use this kernel to debug root cause of the artifacts:

__device__ unsigned int RGBAFloatToInt(float4 rgba)
{
	rgba.x = __saturatef(rgba.x); // clamp to [0.0, 1.0]
	rgba.y = __saturatef(rgba.y);
	rgba.z = __saturatef(rgba.z);
	rgba.w = __saturatef(rgba.w);
	return  ((unsigned int)(rgba.w * 255.0f) << 24) |
			((unsigned int)(rgba.z * 255.0f) << 16) |
			((unsigned int)(rgba.y * 255.0f) << 8) |
			((unsigned int)(rgba.x * 255.0f));
}

__global__ static void Kernel(cudaSurfaceObject_t surface, int nWidth, int nHeight) {
    int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
    int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
    if (x + 1 >= nWidth || y + 1 >= nHeight)
        return;
	
    float4 rgba{};
    rgba.x = (x & 0xFF) / 255.0f;
    rgba.y = (y & 0xFF) / 255.0f;
    rgba.z = 0.0f;
    rgba.w = 1.0f;
    int color = RGBAFloatToInt(rgba);
    surf2Dwrite(color, surface, x * sizeof(uchar4), y);
    surf2Dwrite(color, surface, x * sizeof(uchar4), y + 1);
    surf2Dwrite(color, surface, (x + 1) * sizeof(uchar4), y);
    surf2Dwrite(color, surface, (x + 1) * sizeof(uchar4), y + 1);
}

__global__ static void KernelLinear(void *ptr, int nWidth, int nHeight)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
    int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
    if (x + 1 >= nWidth || y + 1 >= nHeight)
        return;
	
	uint8_t *pDst = ptr + (x + y * nWidth) * sizeof(uchar4);
	
	float4 rgba{};
	rgba.x = (x & 0xFF) / 255.0f;
	rgba.y = (y & 0xFF) / 255.0f;
	rgba.z = 0.0f;
	rgba.w = 1.0f;
	unsigned int color = RGBAFloatToInt(rgba);
	
	*(uint2 *)pDst = uint2{
		color,
		color,
	};
	*(uint2 *)(pDst + nWidth * sizeof(uchar4)) = uint2{
		color,
		color,
	};
}

void LaunchKernel(cudaStream_t stream, cudaSurfaceObject_t surface, int nWidth, int nHeight) {
    Kernel<<<dim3(nWidth / (16 * 2), nHeight / (8 * 2)), dim3(16, 8), 0, stream>>>(surface, nWidth, nHeight);
}
void LaunchKernelLinear(cudaStream_t stream, void *ptr, int nWidth, int nHeight) {
    KernelLinear<<<dim3(nWidth / (16 * 2), nHeight / (8 * 2)), dim3(16, 8), 0, stream>>>(ptr, nWidth, nHeight);
}

Then produce square image 1024x1024 with this kernel and use artifact table to find the root cause. Your goal is to get this image and to have correct parameters on both sides:

Here’s the artifact table I produced. Top row denotes export memory, left most column describes import memory parameters:

	Non-Dedicated, Linear Tiling	Dedicated, Linear Tiling
Non-Dedicated Texture 2D	check notes	check notes
Dedicated Texture 2D	check notes	check notes
Non-Dedicated Linear Buffer
Dedicated Linear Buffer

Notes:

You can’t do that. It will fail on cudaExternalMemoryGetMappedMipmappedArray

From: CUDA Vulkan VkImage Interop - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums

NOTE

Note from NVIDIA forums The underlying reason there`s this requirement is because if the driver knows that an image uses a dedicated allocation, then it knows there`s only one image in an allocation and that that image has offset 0, which allows it to do different optimizations, including a different image layout in memory.

(Thanks to Vivek Kini for this info).

WARNING

The two APIs must agree on the depth of the image - in particular, one must be careful to use a depth of 0 (instead of 1) for a 2D CUDA image. If it`s 1, then that`s a 3D width x height x 1 texture, and may use a different layout (and will produce incorrect results if accessed using surf2D().

Final thoughts

It was a painful journey but alas I still managed to finish it. What does CUDA actually bring here on the table? Well, a lot! We now have a way to encode and decode videos with CUVID / Video Codec SDK, trace scenes with OptiX if we ever expose Vertex/Index Buffers and easy to access objects in the scene info. And other stuff that can be done via CUDA Toolkit.

I know there is a lot to take in but this is the most feature complete post that I ever seen regarding interopability of CUDA with other APIs. I really hope that this comes handy to some one in search of some shining light at the end of this insanity hell-hole.

There’s still a lot to cover regarding this topic that I had no time to try yet:

NOTE

ID3D12Device3::OpenExistingHeapFromAddress requires VirtualAlloc instead of malloc. AgilitySDK has that fixed with ID3D12Device13::OpenExistingHeapFromAddress1 starting from around 1.610.0

Importing CPU memory into Graphics API via - ID3D12Device3::OpenExistingHeapFromAddress or VkImportMemoryHostPointerInfoEXT api call for Vulkan.
Importing GPU memory into DX11/DX12/Vulkan - I just glanced over this and never had enough time to actually test this.
Allocating shared non-dedicated memory - I probably will have to do this at some point but now assuming that non of the users will ever want to share lots of textures, I skipped this.
Memory allocator toggle so that every texture buffer will be allocated as shared memory - not sure if I’ll do this, only the time will tell.
Sharing memory across GPUs - I am sure it’s a broad topic and involves lots of synchronizations, memory management and other hard stuff to work correctly, so I am only interested in this for some research for now.
Transfering Vulkan’s Queue Ownership - it seems like I never needed it for current implementation at least.
Wrapping DX12/DX11/Vulkan into own DLL, so I can issue some GPU Signal calls and kill the app, instead of rebooting whole PC, so I can continue working.
Win32 named handles
What WindowsSecurityAttributes actually does and how it affects interprocess interops

Theoretically this API now allows me to integrate any graphics API, albeit at the cost of running 2 separate drivers and syncing between them. In theory I can friend any of the APIs below with the internal ones:

DirectX 11 (native Handle)
DirectX 12 (native Handle)
Vulkan (via KHR extensions)
OpenGL (via KHR extensions)
CUDA (native Handle/Posix FD)
OpenCL (not sure about this one, but may work in theory similar to CUDA)

This opens new doors for interopabilities! Now it’s possible to integrate DirectX 11 interops, while the engine still running on DX12 or Vulkan internally! The old way of doing things was to export ID3D11Device, ID3D11Texture2D, ID3D11ShaderResourceView and other similar pointers, while it was really easy and all sync work was done automatically by the drivers, now we have unified interfaces for that that have not that hard interfaces to grasp. Albeit they require a bit of understanding of underlaying memory layouts and allocations.

P.S. Optimization of synchronization path is left to the reader as an exercise.

Engine, DirectX, DirectX 11, DirectX 12, Vulkan, CUDA, C++

I feedback.
Let me know what you think of this article on twitter @Kriegthepshyco or leave a comment below!

Interoping Graphics APIs

Handles

Texture Interopability

Basic outline

Interoping resources with Direct3D 12

Create shared texture

Create shared handle

Import memory into D3D12

Acquiring VRAM Size

Linear buffers

Interoping resources with Vulkan

Device Extensions

Create shared texture (VMA)

Create shared texture (Native)

Importing memory into Vulkan

Linear buffers

Extra: Interoping resources with Direct3D 11

Create shared texture

Create shared handle

Import shared handle

Acquiring VRAM Size

Interoping resources with CUDA

Synchronization Interopability

Timeline

Direct3D 12

Vulkan

CPU Wait GPU

CPU Signal GPU

GPU Wait

GPU Signal

Extra: Direct3D 11

CPU Wait GPU

CPU Signal GPU

GPU Wait

GPU Signal

Sync Point

CUDA

GPU Wait

GPU Signal

Sync Point

Resource Destruction / Re-creation

Closing Handles

Destroying Resources

Re-creating Resources

Tips & Tricks

Sync

Permafreeze

CPU permafreeze

GPU permafreeze

Linux

Memory artifacts

Final thoughts