CUDA编程基础（C++特性）

内存限定符

__device__:
__shared__:
__global__:
__constant__:仅仅可以从host代码通过runtime函数赋值，不能从device侧赋值。
__device__、__shared__、__managed__和__constant__内存空间说明符不允许在：

class、struct和union数据成员上
在host执行的外部函数中的变量

device、constant、managed内存空间说明符在device上不允许在外部或者非静态函数中声明。

device、constant、managed或者shared变量定义不能在非空构造体或者非空析构体的类类型中存在。如果trivial构造体或者满足下面的条件条件：

构造函数已经被定义
构造函数没有参数，初始化列表为空同时函数体也为空。
没有虚函数，没有虚基类或者没有非静态数据成员初始化器。
默认的所有基类构造体可以被认为是空。
对类的所有非静态数据成员默认构造函数可以被当做空。

析构体考虑作为一个空的翻译题，如果析构体不满足下面的条件：

析构函数已经被定义
析构函数体是空
没有虚函数或者没有虚基类
所有基类的析构函数可以被操作空
类的非静态成员可以被党作为空。

使用nvcc完整编译代码的过程中，device、shared、managed和constant变量不能使用extern定义为外部变量。分开变异模式下device、shared、managed和constant变量可以使用extern关键字定义为外部变量，然后如果没有找到外部变量的定义nvlink将报错（除非是动态分配的shared变量）。

`managed`内存空间限定符

__managed__内存空间限定符有如下限制：

managed变量的地址不能是一个常量表达式。
managed变量不应该有常量const限定符。
managed变量不应该有引用类型。
managed变量的地址或者值不应该在CUDA Runtime不可用的状态下使用，包括下面的情况：
static/dynamic 初始化或者析构对象或者线程本地存储。
在exit()之后的代码（例如gcc使用__attribute__((destructor))标记的）
当CUDA Runtime没有被初始化的时候执行代码（例如：被gcc标记为__attribute__((constructor))）
一个managed变量不能用作decltype()表达式的参数。
managed变量对动态分配的managed 内存有一致的行为。
managed变量在多GPU环境下变量仅仅分配一次。
managed 变量声明在host执行的函数中没有外部链接性。
managed变量没有外部或者静态链接被允许函数在设备商执行。

__device__ __managed__ int xxx = 10;         // OK

int *ptr = &xxx;                             // error: use of managed variable
                                             // (xxx) in static initialization
struct S1_t {
  int field;
  S1_t(void) : field(xxx) { };
};
struct S2_t {
  ~S2_t(void) { xxx = 10; }
};

S1_t temp1;                                 // error: use of managed variable
                                            // (xxx) in dynamic initialization

S2_t temp2;                                 // error: use of managed variable
                                            // (xxx) in the destructor of
                                            // object with static storage
                                            // duration

__device__ __managed__ const int yyy = 10;  // error: const qualified type

__device__ __managed__ int &zzz = xxx;      // error: reference type

template <int *addr> struct S3_t { };
S3_t<&xxx> temp;                            // error: address of managed
                                            // variable(xxx) not a
                                            // constant expression

__global__ void kern(int *ptr)
{
  assert(ptr == &xxx);                      // OK
  xxx = 20;                                 // OK
}
int main(void)
{
  int *ptr = &xxx;                          // OK
  kern<<<1,1>>>(ptr);
  cudaDeviceSynchronize();
  xxx++;                                    // OK
  decltype(xxx) qqq;                        // error: managed variable(xxx) used
                                            // as unparenthized argument to
                                            // decltype

  decltype((xxx)) zzz = yyy;                // OK
}

`global`

__global__ 函数参数通过常量内存传入device，在volta架构开始，存储参数限制为32764字节，在更老的架构上是4kb。
__global__ 函数不能有可变参数。
__global__ 函数参数不能传递引用。
当__global__ 函数从device端启动的时候，每个参数都应该是trivial cipyable和trivially destructible的。当__global__函数从host段启动的时候，每个参数类型不允许是Non-traivially copyable的或者non-trivially-copyable，但是处理这样的类型不使用C++标准的模型，用户代码必须确保工作流不影响程序的并发性，工作流分为两个部分：

内存拷贝代替拷贝构造函数
当__global__函数从host代码启动的时候，编译器生成子函数拷贝函数的参数，在最后使用memcpy拷贝参数到device上的__global__函数参数。如果一个参数是non-trivially-copyabley,也许会打断程序。

#include <cassert>
struct S{
    int x;
    int *ptr;
    __host__ __device__ S(){}
    __host__ __device__ S(const S&){ptr=&x;}//拷贝构造函数};
__global__ void foo(S in){
    assert(in.ptr == &in.x);//断言失败，因为编译器生成代码将拷贝in的内容到kernel的参数存放空间，因为拷贝构造函数掉过了，in.ptr没有初始化为&in.x
}
int main(){
    S temp;
    foo<<<1,1>>>(temp);
    cudaDeviceSynchronize();
}

#include <cassert>                                                                                                                                                                                    [0/6351]

__managed__ int counter;
struct S1 {
S1() { }
S1(const S1 &) { ++counter; }
};

__global__ void foo(S1) {

/* this assertion may fail, because
   the compiler generates stub
   functions on the host for a kernel
   launch, and they may copy the
   argument by value more than once.
*/
assert(counter == 1);
}

int main() {
S1 V;
foo<<<1,1>>>(V);
cudaDeviceSynchronize();
}

析构体也需要在__global__ 函数完成之前调用
host端kernel启动时异步的，如果一个__global__函数参数有Non-traivial析构体，析构体在函数执行完成之前执行host代码。这会打断程序执行造成副作用：

struct S {
 int *ptr;
 S() : ptr(nullptr) { }
 S(const S &) { cudaMallocManaged(&ptr, sizeof(int)); }
 ~S() { cudaFree(ptr); }
};

__global__ void foo(S in) {

  //error: This store may write to memory that has already been
  //       freed (see below).
  *(in.ptr) = 4;

}

int main() {
 S V;

 /* The object 'V' is first copied by value to a compiler-generated
  * stub function that does the kernel launch, and the stub function
  * bitwise copies the contents of the argument to kernel parameter
  * memory.
  * However, GPU kernel execution is asynchronous with host
  * execution.
  * As a result, S::~S() will execute when the stub function   returns, releasing allocated memory, even though the kernel may not have finished execution.
  */
 foo<<<1,1>>>(V);
 cudaDeviceSynchronize();
}

`volatile`关键字

编译器会优化全局或者共享内存的读写（例如缓存全局内存上的变量到L1缓存或者寄存器）。volatile关键字可以用来禁用优化，它表示：如果一个全局内存或者是共享内存中的变量被声明为voltile，编译器假设它的值可能被改变或者被任何其他的线程使用，因此任何对这个变量的引用都对映射实际的内存读写指令。

GPU指针

__device__、__shared__、__constant__变量仅仅可以用在设备代码中，__device__或者__constant__变量通过cudaGetSymbolAddress()函数在设备上调用然后获取指针。

`device`和`host`声明

#include <cuda_runtime.h>
#include <iostream>
class Base{
    int x;
    public:
    __host__ __device__ Base(void):x(10){
        printf("x = %d address = %p\n",x,&x);
    }
};
class Derive:public Base{
    int y;
};
class Other:public Base{
    int z;
};
__device__ void foo(){
    Derive D1;
    Other D2;
}
__host__ void bar(void){
    Other D3;
}
__global__ void test(){
    printf("test call device func foo\n");
    foo();
}
void call_test(){
    printf("run call_test");
    test<<<1,1>>>();
}
int main(){
    bar();
    call_test();
}

这里声明的foo()函数声明为在__device__上调用Derive带设备上调用，而bar的和foo分别声明了对__host__和__device__的调用，因此可以从host或者device上调用。如果明确声明虚析构函数的设备空间，则派生类的设备空间以基类为准：

#include <cuda_runtime.h>
struct Base1{
    virtual __host__ __device__ ~Base1(){};
};
// 派生类可以在host和device上访问
struct Derived1 : public Base1{
};
struct Base2{
    virtual __device__ ~Base2();
};
__device__ Base2::~Base2() = default;
// 派生类Derived2可以在Derived2上访问
struct Derived2 : public Base2{
};
int main(){
    Base1 b1;
}

C++中的“non-trivially-copyable type”指的是不符合POD（Plain Old Data）数据类型定义的数据类型。POD类型是指仅包含C++内置数据类型和符合特定要求的自定义类型，且可以使用浅拷贝来进行内存复制的数据类型。非POD类型则需要通过类似拷贝构造函数或移动构造函数等特殊函数进行内存复制，这些类型被称为“non-trivially-copyable type”。在C++11标准中，引入了移动语义和右值引用，使得非POD类型可以更高效地进行传递和复制。但是，仍然需要注意的是，非POD类型在进行内存复制时可能会引发一些问题，比如浅拷贝导致数据的混乱等。
因此，如果需要对非POD类型进行内存复制，应该使用适当的特殊函数进行复制，如拷贝构造函数、移动构造函数、拷贝赋值运算符和移动赋值运算符等。这些函数可以确保在内存复制时正确地处理非POD类型的数据，避免数据混乱等问题。总之，“non-trivially-copyable type”是指不符合POD类型定义的数据类型，需要特殊函数来进行内存复制，以确保数据的正确性。