我不知道为什么在pthread子程序中改变变量访问/存储类型会大大提高性能。

15 浏览
0 Comments

我不知道为什么在pthread子程序中改变变量访问/存储类型会大大提高性能。

我对多线程编程还不熟悉,但我知道如果不小心的话可能会有一些奇怪的副作用,但我没想到我写的代码会让我这么困惑。我正在编写一个我认为很明显的线程开始/测试:只是将0到x之间的数字相加(当然是https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/,但是我所尝试的更多是如何使用线程而不是如何让程序尽可能快)。我使用一个函数调用来基于系统上的硬编码核心数量和定义处理器是否具有多线程功能的“布尔”值来创建线程。我将工作分配给每个线程,使每个线程在一个范围内进行求和,理论上,如果所有线程都能合作,我可以做numcores*normal_computation,这确实令人兴奋,并且令我惊讶的是,它的工作方式基本上符合我的预期,直到我做了一些调整。\n继续之前,我认为一些代码会有所帮助:\n这些是我在基本代码中使用的预处理器定义:\n

#define NUM_CORES 4
#define MULTI_THREADED 1 //1为真,0为假
#define BIGVALUE 1000000000UL

\n我使用这个结构体来向我的面向线程的函数传递参数:\n

typedef struct sum_args
{
    int64_t start;
    int64_t end;
    int64_t return_total;
} sum_args;

\n这是创建线程的函数:\n

int64_t SumUpTo_WithThreads(int64_t limit)
{   //从零开始计数
    const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
    pthread_t threads[numthreads];
    sum_args listofargs[numthreads];
    int64_t offset = limit/numthreads; //小数点后的精度丢失要小心
    int64_t total = 0;
    //i < numthread-1,因为由于整数除法,不能确保offset恰好等于limit/numthreads
    for (int i = 0; i < numthreads-1; i++)
    {
        listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
        pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
    }
    //边缘情况处理
    //limit + 1,因为SumBetween()不包括.end,即每个循环在.end - 1停止
    listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
    pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));
    //结束
    for (int i = 0; i < numthreads; i++)
    {
        pthread_join(threads[i], NULL); //用于确保在添加.return_total之前线程已完成
        total += listofargs[i].return_total;
    }
    return total;
}

\n这只是一个“正常”的求和实现,供比较参考:\n

int64_t SumUpTo(int64_t limit)
{
    uint64_t total = 0;
    for (uint64_t i = 0; i <= limit; i++)
        total += i;
    return total;
}

\n这是线程运行的函数,它有“两种实现”,一种是某种原因快速实现,一种是某种原因慢速实现(这是我困惑的地方):额外的副笔记:我使用预处理指令只是为了更容易编译较慢和较快版本。\n

void* SumBetween(void *arg)
{
    #ifdef SLOWER
    ((sum_args *)arg)->return_total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        ((sum_args *)arg)->return_total += i;
    #endif
    #ifdef FASTER
    uint64_t total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        total += i;
    ((sum_args *)arg)->return_total = total;
    #endif
    return NULL;
}

\n这是我的主函数:\n

int main(void)
{
    #ifdef THREADS
    printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
    #endif
    #ifdef NORMAL
    printf("%ld\n", SumUpTo(BIGVALUE));
    #endif 
    return 0;
}

\n这是我的编译(我确保将优化级别设置为0,以避免编译器完全优化掉愚蠢的求和程序,毕竟我想了解如何使用线程!):\n

make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe
make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe
clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

\n以下是结果/差异(请注意,使用GCC生成的代码也具有相同的副作用):\n

slower:
sudo time ./slower.exe 
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
faster:
sudo time ./faster.exe 
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps

\n为什么使用额外的堆栈定义变量比直接解引用传入的结构体指针快那么多!\n我试图自己找到这个问题的答案。我最终进行了一些测试,实现了与我的SumUpTo()函数相同的基本/天真的求和算法,唯一的区别是它处理的数据间接性。\n以下是结果:\n

Choose a function to execute!
int64_t sum(void) took: 2.207833 (s) //新的堆栈定义变量,基本上是SumUpTo()函数的副本
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)

\n测试结果符合我更多或更少的预期。因此我断定这一定是基于这个想法的某些东西。\n为了增加更多的信息,我正在运行Linux,具体是Mint发行版。\n我的处理器信息如下:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nByte Order: Little Endian\nAddress sizes: 36 bits physical, 48 bits virtual\nCPU(s): 8\nOn-line CPU(s) list: 0-7\nThread(s) per core: 2\nCore(s) per socket: 4\nSocket(s): 1\nNUMA node(s): 1\nVendor ID: GenuineIntel\nCPU family: 6\nModel: 42\nModel name: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz\nStepping: 7\nCPU MHz: 813.451\nCPU max MHz: 3500.0000\nCPU min MHz: 800.0000\nBogoMIPS: 4784.41\nVirtualization: VT-x\nL1d cache: 128 KiB\nL1i cache: 128 KiB\nL2 cache: 1 MiB\nL3 cache: 6 MiB\nNUMA node0 CPU(s): 0-7\nVulnerability Itlb multihit: KVM: Mitigation: Split huge pages\nVulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach\n e flushes, SMT vulnerable\nVulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable\nVulnerability Meltdown: Mitigation; PTI\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v\n ia prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user\n pointer sanitization\nVulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB condit\n ional, IBRS_FW, STIBP conditional, RSB filling\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr\n r pge mca cmov pat pse36 clflush dts acpi mmx f\n xsr sse sse2 ht tm pbe syscall nx rdtscp lm con\n stant_tsc arch_perfmon pebs bts nopl xtopology \n nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes\n 64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt\n pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de\n adline_timer aes xsave avx lahf_lm epb pti ssbd\n ibrs ibpb stibp tpr_shadow vnmi flexpriority e\n pt vpid xsaveopt dtherm ida arat pln pts md_cle\n ar flush_l1d\n\n如果你想自己编译代码或查看我特定实例的生成汇编,请查看:https://github.com/spaceface102/Weird_Threads\n主要源代码是“countV2.c”,以防你迷失方向。\n谢谢帮助!\n/*EOPost*/

0