运行时使用两个线程与使用一个线程没有改进效果

Question

13 浏览2023年1月26日

匿名的 2023年1月26日

0 Comments

我对C++中的多线程编程还不太熟悉。我写了一段简单的代码，下面是代码的内容。\n当在两个线程中运行代码时，与单线程运行相比，代码完成的速度几乎没有提升。我发现其他类似的问题，但是它们与我的情况不同，因为我没有共享资源需要两个线程同时访问：\n代码如下：\n#include \n#include \n#include \n#include \nusing namespace std;\ntypedef unsigned long long ull;\null oddSum = 0;\null evenSum = 0;\nvoid addOdd(){\n for(int i = 1; i <= 1999999999; i++){\n if(i % 2 == 1)\n oddSum += i;\n }\n}\nvoid addEven(){\n for(int i = 1; i <= 1999999999; i++){\n if(i % 2 == 0)\n evenSum += i;\n }\n}\nint main(){\n auto startTime = std::chrono::high_resolution_clock::now();\n //两个线程\n std::thread t1(addEven); //启动两个线程运行\n std::thread t2(addOdd); \n t1.join();\n t2.join(); //等待两个线程完成\n //一个线程\n //addEven();\n //addOdd();\n auto stopTime = std::chrono::high_resolution_clock::now();\n auto elapsed = std::chrono::duration_cast(stopTime - startTime);\n cout << \"奇数之和：\" << oddSum << endl;\n cout << \"偶数之和：\" << evenSum << endl;\n cout << elapsed.count()/(double)1000000 << endl;\n return 0;\n}\n当我使用单线程运行时，10次运行的平均时间约为7.3秒。\n当我使用两个线程运行时，10次运行的平均时间约为6.8秒。\n由于函数中的循环占据了绝大部分时间，我认为在并行运行两个线程时，每个线程只运行一个函数，所需时间将减少一半。\n注意1：我知道时间不可能完全减半，一个更合理的猜测可能是在两个线程中运行的时间最多为5秒。我理解创建线程对象本身就有开销。\n注意2：也许我漏掉了什么，但是这两个线程并没有访问任何共享位置。\n欢迎任何想法。我对并发编程的理论很熟悉，现在开始积累一些实践经验。我使用的是一个有4个核心的英特尔i7处理器。

0

3 答案

匿名的 · Answer 1 · 2023-05-21T09:36:56+00:00

运行时间使用两个线程与一个线程没有改善的原因是因为每个线程执行的任务量几乎相同。这导致了运行时间没有明显的改善。为了解决这个问题，可以对代码进行优化，将迭代的数字分配给两个线程。

初始的单线程解决方案是通过一个循环迭代1999999999个数字。代码如下：

for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 1)
        oddSum += i;
    else
        evenSum += i;
}

下面是使用两个线程的解决方案，每个线程都迭代相同的数字。这样，每个线程执行的工作量几乎相同。

for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 1)
        oddSum += i;
}
for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 0)
        evenSum += i;
}

优化后的解决方案是将1999999999个迭代任务分配给两个线程：

for(int i = 1; i <= 1999999999; i += 2){
    oddSum += i;
}
for(int i = 2; i <= 1999999999; i += 2){
    evenSum += i;
}

通过这种优化，每个线程的工作量减少了一半。

然而，问题的关键在于两个全局变量可能在同一缓存行中，这导致了运行时间没有明显的改善。如果在单线程上运行优化后的代码，与在两个线程上运行原始代码没有任何区别。因此，解决方法是确保全局变量在不同的缓存行中。

总之，使用两个线程与一个线程在这种情况下没有改善运行时间，是因为每个线程执行的工作量几乎相同。解决方法是优化代码，将任务分配给不同的线程，并确保全局变量在不同的缓存行中。

匿名的 · Answer 2 · 2023-06-04T11:54:52+00:00

因为你在两个全局变量旁边声明了，所以你可能遇到了伪共享问题。

False sharing is a phenomenon that occurs when multiple threads access different variables that are located on the same cache line. This can lead to decreased performance due to cache invalidation and synchronization issues.

伪共享是一种现象，当多个线程访问位于同一缓存行上的不同变量时会发生。这可能会导致性能降低，因为会出现缓存失效和同步问题。

In this case, the two global variables are likely being placed on the same cache line, causing false sharing. When one thread modifies one variable, it invalidates the cache line, forcing the other thread to reload the entire cache line even though it only needs to access the other variable.

在这种情况下，两个全局变量很可能被放置在同一缓存行上，导致伪共享。当一个线程修改一个变量时，它会使缓存行无效，迫使另一个线程重新加载整个缓存行，即使它只需要访问另一个变量。

To solve this issue, you can use padding to ensure that the two variables are placed on different cache lines. By adding padding between the variables, you can prevent false sharing and improve performance.

要解决这个问题，你可以使用填充来确保这两个变量被放置在不同的缓存行上。通过在变量之间添加填充，你可以防止伪共享并提高性能。

Here's an example of how you can add padding to the global variables in C++:

以下是在C++中如何为全局变量添加填充的示例代码：

struct alignas(64) GlobalVariables {
    int variable1;
    char padding[64];
    int variable2;
};

By aligning the struct to a cache line size (e.g., 64 bytes), you ensure that the two variables are placed on different cache lines, avoiding false sharing.

通过将结构体对齐到缓存行大小（例如64字节），你可以确保这两个变量被放置在不同的缓存行上，避免伪共享。

With this padding, each thread can modify its respective variable without invalidating the cache line of the other variable, leading to improved performance.

有了这个填充，每个线程都可以修改它们各自的变量，而不会使另一个变量的缓存行失效，从而提高性能。

In conclusion, the issue of runtime not improving with two threads compared to one thread is likely due to false sharing. By adding padding to ensure that the variables are placed on different cache lines, you can avoid false sharing and improve performance.

匿名的 · Answer 3 · 2023-05-01T00:03:25+00:00

当频繁从不同线程访问变量时，应确保它们位于不同的缓存行上（在x86上为64字节宽度）。

您可以通过对变量进行对齐来实现这一点：

ull oddSum __attribute__((aligned(64))) = 0;
ull evenSum  __attribute__((aligned(64))) = 0;

如果不这样做，写操作将被有效地串行化，因为缓存行只能一次由一个CPU进行修改。

对变量进行对齐可以将多线程情况下的运行时间减少30%。

正如在评论中提到的，如果编译器支持C++17，这可以以一种可移植的方式完成：

#include 
alignas(std::hardware_destructive_interference_size) ull oddSum = 0;
alignas(std::hardware_destructive_interference_size) ull evenSum  = 0;

这解决了我的问题。我一定要在这个问题上多读一些文章。编辑：我还发现了这篇文章，更详细地解释了此问题：stackoverflow.com/questions/8469427/…

我们从C++11开始就有`alignas`，从C++17开始就有`std::hardware_destructive_interference_size`。您不需要使用特定于编译器的属性和魔术常数。

你使用的是什么CPU和编译器？还有什么优化级别？

GCC 4.8，没有优化（我猜这使得测量结果没有太多意义...）。通过优化，问题不会出现，因为编译器不会在每次迭代时将变量写回内存。我猜原始帖子的作者没有启用优化，根据他们的基准测试数据（在启用优化后，我的10多年前的Xeon 5150比原始数据快得多）。

我使用的是g++ 5.1.0版本，我的Intel i7-8550U有4个核心。我没有传递优化标志（主要是因为我不知道它具体做什么）。正如我提到的，我只是想获得一些实际经验（学习和理解程序为什么会以某种方式运行，背后的原因是什么），我的目标并不是使这个特定的程序运行得尽可能快。谢谢您的帮助，非常有帮助。