为什么当我使用的线程数远远超过CPU核心数时，代码的执行速度反而更快？

Question

12 浏览2023年5月16日

匿名的 2023年5月16日

0 Comments

我有一些计时代码，用于测量给定代码片段的运行时间：

struct time_data {
    std::chrono::steady_clock::time_point start, end;
    auto get_duration() const {
        return end - start;
    }
    void print_data(std::ostream & out) const {
        out << str();
    }
    std::string str() const {
        static const std::locale loc{ "" };
        std::stringstream ss;
        ss.imbue(loc);
        ss << "开始时间：    " << std::setw(24) << std::chrono::nanoseconds{ start.time_since_epoch() }.count() << "ns\n";
        ss << "结束时间：    " << std::setw(24) << std::chrono::nanoseconds{ end.time_since_epoch() }.count() << "ns\n";
        ss << "持续时间：    " << std::setw(24) << std::chrono::nanoseconds{ get_duration() }.count() << "ns\n";
        return ss.str();
    }
    friend std::ostream & operator<<(std::ostream & out, const time_data & data) {
        return out << data.str();
    }
};
template
time_data time_function(T && func) {
    time_data data;
    data.start = std::chrono::steady_clock::now();
    func();
    data.end = std::chrono::steady_clock::now();
    return data;
}
template
T accumulation_function(T a, T b) {
    T count = 0;
    while (b > 1) {
        if (b % 2 == 0) b /= 2;
        else b = b * 3 + 1;
        ++count;
    }
    return a + count;
}
template
auto std_sum(IT begin, IT end) {
    auto sum = (*begin - *begin);
    sum = std::accumulate(begin, end, sum, accumulation_function);
    return sum;
}
template
auto single_thread_sum(IT begin, IT end) {
    auto sum = (*begin - *begin);
    IT current = begin;
    while (current != end) {
        sum = accumulation_function(sum, *current);
        ++current;
    }
    return sum;
}
template
auto N_thread_smart_sum(IT begin, IT end) {
    auto sum = (*begin - *begin);
    std::vector threads;
    std::array sums;
    auto dist = std::distance(begin, end);
    for (uint64_t i = 0; i < N; i++) {
        threads.emplace_back([=, &sums] {
            IT first = begin;
            IT last = begin;
            auto tsum = 0;
            std::advance(first, i * dist / N);
            std::advance(last, (i + 1) * dist / N);
            while (first != last) {
                tsum = accumulation_function(tsum, *first);
                ++first;
            }
            sums[i] = tsum;
        });
    }
    for (std::thread & thread : threads)
        thread.join();
    for (const auto & s : sums) {
        sum += s;
    }
    return sum;
}
template
auto two_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
template
auto four_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
template
auto eight_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
template
auto sixteen_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
template
auto thirty_two_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
template
auto sixty_four_thread_smart_sum(IT begin, IT end) {
    return N_thread_smart_sum(begin, end);
}
int main() {
    std::vector raw_data;
    auto fill_data = time_function([&raw_data] {
        constexpr uint64_t SIZE = 1'000'000'000ull;
        raw_data.resize(SIZE);
        std::vector threads;
        for (int i = 0; i < 8; i++) {
            threads.emplace_back([i, SIZE, &raw_data] {
                uint64_t begin = i * SIZE / 8;
                uint64_t end = (i + 1) * SIZE / 8;
                for (uint64_t index = begin; index < end; index++) {
                    raw_data[index] = begin % (20 + i);
                }
            });
        }
        for (std::thread & t : threads) 
            t.join();
    });
    int64_t sum;
    std::cout << std::setw(25) << "Fill Data" << std::endl;
    std::cout << fill_data << std::endl;
    auto std_data = time_function([&] {
        sum = std_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "STD Sum: " << sum << std::endl;
    std::cout << std_data << std::endl;
    auto single_data = time_function([&] {
        sum = single_thread_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Single Sum: " << sum << std::endl;
    std::cout << single_data << std::endl;
    auto smart_2_data = time_function([&] {
        sum = two_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Two-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_2_data << std::endl;
    auto smart_4_data = time_function([&] {
        sum = four_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Four-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_4_data << std::endl;
    auto smart_8_data = time_function([&] {
        sum = eight_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Eight-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_8_data << std::endl;
    auto smart_16_data = time_function([&] {
        sum = sixteen_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Sixteen-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_16_data << std::endl;
    auto smart_32_data = time_function([&] {
        sum = thirty_two_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Thirty-Two-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_32_data << std::endl;
    auto smart_64_data = time_function([&] {
        sum = sixty_four_thread_smart_sum(raw_data.begin(), raw_data.end());
    });
    std::cout << std::setw(25) << "Sixty-Four-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_64_data << std::endl;
    return 0;
}

程序的输出如下：

                Fill Data
开始时间：          16,295,979,890,252ns
结束时间：          16,300,523,805,484ns
持续时间：              4,543,915,232ns
                STD Sum: 7750000000
开始时间：          16,300,550,212,791ns
结束时间：          16,313,216,607,890ns
持续时间：             12,666,395,099ns
             Single Sum: 7750000000
开始时间：          16,313,216,694,522ns
结束时间：          16,325,774,379,684ns
持续时间：             12,557,685,162ns
   Two-Thread-Smart Sum: 7750000000
开始时间：          16,325,774,466,014ns
结束时间：          16,334,441,032,868ns
持续时间：              8,666,566,854ns
  Four-Thread-Smart Sum: 7750000000
开始时间：          16,334,441,137,913ns
结束时间：          16,342,188,642,721ns
持续时间：              7,747,504,808ns
 Eight-Thread-Smart Sum: 7750000000
开始时间：          16,342,188,770,706ns
结束时间：          16,347,850,908,471ns
持续时间：              5,662,137,765ns
Sixteen-Thread-Smart Sum: 7750000000
开始时间：          16,347,850,961,597ns
结束时间：          16,352,187,838,584ns
持续时间：              4,336,876,987ns
Thirty-Two-Thread-Smart Sum: 7750000000
开始时间：          16,352,187,891,710ns
结束时间：          16,356,111,411,220ns
持续时间：              3,923,519,510ns
Sixty-Four-Thread-Smart Sum: 7750000000
开始时间：          16,356,111,471,288ns
结束时间：          16,359,988,028,812ns
持续时间：              3,876,557,524ns

开始的几个结果并不令人惊讶：我自己编写的累加代码与`std::accumulate`函数的运行时间大致相同（在连续运行中，两者都被证明是“最快”的，这意味着它们的实现可能相似）。当我切换到两个线程和四个线程时，代码变得更快。这是有道理的，因为我使用的是Intel 4核处理器。

但是，之后的结果令人困惑。我的CPU只有4个核心（如果考虑超线程，则为8个核心），但即使我忽略了超线程在8个线程上提供的微小性能增益，将线程数增加到16、32和64个线程仍然会产生额外的性能增益。为什么会这样？当我已经达到CPU可以同时运行的线程数的最大值时，为什么额外的线程会产生额外的性能增益呢？

注意：这与链接问题不同，因为我处理的是特定的用例和代码，而链接问题是处理的一般性问题。

0

2 答案

匿名的 · Answer 1 · 2023-08-20T14:07:55+00:00

为什么当我使用比CPU核心数量更多的线程时，代码运行得更快？

这个问题的原因在于代码中的`tsum`写操作都是对同一个数组`array`进行的，这些写操作在内存中会占据连续的地址。这会导致缓存争用问题，因为所有的写操作都会涉及到相同的缓存行。当使用更多的线程时，这些写操作会分散到不同的缓存行中，因此CPU核心在无效化和重新加载缓存行的时间上会减少。

解决方法是在循环结束时，将累加结果保存到一个局部变量`tsum`中（而不是引用到`sums`中），然后再将该结果写入到`sums[i]`中。

根据你的建议，多线程代码的速度显著提升，但仍然存在这样的行为：16、32和64个线程的性能都明显优于2、4和8个线程。

匿名的 · Answer 2 · 2023-07-23T17:50:14+00:00

这可能是因为您的应用程序的线程在系统中运行的总活动线程中占比较大的原因，从而使您的应用程序获得更多的时间片。您可以尝试在运行其他程序时观察数字是否会有所变化。

如果我有很多其他程序在运行，所有版本的代码都会稍微变慢，但不会改变“线程数>>>核心数”更快的行为。

解决方法可能是优化代码，减少线程的数量，以确保与CPU核心数相匹配。您还可以尝试使用线程池来管理线程，以避免创建过多的线程。另外，您可以尝试使用并发编程工具，如锁、信号量和条件变量，来降低线程之间的竞争和冲突。