为什么在C ++中拆分字符串要比Python慢？

93

我正在尝试将一些代码从Python转换为C ++，以期提高速度并提高生锈的C ++技能。当一个天真的实现从标准输入读取线是在Python比C快得多++（见昨天我惊呆了这个）。今天，我终于弄清楚了如何使用合并定界符（与python的split（）相似的语义）在C ++中拆分字符串，并且现在遇到了deja vu！我的C ++代码需要花费更长的时间才能完成工作（尽管昨天的课程没有那么多）。

Python代码：

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C ++代码：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意，我尝试了两种不同的拆分实现。一个（split1）使用字符串方法搜索令牌，并且能够合并多个令牌以及处理多个令牌（它来自此处）。第二个（split2）使用getline作为流读取字符串，不合并定界符，仅支持单个分隔符（该字符由多个StackOverflow用户发布，用于回答字符串拆分问题）。

我以不同的顺序多次运行。我的测试机器是Macbook Pro（2011，8GB，四核），并不是很重要。我正在使用20M行文本文件进行测试，该文件具有三个以空格分隔的列，每个列看起来都与此类似：“ foo.bar 127.0.0.1 home.foo.bar”

结果：

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

我究竟做错了什么？有没有更好的方法可以在C ++中进行字符串拆分，而该拆分不依赖于外部库（即不增强），支持合并定界符序列（如python的split），是线程安全的（因此没有strtok），并且其性能至少是与python相提并论？

编辑1 /部分解决方案？：

我尝试通过让python重置虚拟列表并每次都将其追加到C ++来使它更公平地进行比较。这仍然不完全是C ++代码正在做的事情，但是还差一点。基本上，循环现在是：

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

python的性能现在与split1 C ++实现大致相同。

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

我仍然感到惊讶的是，即使Python如此优化了字符串处理（如Matt Joiner所建议的那样），这些C ++实现也不会更快。如果有人对如何使用C ++以最佳方式实现此操作有任何想法，请共享您的代码。（我认为我的下一步将尝试在纯C中实现此功能，尽管我不会牺牲程序员的生产力来重新使用C来实现我的整个项目，所以这只是一个字符串拆分速度的实验。）

感谢大家的帮助。

最终编辑/解决方案：

请参阅Alf接受的答案。由于python严格按引用处理字符串，并且经常复制STL字符串，因此使用香草python实现会提高性能。为了进行比较，我通过Alf的代码编译并运行了数据，以下是与其他所有运行在同一台机器上的性能，基本上与朴素的python实现相同（尽管比重置/追加列表的python实现更快，因为如以上编辑所示）：

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

我唯一剩下的小麻烦是关于在这种情况下使C ++执行所需的代码量。

从本期和昨天的stdin行阅读版（如上链接）中可以得出的教训之一是，应该始终进行基准测试，而不是对语言的相对“默认”性能进行幼稚的假设。我很感谢教育。

再次感谢大家的建议！

— 锦江
source

2

您是如何编译C ++程序的？您是否启用了优化功能？

— interjay 2012年

2

@interjay：这是在他的源最后一个注释：g++ -Wall -O3 -o split1 split_1.cpp@JJC：请问你的基准票价，当你实际使用dummy和spline分别，也许Python的删除调用line.split()，因为它有没有副作用？

— 埃里克

2

如果删除拆分并仅保留stdin的阅读行，会得到什么结果？

— interjay 2012年

2

Python是用C编写的。这意味着在C中有一种有效的方法。也许有比使用STL更好的拆分字符串的方法了吗？

— ixe013

3

为什么std :: string操作的效果

— 马特·乔纳

57

可以推测，Python字符串是引用计数的不可变字符串，因此在Python代码中不会复制任何字符串，而C ++ std::string是可变值类型，并且被复制的机会最小。

如果目标是快速拆分，则可以使用恒定时间的子字符串操作，这意味着仅引用原始字符串的部分，如Python（以及Java和C＃…）一样。

C ++ std::string类具有一个赎回功能：它是standard，因此可以在效率不是主要考虑因素的地方安全，方便地传递字符串。但是足够的聊天。代码-在我的机器上，这当然比Python快，因为Python的字符串处理是在C中实现的，而C是C ++的子集（他）：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

免责声明：我希望没有任何错误。我没有测试功能，只是检查了速度。但是我认为，即使存在一两个错误，更正也不会明显影响速度。

— 干杯和hth。-阿尔夫
source

2

是的，Python字符串是引用计数的对象，因此Python所做的复制要少得多。它们仍然在幕后包含以N结尾的C字符串，但是，不像您的代码那样（指针，大小）对。

— 弗雷德·富

13

换句话说-对于更高层次的工作（例如文本操作），坚持使用一种更高层次的语言，数十年来，数十位开发人员已经在有效的工作上付出了数十年的努力-或者只是准备与所有这些开发人员一样多地工作在较低级别具有可比性。

— jsbueno 2012年

2

@JJC：对于StringRef，您可以将子字符串复制到std::string非常简单的just string( sr.begin(), sr.end() )。

— 干杯和健康。-阿尔夫

3

我希望减少CPython字符串的复制。是的，它们是引用计数且不可变，但是str.split（）使用PyString_FromStringAndSize()该调用为每个项目分配新的字符串PyObject_MALLOC()。因此，使用共享表示形式的优化没有利用字符串在Python中是不变的。

— 2012年

3

维护者：请不要通过尝试修复已知的错误来引入错误（特别是不要参考cplusplus.com）。TIA。

— 干杯和健康。-Alf 2015年

9

我没有提供任何更好的解决方案（至少在性能方面），但是可能会提供一些其他有趣的数据。

使用strtok_r（的可重入变体strtok）：

void splitc1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(str.size() + 1);
    strcpy(cpy, str.c_str());

    for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

另外，使用字符串作为参数和fgets输入：

void splitc2(vector<string> &tokens, const char *str,
        const char *delimiters) {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(strlen(str) + 1);
    strcpy(cpy, str);

    for(token = strtok_r(cpy, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

并且，在某些情况下，可以销毁输入字符串：

void splitc3(vector<string> &tokens, char *str,
        const char *delimiters) {
    char *saveptr;
    char *token;

    for(token = strtok_r(str, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }
}

这些的时间安排如下（包括我从问题和已接受的答案中得出的其他变体的结果）：

split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111

splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000

如我们所见，接受的答案的解决方案仍然是最快的。

对于任何想进行进一步测试的人，我还提供了一个Github存储库，其中包含问题，可接受的答案，此答案以及生成文件和生成测试数据的脚本的所有程序：https：// github。 com / tobbez / string-splitting。

— 托贝兹
source

2

我做了一个拉取请求（github.com/tobbez/string-splitting/pull/2），通过“使用”数据（计算单词和字符的数量）使测试更加实际。有了这一更改，所有C / C ++版本都超过了Python版本（期望基于我添加的Boost的tokenizer），并且基于“字符串视图”的方法（如split6的方法）的真正价值得以体现。

— 戴夫·约翰森

如果编译器未能注意到该优化，则应使用memcpy，而不是strcpy。 strcpy通常使用较慢的启动策略，该策略在短字符串快速运行与长字符串快速上升到完整SIMD之间取得平衡。 memcpy马上知道大小，并且不必使用任何SIMD技巧来检查隐式长度字符串的结尾。（在现代x86上没什么大不了的）。如果可以从构造器中创建std::string对象，那么创建对象的(char*, len)速度也会更快saveptr-token。显然，仅存储char*令牌将是最快的：P

— Peter Cordes，

4

我怀疑这是因为std::vector在push_back（）函数调用过程中调整了大小。如果尝试使用std::list或std::vector::reserve()为句子保留足够的空间，则应获得更好的性能。或者，您可以将以下两者结合使用，例如split1（）：

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);
    list<string> token_list;

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the list
        token_list.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
    tokens.assign(token_list.begin(), token_list.end());
}

编辑：我看到的另一件显而易见的事情是，每次dummy都分配 Python变量，但未对其进行修改。因此，与C ++进行比较并不是一个公平的比较。您应该尝试修改Python代码以对其dummy = []进行初始化，然后再执行dummy += line.split()。之后可以报告运行时间吗？

EDIT2：为了使其更加公平，您可以将C ++代码中的while循环修改为：

    while(cin) {
        getline(cin, input_line);
        std::vector<string> spline; // create a new vector

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

— 猎鹰猎鹰
source

谢谢你的主意。我实现了它，但是不幸的是，此实现实际上比原始的split1慢。我还在循环之前尝试了spline.reserve（16），但这对我的split1的速度没有影响。每行只有三个标记，并且每行之后都清除了向量，所以我没想到这会有所帮助。

— JJC '02

我也尝试了您的编辑。请参阅更新的问题。现在，性能与split1相当。

— JJC

我尝试了您的EDIT2。性能稍差：$ / usr / bin / time cat test_lines_double | ./split7 33.39真正的0.01用户0.49 sys C ++：在33秒内看到了20000000行。仰卧起坐速度：606060

— JJC

3

我认为以下代码使用某些C ++ 17和C ++ 14功能会更好：

// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.

// C++17
#include <istream>     // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility>     // C++14 feature std::move.

template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens, 
            std::string_view str,
            std::string_view delimiter = " ") 
{
    /* 
     * The model of the input string:
     *
     * (optional) delimiter | content | delimiter | content | delimiter| 
     * ... | delimiter | content 
     *
     * Using std::string::find_first_not_of or 
     * std::string_view::find_first_not_of is a bad idea, because it 
     * actually does the following thing:
     * 
     *     Finds the first character not equal to any of the characters 
     *     in the given character sequence.
     * 
     * Which means it does not treeat your delimiters as a whole, but as
     * a group of characters.
     * 
     * This has 2 effects:
     *
     *  1. When your delimiters is not a single character, this function
     *  won't behave as you predicted.
     *
     *  2. When your delimiters is just a single character, the function
     *  may have an additional overhead due to the fact that it has to 
     *  check every character with a range of characters, although 
     * there's only one, but in order to assure the correctness, it still 
     * has an inner loop, which adds to the overhead.
     *
     * So, as a solution, I wrote the following code.
     *
     * The code below will skip the first delimiter prefix.
     * However, if there's nothing between 2 delimiter, this code'll 
     * still treat as if there's sth. there.
     *
     * Note: 
     * Here I use C++ std version of substring search algorithm, but u
     * can change it to Boyer-Moore, KMP(takes additional memory), 
     * Rabin-Karp and other algorithm to speed your code.
     * 
     */

    // Establish the loop invariant 1.
    typename std::string_view::size_type 
        next, 
        delimiter_size = delimiter.size(),  
        pos = str.find(delimiter) ? 0 : delimiter_size;

    // The loop invariant:
    //  1. At pos, it is the content that should be saved.
    //  2. The next pos of delimiter is stored in next, which could be 0
    //  or std::string_view::npos.

    do {
        // Find the next delimiter, maintain loop invariant 2.
        next = str.find(delimiter, pos);

        // Found a token, add it to the vector
        tokens.push_back(str.substr(pos, next));

        // Skip delimiters, maintain the loop invariant 1.
        //
        // @ next is the size of the just pushed token.
        // Because when next == std::string_view::npos, the loop will
        // terminate, so it doesn't matter even if the following 
        // expression have undefined behavior due to the overflow of 
        // argument.
        pos = next + delimiter_size;
    } while(next != std::string_view::npos);
}   

template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens, 
            std::istream &stream,
            char delimiter = ' ')
{
    std::string<char, traits, Allocator2> item;

    // Unfortunately, std::getline can only accept a single-character 
    // delimiter.
    while(std::getline(stream, item, delimiter))
        // Move item into token. I haven't checked whether item can be 
        // reused after being moved.
        tokens.push_back(std::move(item));
}

容器的选择：

std::vector。

假设已分配内部数组的初始大小为1，而最终大小为N，则将分配和取消分配log2（N）次，然后复制（2 ^（log2（N）+ 1）-1）= （2N-1）次。正如在中指出的那样，由于未调用realloc对数次数而导致std :: vector的性能较差吗？，当向量的大小无法预测且可能很大时，这可能会导致性能下降。但是，如果您可以估计它的大小，那么问题就不大了。
std::list。

对于每个push_back，它消耗的时间都是一个常数，但是它可能比单个push_back上的std :: vector花费更多的时间。使用每个线程的内存池和自定义分配器可以缓解此问题。
std::forward_list。

与std :: list相同，但每个元素占用的内存更少。由于缺少API push_back，因此需要包装器类才能工作。
std::array。

如果您知道增长的极限，那么可以使用std :: array。当然，您不能直接使用它，因为它没有API push_back。但是您可以定义一个包装器，我认为这是最快的方法，并且如果您的估算非常准确，则可以节省一些内存。
std::deque。

此选项使您可以以内存换取性能。将没有元素的（2 ^（N + 1）-1）倍的复制，只有N倍的分配，并且没有解除分配。另外，您将拥有恒定的随机访问时间，并且可以在两端添加新元素。

根据std :: deque-cppreference

另一方面，双端队列通常具有很大的最小存储成本。仅持有一个元素的双端队列必须分配其完整的内部数组（例如，在64位libstdc ++上为对象大小的8倍；在64位libc ++上为对象大小的16倍或4096字节，以较大者为准）

或者您可以使用以下组合：

std::vector< std::array<T, 2 ^ M> >

这类似于std :: deque，不同之处在于此容器不支持在前面添加元素。但是由于它不会复制基础std :: array达（2 ^（N + 1）-1）次，因此它只会复制（2 ^ （N-M + 1）-1）次，仅在当前电流已满且不需要取消分配任何内容时才分配新数组。顺便说一下，您可以获得恒定的随机访问时间。
std::list< std::array<T, ...> >

大大缓解了内存分帧的压力。仅在当前空间已满时才分配新数组，并且不需要复制任何内容。您仍然需要为与组合1相比较的附加指针付出代价。
std::forward_list< std::array<T, ...> >

与2相同，但内存与组合1相同。

— 徐佳豪
source

如果您使用std :: vector具有一些合理的初始大小（例如128或256）作为总副本（假设增长因子为2），则对于该最大大小的副本，您根本不会进行任何复制。然后，您可以缩小分配以适合实际使用的元素数，因此对于少量输入而言并不可怕。但是，对于很大的N情况，这对于总份数并没有多大帮助。太糟糕了，std :: vector不能realloc用来潜在地允许在当前分配的末尾映射更多页面，因此它慢了大约2倍。

— 彼得·科德斯

是stringview::remove_prefix便宜，因为只是保持跟踪你的当前位置在一个正常的字符串？ std::basic_string::find有一个可选的第二个参数pos = 0，让您从偏移开始搜索。

— 彼得·科德斯

@彼得·科德斯（Peter Cordes）正确。我检查了libcxx impl

— Xu JiaHao Xu 18-4-29

我还检查了libstdc ++ impl，它是相同的。

— 徐佳浩

您对vector的性能的分析已关闭。考虑一个向量，该向量在初次插入时的初始容量为1，并且在每次需要新容量时都会加倍。如果需要放入17个项目，则第一个分配为1，然后2，然后4，然后8，然后16，最后是32分配空间。这意味着总共分配了6个分配（log2(size - 1) + 2使用整数对数）。第一个分配移动了0个字符串，第二个分配移动了1个字符串，然后移动了2个，然后是4个，然后是8个，最后是16个，总共移动了31个（2^(log2(size - 1) + 1) - 1)）。这是O（n），而不是O（2 ^ n）。这将大大胜过std::list。

— 大卫·斯通

2

您错误地假设您选择的C ++实现必然比Python快。Python中的字符串处理已高度优化。有关更多信息，请参见此问题：为什么std :: string操作执行不佳？

— 马特·乔纳
source

4

我对整体语言性能没有任何要求，仅对我的特定代码有任何要求。因此，这里没有任何假设。感谢您很好地指出了另一个问题。我不确定您是在说C ++中的这种特定实现是次优的（您的第一句话），还是在字符串处理中C ++的速度仅比Python慢（您的第二句话）。另外，如果您知道一种快速的方法来完成我想用C ++进行的工作，请分享给大家。谢谢。为了澄清起见，我喜欢python，但我不是盲目狂热者，这就是为什么我试图学习最快的方法来做到这一点。

— JJC

1

@JJC：鉴于Python的实现速度更快，我想说您的实现不是很理想。请记住，语言实现可以为您带来捷径，但最终算法复杂性和手动优化将胜出。在这种情况下，默认情况下，Python在该用例中占优势。

— 马特·乔伊纳

2

如果您采用split1实现并通过以下更改将签名更改为与split2更匹配：

void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")

对此：

void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')

您会在split1和split2之间获得更大的区别，并且进行了更为公平的比较：

split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030

— 保罗·贝克汉姆
source

1

void split5(vector<string> &tokens, const string &str, char delim=' ') {

    enum { do_token, do_delim } state = do_delim;
    int idx = 0, tok_start = 0;
    for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
        switch (state) {
            case do_token:
                if (it == str.end()) {
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                    return;
                }
                else if (*it == delim) {
                    state = do_delim;
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                }
                break;

            case do_delim:
                if (it == str.end()) {
                    return;
                }
                if (*it != delim) {
                    state = do_token;
                    tok_start = idx;
                }
                break;
        }
    }
}

— 。代词
source

谢谢nm！不幸的是，这似乎与我的数据集和计算机上原始（拆分1）实现的运行速度大致相同：$ / usr / bin / time cat test_lines_double | ./split8 21.89实际0.01用户0.47 sys C ++：在22秒内看到20000000行。仰卧起坐速度：909090

— JJC

在我的机器上：split1-54s，split.py-35s，split5-16s。我不知道。

— n。代词

嗯，您的数据与我上面提到的格式匹配吗？我假设您每次都运行几次以消除诸如初始磁盘缓存填充之类的短暂影响？

— JJC

0

我怀疑这与在Python中sys.stdin上的缓冲有关，但在C ++实现中没有缓冲。

有关如何更改缓冲区大小的详细信息，请参见此文章，然后重试比较：为sys.stdin设置较小的缓冲区大小？

— 亚历克斯·柯林斯
source

1

嗯...我不跟随。在C ++中，仅读取行（不拆分）比Python快（包括cin.sync_with_stdio（false）;行）。这就是我昨天提到的问题。

— JJC '02

为什么在C ++中拆分字符串要比Python慢​​？

为什么在C ++中拆分字符串要比Python慢？