Bentley的编码挑战:k个最常见的单词


18

这也许是1986年引起共鸣的经典编码挑战之一,当时专栏作家Jon Bentley邀请Donald Knuth编写一个程序,该程序可以在文件中找到k个最常见的单词。Knuth在一个8页长的程序中使用哈希尝试实现了一种快速解决方案,以说明他的识字编程技术。贝尔实验室的道格拉斯·麦克罗伊(Douglas McIlroy)批评克努斯(Knuth)的解决方案甚至无法处理圣经的全文,并回答了这么短的单行代码,但完成了工作:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q

1987年,普林斯顿大学教授发表了后续文章,提出了另一种解决方案。但是,即使是一部圣经,它也无法返回结果!

问题描述

原始问题描述:

给定一个文本文件和一个整数k,您将以降低的频率打印文件中的k个最常见的单词(及其出现的次数)。

其他问题说明:

  • Knuth将一个单词定义为一串拉丁字母: [A-Za-z]+
  • 其他所有字符均被忽略
  • 大写和小写字母被认为是等效的(WoRd== word
  • 文件大小和字长无限制
  • 连续单词之间的距离可以任意大
  • 最快的程序是使用总CPU时间最少的程序(多线程可能无济于事)

样本测试用例

Test 1: Ulysses by James Joyce concatenated 64 times (96 MB file).

  • 从古腾堡计划中下载《尤利西斯》:wget http://www.gutenberg.org/files/4300/4300-0.txt
  • 将其串联64次: for i in {1..64}; do cat 4300-0.txt >> ulysses64; done
  • 最常见的单词是968832出现的“ the”。

测试2:特殊生成的随机文本giganovel(大约1 GB)。

  • Python 3生成器脚本在这里
  • 文本包含148391个与自然语言相似出现的不同单词。
  • 最常见的单词:“ e”(出现11309次)和“ ihit”(出现11290次)。

共性测试:任意大的单词,任意大的差距。

参考实施

在研究了Rosetta Code来解决此问题并意识到许多实现都非常慢(比shell脚本慢!)之后,我在这里测试了一些好的实现。以下是ulysses64具有时间复杂度的性能:

                                     ulysses64      Time complexity
C++ (prefix trie + heap)             4.145          O((N + k) log k)
Python (Counter)                     10.547         O(N + k log Q)
AWK + sort                           20.606         O(N + Q log Q)
McIlroy (tr + sort + uniq)           43.554         O(N log N)

你能打败那个吗?

测试中

将使用带有标准Unix time命令的2017年13英寸MacBook Pro (“用户”时间)评估性能。请尽可能使用现代编译器(例如,使用最新的Haskell版本,而不是旧版本)。

到目前为止的排名

时间,包括参考程序:

                                              k=10                  k=100K
                                     ulysses64      giganovel      giganovel
C++ (trie) by ShreevatsaR            0.671          4.227          4.276
C (trie + bins) by Moogie            0.704          9.568          9.459
C (trie + list) by Moogie            0.767          6.051          82.306
C++ (hash trie) by ShreevatsaR       0.788          5.283          5.390
C (trie + sorted list) by Moogie     0.804          7.076          x
Rust (trie) by Anders Kaseorg        0.842          6.932          7.503
J by miles                           1.273          22.365         22.637
C# (trie) by recursive               3.722          25.378         24.771
C++ (trie + heap)                    4.145          42.631         72.138
APL (Dyalog Unicode) by Adám         7.680          x              x
Python (dict) by movatica            9.387          99.118         100.859
Python (Counter)                     10.547         102.822        103.930
Ruby (tally) by daniero              15.139         171.095        171.551
AWK + sort                           20.606         213.366        222.782
McIlroy (tr + sort + uniq)           43.554         715.602        750.420

累积排名*(%,最高分– 300):

#     Program                         Score  Generality
 1  C++ (trie) by ShreevatsaR           300     Yes
 2  C++ (hash trie) by ShreevatsaR      368      x
 3  Rust (trie) by Anders Kaseorg       465     Yes
 4  C (trie + bins) by Moogie           552      x
 5  J by miles                         1248     Yes
 6  C# (trie) by recursive             1734      x
 7  C (trie + list) by Moogie          2182      x
 8  C++ (trie + heap)                  3313      x
 9  Python (dict) by movatica          6103     Yes
10  Python (Counter)                   6435     Yes
11  Ruby (tally) by daniero           10316     Yes
12  AWK + sort                        13329     Yes
13  McIlroy (tr + sort + uniq)        40970     Yes

*相对于三个测试中每个最佳程序的时间性能总和。

迄今为止最好的程序:这里(第二个解决方案)


比分只是在尤利西斯的时间?它似乎暗含,但没有明确说明
洛克·亨特(Rock Garf Hunter)发布

@ SriotchilismO'Zaic,现在,是的。但是您不应该依赖第一个测试用例,因为可能会出现更大的测试用例。ulysses64的明显缺点是重复性:文件的1/64之后没有新单词出现。因此,这不是一个很好的测试用例,但是很容易分发(或复制)。
安德里·马库卡

3
我的意思是您之前谈论的隐藏测试用例。如果您在显示实际文本时立即发布哈希值,我们可以确保对答案公平,而且您不是制王者。尽管我认为Ulysses的哈希值有些有用。
发布Rock Garf Hunter,

1
@tsh这是我的理解:例如两个单词e和g
Moogie

1
@AndriyMakukha啊,谢谢。那只是臭虫。我修好了。
安德斯·卡塞格

Answers:


5

[C]

在我的2.8 Ghz Xeon W3530上,测试1的运行时间不到1.6秒。在Windows 7上使用MinGW.org GCC-6.3.0-1构建的:

它使用两个参数作为输入(文本文件的路径,并列出k个最常用的单词)

它只是在单词字母上创建一个分支,然后在叶子字母上增加一个计数器。然后检查当前叶计数器是否大于最频繁单词列表中最小的最频繁单词。(列表大小是通过命令行参数确定的数字)如果是,则将由叶子字母表示的单词提升为最常用的单词之一。重复所有这些操作,直到没有更多的字母被读入为止。之后,通过对最频繁单词列表中的最频繁单词进行无效迭代搜索来输出最频繁单词列表。

当前默认情况下输出处理时间,但是为了与其他提交保持一致,请在源代码中禁用TIMING定义。

另外,我已经从工作计算机上提交了此文件,但无法下载测试2文本。它应该可以与此测试2一起使用,而无需修改,但是可能需要增加MAX_LETTER_INSTANCES的值。

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// increase this if needing to output more top frequent words
#define MAX_TOP_FREQUENT_WORDS 1000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char mostFrequentWord;
    struct Letter* parent;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0 || k> MAX_TOP_FREQUENT_WORDS)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n");
        printf("NOTE: upto %d most frequent words can be requested\n\n",MAX_TOP_FREQUENT_WORDS);
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;
    root->mostFrequentWord = false;
    root->count = 0;

    // the next letter to be processed
    Letter* nextLetter = null;

    // store of the top most frequent words
    Letter* topWords[MAX_TOP_FREQUENT_WORDS];

    // initialise the top most frequent words
    for (i = 0; i<k; i++)
    {
        topWords[i]=root;
    }

    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // ignore this word if already identified as a most frequent word
            if (!currentLetter->mostFrequentWord)
            {
                // update the list of most frequent words
                // by replacing the most infrequent top word if this word is more frequent
                if (currentLetter->count> lowestWordCount)
                {
                    currentLetter->mostFrequentWord = true;
                    topWords[lowestWordIndex]->mostFrequentWord = false;
                    topWords[lowestWordIndex] = currentLetter;
                    lowestWordCount = currentLetter->count;

                    // update the index and count of the next most infrequent top word
                    for (i=0;i<k; i++)
                    {
                        // if the topword  is root then it can immediately be replaced by this current word, otherwise test
                        // whether the top word is less than the lowest word count
                        if (topWords[i]==root || topWords[i]->count<lowestWordCount)
                        {
                            lowestWordCount = topWords[i]->count;
                            lowestWordIndex = i;
                        }
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    while (k > 0 )
    {
        highestWordCount = 0;
        string[0]=0;
        tmp[0]=0;

        // find next most frequent word
        for (i=0;i<k; i++)
        {
            if (topWords[i]->count>highestWordCount)
            {
                highestWordCount = topWords[i]->count;
                highestWordIndex = i;
            }
        }

        Letter* letter = topWords[highestWordIndex];

        // swap the end top word with the found word and decrement the number of top words
        topWords[highestWordIndex] = topWords[--k];

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

对于测试1,对于前10个常见单词并启用了计时功能,应打印:

 968832 the
 528960 of
 466432 and
 421184 a
 322624 to
 320512 in
 270528 he
 213120 his
 191808 i
 182144 s

 Time Taken: 1.549155 seconds

令人印象深刻!在最坏的情况下,使用list可能使其变为O(Nk),因此对于k = 100,000的giganovel,它的运行速度比参考C ++程序慢。但是对于k << N,它显然是赢家。
安德里·马库卡

1
@AndriyMakukha谢谢!如此简单的实现速度很快,令我有些惊讶。通过对列表进行排序,我可以更好地处理更大的k值。(排序不应该太昂贵,因为列表顺序将缓慢变化),但是这增加了复杂性,并且可能会影响k值较低的速度。将不得不实验
Moogie

是的,我也很惊讶。可能是因为参考程序使用了大量函数调用,并且编译器无法正确优化它。
Andriy Makukha

另一个性能优势可能来自letters数组的半静态分配,而参考实现则动态分配树节点。
安德里·马库卡

mmap-ing应该快(〜在我的Linux笔记本电脑的5%) , #include<sys/mman.h><sys/stat.h><fcntl.h>替换文件,读取int d=open(argv[1],0);struct stat s;fstat(d,&s);dataLength=s.st_size;data=mmap(0,dataLength,1,1,d,0);并注释掉free(data);
NGN

4

在我的计算机上,它运行giganovel 100000的速度比Moogie的C“前缀树+容器” C解决方案快42%(10.64 s与18.24 s)(42.5%)。此外,它没有预定义的限制(与C解决方案不同,它预先定义了字长,唯一字,重复字等的限制)。

src/main.rs

use memmap::MmapOptions;
use pdqselect::select_by_key;
use std::cmp::Reverse;
use std::default::Default;
use std::env::args;
use std::fs::File;
use std::io::{self, Write};
use typed_arena::Arena;

#[derive(Default)]
struct Trie<'a> {
    nodes: [Option<&'a mut Trie<'a>>; 26],
    count: u64,
}

fn main() -> io::Result<()> {
    // Parse arguments
    let mut args = args();
    args.next().unwrap();
    let filename = args.next().unwrap();
    let size = args.next().unwrap().parse().unwrap();

    // Open input
    let file = File::open(filename)?;
    let mmap = unsafe { MmapOptions::new().map(&file)? };

    // Build trie
    let arena = Arena::new();
    let mut num_words = 0;
    let mut root = Trie::default();
    {
        let mut node = &mut root;
        for byte in &mmap[..] {
            let letter = (byte | 32).wrapping_sub(b'a');
            if let Some(child) = node.nodes.get_mut(letter as usize) {
                node = child.get_or_insert_with(|| {
                    num_words += 1;
                    arena.alloc(Default::default())
                });
            } else {
                node.count += 1;
                node = &mut root;
            }
        }
        node.count += 1;
    }

    // Extract all counts
    let mut index = 0;
    let mut counts = Vec::with_capacity(num_words);
    let mut stack = vec![root.nodes.iter()];
    'a: while let Some(frame) = stack.last_mut() {
        while let Some(child) = frame.next() {
            if let Some(child) = child {
                if child.count != 0 {
                    counts.push((child.count, index));
                    index += 1;
                }
                stack.push(child.nodes.iter());
                continue 'a;
            }
        }
        stack.pop();
    }

    // Find frequent counts
    select_by_key(&mut counts, size, |&(count, _)| Reverse(count));
    // Or, in nightly Rust:
    //counts.partition_at_index_by_key(size, |&(count, _)| Reverse(count));

    // Extract frequent words
    let size = size.min(counts.len());
    counts[0..size].sort_by_key(|&(_, index)| index);
    let mut out = Vec::with_capacity(size);
    let mut it = counts[0..size].iter();
    if let Some(mut next) = it.next() {
        index = 0;
        stack.push(root.nodes.iter());
        let mut word = vec![b'a' - 1];
        'b: while let Some(frame) = stack.last_mut() {
            while let Some(child) = frame.next() {
                *word.last_mut().unwrap() += 1;
                if let Some(child) = child {
                    if child.count != 0 {
                        if index == next.1 {
                            out.push((word.to_vec(), next.0));
                            if let Some(next1) = it.next() {
                                next = next1;
                            } else {
                                break 'b;
                            }
                        }
                        index += 1;
                    }
                    stack.push(child.nodes.iter());
                    word.push(b'a' - 1);
                    continue 'b;
                }
            }
            stack.pop();
            word.pop();
        }
    }
    out.sort_by_key(|&(_, count)| Reverse(count));

    // Print results
    let stdout = io::stdout();
    let mut stdout = io::BufWriter::new(stdout.lock());
    for (word, count) in out {
        stdout.write_all(&word)?;
        writeln!(stdout, " {}", count)?;
    }

    Ok(())
}

Cargo.toml

[package]
name = "frequent"
version = "0.1.0"
authors = ["Anders Kaseorg <andersk@mit.edu>"]
edition = "2018"

[dependencies]
memmap = "0.7.0"
typed-arena = "1.4.1"
pdqselect = "0.1.0"

[profile.release]
lto = true
opt-level = 3

用法

cargo build --release
time target/release/frequent ulysses64 10

1
高超!在所有三个设置中均具有非常好的性能。我只是在看卡罗尔·尼科尔斯(Carol Nichols)最近关于Rust的演讲中:)有些不寻常的语法,但是我很高兴学习这种语言:这似乎是后C ++系统语言中唯一没有这种语言的语言牺牲很多性能,同时使开发人员的生活更加轻松。
安德里·马库卡

很快!我很佩服!我想知道更好的C(tree + bin)编译器选项是否会给出类似的结果?
Moogie

@Moogie我已经在用进行测试-O3,并且-Ofast没有明显的不同。
Anders Kaseorg

@Moogie,我正在像那样编译您的代码gcc -O3 -march=native -mtune=native program.c
安德里·马库卡

@Andriy Makukha啊。这将解释您获得的结果与我的结果之间在速度上的巨大差异:您已经在应用优化标志。我认为尚无许多重大代码优化措施。我不能像其他人建议的那样使用map进行测试,因为mingw死没有实现。而且只会增加5%。我认为我必须屈服于Anders的出色表现。做得好!
Moogie

3

APL(Dyalog Unicode)

在Windows 10上使用64位Dyalog APL 17.0在我的2.6 Ghz i7-4720HQ上,以下命令在8秒内运行:

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞

它首先提示输入文件名,然后提示输入k。请注意,运行时间的很大一部分(大约1秒)只是读入文件。

为此,应该能够将以下内容传送到dyalog可执行文件中(用于十个最常见的词):

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞
/tmp/ulysses64
10
⎕OFF

它应该打印:

 the  968832
 of   528960
 and  466432
 a    421184
 to   322624
 in   320512
 he   270528
 his  213120
 i    191808
 s    182144

非常好!它胜过Python。之后效果最佳export MAXWS=4096M。我猜,它使用哈希表?因为将工作区大小减小到2 GB,会使它慢了整整2秒钟。
安德里·马库卡

@AndriyMakukha是的,按照this使用哈希表,我很确定在内部也是如此。
亚当

为什么是O(N log N)?对我来说,看起来更像是Python(k次还原所有唯一单词的堆)或AWK(仅排序唯一单词)解决方案。除非像McIlroy的shell脚本中那样对所有单词进行排序,否则它不应为O(N log N)。
Andriy Makukha '19

@AndriyMakukha它对所有计数进行评分。这是我们的性能专家给我写的:时间复杂度是O(N log N),除非您认为哈希表在理论上有些可疑之处,在这种情况下是O(N)。
亚当

好吧,当我针对8个,16个和32个Ulysses运行您的代码时,它的速度会线性下降。也许您的性能工程师需要重新考虑他对哈希表的时间复杂度的看法:)此外,此代码不适用于较大的测试用例。WS FULL即使我将工作空间增加到6 GB,它也会返回。
安德里·马库卡

2

[C]前缀树+箱

注意:使用的编译器对程序执行速度有很大影响! 我已经使用了gcc(MinGW.org GCC-8.2.0-3)8.2.0。使用 -Ofast时开关时,该程序的运行速度比正常编译的程序快50%。

算法复杂度

从那以后,我意识到我正在执行的Bin排序是Pigeonhost排序的一种形式,这意味着我可以消除该解决方案的Big O复杂性。

我将其计算为:

Worst Time complexity: O(1 + N + k)
Worst Space complexity: O(26*M + N + n) = O(M + N + n)

Where N is the number of words of the data
and M is the number of letters of the data
and n is the range of pigeon holes
and k is the desired number of sorted words to return
and N<=M

树的构造复杂度等同于树遍历,因此在任何级别上,要遍历的正确节点都是O(1)(因为每个字母都直接映射到节点,并且我们始终仅对每个字母遍历树的一个级别)

鸽子孔排序是O(N + n),其中n是键值的范围,但是对于此问题,我们不需要对所有值进行排序,只需对k个数进行排序,因此最坏的情况是O(N + k)。

合并产生O(1 + N + k)。

树结构的空间复杂性是由于以下事实:如果数据由一个带有M个字母的单词组成,并且每个节点有26个节点(即字母字母),则最坏情况是26 * M个节点。因此O(26 * M)= O(M)

对于鸽子洞分选,其空间复杂度为O(N + n)

合并产生O(26 * M + N + n)= O(M + N + n)

算法

它使用两个参数作为输入(文本文件的路径,并列出k个最常用的单词)

根据我的其他条目,与我的其他解决方案相比,该版本的时间成本斜率很小,k值增加。但是,对于低k值,它明显较慢,但是对于较大k值,它应该更快。

它在单词字母上创建一个分支,然后在叶子字母上增加一个计数器。然后将单词添加到大小相同的单词库中(首先从已驻留的单词库中删除单词之后)。重复所有这些操作,直到没有更多的字母被读入。此后,将bin从最大的bin开始反向迭代k次,并输出每个bin的单词。

当前默认情况下输出处理时间,但是为了与其他提交保持一致,请在源代码中禁用TIMING定义。

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// may need to increase if the source text has many repeated words
#define MAX_BINS 1000000

// assume maximum of 20 letters in a word... adjust accordingly
#define MAX_LETTERS_IN_A_WORD 20

// assume maximum of 10 letters for the string representation of the bin number... adjust accordingly
#define MAX_LETTERS_FOR_BIN_NAME 10

// maximum number of bytes of the output results
#define MAX_OUTPUT_SIZE 10000000

#define false 0
#define true 1
#define null 0
#define SPACE_ASCII_CODE 32

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    //char isAWord;
    struct Letter* parent;
    struct Letter* binElementNext;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

struct Bin
{
  struct Letter* word;
};
typedef struct Bin Bin;


int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i, j;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], null, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the memory for bins
    Bin* bins = (Bin*) malloc(sizeof(Bin) * MAX_BINS);
    memset(&bins[0], null, sizeof( Bin) * MAX_BINS);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;
    Letter *nextFreeLetter = &letters[0];

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;

    unsigned int sortedListSize = 0;

    // the count of the most frequent word
    unsigned int maxCount = 0;

    // the count of the current word
    unsigned int wordCount = 0;

////////////////////////////////////////////////////////////////////////////////////////////
// CREATING PREFIX TREE
    j=dataLength;
    while (--j>0)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                ++letterMasterIndex;
                nextLetter = ++nextFreeLetter;
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        else
        {
            //currentLetter->isAWord = true;

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

////////////////////////////////////////////////////////////////////////////////////////////
// ADDING TO BINS

    j = letterMasterIndex;
    currentLetter=&letters[j-1];
    while (--j>0)
    {

      // is the letter the leaf letter of word?
      if (currentLetter->count>0)
      {
        i = currentLetter->count;
        if (maxCount < i) maxCount = i;

        // add to bin
        currentLetter->binElementNext = bins[i].word;
        bins[i].word = currentLetter;
      }
      --currentLetter;
    }

////////////////////////////////////////////////////////////////////////////////////////////
// PRINTING OUTPUT

    // the memory for output
    char* output = (char*) malloc(sizeof(char) * MAX_OUTPUT_SIZE);
    memset(&output[0], SPACE_ASCII_CODE, sizeof( char) * MAX_OUTPUT_SIZE);
    unsigned int outputIndex = 0;

    // string representation of the current bin number
    char binName[MAX_LETTERS_FOR_BIN_NAME];
    memset(&binName[0], SPACE_ASCII_CODE, MAX_LETTERS_FOR_BIN_NAME);


    Letter* letter;
    Letter* binElement;

    // starting at the bin representing the most frequent word(s) and then iterating backwards...
    for ( i=maxCount;i>0 && k>0;i--)
    {
      // check to ensure that the bin has at least one word
      if ((binElement = bins[i].word) != null)
      {
        // update the bin name
        sprintf(binName,"%u",i);

        // iterate of the words in the bin
        while (binElement !=null && k>0)
        {
          // stop if we have reached the desired number of outputed words
          if (k-- > 0)
          {
              letter = binElement;

              // add the bin name to the output
              memcpy(&output[outputIndex],&binName[0],MAX_LETTERS_FOR_BIN_NAME);
              outputIndex+=MAX_LETTERS_FOR_BIN_NAME;

              // construct string of letters to form the word
               while (letter != root)
              {
                // output the letter to the output
                output[outputIndex++] = letter->asciiCode+97;
                letter=letter->parent;
              }

              output[outputIndex++] = '\n';

              // go to the next word in the bin
              binElement = binElement->binElementNext;
          }
        }
      }
    }

    // write the output to std out
    fwrite(output, 1, outputIndex, stdout);
   // fflush(stdout);

   // free( data );
   // free( letters );
   // free( bins );
   // free( output );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

编辑:现在将填充箱推迟到构造树并优化输出构造之后。

EDIT2:现在使用指针算法而不是数组访问来优化速度。


哇!1 GB文件中的100,000个最常用单词在11秒内...这看起来像是某种魔术。
Andriy Makukha,

没有技巧...仅以CPU时间为代价就可以有效地利用内存。我为您的结果感到惊讶...在我的旧电脑上,它需要60秒钟以上。我注意到II正在做不必要的比较,并且可以将装箱推迟到文件处理完毕之后。它应该使其更快。我将尽快尝试并更新我的答案。
Moogie

@AndriyMakukha我现在将填充Bins推迟到处理完所有单词并构造完树之后。这样可以避免不必要的比较和bin元素操作。我还改变了输出的构造方式,因为我发现打印花费了大量时间!
Moogie

在我的计算机上,此更新没有任何明显的不同。但是,它的确ulysses64一次执行得非常快,因此现在是当前的领导者。
Andriy Makukha,

那一定是我的PC上的一个独特问题:)我注意到使用这种新的输出算法时,速度提高了5秒
Moogie

2

Ĵ

9!:37 ] 0 _ _ _

'input k' =: _2 {. ARGV
k =: ". k

lower =: a. {~ 97 + i. 26
words =: ((lower , ' ') {~ lower i. ]) (32&OR)&.(a.&i.) fread input
words =: ' ' , words
words =: -.&(s: a:) s: words
uniq =: ~. words
res =: (k <. # uniq) {. \:~ (# , {.)/.~ uniq&i. words
echo@(,&": ' ' , [: }.@": {&uniq)/"1 res

exit 0

使用作为脚本运行jconsole <script> <input> <k>。例如,giganovelwith 的输出k=100K

$ time jconsole solve.ijs giganovel 100000 | head 
11309 e
11290 ihit
11285 ah
11260 ist
11255 aa
11202 aiv
11201 al
11188 an
11187 o
11186 ansa

real    0m13.765s
user    0m11.872s
sys     0m1.786s

除了可用的系统内存量外,没有其他限制。


对于较小的测试用例,速度非常快!真好!但是,对于任意大的单词,它会截断输出中的单词。我不确定单词中的字符数是否有限制,还是只是为了使输出更简洁。
安德里·马库卡

@AndriyMakukha是的,...发生是因为每行输出截断。我在开始处添加了一行以禁用所有截断。它在giganovel上放慢了速度,因为它使用更多的内存,因为有更多的独特单词。
英里

大!现在,它通过了通用性测试。而且它并没有降低我的计算机的速度。实际上,速度略有提高。
安德里·马库卡

2

C ++(la Knuth)

我很好奇Knuth的程序会如何发展,所以我将他(最初是Pascal)的程序翻译成C ++。

尽管Knuth的主要目标不是提高速度,而是说明他的WEB素养编程系统,但是该程序具有令人惊讶的竞争力,并且比迄今为止的任何答案都能提供更快的解决方案。这是我对他程序的翻译(WEB程序的相应“部分”编号在“ {§24}”之类的注释中提到):

#include <iostream>
#include <cassert>

// Adjust these parameters based on input size.
const int TRIE_SIZE = 800 * 1000; // Size of the hash table used for the trie.
const int ALPHA = 494441;  // An integer that's approximately (0.61803 * TRIE_SIZE), and relatively prime to T = TRIE_SIZE - 52.
const int kTolerance = TRIE_SIZE / 100;  // How many places to try, to find a new place for a "family" (=bunch of children).

typedef int32_t Pointer;  // [0..TRIE_SIZE), an index into the array of Nodes
typedef int8_t Char;  // We only care about 1..26 (plus two values), but there's no "int5_t".
typedef int32_t Count;  // The number of times a word has been encountered.
// These are 4 separate arrays in Knuth's implementation.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Pointer sibling;  // Previous sibling, cyclically. (From smallest child to header, and header to largest child.)
  Count count;  // The number of times this word has been encountered.
  Char ch;  // EMPTY, or 1..26, or HEADER. (For nodes with ch=EMPTY, the link/sibling/count fields mean nothing.)
} node[TRIE_SIZE + 1];
// Special values for `ch`: EMPTY (free, can insert child there) and HEADER (start of family).
const Char EMPTY = 0, HEADER = 27;

const Pointer T = TRIE_SIZE - 52;
Pointer x;  // The `n`th time we need a node, we'll start trying at x_n = (alpha * n) mod T. This holds current `x_n`.
// A header can only be in T (=TRIE_SIZE-52) positions namely [27..TRIE_SIZE-26].
// This transforms a "h" from range [0..T) to the above range namely [27..T+27).
Pointer rerange(Pointer n) {
  n = (n % T) + 27;
  // assert(27 <= n && n <= TRIE_SIZE - 26);
  return n;
}

// Convert trie node to string, by walking up the trie.
std::string word_for(Pointer p) {
  std::string word;
  while (p != 0) {
    Char c = node[p].ch;  // assert(1 <= c && c <= 26);
    word = static_cast<char>('a' - 1 + c) + word;
    // assert(node[p - c].ch == HEADER);
    p = (p - c) ? node[p - c].link : 0;
  }
  return word;
}

// Increment `x`, and declare `h` (the first position to try) and `last_h` (the last position to try). {§24}
#define PREPARE_X_H_LAST_H x = (x + ALPHA) % T; Pointer h = rerange(x); Pointer last_h = rerange(x + kTolerance);
// Increment `h`, being careful to account for `last_h` and wraparound. {§25}
#define INCR_H { if (h == last_h) { std::cerr << "Hit tolerance limit unfortunately" << std::endl; exit(1); } h = (h == TRIE_SIZE - 26) ? 27 : h + 1; }

// `p` has no children. Create `p`s family of children, with only child `c`. {§27}
Pointer create_child(Pointer p, int8_t c) {
  // Find `h` such that there's room for both header and child c.
  PREPARE_X_H_LAST_H;
  while (!(node[h].ch == EMPTY and node[h + c].ch == EMPTY)) INCR_H;
  // Now create the family, with header at h and child at h + c.
  node[h]     = {.link = p, .sibling = h + c, .count = 0, .ch = HEADER};
  node[h + c] = {.link = 0, .sibling = h,     .count = 0, .ch = c};
  node[p].link = h;
  return h + c;
}

// Move `p`'s family of children to a place where child `c` will also fit. {§29}
void move_family_for(const Pointer p, Char c) {
  // Part 1: Find such a place: need room for `c` and also all existing children. {§31}
  PREPARE_X_H_LAST_H;
  while (true) {
    INCR_H;
    if (node[h + c].ch != EMPTY) continue;
    Pointer r = node[p].link;
    int delta = h - r;  // We'd like to move each child by `delta`
    while (node[r + delta].ch == EMPTY and node[r].sibling != node[p].link) {
      r = node[r].sibling;
    }
    if (node[r + delta].ch == EMPTY) break;  // There's now space for everyone.
  }

  // Part 2: Now actually move the whole family to start at the new `h`.
  Pointer r = node[p].link;
  int delta = h - r;
  do {
    Pointer sibling = node[r].sibling;
    // Move node from current position (r) to new position (r + delta), and free up old position (r).
    node[r + delta] = {.ch = node[r].ch, .count = node[r].count, .link = node[r].link, .sibling = node[r].sibling + delta};
    if (node[r].link != 0) node[node[r].link].link = r + delta;
    node[r].ch = EMPTY;
    r = sibling;
  } while (node[r].ch != EMPTY);
}

// Advance `p` to its `c`th child. If necessary, add the child, or even move `p`'s family. {§21}
Pointer find_child(Pointer p, Char c) {
  // assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // If `p` currently has *no* children.
  Pointer q = node[p].link + c;
  if (node[q].ch == c) return q;  // Easiest case: `p` already has a `c`th child.
  // Make sure we have room to insert a `c`th child for `p`, by moving its family if necessary.
  if (node[q].ch != EMPTY) {
    move_family_for(p, c);
    q = node[p].link + c;
  }
  // Insert child `c` into `p`'s family of children (at `q`), with correct siblings. {§28}
  Pointer h = node[p].link;
  while (node[h].sibling > q) h = node[h].sibling;
  node[q] = {.ch = c, .count = 0, .link = 0, .sibling = node[h].sibling};
  node[h].sibling = q;
  return q;
}

// Largest descendant. {§18}
Pointer last_suffix(Pointer p) {
  while (node[p].link != 0) p = node[node[p].link].sibling;
  return p;
}

// The largest count beyond which we'll put all words in the same (last) bucket.
// We do an insertion sort (potentially slow) in last bucket, so increase this if the program takes a long time to walk trie.
const int MAX_BUCKET = 10000;
Pointer sorted[MAX_BUCKET + 1];  // The head of each list.

// Records the count `n` of `p`, by inserting `p` in the list that starts at `sorted[n]`.
// Overwrites the value of node[p].sibling (uses the field to mean its successor in the `sorted` list).
void record_count(Pointer p) {
  // assert(node[p].ch != HEADER);
  // assert(node[p].ch != EMPTY);
  Count f = node[p].count;
  if (f == 0) return;
  if (f < MAX_BUCKET) {
    // Insert at head of list.
    node[p].sibling = sorted[f];
    sorted[f] = p;
  } else {
    Pointer r = sorted[MAX_BUCKET];
    if (node[p].count >= node[r].count) {
      // Insert at head of list
      node[p].sibling = r;
      sorted[MAX_BUCKET] = p;
    } else {
      // Find right place by count. This step can be SLOW if there are too many words with count >= MAX_BUCKET
      while (node[p].count < node[node[r].sibling].count) r = node[r].sibling;
      node[p].sibling = node[r].sibling;
      node[r].sibling = p;
    }
  }
}

// Walk the trie, going over all words in reverse-alphabetical order. {§37}
// Calls "record_count" for each word found.
void walk_trie() {
  // assert(node[0].ch == HEADER);
  Pointer p = node[0].sibling;
  while (p != 0) {
    Pointer q = node[p].sibling;  // Saving this, as `record_count(p)` will overwrite it.
    record_count(p);
    // Move down to last descendant of `q` if any, else up to parent of `q`.
    p = (node[q].ch == HEADER) ? node[q].link : last_suffix(q);
  }
}

int main(int, char** argv) {
  // Program startup
  std::ios::sync_with_stdio(false);

  // Set initial values {§19}
  for (Char i = 1; i <= 26; ++i) node[i] = {.ch = i, .count = 0, .link = 0, .sibling = i - 1};
  node[0] = {.ch = HEADER, .count = 0, .link = 0, .sibling = 26};

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0L, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  if (fptr) fclose(fptr);

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (int i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  node[0].count = 0;

  walk_trie();

  const int max_words_to_print = atoi(argv[2]);
  int num_printed = 0;
  for (Count f = MAX_BUCKET; f >= 0 && num_printed <= max_words_to_print; --f) {
    for (Pointer p = sorted[f]; p != 0 && num_printed < max_words_to_print; p = node[p].sibling) {
      std::cout << word_for(p) << " " << node[p].count << std::endl;
      ++num_printed;
    }
  }

  return 0;
}

与Knuth计划的差异:

  • 我结合Knuth的4个阵列linksiblingcountch成的阵列struct Node(发现很容易理解这种方式)。
  • 我将部分的读写编程(WEB风格)文本包含转换为更常规的函数调用(和几个宏)。
  • 我们不需要使用标准的Pascal怪异的I / O约定/限制,因此可以使用freaddata[i] | 32 - 'a'其他答案一样,而不是Pascal解决方法。
  • 如果程序运行时超出限制(空间不足),则Knuth的原始程序会通过丢弃后面的单词并在末尾打印一条消息来优雅地处理它。(说麦克罗伊“批评克努斯的解决方案甚至不能处理圣经的全部内容,这是不正确的;他只是指出有时在文本中可能会出现很晚的单词,例如”耶稣”一词。 “在圣经中,所以错误情况并非无害。”我采用了更嘈杂的方法(并且反而更容易)来简单地终止程序。
  • 该程序声明一个常量TRIE_SIZE来控制内存使用率,我对此大加赞赏。(已为原始要求选择了32767的常数-“用户应该能够在20页的技术论文(大约50K字节的文件)中找到100个最常用的单词”,并且因为Pascal能够很好地处理范围整数进行分类和优化包装。由于测试输入现在增加了2000万倍,我们不得不将其增加25倍,达到80万。)
  • 对于字符串的最终打印,我们只需走一下Trie并执行一个哑(可能甚至是二次)字符串附加即可。

除此之外,这几乎完全是Knuth的程序(使用他的哈希Trie /打包Trie数据结构和存储桶排序),并且在循环输入中的所有字符时执行几乎相同的操作(与Knuth的Pascal程序相同)。请注意,它不使用外部算法或数据结构库,并且将按字母顺序打印等频率的单词。

定时

编译与

clang++ -std=c++17 -O2 ptrie-walktrie.cc 

当在这里最大的测试用例(giganovel要求100,000个单词)上运行时,与到目前为止发布的最快的程序相比,我发现它略有提高,但始终更快:

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]

(最上面的一行是Anders Kaseorg的Rust解决方案;最下面的是上面的程序。这些是从100次运行开始的时间,包括平均值,最小值,最大值,中位数和四分位数。)

分析

为什么这样更快?不是C ++比Rust快,也不是Knuth的程序可能是最快的-实际上,由于Trie打包(以节省内存),Knuth的程序在插入时(如他提到的)较慢。我怀疑,原因与Knuth 在2008年抱怨的事情有关

关于64位指针的火焰

当我编译使用少于4 GB RAM的程序时,拥有64位指针绝对是愚蠢的。当此类指针值出现在结构内部时,它们不仅浪费了一半的内存,而且还浪费掉了一半的缓存。

上面的程序使用32位数组索引(而不是64位指针),因此“节点”结构占用的内存更少,因此堆栈上的节点更多,缓存未命中的次数也更少。(实际上,在x32 ABI上已经做了一些工作,但是即使这个想法显然有用,它似乎也不处于良好状态,例如,参见最近在V8中发布指针压缩。哦,很好。),此程序将12.8 MB用于(打包的)特里,而Rust程序的32.18MB用于其特里(在giganovelgiganovel)。我们可以扩大1000倍(从“ giganovel”到“ teranovel”),但仍不超过32位索引,因此这似乎是一个合理的选择。

更快的变体

我们可以优化速度并放弃打包,因此我们可以像在Rust解决方案中一样实际使用(非打包的)特里,使用索引而不是指针。这样可以提供更快的速度,并且对不同的单词,字符等没有预先设置的限制:

#include <iostream>
#include <cassert>
#include <vector>
#include <algorithm>

typedef int32_t Pointer;  // [0..node.size()), an index into the array of Nodes
typedef int32_t Count;
typedef int8_t Char;  // We'll usually just have 1 to 26.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Count count;  // The number of times this word has been encountered. Undefined for header nodes.
};
std::vector<Node> node; // Our "arena" for Node allocation.

std::string word_for(Pointer p) {
  std::vector<char> drow;  // The word backwards
  while (p != 0) {
    Char c = p % 27;
    drow.push_back('a' - 1 + c);
    p = (p - c) ? node[p - c].link : 0;
  }
  return std::string(drow.rbegin(), drow.rend());
}

// `p` has no children. Create `p`s family of children, with only child `c`.
Pointer create_child(Pointer p, Char c) {
  Pointer h = node.size();
  node.resize(node.size() + 27);
  node[h] = {.link = p, .count = -1};
  node[p].link = h;
  return h + c;
}

// Advance `p` to its `c`th child. If necessary, add the child.
Pointer find_child(Pointer p, Char c) {
  assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // Case 1: `p` currently has *no* children.
  return node[p].link + c;  // Case 2 (easiest case): Already have the child c.
}

int main(int, char** argv) {
  auto start_c = std::clock();

  // Program startup
  std::ios::sync_with_stdio(false);

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  fclose(fptr);

  node.reserve(dataLength / 600);  // Heuristic based on test data. OK to be wrong.
  node.push_back({0, 0});
  for (Char i = 1; i <= 26; ++i) node.push_back({0, 0});

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (long i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  ++node[p].count;
  node[0].count = 0;

  // Brute-force: Accumulate all words and their counts, then sort by frequency and print.
  std::vector<std::pair<int, std::string>> counts_words;
  for (Pointer i = 1; i < static_cast<Pointer>(node.size()); ++i) {
    int count = node[i].count;
    if (count == 0 || i % 27 == 0) continue;
    counts_words.push_back({count, word_for(i)});
  }
  auto cmp = [](auto x, auto y) {
    if (x.first != y.first) return x.first > y.first;
    return x.second < y.second;
  };
  std::sort(counts_words.begin(), counts_words.end(), cmp);
  const int max_words_to_print = std::min<int>(counts_words.size(), atoi(argv[2]));
  for (int i = 0; i < max_words_to_print; ++i) {
    auto [count, word] = counts_words[i];
    std::cout << word << " " << count << std::endl;
  }

  return 0;
}

该程序尽管比这里的解决方案要费劲很多,但giganovel它的trie仅使用(for )12.2MB,并且设法更快。该程序的时间(最后一行),与前面提到的时间相比:

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]
itrie-nolimit:             3.907 ±   0.127 [ 3.69.. 4.23]        [... 3.81 ...   3.9 ...   4.0...]

我很想看看如果翻译成Rust的话(或hash-trie程序)想要什么。:-)

更多细节

  1. 关于此处使用的数据结构:在TAOCP第3卷6.3节(数字搜索,即尝试)的练习4中,以及在Knuth的学生Frank Liang关于TeX中的连字的论文中,简要地解释了“打包”尝试。 :字词连字符计算机

  2. 鉴于以前和以后的专栏以及Knuth以前的经验(包括编译器,TAOCP和TeX ),Bentley专栏,Knuth的程序和McIlroy的评论(其中只有一小部分是关于Unix哲学)的上下文更加清晰。

  3. 整本书中有《 练习编程风格》,展示了此特定程序的不同方法,等等。

关于以上几点,我有一篇未完成的博客文章;完成后可能会编辑此答案。同时,无论如何,要在Knuth生日那天(1月10日)在这里发布答案。:-)


太棒了!最终,不仅有人以出色的分析和性能发布了Knuth的解决方案(我打算这样做,而且是在Pascal上发布),击败了以前最好的发布,但另一个C ++程序的速度创造了新记录!精彩。
Andriy Makukha

我仅有的两条评论是:1)您的第二个程序当前因Segmentation fault: 11单词和空格任意大的测试用例而失败;2)尽管可能会觉得我对麦克罗伊的“批评”表示同情,但我很清楚,克努斯的意图仅仅是炫耀他的识字编程技术,而麦克罗伊则从工程学的角度对其进行了批评。麦克罗伊本人后来承认,这样做不是一件公平的事情。
Andriy Makukha

@AndriyMakukha哦,那是递归的word_for;现在修复它。是的,McIlroy作为Unix管道的发明者,借此机会宣扬了组成小型工具的Unix哲学。与Knuth令人沮丧的(如果您要阅读他的程序)整体式的方法相比,这是一个好主意,但是在上下文中,这有点不公平,还有另一个原因:今天Unix方式已经广泛使用,但在1986年受到限制贝尔实验室,伯克利等(“他的公司制造了行业中最好的预制件”)
ShreevatsaR

作品!恭喜这位新国王:-P至于Unix和Knuth,他似乎不太喜欢该系统,因为不同工具之间存在统一性,而且几乎没有统一性(例如,许多工具对正则表达式的定义不同)。
Andriy Makukha

1

Python 3

使用简单字典的这种实现比Counter在我的系统上使用字典的实现略快。

def words_from_file(filename):
    import re

    pattern = re.compile('[a-z]+')

    for line in open(filename):
        yield from pattern.findall(line.lower())


def freq(textfile, k):
    frequencies = {}

    for word in words_from_file(textfile):
        frequencies[word] = frequencies.get(word, 0) + 1

    most_frequent = sorted(frequencies.items(), key=lambda item: item[1], reverse=True)

    for i, (word, frequency) in enumerate(most_frequent):
        if i == k:
            break

        yield word, frequency


from time import time

start = time()
print('\n'.join('{}:\t{}'.format(f, w) for w,f in freq('giganovel', 10)))
end = time()
print(end - start)

1
我只能在系统上使用giganovel进行测试,这需要相当长的时间(约90秒)。gutenbergproject由于法律原因在德国被封锁...
movatica

有趣。它要么heapq不增加Counter.most_common方法的性能,要么在内部enumerate(sorted(...))使用heapq
安德里·马库卡

我使用Python 2进行了测试,并且性能相似,因此,我猜想排序的速度差不多Counter.most_common
安德里·马库卡

是的,也许只是我系统上的抖动...至少它并不慢:)但是正则表达式搜索比遍历字符快得多。它似乎实现得相当不错。
movatica

1

[C]前缀树+排序的链表

它使用两个参数作为输入(文本文件的路径,并列出k个最常用的单词)

根据我的其他文章,对于较大的k值,此版本要快得多,但是对于较低的k值,其性能代价较小。

它在单词字母上创建一个分支,然后在叶子字母上增加一个计数器。然后检查当前叶计数器是否大于最频繁单词列表中最小的最频繁单词。(列表大小是通过命令行参数确定的数字)如果是,则将由叶子字母表示的单词提升为最常用的单词之一。如果已经是最频繁使用的单词,则如果单词数量现在更高,则与下一个最频繁使用的单词交换,从而使列表保持排序。重复所有这些操作,直到没有更多的字母被读入。之后,将输出最常用的单词列表。

当前默认情况下输出处理时间,但是为了与其他提交保持一致,请在源代码中禁用TIMING定义。

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char isTopWord;
    struct Letter* parent;
    struct Letter* higher;
    struct Letter* lower;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;
    Letter* sortedWordsStart = null;
    Letter* sortedWordsEnd = null;
    Letter* A;
    Letter* B;
    Letter* C;
    Letter* D;

    unsigned int sortedListSize = 0;


    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // is this word not in the top word list?
            if (!currentLetter->isTopWord)
            {
                // first word becomes the sorted list
                if (sortedWordsStart == null)
                {
                  sortedWordsStart = currentLetter;
                  sortedWordsEnd = currentLetter;
                  currentLetter->isTopWord = true;
                  ++sortedListSize;
                }
                // always add words until list is at desired size, or 
                // swap the current word with the end of the sorted word list if current word count is larger
                else if (sortedListSize < k || currentLetter->count> sortedWordsEnd->count)
                {
                    // replace sortedWordsEnd entry with current word
                    if (sortedListSize == k)
                    {
                      currentLetter->higher = sortedWordsEnd->higher;
                      currentLetter->higher->lower = currentLetter;
                      sortedWordsEnd->isTopWord = false;
                    }
                    // add current word to the sorted list as the sortedWordsEnd entry
                    else
                    {
                      ++sortedListSize;
                      sortedWordsEnd->lower = currentLetter;
                      currentLetter->higher = sortedWordsEnd;
                    }

                    currentLetter->lower = null;
                    sortedWordsEnd = currentLetter;
                    currentLetter->isTopWord = true;
                }
            }
            // word is in top list
            else
            {
                // check to see whether the current word count is greater than the supposedly next highest word in the list
                // we ignore the word that is sortedWordsStart (i.e. most frequent)
                while (currentLetter != sortedWordsStart && currentLetter->count> currentLetter->higher->count)
                {
                    B = currentLetter->higher;
                    C = currentLetter;
                    A = B != null ? currentLetter->higher->higher : null;
                    D = currentLetter->lower;

                    if (A !=null) A->lower = C;
                    if (D !=null) D->higher = B;
                    B->higher = C;
                    C->higher = A;
                    B->lower = D;
                    C->lower = B;

                    if (B == sortedWordsStart)
                    {
                      sortedWordsStart = C;
                    }

                    if (C == sortedWordsEnd)
                    {
                      sortedWordsEnd = B;
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    Letter* letter;
    while (sortedWordsStart != null )
    {
        letter = sortedWordsStart;
        highestWordCount = letter->count;
        string[0]=0;
        tmp[0]=0;

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
        sortedWordsStart = sortedWordsStart->lower;
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

对于k = 100,000:,它返回的输出不是很排序12 eroilk 111 iennoa 10 yttelen 110 engyt
Andriy Makukha,

我想我对原因有所了解。我的想法是,当检查当前单词的下一个最高单词时,我将需要迭代列表中的交换单词。当我有时间,我会检查
Moogie

嗯,似乎将if改为while的简单解决方法确实起作用了,但是对于较大的k值,它也显着降低了算法的速度。我可能不得不考虑一个更聪明的解决方案。
Moogie

1

C#

这应该与最新的.net SDK一起使用

using System;
using System.IO;
using System.Diagnostics;
using System.Collections.Generic;
using System.Linq;
using static System.Console;

class Node {
    public Node Parent;
    public Node[] Nodes;
    public int Index;
    public int Count;

    public static readonly List<Node> AllNodes = new List<Node>();

    public Node(Node parent, int index) {
        this.Parent = parent;
        this.Index = index;
        AllNodes.Add(this);
    }

    public Node Traverse(uint u) {
        int b = (int)u;
        if (this.Nodes is null) {
            this.Nodes = new Node[26];
            return this.Nodes[b] = new Node(this, b);
        }
        if (this.Nodes[b] is null) return this.Nodes[b] = new Node(this, b);
        return this.Nodes[b];
    }

    public string GetWord() => this.Index >= 0 
        ? this.Parent.GetWord() + (char)(this.Index + 97)
        : "";
}

class Freq {
    const int DefaultBufferSize = 0x10000;

    public static void Main(string[] args) {
        var sw = Stopwatch.StartNew();

        if (args.Length < 2) {
            WriteLine("Usage: freq.exe {filename} {k} [{buffersize}]");
            return;
        }

        string file = args[0];
        int k = int.Parse(args[1]);
        int bufferSize = args.Length >= 3 ? int.Parse(args[2]) : DefaultBufferSize;

        Node root = new Node(null, -1) { Nodes = new Node[26] }, current = root;
        int b;
        uint u;

        using (var fr = new FileStream(file, FileMode.Open))
        using (var br = new BufferedStream(fr, bufferSize)) {
            outword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done; 
                    else goto outword;
                }
                else current = root.Traverse(u);
            inword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done;
                    ++current.Count;
                    goto outword;
                }
                else {
                    current = current.Traverse(u);
                    goto inword;
                }
            done:;
        }

        WriteLine(string.Join("\n", Node.AllNodes
            .OrderByDescending(count => count.Count)
            .Take(k)
            .Select(node => node.GetWord())));

        WriteLine("Self-measured milliseconds: {0}", sw.ElapsedMilliseconds);
    }
}

这是一个示例输出。

C:\dev\freq>csc -o -nologo freq-trie.cs && freq-trie.exe giganovel 100000
e
ihit
ah
ist
 [... omitted for sanity ...]
omaah
aanhele
okaistai
akaanio
Self-measured milliseconds: 13619

最初,我尝试使用带有字符串键的字典,但这太慢了。我认为这是因为.net字符串在内部使用2字节编码表示,这对于该应用程序来说是一种浪费。因此,我只是切换到纯字节,并使用丑陋的goto风格的状态机。大小写转换是按位运算符。减法后,字符范围检查在单个比较中完成。我没有花费任何精力来优化最终排序,因为我发现它使用的运行时间不到0.1%。

修正:该算法本质上是正确的,但是它通过统计单词的所有前缀来过度报告总单词。由于总字数不是问题的要求,因此我删除了该输出。为了输出所有k个字,我还调整了输出。我最终决定使用string.Join(),然后立即编写整个列表。出乎意料的是,这在我的机器上大约要快一秒钟,即每个单词分别写入100k。


1
非常令人印象深刻!我喜欢您的按位tolower和单个比较​​技巧。但是,我不明白为什么您的程序报告的单词比预期的要多。另外,根据最初的问题描述,该程序需要以降序输出所有k个单词,因此我没有将您的程序计入最后一次测试,该测试需要输出100,000个最频繁的单词。
安德里·马库卡

@AndriyMakukha:我可以看到我还在计算最终计数中从未出现过的单词前缀。我避免编写所有输出,因为在Windows中控制台输出非常慢。我可以将输出写入文件吗?
递归

请只打印标准输出。对于k = 10,它应该在任何机器上都快。您也可以从命令行将输出重定向到文件中。这样
安德里·马库卡

@AndriyMakukha:我相信我已经解决了所有问题。我找到了一种无需花费大量运行时间即可生成所有所需输出的方法。
递归

这个输出速度很快!非常好。与其他解决方案一样,我修改了程序以同时打印频率计数。
Andriy Makukha,

1

Ruby 2.7.0-preview1与 tally

最新版本的Ruby具有名为的新方法tally。来自发行说明中

Enumerable#tally被添加。它计算每个元素的出现。

["a", "b", "c", "b"].tally
#=> {"a"=>1, "b"=>2, "c"=>1}

这几乎为我们解决了整个任务。我们只需要先阅读文件,然后再查找最大值即可。

整个过程如下:

k = ARGV.shift.to_i

pp ARGF
  .each_line
  .lazy
  .flat_map { @1.scan(/[A-Za-z]+/).map(&:downcase) }
  .tally
  .max_by(k, &:last)

编辑:添加k为命令行参数

它可以ruby k filename.rb input.txt使用2.7.0-preview1版本的Ruby 来运行。可以从发行说明页面上的各个链接下载此文件,也可以使用以下命令将其与rbenv一起安装rbenv install 2.7.0-dev

示例在我自己的老旧计算机上运行:

$ time ruby bentley.rb 10 ulysses64 
[["the", 968832],
 ["of", 528960],
 ["and", 466432],
 ["a", 421184],
 ["to", 322624],
 ["in", 320512],
 ["he", 270528],
 ["his", 213120],
 ["i", 191808],
 ["s", 182144]]

real    0m17.884s
user    0m17.720s
sys 0m0.142s

1
我从源代码安装了Ruby。它的运行速度几乎与您的计算机一样(15秒vs 17)。
安德里·马库卡
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.