每个字符多少个字符？

在http://shakespeare.mit.edu/上，您可以在一页上找到莎士比亚戏剧的全文（例如Hamlet）。

编写一个脚本，该脚本接收来自stdin的剧本的URL，例如http://shakespeare.mit.edu/hamlet/full.html，并输出每个与stdout对话的剧本字符的文本字符数，并根据谁进行排序说话最多。

戏剧/场景/动作标题显然不算作对话，角色名称也不算。斜体文本和[方括号文本]不是实际的对话，因此不应计算在内。对话中的空格和其他标点符号应计算在内。

（尽管我没有看全部，但是剧本的格式看起来非常一致。请告诉我是否忽略了任何内容。您的剧本不一定适合这首诗。）

例

这是Much Ado About Nothing中的模拟部分，以显示我期望的输出：

什么都不做

场景0。

信使

我会。

美丽

做。

莱昂纳托

你永远不会。

美丽

没有。

预期产量：

LEONATO 15
Messenger 7
BEATRICE 6

计分

这是代码高尔夫。以字节为单位的最小程序将获胜。

code-golf string counting

— 卡尔文的爱好
source

如果有人在莎士比亚挑战莎士比亚怎么办？如果这是可能的话，那将是惊人的……

— fuandon 2014年

我们是否可以假定我们有剧本中的角色列表？还是我们必须从文本中推断出字符？鉴于某些字符（例如Messenger）混合使用大小写字母，后者非常困难。其他人的名字只有大写字母（例如LEONATO）；其中一些是复合名称。

— DavidC 2014年

是的，您应该推断名称。它们的格式与对话的格式大不相同，因此给定html区别它们不应太棘手。

— 加尔文的爱好2014年

应该将“全部”视为一个单独的字符吗？

— es1024

@ es1024是的。即使结果并不完全合理，任何具有唯一标题的剧本角色也被认为是单独的。

— 卡尔文的爱好2014年

Answers:

PHP（240个字符）

将html分为字符串（用作分隔符），然后运行几个正则表达式以提取说出的姓名和单词。节省朗读到数组中的单词的长度。打高尔夫球：

<?@$p=preg_match_all;foreach(explode('/bl',implode(file(trim(fgets(STDIN)))))as$c)if($p('/=s.*?b>(.*?):?</',$c,$m)){$p('/=\d.*?>(.*?)</',$c,$o);foreach($m[1]as$n)@$q[$n]+=strlen(implode($o[1]));}arsort($q);foreach($q as$n=>$c)echo"$n $c\n";

取消高尔夫：

<?php
$html = implode(file(trim(fgets(STDIN))));
$arr = explode('/bl',$html);
foreach($arr as $chunk){
    if(preg_match_all('/=s.*?b>(.*?):?</',$chunk,$matches)){
        $name = $matches[1];
        preg_match_all('/=\d.*?>(.*?)</',$chunk,$matches);
        foreach($name as $n)
            @$names[$n] += strlen(implode($matches[1]));
    }
}
arsort($names);
foreach($names as $name=>$count)
    echo "$name $count\n";

注意：这将“全部”视为一个单独的字符。

例：

$php shakespeare.php <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 60063
KING CLAUDIUS 21461
LORD POLONIUS 13877
HORATIO 10605
LAERTES 7519
OPHELIA 5916
QUEEN GERTRUDE 5554
First Clown 3701
ROSENCRANTZ 3635
Ghost 3619
MARCELLUS 2350
First Player 1980
OSRIC 1943
Player King 1849
GUILDENSTERN 1747
Player Queen 1220
BERNARDO 1153
Gentleman 978
PRINCE FORTINBRAS 971
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 330
FRANCISCO 287
LUCIANUS 272
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 94
All 94
Danes 75
Servant 49
CORNELIUS 45

— es1024
source

请显示一些输出示例。

— DavidC

@DavidCarraher已添加一个示例。

— es1024 2014年

Rebol- 556 527

t: complement charset"<"d: charset"0123456789."m: map[]parse to-string read to-url input[any[(s: 0 a: copy[])some["<A NAME=speech"some d"><b>"copy n some t</b></a>(append a trim/with n":")some newline]<blockquote>newline any["<A NAME="some d">"copy q some t</a><br>newline(while[f: find q"["][q: remove/part f next find f"]"]s: s + length? trim head q)|<p><i>some t</i></p>newline][</blockquote>|</body>](foreach n a[m/:n: either none? m/:n[s][s + m/:n]])| skip]]foreach[x y]sort/reverse/skip/compare to-block m 2 2[print[x y]]

这可能会打得更远，但是它不可能低于已经提供的答案：(

取消高尔夫：

t: complement charset "<"
d: charset "0123456789."
m: map []

parse to-string read to-url input [
    any [
        (s: 0 a: copy [])

        some [
            "<A NAME=speech" some d "><b>" copy n some t </b></a>
            (append a trim/with n ":")
            some newline
        ]

        <blockquote> newline
        any [
            "<A NAME=" some d ">" copy q some t </a><br> newline (
                while [f: find q "["] [
                    q: remove/part f next find f "]"
                ]
                s: s + length? trim head q
            )
            | <p><i> some t </i></p> newline
        ]
        [</blockquote> | </body>]
        (foreach n a [m/:n: either none? m/:n [s] [s + m/:n]])

        | skip
    ]
]

foreach [x y] sort/reverse/skip/compare to-block m 2 2 [print [x y]]

该程序将删除[方括号中的文本]，并从对话框中修剪周围的空白。没有此输出，输出将与es1024答案相同。

例：

$ rebol -q shakespeare.reb <<< "http://shakespeare.mit.edu/hamlet/full.html"
HAMLET 59796
KING CLAUDIUS 21343
LORD POLONIUS 13685
HORATIO 10495
LAERTES 7402
OPHELIA 5856
QUEEN GERTRUDE 5464
First Clown 3687
ROSENCRANTZ 3585
Ghost 3556
MARCELLUS 2259
First Player 1980
OSRIC 1925
Player King 1843
GUILDENSTERN 1719
Player Queen 1211
BERNARDO 1135
Gentleman 978
PRINCE FORTINBRAS 953
VOLTIMAND 896
Second Clown 511
First Priest 499
Captain 400
Lord 338
REYNALDO 312
FRANCISCO 287
LUCIANUS 269
First Ambassador 230
First Sailor 187
Messenger 185
Prologue 89
All 76
Danes 51
Servant 49
CORNELIUS 45

— 德拉贡
source

普通Lisp-528

(use-package :plump)(lambda c(u &aux(h (make-hash-table))n r p)(traverse(parse(drakma:http-request u))(lambda(x &aux y)(case p(0(when(and n(not(ppcre:scan"speech"(attribute x"NAME"))))(setf r t y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))(dolist(w n)(incf(gethash w h 0)(length y)))))(1(if r(setf n()r()))(push(intern(text(aref(children x)0)))n)))):test(lambda(x)(and(element-p x)(setf p(position(tag-name x)'("A""b"):test #'string=)))))(format t"~{~a ~a~^~%~}"(alexandria:hash-table-plist h)))

说明

这是经过稍微修改的版本，增加了打印信息（请参见粘贴）。

(defun c (u &aux
                 (h (make-hash-table)) ;; hash-table
                 n ;; last seen character name
                 r p
                 )
      (traverse                 ;; traverse the DOM generated by ...
       (parse                   ;; ... parsing the text string
        (drakma:http-request u) ;; ... resulting from http-request to link U
        )

       ;; call the function held in variable f for each traversed element
       (lambda (x &aux y)
         (case p
           (0 ;a
            (when(and n(not(alexandria:starts-with-subseq"speech"(attribute x "NAME"))))
              (setf r t)
              (setf y(#1=ppcre:regex-replace-all"aside: "(#1#"^(\\[[^]]*\\] |\\s*)"(text x)"")""))
              (format t "~A ~S~%" n y) ;; debugging
              (dolist(w n)
                (incf
                    (gethash w h 0) ;; get values in hash, with default value 0
                    (length y)))) ;; length of text
            )
           (1 ;b
            (if r(setf n()r()))
            (push (intern (text (aref (children x)0)))n))))

       ;; but only for elements that satisfy the test predicate
       :test
       (lambda(x)
         (and (element-p x) ;; must be an element node
              (setf p(position(tag-name x)'("A""b"):test #'string=)) ;; either <a> or <b>; save result of "position" in p
              )))

        ;; finally, iterate over the elements of the hash table, as a
        ;; plist, i.e. a list of alternating key values (k1 v1 k2 v2 ...),
        ;; and print them as requested. ~{ ~} is an iteration control format.
  (format t "~&~%~%TOTAL:~%~%~{~a ~a~^~%~}" (alexandria:hash-table-plist h)))

笔记

我删除了括号中的文本以及括号中不存在的“ aside：”出现（我也修剪了空白字符）。这是执行的痕迹，其中包含匹配的文本以及Hamlet的每个字符的总计。
作为其他答案，假定All为字符。将all的值添加到所有其他字符可能很吸引人，但这是不正确的，因为“ All”是指舞台上实际存在的字符，这需要保持谁在场的上下文（跟踪“ exit”“ exeunt ”和“输入”指示）。尚未完成。

— 核心转储
source