建立可读性索引


13

所述弗莱士-金凯德可读性算法取决于字计数和音节计数,这两者都不是完全客观,或使用计算机自动化容易的措施。例如,带连字符的“代码高尔夫”算作一个单词还是两个单词?“百万”一词是两个还是三个音节?在此任务中,您将需要进行近似计算,因为精确计数需要花费大量时间,空间,最重要的是,代码。

您的任务是用任何一种会占用英语阅读段落(假定为完整句子)的语言来构建最小的程序(即函数),并计算Flesch Reading Ease指数至8点的公差(以说明音节计数和字数统计的变化)。计算公式如下:

FRE = 206.835 - 1.015 * (words per sentence) - 84.6 * (syllables per word)

您的程序必须与下面的参考段落对齐,这些参考段落的索引是使用手动计数计算的:

I would not, could not, in the rain.
Not in the dark, not on a train.
Not in a car, not in a tree.
I do not like them, Sam, you see.
Not in a house, not in a box.
Not with a mouse, not with a fox.
I will not eat them here or there.
I do not like them anywhere!

索引:111.38(8个句子中62个单词的64个音节)

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape
the vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

索引:65.09(2个句子中55个单词的74个音节)

When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to
which the Laws of Nature and of Nature's God entitle them, a decent respect
to the opinions of mankind requires that they should declare the causes
which impel them to the separation.

索引:3.70(1个句子中71个单词的110个音节)

如果您还有其他段落需要手动计算音节和单词并计算出索引,则可以将其显示为验证。


可以是功能吗?还是必须服用STDIN?
Brigand

2
您是否有3个示例段落的音节数可用,还是只有索引?如果有的话,音节数比较方便。
Strigoides

它可以是一个功能。实际上,它应该是一个函数。
Joe Z.

Answers:


6

Perl 120字节

#!perl -pa0
s@\w+|([.!?])@$s+=$#-,lc($&)=~s![aeiou]+\B|([aeiouy]$)!$y+=1-$#-/3!ger@ge}
{$_=206.835-1.015*@F/$s-84.6*$y/@F

样本I / O:

$ perl flesch-kincaid.pl < input1.dat
110.730040322581

$ perl flesch-kincaid.pl < input2.dat
65.6097727272728

$ perl flesch-kincaid.pl < input2.dat
1.71366197183096

通过假定每个元音簇是单个音节来完成音节的计数,但单词末尾的单独元音除外,后者仅占三分之二;似乎很准确的启发式方法


3

K&R C - 188个196 199 229字符

通过更改规范以指定功能,我可以节省大量的c开销。还更改为使用Strigoides的音节计数技巧,这比我的公式调整要好,并扩展到可以处理单词的过度计数。

当我发现一种更短的方法来进行元音检测时(可悲的是基于该方法)stdchr,我有动力从我一直在使用的令人讨厌的可憎事物中挤出更多的信息,这样我就不必感到无聊了。

d,a,v,s,t,w;float R(char*c){for(;*c;++c){s+=*c=='.';if(isalpha(*c)){
w+=!a++;d=(*c&30)>>1;if(*c&1&(d==7|((!(d&1))&(d<6|d>8)))){t+=!v++;}
else v=0;}else v=a=0;}return 206.835-1.*w/s-82.*t/w;}

这里的逻辑是一个简单的状态机。它仅按句点计数句子,按字母字符串对单词进行计数,并将音节作为元音字符串(包括y)进行计数。

为了使它具有正确的数字,我不得不稍微增加一些常数,但是我借用了Strigoides的技巧,即只将音节低了一个固定的分数。

Ungolfed,带有注释和一些调试工具:

#include <stdlib.h>
#include <stdio.h>
d,a,/*last character was alphabetic */
  v,/*lastcharacter was a vowel */
  s, /* sentences counted by periods */
  t, /* syllables counted by non-consequtive vowels */
  w; /* words counted by non-letters after letters */
float R/*eadability*/(char*c){
  for(;*c;++c){
    s+=*c=='.';
    if(isalpha(*c)){ /* a letter might mark the start of a word or a
               vowel string */
      w+=!a++; /* It is only the start of a word if the last character
              wasn't a letter */
      /* Extract the four bits of the character that matter in determining
       * vowelness because a vowel might mark a syllable */
      d=(*c&30)>>1;
      if( *c&1  & ( d==7 | ( (!(d&1)) & (d<6|d>8) ) ) 
      ) { /* These bits 7 or even and not 6, 8 make for a
         vowel */
    printf("Vowel: '%c' (mangled as %d [0x%x]) counts:%d\n",*c,d,d,!v);
    t+=!v++;
      } else v=0; /* Not a vowel so set the vowel flag to zero */
    }else v=a=0; /* this input not alphabetic, so set both the
            alphabet and vowel flags to zero... */
  }
  printf("Syllables: %3i\n",t);
  printf("Words:     %3i       (t/w) = %f\n",w,(1.0*t/w));
  printf("Sentences: %3i       (w/s) = %f\n",s,(1.0*w/s));
  /* Constants tweaked here due to bad counting behavior ...
   * were:       1.015     84.6 */
  return 206.835-1.   *w/s-82. *t/w;
}
main(c){
  int i=0,n=100;
  char*buf=malloc(n);
  /* Suck in the whole input at once, using a dynamic array for staorage */
  while((c=getc(stdin))!=-1){
    if(i==n-1){ /* Leave room for the termination */
      n*=1.4;
      buf=realloc(buf,n);
      printf("Reallocated to %d\n",n);
    }
    buf[i++]=c;
    printf("%c %c\n",c,buf[i-1]);
  }
  /* Be sure the string is terminated */
  buf[i]=0;
  printf("'%s'\n",buf);
  printf("%f\n",R/*eadability*/(buf));
}

输出:(使用长版的脚手架,但使用打高尔夫球的功能。)

$ gcc readability_golf.c
readability_golf.c:1: warning: data definition has no type or storage class
$ ./a.out < readability1.txt 
'I would not, could not, in the rain.
Not in the dark, not on a train.
Not in a car, not in a tree.
I do not like them, Sam, you see.
Not in a house, not in a box.
Not with a mouse, not with a fox.
I will not eat them here or there.
I do not like them anywhere!
'
104.074631    
$ ./a.out < readability2.txt
'It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape
the vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.
'
63.044090
$ ./a.out < readability3.txt 
'When in the Course of human events, it becomes necessary for one people to
dissolve the political bands which have connected them with another, and to
assume among the powers of the earth, the separate and equal station to
which the Laws of Nature and of Nature's God entitle them, a decent respect
to the opinions of mankind requires that they should declare the causes
which impel them to the separation.
'
-1.831667

缺陷:

  • 句子计数逻辑是错误的,但是我避免了,因为只有一个输入具有a !或a ?
  • 单词计数逻辑会将紧缩视为两个单词。
  • 音节计数逻辑会将那些相同的收缩视为一个音节。但平均而言可能会高估(例如,there被算作两个,许多结尾的单词e将被算作一个太多),因此我应用了96.9%的校正常数。
  • 假定一个ASCII字符集。
  • 我相信元音检测将承认[{,这显然是不正确的。
  • 很大程度上依赖于K&R语义,这很难看,但是,嘿,这是代码高尔夫。

要看的东西:

  • 我(暂时)在这里领先于两个python解决方案,即使我落后于perl。

  • 获得我为检测元音所做的可怕事情。如果您以二进制形式写出ASCII表示形式并以长版本阅读注释,这是有道理的。


“我不得不手动改变配方以获得可接受的结果。” 这可能是错误的形式。
Joe Z.

1
我现在至少是跟随Strigoides的领导,并根据文本理解会导致错误的原因进行调整,而不是纯粹地临时调整以使三个测试用例达成一致。
dmckee ---前主持人小猫,

2

Python中,202个 194 188 184 171 167字符

import re
def R(i):r=re.split;w=len(r(r'[ \n]',i));s=r('\\.',i);y=r('[^aeiou](?i)+',i);return 206.835-1.015*w/(len(s)-s.count('\n'))-84.6*(len(y)-y.count(' ')-2)*.98/w

首先,通过沿空格和换行符分割来获得单词总数:

w=len(r(r'[ \n]',i))

然后,公式。句子和音节计数仅使用一次,因此将它们嵌入此表达式中。

句子只是将输入与分开.,并过滤掉换行符:

s=r('\\.',i);s=len(s)-s.count('\n')

音节由沿非元音拆分的输入组成,并删除了空格。这似乎总是略微高估了音节的数量,因此我们需要将其调低(似乎做到了.98左右):

y=r('[^aeiou](?i)+',i);y=len(y)-y.count(' ')-2;

202-> 194: len(x)-2而不是len(x[1:-1])。卸下不必要的括号。使音节正则表达式不区分大小写

194-> 188: 该文件以前被保存为dos而不是unix文件格式,导致wc -c将换行符计为两个字符。哎呀

188-> 184:x for x in ... if x!=...通过存储中间结果并减去来 消除那些讨厌的x.count(...)

184-> 171: 删除输入/输出,并转换为功能

171-> 167:len(x)-x.count(...)s插入公式


您的答案不必包括输入和输出过程。
Joe Z.

@JoeZeng哦,好的。然后,我将其转换为函数。
Strigoides

1

Python 380个字符

import re
def t(p):
 q=lambda e: e!=''
 w=filter(q,re.split('[ ,\n\t]',p))
 s=filter(q,re.split('[.?!]',p))
 c=len(w)*1.0
 f=c/len(s)
 return w,f,c
def s(w):
 c= len(re.findall(r'([aeiouyAEIOUY]+)',w))
 v='aeiouAEIOU'
 if len(w)>2 and w[-1]=='e'and w[-2]not in v and w[-3]in v:c-= 1
 return c
def f(p):
 w,f,c=t(p)
 i=0
 for o in w:
  i+=s(o)
 x=i/c
 return 206.835-1.015*f-84.6*x

这是一个相当长的解决方案,但是至少可以在3个测试用例中提供足够的效果。

测试代码

def test():
 test_cases=[['I would not, could not, in the rain.\
        Not in the dark, not on a train.\
        Not in a car, not in a tree.\
        I do not like them, Sam, you see.\
        Not in a house, not in a box.\
        Not with a mouse, not with a fox.\
        I will not eat them here or there.\
        I do not like them anywhere!', 111.38, 103.38, 119.38],\
        ['It was a bright cold day in April, and the clocks were striking thirteen.\
        Winston Smith, his chin nuzzled into his breast in an effort to escape\
        the vile wind, slipped quickly through the glass doors of Victory Mansions,\
        though not quickly enough to prevent a swirl of gritty dust from entering\
        along with him.', 65.09, 57.09, 73.09],\
        ["When in the Course of human events, it becomes necessary for one people to\
        dissolve the political bands which have connected them with another, and to\
        assume among the powers of the earth, the separate and equal station to\
        which the Laws of Nature and of Nature's God entitle them, a decent respect\
        to the opinions of mankind requires that they should declare the causes\
        which impel them to the separation.", 3.70, -4.70, 11.70]]
 for case in test_cases:
  fre= f(case[0])
  print fre, case[1], (fre>=case[2] and fre<=case[3])

if __name__=='__main__':
 test()

结果-

elssar@elssar-laptop:~/code$ python ./golf/readibility.py
108.910685484 111.38 True
63.5588636364 65.09 True
-1.06661971831 3.7 True

我从这里开始使用音节计数器- 计数音节

此处提供更具可读性的版本


1
if len(w)>2 and w[-1]=='e'and w[-2]not in v and w[-3]in v:c-= 1头脑简单,但近似。我喜欢。
dmckee ---前主持人小猫,

0

Javascript,191个字节

t=prompt(q=[]);s=((t[m="match"](/[!?.]+/g)||q)[l="length"]||1);y=(t[m](/[aeiouy]+/g)||q)[l]-(t[m](/[^aeiou][aeiou][s\s,'.?!]/g)||q)[l]*.33;w=(t.split(/\s+/g))[l];alert(204-1.015*w/s-84.5*y/w)

第一个测试用例给出112.9(正确答案是111.4,减少了1.5分)

第二个测试用例给出67.4(正确答案是65.1,减少了2.3点)

第三个测试用例给出1.7(正确答案是3.7,相差2.0分)

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.