这个词是什么语言?


16

您应该编写确定给定单词语言的程序或函数。

任务是识别4种语言中的5000个最常见的单词:

  • 英语
  • 德语
  • 义大利文
  • 匈牙利

单词列表可以在此GitHub存储库中找到

可以在提供的40%测试用例中犯错误。即您可以将20000个输入中的8000个错误分类。

细节

  • 这些名单只包含有小写字母的话a-z那么如won'tmöchte不包括在内。
  • 几种语言会以多种语言显示,这意味着您的代码无法始终正确地猜测预期的输出。
  • 为了方便起见,您可以将所有测试用例下载为一个列表。在每一行中,一个数字表示单词的语言。(1用于英语,2德语,3意大利语和4匈牙利语。)
  • 不允许出现标准漏洞
  • 禁止使用单词表列出您的编程语言提供的类似数据。

输入值

  • 仅包含小写英文字母(az)的字符串。
  • 尾随换行符是可选的。

输出量

  • 您可以通过为每种语言提供清晰一致的输出(始终相同)来对单词进行分类。(例如1,英语,2德语,3意大利语和4匈牙利语。)

这是代码高尔夫球,因此最短的程序或功能将获胜。

相关代码高尔夫问题:这甚至是一个单词吗?

单词列表来自wiktionary.org和101languages.net。


您确定列表正确吗?我很确定我从未听过德语。输出具有所有可能语言的数组是否计数?例如,显然所有语言都存在,因此它将{1,2,3,4}
Eumel

@Eumel头几个英语单词可能出现在其他列表中的某处,因为在用于生成单词列表的语言文本中可能有英语短语。您只能将输入分类为一种语言。(Whihch的意思是在问题中提到“您的代码不能总是正确地猜出预期的输出”。)
randomra 2016年

列表中仅包含带有小写字母的单词 ...并非完全正确。该all_languages文件包含数十个大写单词(MrGutenberg等)以及非单词“”(空字符串)和“]] |-”。我认为可以将前者小写并删除后者吗?
吱吱作响的ossifrage

@squeamishossifrage感谢您的收获。更新了英语列表。(有
〜60

为什么要删除变音符号?如果目标是区分没有变音符号的语言,那么为什么不使用没有变音符号的语言呢?

Answers:


9

视网膜,51字节

.*[aeio]$
1
A`en$|ch|ei|au
^$
2
A`[jkz]|gy|m$
\D+
4

我想出了正则表达式,@MartinBüttner在视网膜上进行了高尔夫转换,所以……为团队的努力欢呼吗?

映射为1 -> Italian, 2 -> German, (empty) -> Hungarian, 4 -> English,每个类别中归类的数量为4506 + 1852 + 2092 + 3560 = 12010

在线尝试!| 修改的多行版本

说明

首先,等效的Python是这样的:

import re
def f(s):
  if re.search("[aeio]$", s):
    return 1
  if re.search("en$|ch|ei|au", s):
    return 2
  if re.search("[jkz]|gy|m$", s):
    return ""
  return 4

我只想说这o$是意大利语的绝佳标志。

视网膜版本类似,成对的线形成替换阶段。例如,前两行

.*[aeio]$
1

用第二行的内容替换第一行的匹配项。

接下来的三行执行相同的操作,但是使用Retina的anti-grep模式-anti-grep(用指定A`)会在与给定的regex相匹配的情况下删除该行,并且以下两行是从空行到所需输出的替换。

A`en$|ch|ei|au
^$
2

以下行再次使用anti-grep,但不会替换空行,从而为匈牙利语提供了固定的输出。

A`[jkz]|gy|m$

最后,最后两行

\D+
4

用替换非空的非数字行4。只有在没有激活任何先前的替代(模拟一条if/else if链)的情况下,所有替代才会发生。


1

LabVIEW,29个LabVIEW原语和148.950字节

循环浏览各种语言,如果单词存在,则将迭代器放入数组。这由内部循环检查,选择第i行并执行=。在LabVIEW中,只有字符串完全相同时,才给出true。

现在,使用输出数组的第一个元素,让其余的所有内容都成为英语。

现在的输出是0英语,1德语,2意大利语和3匈牙利语。


我对LabView不熟悉,但是如何存储值(单词列表)以及它们在LabView基本体中的反映方式?元条目说:“ 常量:字符串是每个字符1个LabVIEW Primitive ”。那不会增加原始数吗?
insertusername此处

我从目录路径+字符串构建路径+加载文件加载文件。存储在内部完成,并通过电线传递。
Eumel '02

5
我可能是错的,但我认为挑战的核心在于如何压缩/存储单词表。因此,可能不允许从外部文件加载。将询问OP。:)
insertusername此处,2016年

2
如果使用外部文件,则应将其大小添加到代码大小中,因为这是解决方案的一部分。
randomra

我给人的印象是,这些应该被给予,但病倒了
。np

1

Java,3416字节,62%

这是我的解决方案,我分析给定单词的列表,并为每种语言找到60个最常见的双字母组和三字母组。现在,我要针对单词检查我的n-gram,并选择单词中包含最多n-gram的语言。

public class Classificator {

    String[][] triGr = {
            {"ing","ion","ent","tio","ted","nce","ter","res","ati","con","ess","ate","pro","ain","est","ons","men","ect","red","rea","com","ere","ers","nte","ine","her","ble","ist","tin","for","per","der","ear","str","ght","pre","ver","int","nde","the","igh","ive","sta","ure","end","enc","ned","ste","dis","ous","all","and","anc","ant","oun","ten","tra","are","sed","cti"},
            {"sch","che","ver","gen","ten","cht","ich","ein","ste","ter","hen","nde","nge","ach","ere","ung","den","sse","ers","and","eit","ier","ren","sen","ges","ang","ben","rei","est","nen","nte","men","aus","der","ent","hei","her","lle","ern","ert","uch","ine","ehe","auf","lie","tte","ige","ing","hte","mme","end","wei","len","hre","rau","ite","bes","ken","cha","ebe"},
            {"ent","are","ato","nte","ett","ere","ion","chi","con","one","men","nti","gli","pre","ess","att","tto","par","per","sta","tra","zio","and","iam","end","ter","res","est","nto","tta","acc","sci","cia","ver","ndo","amo","ant","str","tro","ssi","pro","era","eri","nta","der","ate","ort","com","man","tor","rat","ell","ale","gio","ont","col","tti","ano","ore","ist"},
            {"sze","ere","meg","ett","gye","ele","ond","egy","enn","ott","tte","ete","unk","ban","tem","agy","zer","esz","tet","ara","nek","hal","dol","mon","art","ala","ato","szt","len","men","ben","kap","ent","min","ndo","eze","sza","isz","fog","kez","ind","ten","tam","nak","fel","ene","all","asz","gon","mar","zem","szo","tek","zet","elm","het","eve","ssz","hat","ell"}

                    };
    static String[][] biGr = {
        {"in","ed","re","er","es","en","on","te","ng","st","nt","ti","ar","le","an","se","de","at","ea","co","ri","ce","or","io","al","is","it","ne","ra","ro","ou","ve","me","nd","el","li","he","ly","si","pr","ur","th","di","pe","la","ta","ss","ns","nc","ll","ec","tr","as","ai","ic","il","us","ch","un","ct"},
        {"en","er","ch","te","ge","ei","st","an","re","in","he","ie","be","sc","de","es","le","au","se","ne","el","ng","nd","un","ra","ar","nt","ve","ic","et","me","ri","li","ss","it","ht","ha","la","is","al","eh","ll","we","or","ke","fe","us","rt","ig","on","ma","ti","nn","ac","rs","at","eg","ta","ck","ol"},
        {"re","er","to","ar","en","te","ta","at","an","nt","ra","ri","co","on","ti","ia","or","io","in","st","tt","ca","es","ro","ci","di","li","no","ma","al","am","ne","me","le","sc","ve","sa","si","tr","nd","se","pa","ss","et","ic","na","pe","de","pr","ol","mo","do","so","it","la","ce","ie","is","mi","cc"},
        {"el","en","sz","te","et","er","an","me","ta","on","al","ar","ha","le","gy","eg","re","ze","em","ol","at","ek","es","tt","ke","ni","la","ra","ne","ve","nd","ak","ka","in","am","ad","ye","is","ok","ba","na","ma","ed","to","mi","do","om","be","se","ag","as","ez","ot","ko","or","cs","he","ll","nn","ny"}

                    };

    public int guess(String word) {

        if (word.length() < 3) {
            return 4; // most words below 2 characters on list are hungarians
        }
        int score[] = { 0, 0, 0, 0 };
        for (int i = 0; i < 4; i++) {
            for (String s : triGr[i]) {
                if (word.contains(s)) {
                    score[i] = score[i] + 2;
                }
            }
            for (String s : biGr[i]) {
                if (word.contains(s)) {
                    score[i] = score[i] + 1;
                }
            }
        }
        int v = -1;
        int max = 0;
        for (int i = 0; i < 4; i++) {
            if (score[i] > max) {
                max = score[i];
                v = i;
            }
        }
        v++;
        return v==0?Math.round(4)+1:v;
    }
}

这是我的测试用例

public class Test {

    Map<String, List<Integer>> words = new HashMap<String, List<Integer>>();

    boolean validate(String word, Integer lang) {
        List<Integer> langs = words.get(word);
        return langs.contains(lang);
    }

    public static void main(String[] args) throws FileNotFoundException {

        FileReader reader = new FileReader("list.txt");
        BufferedReader buf = new BufferedReader(reader);
        Classificator cl = new Classificator();
        Test test = new Test();
        buf.lines().forEach(x -> test.process(x));
        int guess = 0, words = 0;
        for (String word : test.words.keySet()) {
            int lang = cl.guess(word);
            if (lang==0){
                continue;
            }
            boolean result = test.validate(word, lang);
            words++;
            if (result) {
                guess++;
            }
        }
        System.out.println(guess+ " "+words+ "    "+(guess*100f/words));
    }

    private void process(String x) {
        String arr[] = x.split("\\s+");
        String word = arr[0].trim();
        List<Integer> langs = words.get(word);
        if (langs == null) {
            langs = new ArrayList<Integer>();
            words.put(word, langs);
        }
        langs.add(Integer.parseInt(arr[1].trim()));

    }

}
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.