洋葱还是不是洋葱?


11

The Onion(警告:许多文章都是NSFW)是一个讽刺性新闻机构,模仿传统新闻媒体。2014年,The Onion推出了ClickHole(警告:也经常是NSFW),这是一个讽刺性新闻网站,模仿了诸如BuzzFeed之类的“ clickbait”网站。多亏了坡定律,人们才经常阅读The Onion或ClickHole上的文章标题,并相信它们是真实的,而不知道它们是要讽刺的。相反,在听起来可笑的真实新闻故事中也会发生这种情况-人们常常以为自己是讽刺作家,而不是。

这种困惑自然很适合游戏-给定新闻头条,尝试猜测它是否讽刺。挑战在于如何使用程序来做到这一点。

给定新闻标题(仅由可打印的ASCII字符和空格组成的字符串),1如果标题是讽刺,或者0不是,则输出。您的分数将是正确输出的数量除以标题总数。

按照惯例,不允许出现标准漏洞(尤其是针对测试用例进行优化)。为了实现这一点,我将在一组200个隐藏的测试用例上运行您的程序(The Onion中有100个,Not The Onion中有100个)。您的解决方案的分数必须比公共测试用例的分数少不超过20个百分点才能生效。

测试用例

为了提出针对此挑战的测试案例,我从The Onion subreddit中选择了25个标题(在The Onion及其子站点(如ClickHole)上发布了文章),从Not The Onion subreddit中选择了 25个标题(其中包含真实新闻)听起来像是讽刺)。我对标题所做的唯一更改是将“花式”引号替换为常规ASCII引号并标准化了大写字母-其他所有内容在原始文章的标题中均保持不变。每个标题都在自己的行上。

洋葱头条

Trump Warns Removing Confederate Statues Could Be Slippery Slope To Eliminating Racism Entirely
'No Way To Prevent This,' Says Only Nation Where This Regularly Happens
My Doctor Told Me I Should Vaccinate My Children, But Then Someone Much Louder Than My Doctor Told Me I Shouldn't
Man At Park Who Set Up Table Full Of Water Cups Has No Idea How Passing Marathon Runners Got Impression They Can Take Them
This Child Would Have Turned 6 Today If His Mother Hadn't Given Birth To Him In October
Incredible Realism: The Campaign In The Next 'Call Of Duty' Will Begin At Your Avatar's High School Cafeteria When He's Being Tricked Into Joining The Military By A Recruiter
'Sometimes Things Have To Get Worse Before They Get Better,' Says Man Who Accidentally Turned Shower Knob Wrong Way
Report: Uttering Phrase 'Easy Does It' Prevents 78% Of Drywall Damage While Moving Furniture
Barbara Bush Passes Away Surrounded By Loved Ones, Jeb
Family Has Way Too Many Daughters For Them Not To Have Been Trying For Son
News: Privacy Win! Facebook Is Adding A 'Protect My Data' Button That Does Nothing But Feels Good To Press
Dalai Lama Announces Next Life To Be His Last Before Retirement
Researchers Find Decline In Facebook Use Could Be Directly Linked To Desire To Be Happy, Fully Functioning Person
Manager Of Combination Taco Bell/KFC Secretly Considers It Mostly A Taco Bell
Trump: 'It's My Honor To Deliver The First-Ever State Of The Union'
Daring To Dream: Jeff Bezos Is Standing Outside A Guitar Center Gazing Longingly At A $200 Billion Guitar
Area Dad Looking To Get Average Phone Call With Adult Son Down To 47.5 Seconds
Experts Warn Beef Could Act As Gateway Meat To Human Flesh
Jeff Bezos Named Amazon Employee Of The Month
Dad Suggests Arriving At Airport 14 Hours Early
Report: Only 3% Of Conversations Actually Need To Happen
Delta Pilot Refuses To Land Until Gun Control Legislation Passed
Family Wishes Dad Could Find Healthier Way To Express Emotions Than Bursting Into Full-Blown Musical Number
New Honda Commercial Openly Says Your Kids Will Die In A Car Crash If You Buy A Different Brand
Teacher Frustrated No One In Beginner Yoga Class Can Focus Chakras Into Energy Blast

不是洋葱头条

Man Rescued From Taliban Didn't Believe Donald Trump Was President
Nat Geo Hires Jeff Goldblum To Walk Around, Being Professionally Fascinated By Things
Mike Pence Once Ratted Out His Fraternity Brothers For Having A Keg
Reddit CEO Tells User, "We Are Not The Thought Police," Then Suspends That User
Trump Dedicates Golf Trophy To Hurricane Victims
Uber's Search For A Female CEO Has Been Narrowed Down To 3 Men
ICE Director: ICE Can't Be Compared To Nazis Since We're Just Following Orders
Passenger Turned Away From Two Flights After Wearing 10 Layers Of Clothing To Avoid Luggage Fee
Somali Militant Group Al-Shabaab Announces Ban On Single-Use Plastic Bags
UPS Loses Family's $846k Inheritance, Offers To Refund $32 Shipping Fee
Teen Suspended From High School After Her Anti-Bullying Video Hurts Principal's Feelings
Alabama Lawmaker: We Shouldn't Arm Teachers Because Most Are Women
Cat Named After Notorious B.I.G. Shot Multiple Times - And Survives
EPA Head Says He Needs To Fly First Class Because People Are Mean To Him In Coach
Apology After Japanese Train Departs 20 Seconds Early
Justin Bieber Banned From China In Order To 'Purify' Nation
Alcohol Level In Air At Fraternity Party Registers On Breathalyzer
NPR Tweets The Declaration Of Independence, And People Freak Out About A 'Revolution'
Man Who Mowed Lawn With Tornado Behind Him Says He 'Was Keeping An Eye On It.'
After Eating Chipotle For 500 Days, An Ohio Man Says He's Ready For Something New
'El Chapo' Promises Not To Kill Any Jurors From Upcoming Federal Trial
After 4th DWI, Man Argues Legal Limit Discriminates Against Alcoholics
Palestinian Judge Bans Divorce During Ramadan Because 'People Make Hasty Decisions When They're Hungry'
Argentinian Officers Fired After Claiming Mice Ate Half A Ton Of Missing Marijuana
'Nobody Kill Anybody': Murder-Free Weekend Urged In Baltimore

6
Your score will be the number of correct outputs divided by the total number of headlines字节数是平局吗?
Skidsdev

9
我有点困惑。您期望什么的解决方案?每个解决方案都必须在某种程度上“针对测试用例进行优化”,禁止编写能够理解英语并具有幽默感的AI。例如,Arnauld的解决方案/ly\b/仅在您选择的25个洋葱头条带有更多副词的情况下才检测出哪个有效,但就我所知,您可以使用其他测试电池轻松将其绊倒。谁说他的系数没有被选择来优化他的分数?(他为什么不优化它们?)
林恩

10
这个测试电池看起来确实有点不寻常。这就像要求一个可以在照片中检测到狗的分类器一样,但是将您的阳性测试用例作为狗的照片,并将阴性测试用例从Buzzfeed文章“您会发誓是狗的对象的25张照片,但不,转身出来,他们不是!(#11会打击你的头脑!)”这使一个很难解决的问题变得更加困难。
Sophia Lechner

4
挑战不仅困难重重,而且(对我而言)区别也不明显。如果我无法解决,那么我的程序当然也无法解决(也就是说,虽然说服我它不会对测试用例进行硬编码)
user202729

4
好吧,我花了+36个小时来训练使用brain.jsLSTM和LSTM 进行人工神经网络训练,并提供了本期的样本以及每种类型的100个其他样本(提供的链接),但是对于训练集中没有的新标题,结果还不够好。我完成了:P
Night2

Answers:


7

JavaScript(ES7),39/50(78%)

隐藏测试用例的63.5%(127/200)

一个简单的启发式方法,基于标题的长度,空格数和-ly后缀的使用。

isOnion = str =>
  str.length ** 0.25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(/ly\b/).length ** 1.75 * 7
  > 76

在线尝试!


这是如此简单,真是荒唐有效。
Don Thousand

在隐藏的测试案例中,该解决方案的得分为63.5%,因此它是有效的。
Mego

并不是像沙盒开始时那样简单(100%,在标准化之前利用大小写差异),但这确实很简单。
扎卡里

@Mego出于好奇,此NSFW版本是否提高了隐藏测试用例的分数?:)
Arnauld

@Arnauld带有该版本的66%
Mego

6

Python 3,84%

未经测试的隐藏测试用例。

这使用了在各个标题上受过训练的Keras LSTM RNN。要运行它,您需要Keras以下代码以及我在GitHub上可用的模型:repo link。您将需要模型.h5,并且单词/向量映射在中.pkl。最新的

依赖项是:

import numpy as np
from pickle import load
from keras.preprocessing import sequence, text
from keras.models import Sequential
from keras.layers import Dense, Embedding, SpatialDropout1D, LSTM, Dropout
from keras.regularizers import l2
import re

设置为:

max_headline_length = 70
word_count = 20740

该模型是:

model = Sequential()
model.add(Embedding(word_count, 32, input_length=max_headline_length))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(64, kernel_regularizer=l2(0.005), dropout=0.3, recurrent_dropout=0.3))
model.add(Dropout(0.5))
model.add(Dense(32, kernel_regularizer=l2(0.005)))
model.add(Dropout(0.5))
model.add(Dense(2, kernel_regularizer=l2(0.001), activation='softmax'))

现在加载模型和词嵌入:

model.load_weights('model.h5')
word_to_index = load(open('words.pkl', 'rb'))

以及用于测试字符串是否来自'NotTheOnion'或'TheOnion'的代码,我编写了一个快速帮助函数,该函数将字符串转换为相应的单词嵌入:

def get_words(string):
  words = []
  for word in re.finditer("[a-z]+|[\"'.;/!?]", string.lower()):
    words.append(word.group(0))
  return words

def words_to_indexes(words):
  return [word_to_index.get(word, 0) for word in words]

def format_input(word_indexes):
  return sequence.pad_sequences([word_indexes], maxlen=max_headline_length)[0]

def get_type(string):
  words = words_to_indexes(get_words(string))
  result = model.predict(np.array([format_input(words)]))[0]

  if result[0] > result[1]:
    site = 'NotTheOnion'
  else:
    site = 'TheOnion'

  return site

说明

该代码运行一个模型,该模型通过将单词表示为“向量”来分析单词之间的关系。您可以在此处了解有关单词嵌入的更多信息。

在头条新闻上对此进行了培训,但排除了测试用例。

经过大量的处理后,此过程是自动化的。我已经将最终处理过的单词列表分发为,.pkl但是在单词嵌入中发生的事情是首先我们分析句子并隔离单词。

有了单词之后,下一步就是能够理解某些单词(例如king和和queenduke和)之间的区别和相似之处duchess。这些嵌入不是发生在实际单词之间,而是出现在表示单词的数字之间,即存储在.pkl文件中的单词。机器不理解的词被映射为一个特殊的词<UNK>,这使我们能够理解那里有一个词,但不清楚其含义是什么。

既然已经能够理解单词,那么就需要能够分析单词的顺序(标题)。这就是“ LSTM”的作用,LTSM是“ RNN”单元的一种,可以避免梯度消失的影响。更简单地说,它包含一系列单词,并且使我们能够找到它们之间的关系。

现在最后一层Dense基本上意味着它就像一个数组,意味着输出如下:[probability_is_not_onion, probability_is_onion]。通过找到哪个更大,我们可以选择哪个对于给定的标题而言是最自信的结果。


3

Python 3 + Keras,41/50 = 82%

隐藏测试用例的83%(166/200)

import json
import keras
import numpy
import re

from keras import backend as K

STRIP_PUNCTUATION = re.compile(r"[^a-z0-9 ]+")


class AttentionWeightedAverage(keras.engine.Layer):
    def __init__(self, return_attention=False, **kwargs):
        self.init = keras.initializers.get("uniform")
        self.supports_masking = True
        self.return_attention = return_attention
        super(AttentionWeightedAverage, self).__init__(**kwargs)

    def build(self, input_shape):
        self.input_spec = [keras.engine.InputSpec(ndim=3)]
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[2], 1),
                                 name="{}_W".format(self.name),
                                 initializer=self.init)
        self.trainable_weights = [self.W]

        super(AttentionWeightedAverage, self).build(input_shape)

    def call(self, x, mask=None):
        logits = K.dot(x, self.W)
        x_shape = K.shape(x)
        logits = K.reshape(logits, (x_shape[0], x_shape[1]))

        ai = K.exp(logits - K.max(logits, axis=-1, keepdims=True))

        if mask is not None:
            mask = K.cast(mask, K.floatx())
            ai = ai * mask

        att_weights = ai / (K.sum(ai, axis=1, keepdims=True) + K.epsilon())
        weighted_input = x * K.expand_dims(att_weights)

        result = K.sum(weighted_input, axis=1)

        if self.return_attention:
            return [result, att_weights]

        return result

    def get_output_shape_for(self, input_shape):
        return self.compute_output_shape(input_shape)

    def compute_output_shape(self, input_shape):
        output_len = input_shape[2]

        if self.return_attention:
            return [(input_shape[0], output_len), (input_shape[0], input_shape[1])]

        return (input_shape[0], output_len)

    def compute_mask(self, input, input_mask=None):
        if isinstance(input_mask, list):
            return [None] * len(input_mask)
        else:
            return None


if __name__ == "__main__":
    model = keras.models.load_model("combined.h5", custom_objects={"AttentionWeightedAverage": AttentionWeightedAverage})
    with open("vocabulary.json", "r") as fh:
        vocab = json.load(fh)

    while True:
        try:
            headline = input()
        except EOFError:
            break

        tokens = STRIP_PUNCTUATION.sub("", headline.lower()).split()

        inp = numpy.zeros((1, 45))

        for i, token in enumerate(tokens):
            try:
                inp[0,i] = vocab[token]
            except KeyError:
                inp[0,i] = 1

        print(model.predict(inp)[0][0] > 0.3)

combined.h5并且vocabulary.json可以从这里(很大)这里检索。

全连接分类器连接到预训练的情绪分析模型DeepMoji,该模型由堆叠的双向LSTM和注意机制组成。我冻结了DeepMoji层,取出了最后一个softmax层,只训练了完全连接的层,然后解冻了DeepMoji层,并将它们一起训练以进行微调。注意机制取自https://github.com/bfelbo/DeepMoji/blob/master/deepmoji/attlayer.py(我不想将所有代码都用作一个类的依赖项,特别是因为它是Python 2,很难用作模块...)

考虑到在我自己的更大的验证集上,Mego的测试集获得> 90%的性能,这在Mego的测试集上表现出奇的差。所以我还没有完成。


假设我正确运行了隐藏测试用例的83%
Mego,

1

JavaScript(Node.js),98%(49/50)

隐藏测试用例的96%(192/200)

const words = require('./words');
const bags = require('./bags');

let W = s => s.replace(/[^A-Za-z0-9 ]/g, '').toLowerCase().split(' ').filter(w => w.length > 3);

let M = b => {
    for (let i = 0; i < bags.length; i++) {
        let f = true;
        for (let j = 0; j < bags[i].length; j++) if (!b.includes(bags[i][j])) {
            f = false;
            break;
        }
        if (f) return true;
    }
    return false;
};

let O = s => {
    let b = [];
    W(s).forEach(w => {
        let p = words.indexOf(w);
        if (p >= 0) b.push(p);
    });
    return (b.length > 0 && M(b));
};

这需要两个大的JSON文件,我无法将它们放在此处或“ TiO”上。请从以下链接下载它们,并使用words.jsonbags.json名称将它们保存在与JS文件相同的文件夹中。还有一个包含测试用例和结果/百分比打印的JS文件的链接。您可以将隐藏的测试用例放在onionsnonOnions变量中。

将所有3个文件保存在同一目录中后,运行node onion.js

如果不是洋葱,则该O函数将返回。使用大量的单词袋列表(无顺序)来检测输入的字符串是否为洋葱。某种硬编码,但在各种随机测试用例上都能很好地工作。truefalse


在隐藏的测试用例中,此解决方案可获得96%的
响应

0

解决Arnauld的解决方案

JavaScript(ES6),41/50

隐藏测试用例的64%(128/200)

str.includes("Dad") || str.length ** .25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(/ly\b/).length ** 1.75 * 7
 > 76

JavaScript(ES6),42/50

隐藏测试用例的62.5%(125/200)(无效)

isOnion = str =>
  str.includes("Dad") || str.length ** .25 +
  str.split(' ').length ** 1.25 * 2 +
  str.split(' ').filter(w => w.length > 3 && w.split(/ly/).length > 1).length * 23.54 +
 /\d/.test(str) * 8
 > 76

长度+字数+“ ly”概念非常有效,我通过检查单词“ Dad”(当真正的文章在标题中谈论的是第三人称的时候是什么?通过更改“ ly”搜索试探法并检查标题中是否存在数字来增加一点(在测试之外的一般情况下,这可能不太有效,因此我留下了两种解决方案)


我不知道父亲的部分...似乎有点像对我优化测试用例...
Don Thousand

是的,我可以找到很多关于父亲的“非洋葱”文章
Don Thousand

作为启发式方法的一部分,可能有更好的方法,而且如果包含父亲,不仅是艰难的“胜利”,而且我想即使在测试数据库之外,抽象地谈论特定的“爸爸”在The Onion上也更为常见
TiKevin83

您的第一个解决方案在隐藏的测试用例上得分为64%,因此它是有效的。您的第二个解决方案在隐藏的测试用例上得分为62.5%,因此无效。
Mego

@Mego多少保证金...
user202729 '18
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.