正则表达式:匹配平均主义系列


18

介绍

我在这里看不到很多正则表达式挑战,因此我想提供这种看似简单的方法,可以使用多种正则表达式来以多种方式完成。我希望它为正则表达式爱好者提供一些有趣的打高尔夫球时间。

挑战

面临的挑战是匹配我非常宽松地称为“平等主义者”系列的内容:一系列相等数量的不同字符。最好用示例来描述。

比赛:

aaabbbccc
xyz 
iillppddff
ggggggoooooollllllffffff
abc
banana

不匹配:

aabc
xxxyyzzz
iilllpppddff
ggggggoooooollllllfff
aaaaaabbbccc
aaabbbc
abbaa
aabbbc

一概而论,要匹配形式的主题(为任意字符的列表来,在所有c1)n(c2)n(c3)n...(ck)nc1ckci != ci+1i, k > 1, and n > 0.

说明:

  • 输入将不为空。

  • 字符可以稍后在字符串中重复(例如“ banana”)

  • k > 1,因此字符串中将始终至少包含2个不同的字符。

  • 您可以假定仅将ASCII字符作为输入传递,并且没有字符将成为行终止符。

规则

(感谢马丁·恩德(Martin Ender)出色阐述了这一规则)

您的答案应该由一个正则表达式组成,没有任何其他代码(可选地,使您的解决方案起作用所需的正则表达式修饰符列表除外)。您不得使用允许您以托管语言调用代码的语言正则表达式功能(例如Perl的e修饰符)。

您可以使用在挑战之前存在的任何正则表达式风味,但请指定风味。

不要假设正则表达式是隐式锚定的,例如,如果您使用的是Python,请假设您的正则表达式用于re.search,而不用于re.match。您的正则表达式必须与整个字符串匹配有效的均等字符串,而对于无效字符串则不匹配。您可以根据需要使用任意数量的捕获组。

您可以假设输入将始终是两个或多个不包含任何行终止符的ASCII字符的字符串。

这是正则表达式高尔夫,因此以字节为单位的最短正则表达式获胜。如果您的语言需要分隔符(通常是/.../)来表示正则表达式,请不要计算分隔符本身。如果您的解决方案需要修饰符,请为每个修饰符添加一个字节。

标准

这是一门不错的老式高尔夫球,所以请不要考虑效率问题,而要尽量减小正则表达式。

请说明您使用了哪种正则表达式,并在可能的情况下提供一个链接,该链接显示了您在操作中的表情在线演示。


这是正则表达式高尔夫吗?您可能应该澄清一下,以及它的规则。该站点上的最大挑战是各种编程语言。
LyricLy

@LyricLy感谢您的建议!是的,我希望它纯粹是正则表达式。提交者选择的正则表达式形式的单个正则表达式。我还要注意其他规则吗?
jaytea

我不理解您对“平等主义者”的定义,即“平等主义者” banana
msh210 '17

@ msh210当我想到“平等主义”一词来描述该系列时,我不认为我会允许在该系列的后面重复字符(例如“ banana”或“ aaabbbcccaaa”等)。 。我只是想用一个术语来表示这样的想法,即每个重复字符的块都相同。由于“香蕉”没有重复的字符,因此此定义适用。
jaytea

Answers:


11

.NET风格,48个字节

^(.)\1*((?<=(\5())*(.))(.)(?<-4>\6)*(?!\4|\6))+$

在线尝试!(使用视网膜

好吧,事实证明,排除逻辑毕竟更简单。我将其作为一个单独的答案,因为这两种方法完全不同。

说明

^            # Anchor the match to the beginning of the string.
(.)\1*       # Match the first run of identical characters. In principle, 
             # it's possible that this matches only half, a quarter, an 
             # eighth etc of of the first run, but that won't affect the 
             # result of the match (in other words, if the match fails with 
             # matching this as the entire first run, then backtracking into
             # only matching half of it won't cause the rest of the regex to
             # match either).
(            # Match this part one or more times. Each instance matches one
             # run of identical letters.
  (?<=       #   We start with a lookbehind to record the length
             #   of the preceding run. Remember that the lookbehind
             #   should be read from the bottom up (and so should
             #   my comments).
    (\5())*  #     And then we match all of its adjacent copies, pushing an
             #     empty capture onto stack 4 each time. That means at the
             #     end of the lookbehind, we will have n-1 captures stack 4, 
             #     where n is the length of the preceding run. Due to the 
             #     atomic nature of lookbehinds, we don't have to worry 
             #     about backtracking matching less than n-1 copies here.
    (.)      #     We capture the character that makes up the preceding
             #     run in group 5.
  )
  (.)        #   Capture the character that makes up the next run in group 6.
  (?<-4>\6)* #   Match copies of that character while depleting stack 4.
             #   If the runs are the same length that means we need to be
             #   able to get to the end of the run at the same time we
             #   empty stack 4 completely.
  (?!\4|\6)  #   This lookahead ensures that. If stack 4 is not empty yet,
             #   \4 will match, because the captures are all empty, so the
             #   the backreference can't fail. If the stack is empty though,
             #   then the backreference will always fail. Similarly, if we
             #   are not at the end of the run yet, then \6 will match 
             #   another copy of the run. So we ensure that neither \4 nor
             #   \6 are possible at this position to assert that this run
             #   has the same length das the previous one.
)+
$            # Finally, we make sure that we can cover the entire string
             # by going through runs of identical lengths like this.

我喜欢您在这两种方法之间摇摆不定!我还认为,在我实际尝试消极方法之前,它应该更短一些,但发现它更加尴尬(即使感觉应该更简单)。我在PCRE中有48b,在Perl中有49b,使用完全不同的方法,而.NET中的第三种方法的大小相同,我想说这是一个很酷的正则表达式挑战:D
jaytea

@jaytea我很想看看那些。如果一周左右没有人提出任何建议,希望您自己张贴。:)是的,同意,方法在字节数上如此接近真是太好了。
Martin Ender's

我可能会!另外,Perl一个已经打到46b了;)
jaytea

所以我想您可能现在想看看这些!这是PCRE中的48b:((^.|\2(?=.*\4\3)|\4(?!\3))(?=\2*+((.)\3?)))+\3$我正在尝试\3*将其(?!\3)改为45b,但是在“ aabbbc”上失败了:( Perl版本更易于理解,现在降至45b:^((?=(.)\2*(.))(?=(\2(?4)?\3)(?!\3))\2+)+\3+$-我称其为Perl的原因PCRE似乎是有效的,因为PCRE认为(\2(?4)?\3)可以无限期递归,而Perl
则更

@jaytea啊,那真是整洁的解决方案。您应该将它们发布在单独的答案中。:)
Martin Ender's

9

.NET风格,54个字节

^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+

在线尝试!(使用视网膜

我敢肯定这不是最理想的选择,但这是我目前为平衡小组提出的最佳方案。在相同的字节数下,我有一个选择,它几乎是相同的:

^(?!.*(?<=(\3())*(.))(?!\3)(?>(.)(?<-2>\4)*)(\2|\4)).+

说明

主要思想是反转问题,匹配非均等的字符串,并将整个问题否定地提前否定结果。好处是我们不必在整个字符串中都跟踪n(由于平衡组的性质,通常在检查n时会消耗n)来检查所有游程的长度是否相等。取而代之的是,我们只寻找一对长度相同的相邻行程。这样,我只需要使用一次n

这是正则表达式的细分。

^(?!.*         # This negative lookahead means that we will match
               # all strings where the pattern inside the lookahead
               # would fail if it were used as a regex on its own.
               # Due to the .* that inner regex can match from any
               # position inside the string. The particular position
               # we're looking for is between two runs (and this
               # will be ensured later).

  (?<=         #   We start with a lookbehind to record the length
               #   of the preceding run. Remember that the lookbehind
               #   should be read from the bottom up (and so should
               #   my comments).
    (\2)*      #     And then we match all of its adjacent copies, capturing
               #     them separately in group 1. That means at the
               #     end of the lookbehind, we will have n-1 captures
               #     on stack 1, where n is the length of the preceding
               #     run. Due to the atomic nature of lookbehinds, we
               #     don't have to worry about backtracking matching
               #     less than n-1 copies here.
    (.)        #     We capture the character that makes up the preceding
               #     run in group 2.
  )
  (?!\2)       #   Make sure the next character isn't the same as the one
               #   we used for the preceding run. This ensures we're at a
               #   boundary between runs.
  (?>          #   Match the next stuff with an atomic group to avoid
               #   backtracking.
    (.)        #     Capture the character that makes up the next run
               #     in group 3.
    (?<-1>\3)* #     Match as many of these characters as possible while
               #     depleting the captures on stack 1.
  )
               #   Due to the atomic group, there are three two possible
               #   situations that cause the previous quantifier to stopp
               #   matching. 
               #   Either the run has ended, or stack 1 has been depleted.
               #   If both of those are true, the runs are the same length,
               #   and we don't actually want a match here. But if the runs
               #   are of different lengths than either the run ended but
               #   the stack isn't empty yet, or the stack was depleted but
               #   the run hasn't ended yet.
  (?(1)|\3)    #   This conditional matches these last two cases. If there's
               #   still a capture on stack 1, we don't match anything,
               #   because we know this run was shorter than the previous
               #   one. But if stack 1, we want to match another copy of 
               #   the character in this run to ensure that this run is 
               #   longer than the previous one.
)
.+             # Finally we just match the entire string to comply with the
               # challenge spec.

我试图使它失败的:bananaababbbaaannnaaannnaaabbbaaannnaaannnaaaaaaThe Nineteenth Byte11110^(?!.*(?<=(\2)*(.))(?!\2)(?>(.)(?<-1>\3)*)(?(1)|\3)).+bababa。是我失败了。:( +1
Erik the Outgolfer

1
那一刻,当您完成解释然后弄清楚您可以通过使用完全相反的方法来节省1个字节时...我想我会在稍后再给出一个答案...:|
Martin Ender's

@MartinEnder ...然后意识到您可以按2个字节打高尔夫球,这是哈哈:P
Xcoder先生,2017年

@ Mr.Xcoder现在必须为7个字节,所以我希望我安全。;)
Martin Ender's
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.