用您的语言实施PCRE。

注意：自己尝试一下之后，我很快意识到这是一个错误。因此，我将对规则进行一些修改。

所需的最低功能：

字符类（.，\w，\W等）
乘法器（+，*，和?）
简单捕获组

您面临的挑战是在满足以下条件的情况下以您选择的语言实施PCRE：

您不得以任何方式使用语言的本机RegEx工具。您也不能使用第三方RegEx库。
您的输入内容应实现PCRE规范中的大部分内容。尽可能。
您的程序应接受以下两行作为输入：
- 正则表达式
- 输入要匹配的字符串
您的程序应在其输出中指出：
- RegEx是否与输入字符串中的任何位置匹配
- 任何捕获组的结果
获胜者将是实现尽可能多规范的作品。尽可能。根据我的判断，如果平局，获胜者将是最具创造力的作品。

编辑：澄清一些事情，这是一些输入和预期输出的示例：

输入：

^ \ s *（\ w +）$
         你好

输出：

符合条件：是
第1组：“你好”

输入：

（\ w +）@（\ w +）（？：\。com | \ .net）
sam@test.net

输出：

符合条件：是
第1组：“ sam”
第二组：“测试”

code-challenge regular-expression

— 内森·奥斯曼（Nathan Osman）
source

考虑到PCRE中的功能数量，这是一个非常具有挑战性的挑战。递归，回溯，超前/断言，unicode，条件子模式，...

— Arnaud Le Blanc11年

参见PCRE文档 ; PERL RE ; PHP PCRE文档也很棒。

— Arnaud Le Blanc

@ user300：目标是尽可能多地实施。显然，一切都会有些困难。

— 内森·奥斯曼

@George：您如何列出所需的功能并给出一些测试用例，以使我们全力以赴。

— Marko Dumic

@George：我认为@Marko追求的是特定功能，或者说是您想让人们首先实现的最小子集。总体而言，PCRE对于临时编码竞赛而言确实是一项艰巨的挑战。我建议将其更改为非常小的特定RE子集，并提出实施挑战。

— MtnViewMark

蟒蛇

由于实施完整的PCRE太多了，因此我仅实施了其中的一个基本子集。

支持|.\.\w\W\s+*()。输入正则表达式必须正确。

例子：

$ python regexp.py 
^\s*(\w+)$
   hello
Matches:     hello
Group 1 hello

$ python regexp.py
(a*)+
infinite loop

$ python regexp.py 
(\w+)@(\w+)(\.com|\.net)
sam@test.net
Matches:  sam@test.net
Group 1 sam
Group 2 test
Group 3 .net

怎么运行的：

有关详细的理论，请阅读本“自动机理论，语言和计算简介”。

想法是将原始正则表达式转换为不确定的有限自动机（NFA）。实际上，PCRE正则表达式至少是上下文无关的语法，为此我们需要下推自动机，但是我们将自己限制为PCRE的子集。

有限自动机是有向图，其中节点是状态，边是转移，每个转移具有匹配的输入。最初，您从预定义的起始节点开始。每当您收到与转换之一匹配的输入时，您就会将该转换带入新状态。如果到达终端节点，则称为自动机接受的输入。在我们的例子中，输入是一个返回true的匹配函数。

它们之所以称为非确定性自动机，是因为有时您可以从相同状态进行更多匹配的转换。在我的实现中，所有向相同状态的转换都必须匹配同一事物，因此我将匹配函数与目标状态（states[dest][0]）一起存储了。

我们使用构件块将正则表达式转换为有限自动机。一个构建块具有一个开始节点（first）和一个结束节点（last），并匹配文本中的内容（可能为空字符串）。

最简单的例子包括

匹配什么：True（first == last）
匹配的字符：c == txt[pos]（first == last）
匹配字符串的结尾：pos == len（txt）(first == last`）

您还将需要在文本中的新位置与下一个标记匹配。

更复杂的例子是（大写字母代表方块）。

匹配B +：
- 创建节点：u，v（不匹配）
- 创建过渡：u-> B.first，B.last-> v，v-> u
- 当您到达节点v时，您已经匹配了B。然后您有两个选择：走得更远，或者再次尝试匹配B。
匹配A | B | C：
- 创建节点：u，v（不匹配）
- 创建过渡：u-> A.first，u-> C.first，u-> C.first，
- 创建过渡：A->最后-> v，B->最后-> v，C->最后-> v，
- 从你可以去任何街区

所有正则表达式运算符都可以像这样进行转换。试一试*。

最后一部分是解析正则表达式，它需要一个非常简单的语法：

 or: seq ('|' seq)*
 seq: empty
 seq: atom seq
 seq: paran seq
 paran: '(' or ')'

希望实现一个简单的语法（我认为是LL（1），但如果我错了，请纠正我）比构建NFA容易得多。

获得NFA之后，您需要回溯直到到达终端节点。

源代码（或此处）：

from functools import *

WORDCHAR = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_'


def match_nothing(txt, pos):
  return True, pos

def match_character(c, txt, pos):
  return pos < len(txt) and txt[pos] == c, pos + 1

def match_space(txt, pos):
  return pos < len(txt) and txt[pos].isspace(), pos + 1

def match_word(txt, pos):
  return pos < len(txt) and txt[pos] in WORDCHAR, pos + 1

def match_nonword(txt, pos):
  return pos < len(txt) and txt[pos] not in WORDCHAR, pos + 1

def match_dot(txt, pos):
  return pos < len(txt), pos + 1

def match_start(txt, pos):
  return pos == 0, pos

def match_end(txt, pos):
  return pos == len(txt), pos


def create_state(states, match=None, last=None, next=None, name=None):
  if next is None: next = []
  if match is None: match = match_nothing

  state = len(states)
  states[state] = (match, next, name)
  if last is not None:
    states[last][1].append(state)

  return state


def compile_or(states, last, regexp, pos):
  mfirst = create_state(states, last=last, name='or_first')
  mlast = create_state(states, name='or_last')

  while True:
    pos, first, last = compile_seq(states, mfirst, regexp, pos)
    states[last][1].append(mlast)
    if pos != len(regexp) and regexp[pos] == '|':
      pos += 1
    else:
      assert pos == len(regexp) or regexp[pos] == ')'
      break

  return pos, mfirst, mlast


def compile_paren(states, last, regexp, pos):
  states.setdefault(-2, [])   # stores indexes
  states.setdefault(-1, [])   # stores text

  group = len(states[-1])
  states[-2].append(None)
  states[-1].append(None)

  def match_pfirst(txt, pos):
    states[-2][group] = pos
    return True, pos

  def match_plast(txt, pos):
    old = states[-2][group]
    states[-1][group] = txt[old:pos]
    return True, pos

  mfirst = create_state(states, match=match_pfirst, last=last, name='paren_first')
  mlast = create_state(states, match=match_plast, name='paren_last')

  pos, first, last = compile_or(states, mfirst, regexp, pos)
  assert regexp[pos] == ')'

  states[last][1].append(mlast)
  return pos + 1, mfirst, mlast


def compile_seq(states, last, regexp, pos):
  first = create_state(states, last=last, name='seq')
  last = first

  while pos < len(regexp):
    p = regexp[pos]
    if p == '\\':
      pos += 1
      p += regexp[pos]

    if p in '|)':
      break

    elif p == '(':
      pos, first, last = compile_paren(states, last, regexp, pos + 1)

    elif p in '+*':
      # first -> u ->...-> last -> v -> t
      # v -> first (matches at least once)
      # first -> t (skip on *)
      # u becomes new first
      # first is inserted before u

      u = create_state(states)
      v = create_state(states, next=[first])
      t = create_state(states, last=v)

      states[last][1].append(v)
      states[u] = states[first]
      states[first] = (match_nothing, [[u], [u, t]][p == '*'])

      last = t
      pos += 1

    else:  # simple states
      if p == '^':
    state = create_state(states, match=match_start, last=last, name='begin')
      elif p == '$':
    state = create_state(states, match=match_end, last=last, name='end')
      elif p == '.':
    state = create_state(states, match=match_dot, last=last, name='dot')
      elif p == '\\.':
    state = create_state(states, match=partial(match_character, '.'), last=last, name='dot')
      elif p == '\\s':
    state = create_state(states, match=match_space, last=last, name='space')
      elif p == '\\w':
    state = create_state(states, match=match_word, last=last, name='word')
      elif p == '\\W':
    state = create_state(states, match=match_nonword, last=last, name='nonword')
      elif p.isalnum() or p in '_@':
    state = create_state(states, match=partial(match_character, p), last=last, name='char_' + p)
      else:
    assert False

      first, last = state, state
      pos += 1

  return pos, first, last


def compile(regexp):
  states = {}
  pos, first, last = compile_or(states, create_state(states, name='root'), regexp, 0)
  assert pos == len(regexp)
  return states, last


def backtrack(states, last, string, start=None):
  if start is None:
    for i in range(len(string)):
      if backtrack(states, last, string, i):
    return True
    return False

  stack = [[0, 0, start]]   # state, pos in next, pos in text
  while stack:
    state = stack[-1][0]
    pos = stack[-1][2]
    #print 'in state', state, states[state]

    if state == last:
      print 'Matches: ', string[start:pos]
      for i in xrange(len(states[-1])):
    print 'Group', i + 1, states[-1][i]
      return True

    while stack[-1][1] < len(states[state][1]):
      nstate = states[state][1][stack[-1][1]]
      stack[-1][1] += 1

      ok, npos = states[nstate][0](string, pos)
      if ok:
    stack.append([nstate, 0, npos])
    break
      else:
    pass
    #print 'not matched', states[nstate][2]
    else:
      stack.pop()

  return False



# regexp = '(\\w+)@(\\w+)(\\.com|\\.net)'
# string = 'sam@test.net'
regexp = raw_input()
string = raw_input()

states, last = compile(regexp)
backtrack(states, last, string)

— 亚历山德鲁
source

+1哇...我试图用PHP自己做，但完全失败了。

— 内森·奥斯曼

TIL (a+b)+比赛abaabaaabaaaab。

— 亚历山德鲁