检查字符串是否与ruby中的正则表达式匹配的最快方法？

96

检查字符串是否与Ruby中的正则表达式匹配的最快方法是什么？

我的问题是，我必须通过大量的字符串“ egrep”来查找与运行时给出的regexp匹配的字符串。我只在乎字符串是否与正则表达式匹配，不在乎匹配的位置，也不在乎匹配组的内容。我希望这个假设可以减少我的代码用于匹配正则表达式的时间。

我加载正则表达式

pattern = Regexp.new(ptx).freeze

我发现它string =~ pattern的速度比稍快string.match(pattern)。

还有其他技巧或捷径可以使测试更快吗？

ruby regex performance

— 吉奥莱
source

如果您不关心匹配组的内容，为什么要拥有它们？您可以通过将正则表达式转换为非捕获来加快正则表达式的速度。

— 马克·托马斯

1

由于正则表达式是在运行时提供的，因此我认为它是不受限制的，在这种情况下，正则表达式内可能存在对分组的内部引用，因此通过修改正则表达式将它们转换为非捕获可以修改结果（除非您还要检查内部参考，但问题变得越来越复杂）。我觉得很好奇=〜会比string.match快。

— djconnel 2012年

在这里冻结正则表达式有什么好处？

— Hardik

103

从Ruby 2.4.0开始，您可以使用RegExp#match?：

pattern.match?(string)

Regexp#match?已在2.4.0的发行说明中明确列出为性能增强，因为它避免了其他方法（例如Regexp#match和）执行的对象分配=~：

正则表达式匹配
添加了Regexp#match?，它执行regexp匹配而不创建反向引用对象，也无需进行更改$~以减少对象分配。

— 维克多·史翠比维
source

5

感谢您的建议。我已经更新了基准脚本，Regexp#match?确实比其他替代方法至少快50％。

— gioele

74

这是一个简单的基准：

require 'benchmark'

"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=>   0.610000   0.000000   0.610000 (  0.578133)

"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=>   0.718000   0.000000   0.718000 (  0.750010)

irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=>   1.703000   0.000000   1.703000 (  1.578146)

因此=~速度更快，但这取决于您希望拥有什么作为返回值。如果您只想检查文本是否包含正则表达式，请使用=~

— 豆贵
source

2

如我所写，我已经发现它=~比更快match，并且在较大的正则表达式上运行时性能提升不那么显着。我想知道的是，是否有任何奇怪的方法可以使此检查更快，也许是在Regexp中使用一些奇怪的方法或一些奇怪的构造。

— gioele 2012年

我认为没有其他解决方案了

— Dougui 2012年

那!("test123" !~ /1/)呢

— ma11hew28 2015年

1

@MattDiPasquale，两次逆不应快于"test123" =~ /1/

— Dougui

1

/1/.match?("test123")比"test123" =~ /1/仅检查文本是否包含正则表达式要快。

— noraj

41

在网上找到一些文章后，这就是我运行的基准。

在2.4.0版中，获胜者是re.match?(str)（如@wiktor-stribiżew所建议），在以前的版本中，re =~ str似乎是最快的，尽管str =~ re几乎一样快。

#!/usr/bin/env ruby
require 'benchmark'

str = "aacaabc"
re = Regexp.new('a+b').freeze

N = 4_000_000

Benchmark.bm do |b|
    b.report("str.match re\t") { N.times { str.match re } }
    b.report("str =~ re\t")    { N.times { str =~ re } }
    b.report("str[re]  \t")    { N.times { str[re] } }
    b.report("re =~ str\t")    { N.times { re =~ str } }
    b.report("re.match str\t") { N.times { re.match str } }
    if re.respond_to?(:match?)
        b.report("re.match? str\t") { N.times { re.match? str } }
    end
end

结果MRI 1.9.3-o551：

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         2.390000   0.000000   2.390000 (  2.397331)
str =~ re         2.450000   0.000000   2.450000 (  2.446893)
str[re]           2.940000   0.010000   2.950000 (  2.941666)
re.match str      3.620000   0.000000   3.620000 (  3.619922)
str.match re      4.180000   0.000000   4.180000 (  4.180083)

MRI 2.1.5结果：

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         1.150000   0.000000   1.150000 (  1.144880)
str =~ re         1.160000   0.000000   1.160000 (  1.150691)
str[re]           1.330000   0.000000   1.330000 (  1.337064)
re.match str      2.250000   0.000000   2.250000 (  2.255142)
str.match re      2.270000   0.000000   2.270000 (  2.270948)

结果MRI 2.3.3（似乎正则表达式匹配存在回归）：

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         3.540000   0.000000   3.540000 (  3.535881)
str =~ re         3.560000   0.000000   3.560000 (  3.560657)
str[re]           4.300000   0.000000   4.300000 (  4.299403)
re.match str      5.210000   0.010000   5.220000 (  5.213041)
str.match re      6.000000   0.000000   6.000000 (  6.000465)

MRI 2.4.0结果：

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re.match? str     0.690000   0.010000   0.700000 (  0.682934)
re =~ str         1.040000   0.000000   1.040000 (  1.035863)
str =~ re         1.040000   0.000000   1.040000 (  1.042963)
str[re]           1.340000   0.000000   1.340000 (  1.339704)
re.match str      2.040000   0.000000   2.040000 (  2.046464)
str.match re      2.180000   0.000000   2.180000 (  2.174691)

— 吉奥莱
source

只是要注意，文字形式比这些形式要快。例如/a+b/ =~ str和str =~ /a+b/。即使通过函数迭代它们也是有效的，而且我认为此方法足以认为比将正则表达式存储和冻结在变量上更好。我用ruby 1.9.3p547，ruby 2.0.0p481和ruby 2.1.4p265测试了我的脚本。这些改进可能是在较新的修补程序上进行的，但是我还没有计划用较早的版本/修补程序对其进行测试。

— konsolebox 2014年

我以为!(re !~ str)可能更快，但事实并非如此。

— ma11hew28

7

那re === str（案例比较）呢？

由于它的评估结果为true或false，并且不需要存储匹配项，返回匹配索引和其他内容，因此我想知道它是否比甚至更快=~。

好的，我对此进行了测试。=~即使您有多个捕获组，它仍然更快，但是比其他选项要快。

顺便说一句，有什么好处freeze？我无法衡量由此带来的任何性能提升。

— 平子
source

的影响freeze将在结果未显示之前的基准循环，因为它发生，并作用于模式本身。

— Tin Man

4

根据正则表达式的复杂程度，您可以只使用简单的字符串切片。我不确定此方法在您的应用程序中的实用性，或者是否会真正提高速度。

'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false

— 吉米迪夫
source

我不能使用字符串切片，因为正则表达式是在运行时提供的，对此我没有任何控制权。

— gioele 2012年

您可以使用字符串切片，而不是使用固定字符串切片。在引号中使用变量而不是字符串，它仍然可以工作。

— Tin Man

3

我想知道的是，是否有任何奇怪的方法可以使此检查更快，也许是在Regexp中使用一些奇怪的方法或一些奇怪的构造。

正则表达式引擎实现搜索的方式有所不同，但是通常来说，锚定模式以提高速度并避免贪婪的匹配，尤其是在搜索长字符串时。

在您熟悉特定引擎的工作方式之前，最好的办法是进行基准测试并添加/删除锚点，尝试限制搜索，使用通配符或显式匹配等。

该水果宝石是快速标杆的东西非常有用，因为它很聪明。Ruby的内置Benchmark代码也很有用，尽管您可以编写不小心使自己愚弄的测试。

我在Stack Overflow的很多答案中都使用了这两种方法，因此您可以搜索我的答案，并会看到许多小技巧和结果，以使您了解如何编写更快的代码。

要记住的最大事情是，在不知道减速发生的位置之前过早优化代码是很不好的。

— 锡人
source

0

要完成Wiktor的Stribiżew和Dougui回答我要说的是/regex/.match?("string")一样快"string".match?(/regex/)。

Ruby 2.4.0（10000〜2秒）

2.4.0 > require 'benchmark'
 => true 
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
 => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
 => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007>

Ruby 2.6.2（1000000000〜20秒）

irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998>

注意：时间各不相同，有时/regex/.match?("string")更快，有时"string".match?(/regex/)，差异可能仅是由于计算机活动造成的。

— 诺拉杰
source