byte []数组模式搜索


74

任何人都知道一种在byte []数组中搜索/匹配字节模式然后返回位置的好方法。

例如

byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}

Answers:


54

我可以提出一些不涉及创建字符串,复制数组或不安全代码的建议:

using System;
using System.Collections.Generic;

static class ByteArrayRocks
{    
    static readonly int[] Empty = new int[0];

    public static int[] Locate (this byte[] self, byte[] candidate)
    {
        if (IsEmptyLocate(self, candidate))
            return Empty;

        var list = new List<int>();

        for (int i = 0; i < self.Length; i++)
        {
            if (!IsMatch(self, i, candidate))
                continue;

            list.Add(i);
        }

        return list.Count == 0 ? Empty : list.ToArray();
    }

    static bool IsMatch (byte[] array, int position, byte[] candidate)
    {
        if (candidate.Length > (array.Length - position))
            return false;

        for (int i = 0; i < candidate.Length; i++)
            if (array[position + i] != candidate[i])
                return false;

        return true;
    }

    static bool IsEmptyLocate (byte[] array, byte[] candidate)
    {
        return array == null
            || candidate == null
            || array.Length == 0
            || candidate.Length == 0
            || candidate.Length > array.Length;
    }

    static void Main()
    {
        var data = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
        var pattern = new byte[] { 12, 3, 5, 76, 8, 0, 6, 125 };

        foreach (var position in data.Locate(pattern))
            Console.WriteLine(position);
    }
}

编辑(由IAbstract提供) -帖子内容移到此处,因为这不是答案

出于好奇,我创建了一个具有不同答案的小型基准测试。

这是一百万次迭代的结果:

solution [Locate]:            00:00:00.7714027
solution [FindAll]:           00:00:03.5404399
solution [SearchBytePattern]: 00:00:01.1105190
solution [MatchBytePattern]:  00:00:03.0658212

3
您的解决方案在大字节数组上很慢。
Tomas

1
看起来不错-我将Locate方法更改为返回IEnumerable <int>并替换了列表。添加带有yield return的位,这简化了实现并摆脱了“ Empty”。
杰夫

将其转换为字符串有什么问题?Op没有提及速度/性能方面的任何内容。
disklosr 2015年

1
您可以只实施KMP算法,它的效率要高得多。
Alex Zhukovskiy

28

使用LINQ方法。

public static IEnumerable<int> PatternAt(byte[] source, byte[] pattern)
{
    for (int i = 0; i < source.Length; i++)
    {
        if (source.Skip(i).Take(pattern.Length).SequenceEqual(pattern))
        {
            yield return i;
        }
    }
}

非常简单!


3
但不是特别有效,因此适用于大多数情况,但并非全部。
phoog

13

最初,我发布了一些以前使用的旧代码,但对Jb Evain的基准测试感到好奇。我发现我的解决方案很慢。看来bruno conde的SearchBytePattern是最快的。我不知道为什么,特别是因为他使用Array.Copy和Extension方法。但是在Jb的测试中有证据,因此对bruno表示敬意。

我进一步简化了操作,因此希望这将是最清晰,最简单的解决方案。(bruno conde完成的所有艰苦工作)增强功能包括:

  • Buffer.BlockCopy
  • Array.IndexOf <字节>
  • while循环而不是for循环
  • 起始索引参数
  • 转换为扩展方法

    public static List<int> IndexOfSequence(this byte[] buffer, byte[] pattern, int startIndex)    
    {
       List<int> positions = new List<int>();
       int i = Array.IndexOf<byte>(buffer, pattern[0], startIndex);  
       while (i >= 0 && i <= buffer.Length - pattern.Length)  
       {
          byte[] segment = new byte[pattern.Length];
          Buffer.BlockCopy(buffer, i, segment, 0, pattern.Length);    
          if (segment.SequenceEqual<byte>(pattern))
               positions.Add(i);
          i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);
       }
       return positions;    
    }
    

请注意,该while块中的最后一条语句应i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);改为i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);。看看约翰的评论。一个简单的测试可以证明:

byte[] pattern = new byte[] {1, 2};
byte[] toBeSearched = new byte[] { 1, 1, 2, 1, 12 };

使用i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);,什么也没有返回。i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);返回正确的结果。


5
“ i = Array.IndexOf <byte>(buffer,pattern [0],i + pattern.Length)”行可能应该是“ i = Array.IndexOf <byte>(buffer,pattern [0],i + 1”) ”。现在,找到第一个字符后将跳过数据。
2012年

12

使用高效的Boyer-Moore算法

它旨在查找带有字符串的字符串,但是您几乎不需要任何想象力就可以将其投影到字节数组。

通常,最好的答案是:使用您喜欢的任何字符串搜索算法:)。


12

这是我的建议,更简单,更快捷:

int Search(byte[] src, byte[] pattern)
{
    int c = src.Length - pattern.Length + 1;
    int j;
    for (int i = 0; i < c; i++)
    {
        if (src[i] != pattern[0]) continue;
        for (j = pattern.Length - 1; j >= 1 && src[i + j] == pattern[j]; j--) ;
        if (j == 0) return i;
    }
    return -1;
}

实际上我不了解逻辑,但是它比我尝试过的某些上述方法要快。
2016年

我只检查第一个字节,然后找到一个匹配项,检查其余模式。可快多了只检查整数而不是字节
了Ing。GerardoSánchez17年

7

我的解决方案:

class Program
{
    public static void Main()
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

        byte[] toBeSearched = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125};

        List<int> positions = SearchBytePattern(pattern, toBeSearched);

        foreach (var item in positions)
        {
            Console.WriteLine("Pattern matched at pos {0}", item);
        }

    }

    static public List<int> SearchBytePattern(byte[] pattern, byte[] bytes)
    {
        List<int> positions = new List<int>();
        int patternLength = pattern.Length;
        int totalLength = bytes.Length;
        byte firstMatchByte = pattern[0];
        for (int i = 0; i < totalLength; i++)
        {
            if (firstMatchByte == bytes[i] && totalLength - i >= patternLength)
            {
                byte[] match = new byte[patternLength];
                Array.Copy(bytes, i, match, 0, patternLength);
                if (match.SequenceEqual<byte>(pattern))
                {
                    positions.Add(i);
                    i += patternLength - 1;
                }
            }
        }
        return positions;
    }
}

1
为什么要array.copy?只是以这种方式变慢了。我猜这只是因为您要使用SequenceEqual,但这可能只是因为您想使用扩展方法而需要花很多时间。“ i + = patternLength-1;” 部分很好!
戴维·兰德曼

4
您不应该仅仅因为解决方案并不完美就给每个人-1。在这种情况下,您应该只对您认为最佳的解决方案进行投票。
bruno conde

这样会不会错过重叠的图案?(例如,BOB只会在BOBOB中找到一次)
Jeff

如果将byte []分配保留在foreach循环之前,则可能会加快速度,因为模式长度在整个循环内始终保持相同。
user1132959 2013年

4

我缺少LINQ方法/答案:-)

/// <summary>
/// Searches in the haystack array for the given needle using the default equality operator and returns the index at which the needle starts.
/// </summary>
/// <typeparam name="T">Type of the arrays.</typeparam>
/// <param name="haystack">Sequence to operate on.</param>
/// <param name="needle">Sequence to search for.</param>
/// <returns>Index of the needle within the haystack or -1 if the needle isn't contained.</returns>
public static IEnumerable<int> IndexOf<T>(this T[] haystack, T[] needle)
{
    if ((needle != null) && (haystack.Length >= needle.Length))
    {
        for (int l = 0; l < haystack.Length - needle.Length + 1; l++)
        {
            if (!needle.Where((data, index) => !haystack[l + index].Equals(data)).Any())
            {
                yield return l;
            }
        }
    }
}

3

我上面的Foubar回答版本,它避免搜索超出干草堆的末端,并允许指定起始偏移量。假设针不空或比干草堆长。

public static unsafe long IndexOf(this byte[] haystack, byte[] needle, long startOffset = 0)
{ 
    fixed (byte* h = haystack) fixed (byte* n = needle)
    {
        for (byte* hNext = h + startOffset, hEnd = h + haystack.LongLength + 1 - needle.LongLength, nEnd = n + needle.LongLength; hNext < hEnd; hNext++)
            for (byte* hInc = hNext, nInc = n; *nInc == *hInc; hInc++)
                if (++nInc == nEnd)
                    return hNext - h;
        return -1;
    }
}

我在另一个答案中使用了您的IndexOf代码(并为您赢得了荣誉)。只是觉得你可能想知道-你可以在这里找到:stackoverflow.com/questions/31364114/...
gymbrall

3

如果您使用的是.NET Core 2.1或更高版本(或.NET Standard 2.1或更高版本的平台),则可以MemoryExtensions.IndexOfSpan类型使用扩展方法:

int matchIndex = toBeSearched.AsSpan().IndexOf(pattern);

要查找所有出现的事件,可以使用类似以下内容的方法:

public static IEnumerable<int> IndexesOf(this byte[] haystack, byte[] needle,
    int startIndex = 0, bool includeOverlapping = false)
{
    int matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
    while (matchIndex >= 0)
    {
        yield return startIndex + matchIndex;
        startIndex += matchIndex + (includeOverlapping ? 1 : needle.Length);
        matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
    }
}

不幸的是,.NET Core 2.1-3.0中实现使用迭代的“对第一个字节进行优化的单字节搜索,然后检查余数”方法,而不是快速的字符串搜索算法,但是在将来的版本中可能会有所改变。


2

Jb Evain的答案是:

 for (int i = 0; i < self.Length; i++) {
      if (!IsMatch (self, i, candidate))
           continue;
      list.Add (i);
 }

然后IsMatch函数首先检查是否 candidate超出了要搜索的数组的长度。

如果对for循环进行编码,这将更加有效:

     for (int i = 0, n = self.Length - candidate.Length + 1; i < n; ++i) {
          if (!IsMatch (self, i, candidate))
               continue;
          list.Add (i);
     }

此时也可以从开始就消除测试IsMatch,只要您通过前提条件进行合同,决不要使用“非法”参数调用它。注意:在2019年更正了一个错误。


stackoverflow的唯一问题是出现错误时,但是您将如何处理?我不知道。这里已经超过10个,但有一个错误。这是一个很好的优化,但是存在问题。一对一。对。想象一下self.Length = 1和canidate.Length = 1,即使它们相同,也不会找到匹配的对象。我将尝试更改它。
卡梅伦

@Cameron发现得很好-编辑批准,但有微小改动。
Alnitak

2

这些是您可以使用的最简单,最快的方法,并且没有比这些更快的方法。这是不安全的,但这就是我们使用指针的目的。因此,在这里,我为您提供了我用于搜索单个对象的扩展方法以及出现次数的索引列表。我想说这是这里最干净的代码。

    public static unsafe long IndexOf(this byte[] Haystack, byte[] Needle)
    {
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            long i = 0;
            for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {
                bool Found = true;
                for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                if (Found) return i;
            }
            return -1;
        }
    }
    public static unsafe List<long> IndexesOf(this byte[] Haystack, byte[] Needle)
    {
        List<long> Indexes = new List<long>();
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            long i = 0;
            for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {
                bool Found = true;
                for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                if (Found) Indexes.Add(i);
            }
            return Indexes;
        }
    }

使用Locate进行基准测试,速度提高了1.2-1.4倍


1
不过,从字面上看,它不安全的,因为它会越过针头寻找大海捞针。请在下面查看我的版本。
Dylan Nicholson 2015年

1

这是我(不是性能最高的)解决方案。它依赖于字节/拉丁1转换是无损的事实,这对于字节/ ASCII或字节/ UTF8转换不是正确的。

它的优点是,它适用于任何字节值(某些其他解决方案无法正确处理字节0x80-0xff),并且可以扩展以执行更高级的正则表达式匹配。

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

class C {

  public static void Main() {
    byte[] data = {0, 100, 0, 255, 100, 0, 100, 0, 255};
    byte[] pattern = {0, 255};
    foreach (int i in FindAll(data, pattern)) {
      Console.WriteLine(i);
    }
  }

  public static IEnumerable<int> FindAll(
    byte[] haystack,
    byte[] needle
  ) {
    // bytes <-> latin-1 conversion is lossless
    Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
    string sHaystack = latin1.GetString(haystack);
    string sNeedle = latin1.GetString(needle);
    for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
         m.Success; m = m.NextMatch()) {
      yield return m.Index;
    }
  }
}

2
您不应该对此类内容使用字符串和正则表达式,而只是在滥用它们。
戴维·兰德曼

1
戴维,您的发言很主观。Regex是用于模式匹配工具,.NET实现不直接接受字节数组不是我的错。顺便说一句,某些正则表达式库没有此限制。
君士坦丁

1

我使用答案和Alnitak的提示创建了一个新功能。

public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
{
    if ((superSet == null) || (subSet == null))
    {
       throw new ArgumentNullException();
    }
    if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
    {
        return new List<Int32>();
    }
    var result = new List<Int32>();
    Int32 currentIndex = 0;
    Int32 maxIndex =  superSet.Length - subSet.Length;
    while (currentIndex < maxIndex)
    {
         Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
         if (matchCount ==  subSet.Length)
         {
            result.Add(currentIndex);
         }
         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }
    }
    return result;
}

private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
{
    Int32 currentOffset = 0;
    while (currentOffset < subSet.Length)
    {
        if (superSet[startIndex + currentOffset] != subSet[currentOffset])
        {
            break;
        }
        currentOffset++;
    }
    return currentOffset;
}

我唯一不高兴的部分是

         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }

部分...我想使用if if else来避免-1,但这会导致更好的分支预测(尽管我不确定这是否有那么大的重要性)。


1

为什么使简单变得困难?可以使用for循环以任何语言完成此操作。这是C#中的一个:

使用系统;
使用System.Collections.Generic;

命名空间BinarySearch
{
    班级计划
    {
        静态void Main(string [] args)
        {
            byte []模式=新的byte [] {12,3,5,76,8,0,6,125};
            byte [] toBeSearched = new byte [] { 
23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76, 64,12,3,5,76,8,0,6,125}; List <int>出现次数= findOccurences(toBeSearched,pattern); foreach(事件中出现的事件){ Console.WriteLine(“找到的匹配项从基于0的索引处开始:” +出现) } } 静态List <int> findOccurences(byte []干草堆,byte []针) { List <int>出现=新的List <int>(); 为(int i = 0; i <haystack.Length; i ++) { 如果(needle [0] == haystack [i]) { 布尔发现=真; int j,k; 对于(j = 0,k = i; j <针长; j ++,k ++) { 如果(k> = haystack.Length ||针[j]!=干草堆[k]) { 发现=假; 打破; } } 如果(找到) { existences.Add(i-1); i = k; } } } 返回事件; } } }

您的幼稚算法具有运行时O(needle.Length * haystack.Length),优化算法具有运行时O(needle.Length + haystack.Length)
CodesInChaos

1

感谢您抽出宝贵的时间...

这是我问问题之前使用/测试的代码...我问这个问题的原因是我确定我没有使用最佳代码来执行此操作...所以再次感谢花时间!

   private static int CountPatternMatches(byte[] pattern, byte[] bytes)
   {
        int counter = 0;

        for (int i = 0; i < bytes.Length; i++)
        {
            if (bytes[i] == pattern[0] && (i + pattern.Length) < bytes.Length)
            {
                for (int x = 1; x < pattern.Length; x++)
                {
                    if (pattern[x] != bytes[x+i])
                    {
                        break;
                    }

                    if (x == pattern.Length -1)
                    {
                        counter++;
                        i = i + pattern.Length;
                    }
                }
            }
        }

        return counter;
    }

有人在我的代码中看到任何错误吗?这被认为是一种骇人听闻的方法吗?我已经尝试了几乎所有你们发布的样本,而且比赛结果似乎有些变化。我一直在用〜10Mb字节数组作为toBeSearched数组运行测试。


1

我会使用通过转换为字符串进行匹配的解决方案...

您应该编写一个简单的函数来实现Knuth-Morris-Pratt搜索算法。这将是最快的简单算法,可用于查找正确的索引(可以使用Boyer-Moore,但需要更多设置。

优化算法后,可以尝试寻找其他类型的优化。但是您应该从基础开始。

例如,当前的“最快”是Jb Evian的“定位”解决方案。

如果你看核心

    for (int i = 0; i < self.Length; i++) {
            if (!IsMatch (self, i, candidate))
                    continue;

            list.Add (i);
    }

子算法匹配后,它将开始在i + 1处找到匹配项,但是您已经知道第一个可能的匹配项将是i +候选人。因此,如果您添加,

i += candidate.Length -2; //  -2 instead of -1 because the i++ will add the last index

当您期望超集中子集的出现次数很多时,它将更快。(Bruno Conde已经在他的解决方案中做到了这一点)

但这只是KNP算法的一半,您还应该在IsMatch方法中添加一个名为numberOfValidMatches的额外参数,该参数将是out参数。

这将解决以下问题:

int validMatches = 0;
if (!IsMatch (self, i, candidate, out validMatches))
{
    i += validMatches - 1; // -1 because the i++ will do the last one
    continue;
}

static bool IsMatch (byte [] array, int position, byte [] candidate, out int numberOfValidMatches)
{
    numberOfValidMatches = 0;
    if (candidate.Length > (array.Length - position))
            return false;

    for (i = 0; i < candidate.Length; i++)
    {
            if (array [position + i] != candidate [i])
                    return false;
            numberOfValidMatches++; 
    }

    return true;
}

进行一些重构,您可以将numberOfValidMatches用作循环变量,并使用一段时间重写Locate循环以避免-2和-1。但是我只是想弄清楚如何添加KMP算法。


“但您已经知道第一个可能的匹配将是i +候选人。长度”-这是不正确的-候选人模式可能具有重复或循环,可能导致重叠匹配。
Alnitak

这就是问题,在我看来,您只希望完全不重叠的匹配。仅当候选数组末尾的一个或多个字节与候选数组的第一个字节匹配时,这种情况才有可能。
戴维·兰德曼

1

速度不是一切。您检查它们的一致性了吗?

我没有测试这里列出的所有代码。我测试了自己的代码(我承认这不是完全一致的)和IndexOfSequence。我发现对于许多测试,IndexOfSequence比我的代码快很多,但是经过反复测试,我发现它的一致性较差。特别是在数组末尾查找模式似乎最麻烦,但有时也会在数组中间错过它们。

我的测试代码不是为了提高效率而设计的,我只是想拥有一堆随机数据,并在其中包含一些已知的字符串。该测试模式大致类似于http表单上传流中的边界标记。这是我在遍历此代码时所要寻找的,因此我认为可以使用要搜索的数据进行测试。看起来,模式越长,IndexOfSequence丢失值的可能性就越大。

private static void TestMethod()
{
    Random rnd = new Random(DateTime.Now.Millisecond);
    string Pattern = "-------------------------------65498495198498";
    byte[] pattern = Encoding.ASCII.GetBytes(Pattern);

    byte[] testBytes;
    int count = 3;
    for (int i = 0; i < 100; i++)
    {
        StringBuilder TestString = new StringBuilder(2500);
        TestString.Append(Pattern);
        byte[] buf = new byte[1000];
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        testBytes = Encoding.ASCII.GetBytes(TestString.ToString());

        List<int> idx = IndexOfSequence(ref testBytes, pattern, 0);
        if (idx.Count != count)
        {
            Console.Write("change from {0} to {1} on iteration {2}: ", count, idx.Count, i);
            foreach (int ix in idx)
            {
                Console.Write("{0}, ", ix);
            }
            Console.WriteLine();
            count = idx.Count;
        }
    }

    Console.WriteLine("Press ENTER to exit");
    Console.ReadLine();
}

(显然,我将IndexOfSequence从扩展名转换回此测试的常规方法)

这是我的输出的示例运行:

change from 3 to 2 on iteration 1: 0, 2090,
change from 2 to 3 on iteration 2: 0, 1045, 2090,
change from 3 to 2 on iteration 3: 0, 1045,
change from 2 to 3 on iteration 4: 0, 1045, 2090,
change from 3 to 2 on iteration 6: 0, 2090,
change from 2 to 3 on iteration 7: 0, 1045, 2090,
change from 3 to 2 on iteration 11: 0, 2090,
change from 2 to 3 on iteration 12: 0, 1045, 2090,
change from 3 to 2 on iteration 14: 0, 2090,
change from 2 to 3 on iteration 16: 0, 1045, 2090,
change from 3 to 2 on iteration 17: 0, 1045,
change from 2 to 3 on iteration 18: 0, 1045, 2090,
change from 3 to 1 on iteration 20: 0,
change from 1 to 3 on iteration 21: 0, 1045, 2090,
change from 3 to 2 on iteration 22: 0, 2090,
change from 2 to 3 on iteration 23: 0, 1045, 2090,
change from 3 to 2 on iteration 24: 0, 2090,
change from 2 to 3 on iteration 25: 0, 1045, 2090,
change from 3 to 2 on iteration 26: 0, 2090,
change from 2 to 3 on iteration 27: 0, 1045, 2090,
change from 3 to 2 on iteration 43: 0, 1045,
change from 2 to 3 on iteration 44: 0, 1045, 2090,
change from 3 to 2 on iteration 48: 0, 1045,
change from 2 to 3 on iteration 49: 0, 1045, 2090,
change from 3 to 2 on iteration 50: 0, 2090,
change from 2 to 3 on iteration 52: 0, 1045, 2090,
change from 3 to 2 on iteration 54: 0, 1045,
change from 2 to 3 on iteration 57: 0, 1045, 2090,
change from 3 to 2 on iteration 62: 0, 1045,
change from 2 to 3 on iteration 63: 0, 1045, 2090,
change from 3 to 2 on iteration 72: 0, 2090,
change from 2 to 3 on iteration 73: 0, 1045, 2090,
change from 3 to 2 on iteration 75: 0, 2090,
change from 2 to 3 on iteration 76: 0, 1045, 2090,
change from 3 to 2 on iteration 78: 0, 1045,
change from 2 to 3 on iteration 79: 0, 1045, 2090,
change from 3 to 2 on iteration 81: 0, 2090,
change from 2 to 3 on iteration 82: 0, 1045, 2090,
change from 3 to 2 on iteration 85: 0, 2090,
change from 2 to 3 on iteration 86: 0, 1045, 2090,
change from 3 to 2 on iteration 89: 0, 2090,
change from 2 to 3 on iteration 90: 0, 1045, 2090,
change from 3 to 2 on iteration 91: 0, 2090,
change from 2 to 1 on iteration 92: 0,
change from 1 to 3 on iteration 93: 0, 1045, 2090,
change from 3 to 1 on iteration 99: 0,

我并不是要选择IndexOfSequence,它恰好是我今天开始使用的那个。我注意到在一天结束时,数据中似乎缺少模式,所以今晚我写了自己的模式匹配器。虽然没有那么快。我将对其进行一些微调,以查看在发布之前是否可以使它100%一致。

我只是想提醒大家,在您信任生产代码之前,应该测试这样的事情,以确保它们给出良好的,可重复的结果。


1

我尝试了各种解决方案,最后修改了SearchBytePattern。我在30k的序列上进行了测试,速度很快:)

    static public int SearchBytePattern(byte[] pattern, byte[] bytes)
    {
        int matches = 0;
        for (int i = 0; i < bytes.Length; i++)
        {
            if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
            {
                bool ismatch = true;
                for (int j = 1; j < pattern.Length && ismatch == true; j++)
                {
                    if (bytes[i + j] != pattern[j])
                        ismatch = false;
                }
                if (ismatch)
                {
                    matches++;
                    i += pattern.Length - 1;
                }
            }
        }
        return matches;
    }

让我知道你的想法。


1

这是我想出的解决方案。我包括了在实施过程中发现的注释。它可以向前,向后匹配,并具有不同的(in / dec)校正量,例如方向;从干草堆中的任何偏移量开始。

任何输入都会很棒!

    /// <summary>
    /// Matches a byte array to another byte array
    /// forwards or reverse
    /// </summary>
    /// <param name="a">byte array</param>
    /// <param name="offset">start offset</param>
    /// <param name="len">max length</param>
    /// <param name="b">byte array</param>
    /// <param name="direction">to move each iteration</param>
    /// <returns>true if all bytes match, otherwise false</returns>
    internal static bool Matches(ref byte[] a, int offset, int len, ref byte[] b, int direction = 1)
    {
        #region Only Matched from offset Within a and b, could not differ, e.g. if you wanted to mach in reverse for only part of a in some of b that would not work
        //if (direction == 0) throw new ArgumentException("direction");
        //for (; offset < len; offset += direction) if (a[offset] != b[offset]) return false;
        //return true;
        #endregion
        //Will match if b contains len of a and return a a index of positive value
        return IndexOfBytes(ref a, ref offset, len, ref b, len) != -1;
    }

///Here is the Implementation code

    /// <summary>
    /// Swaps two integers without using a temporary variable
    /// </summary>
    /// <param name="a"></param>
    /// <param name="b"></param>
    internal static void Swap(ref int a, ref int b)
    {
        a ^= b;
        b ^= a;
        a ^= b;
    }

    /// <summary>
    /// Swaps two bytes without using a temporary variable
    /// </summary>
    /// <param name="a"></param>
    /// <param name="b"></param>
    internal static void Swap(ref byte a, ref byte b)
    {
        a ^= b;
        b ^= a;
        a ^= b;
    }

    /// <summary>
    /// Can be used to find if a array starts, ends spot Matches or compltely contains a sub byte array
    /// Set checkLength to the amount of bytes from the needle you want to match, start at 0 for forward searches start at hayStack.Lenght -1 for reverse matches
    /// </summary>
    /// <param name="a">Needle</param>
    /// <param name="offset">Start in Haystack</param>
    /// <param name="len">Length of required match</param>
    /// <param name="b">Haystack</param>
    /// <param name="direction">Which way to move the iterator</param>
    /// <returns>Index if found, otherwise -1</returns>
    internal static int IndexOfBytes(ref byte[] needle, ref int offset, int checkLength, ref byte[] haystack, int direction = 1)
    {
        //If the direction is == 0 we would spin forever making no progress
        if (direction == 0) throw new ArgumentException("direction");
        //Cache the length of the needle and the haystack, setup the endIndex for a reverse search
        int needleLength = needle.Length, haystackLength = haystack.Length, endIndex = 0, workingOffset = offset;
        //Allocate a value for the endIndex and workingOffset
        //If we are going forward then the bound is the haystackLength
        if (direction >= 1) endIndex = haystackLength;
        #region [Optomization - Not Required]
        //{

            //I though this was required for partial matching but it seems it is not needed in this form
            //workingOffset = needleLength - checkLength;
        //}
        #endregion
        else Swap(ref workingOffset, ref endIndex);                
        #region [Optomization - Not Required]
        //{ 
            //Otherwise we are going in reverse and the endIndex is the needleLength - checkLength                   
            //I though the length had to be adjusted but it seems it is not needed in this form
            //endIndex = needleLength - checkLength;
        //}
        #endregion
        #region [Optomized to above]
        //Allocate a value for the endIndex
        //endIndex = direction >= 1 ? haystackLength : needleLength - checkLength,
        //Determine the workingOffset
        //workingOffset = offset > needleLength ? offset : needleLength;            
        //If we are doing in reverse swap the two
        //if (workingOffset > endIndex) Swap(ref workingOffset, ref endIndex);
        //Else we are going in forward direction do the offset is adjusted by the length of the check
        //else workingOffset -= checkLength;
        //Start at the checkIndex (workingOffset) every search attempt
        #endregion
        //Save the checkIndex (used after the for loop is done with it to determine if the match was checkLength long)
        int checkIndex = workingOffset;
        #region [For Loop Version]
        ///Optomized with while (single op)
        ///for (int checkIndex = workingOffset; checkIndex < endIndex; offset += direction, checkIndex = workingOffset)
            ///{
                ///Start at the checkIndex
                /// While the checkIndex < checkLength move forward
                /// If NOT (the needle at the checkIndex matched the haystack at the offset + checkIndex) BREAK ELSE we have a match continue the search                
                /// for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
                /// If the match was the length of the check
                /// if (checkIndex == checkLength) return offset; //We are done matching
            ///}
        #endregion
        //While the checkIndex < endIndex
        while (checkIndex < endIndex)
        {
            for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
            //If the match was the length of the check
            if (checkIndex == checkLength) return offset; //We are done matching
            //Move the offset by the direction, reset the checkIndex to the workingOffset
            offset += direction; checkIndex = workingOffset;                
        }
        //We did not have a match with the given options
        return -1;
    }

1

我参加聚会有点晚了。如何使用Boyer Moore算法,但要搜索字节而不是字符串。下面的C#代码。

EyeCode Inc.

class Program {
    static void Main(string[] args) {
        byte[] text         =  new byte[] {12,3,5,76,8,0,6,125,23,36,43,76,125,56,34,234,12,4,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,123};
        byte[] pattern      = new byte[] {12,3,5,76,8,0,6,125};

        BoyerMoore tmpSearch = new BoyerMoore(pattern,text);

        Console.WriteLine(tmpSearch.Match());
        Console.ReadKey();
    }

    public class BoyerMoore {

        private static int ALPHABET_SIZE = 256;

        private byte[] text;
        private byte[] pattern;

        private int[] last;
        private int[] match;
        private int[] suffix;

        public BoyerMoore(byte[] pattern, byte[] text) {
            this.text = text;
            this.pattern = pattern;
            last = new int[ALPHABET_SIZE];
            match = new int[pattern.Length];
            suffix = new int[pattern.Length];
        }


        /**
        * Searches the pattern in the text.
        * returns the position of the first occurrence, if found and -1 otherwise.
        */
        public int Match() {
            // Preprocessing
            ComputeLast();
            ComputeMatch();

            // Searching
            int i = pattern.Length - 1;
            int j = pattern.Length - 1;    
            while (i < text.Length) {
                if (pattern[j] == text[i]) {
                    if (j == 0) { 
                        return i;
                    }
                    j--;
                    i--;
                } 
                else {
                  i += pattern.Length - j - 1 + Math.Max(j - last[text[i]], match[j]);
                  j = pattern.Length - 1;
              }
            }
            return -1;    
          }  


        /**
        * Computes the function last and stores its values in the array last.
        * last(Char ch) = the index of the right-most occurrence of the character ch
        *                                                           in the pattern; 
        *                 -1 if ch does not occur in the pattern.
        */
        private void ComputeLast() {
            for (int k = 0; k < last.Length; k++) { 
                last[k] = -1;
            }
            for (int j = pattern.Length-1; j >= 0; j--) {
                if (last[pattern[j]] < 0) {
                    last[pattern[j]] = j;
                }
            }
        }


        /**
        * Computes the function match and stores its values in the array match.
        * match(j) = min{ s | 0 < s <= j && p[j-s]!=p[j]
        *                            && p[j-s+1]..p[m-s-1] is suffix of p[j+1]..p[m-1] }, 
        *                                                         if such s exists, else
        *            min{ s | j+1 <= s <= m 
        *                            && p[0]..p[m-s-1] is suffix of p[j+1]..p[m-1] }, 
        *                                                         if such s exists,
        *            m, otherwise,
        * where p is the pattern and m is its length.
        */
        private void ComputeMatch() {
            /* Phase 1 */
            for (int j = 0; j < match.Length; j++) { 
                match[j] = match.Length;
            } //O(m) 

            ComputeSuffix(); //O(m)

            /* Phase 2 */
            //Uses an auxiliary array, backwards version of the KMP failure function.
            //suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
            //if there is no such j, suffix[i] = m

            //Compute the smallest shift s, such that 0 < s <= j and
            //p[j-s]!=p[j] and p[j-s+1..m-s-1] is suffix of p[j+1..m-1] or j == m-1}, 
            //                                                         if such s exists,
            for (int i = 0; i < match.Length - 1; i++) {
                int j = suffix[i + 1] - 1; // suffix[i+1] <= suffix[i] + 1
                if (suffix[i] > j) { // therefore pattern[i] != pattern[j]
                    match[j] = j - i;
                } 
                else {// j == suffix[i]
                    match[j] = Math.Min(j - i + match[i], match[j]);
                }
            }

            /* Phase 3 */
            //Uses the suffix array to compute each shift s such that
            //p[0..m-s-1] is a suffix of p[j+1..m-1] with j < s < m
            //and stores the minimum of this shift and the previously computed one.
            if (suffix[0] < pattern.Length) {
                for (int j = suffix[0] - 1; j >= 0; j--) {
                    if (suffix[0] < match[j]) { match[j] = suffix[0]; }
                }
                {
                    int j = suffix[0];
                    for (int k = suffix[j]; k < pattern.Length; k = suffix[k]) {
                        while (j < k) {
                            if (match[j] > k) {
                                match[j] = k;
                            }
                            j++;
                        }
                    }
                }
            }
        }


        /**
        * Computes the values of suffix, which is an auxiliary array, 
        * backwards version of the KMP failure function.
        * 
        * suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
        * if there is no such j, suffix[i] = m, i.e. 

        * p[suffix[i]..m-1] is the longest prefix of p[i..m-1], if suffix[i] < m.
        */
        private void ComputeSuffix() {        
            suffix[suffix.Length-1] = suffix.Length;            
            int j = suffix.Length - 1;
            for (int i = suffix.Length - 2; i >= 0; i--) {  
                while (j < suffix.Length - 1 && !pattern[j].Equals(pattern[i])) {
                    j = suffix[j + 1] - 1;
                }
                if (pattern[j] == pattern[i]) { 
                    j--; 
                }
                suffix[i] = j + 1;
            }
        }

    }

}

1

您可以使用ORegex:

var oregex = new ORegex<byte>("{0}{1}{2}", x=> x==12, x=> x==3, x=> x==5);
var toSearch = new byte[]{1,1,12,3,5,1,12,3,5,5,5,5};

var found = oregex.Matches(toSearch);

将发现两个匹配项:

i:2;l:3
i:6;l:3

复杂度:在最坏的情况下为O(n * m),在现实生活中,由于内部状态机的原因,它为O(n)。在某些情况下,它比.NET Regex快。它结构紧凑,快速,专为阵列模式匹配而设计。


0

这是我仅使用基本数据类型编写的简单代码:(它返回第一次出现的索引)

private static int findMatch(byte[] data, byte[] pattern) {
    if(pattern.length > data.length){
        return -1;
    }
    for(int i = 0; i<data.length ;){
        int j;
       for(j=0;j<pattern.length;j++){

           if(pattern[j]!=data[i])
               break;
           i++;
       }
       if(j==pattern.length){
           System.out.println("Pattern found at : "+(i - pattern.length ));
           return i - pattern.length ;
       }
       if(j!=0)continue;
       i++;
    }

    return -1;
}

您答案的开头让我想起了一首歌: Here's a little code I wrote, you might want to see it node for node, don't worry, be happy
Davi Fiamenghi 2013年

0

另一个易于理解的答案,对于O(n)类型的操作非常有效,而无需使用不安全的代码或复制源数组的某些部分。

请务必进行测试。在此主题上发现的一些建议很容易受到影响。

    static void Main(string[] args)
    {
        //                                                         1   1  1  1  1  1  1  1  1  1  2   2   2
        //                           0  1  2  3  4  5  6  7  8  9  0   1  2  3  4  5  6  7  8  9  0   1   2  3  4  5  6  7  8  9
        byte[] buffer = new byte[] { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 5, 5, 0, 5, 5, 1, 2 };
        byte[] beginPattern = new byte[] { 1, 0, 2 };
        byte[] middlePattern = new byte[] { 8, 9, 10 };
        byte[] endPattern = new byte[] { 9, 10, 11 };
        byte[] wholePattern = new byte[] { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
        byte[] noMatchPattern = new byte[] { 7, 7, 7 };

        int beginIndex = ByteArrayPatternIndex(buffer, beginPattern);
        int middleIndex = ByteArrayPatternIndex(buffer, middlePattern);
        int endIndex = ByteArrayPatternIndex(buffer, endPattern);
        int wholeIndex = ByteArrayPatternIndex(buffer, wholePattern);
        int noMatchIndex = ByteArrayPatternIndex(buffer, noMatchPattern);
    }

    /// <summary>
    /// Returns the index of the first occurrence of a byte array within another byte array
    /// </summary>
    /// <param name="buffer">The byte array to be searched</param>
    /// <param name="pattern">The byte array that contains the pattern to be found</param>
    /// <returns>If buffer contains pattern then the index of the first occurrence of pattern within buffer; otherwise, -1</returns>
    public static int ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
    {
        if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
        {
            int resumeIndex;
            for (int i = 0; i <= buffer.Length - pattern.Length; i++)
            {
                if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
                {
                    resumeIndex = 0;
                    for (int x = 1; x < pattern.Length; x++)
                    {
                        if (buffer[i + x] == pattern[x])
                        {
                            if (x == pattern.Length - 1)  // Matched the entire pattern
                                return i;
                            else if (resumeIndex == 0 && buffer[i + x] == pattern[0])  // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
                                resumeIndex = i + x;
                        }
                        else
                        {
                            if (resumeIndex > 0)
                                i = resumeIndex - 1;  // The outer loop iterator will increment so subtract one
                            else if (x > 1)
                                i += (x - 1);  // Advance the outer loop variable since we already checked these bytes
                            break;
                        }
                    }
                }
            }
        }
        return -1;
    }

    /// <summary>
    /// Returns the indexes of each occurrence of a byte array within another byte array
    /// </summary>
    /// <param name="buffer">The byte array to be searched</param>
    /// <param name="pattern">The byte array that contains the pattern to be found</param>
    /// <returns>If buffer contains pattern then the indexes of the occurrences of pattern within buffer; otherwise, null</returns>
    /// <remarks>A single byte in the buffer array can only be part of one match.  For example, if searching for 1,2,1 in 1,2,1,2,1 only zero would be returned.</remarks>
    public static int[] ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
    {
        if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
        {
            List<int> indexes = new List<int>();
            int resumeIndex;
            for (int i = 0; i <= buffer.Length - pattern.Length; i++)
            {
                if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
                {
                    resumeIndex = 0;
                    for (int x = 1; x < pattern.Length; x++)
                    {
                        if (buffer[i + x] == pattern[x])
                        {
                            if (x == pattern.Length - 1)  // Matched the entire pattern
                                indexes.Add(i);
                            else if (resumeIndex == 0 && buffer[i + x] == pattern[0])  // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
                                resumeIndex = i + x;
                        }
                        else
                        {
                            if (resumeIndex > 0)
                                i = resumeIndex - 1;  // The outer loop iterator will increment so subtract one
                            else if (x > 1)
                                i += (x - 1);  // Advance the outer loop variable since we already checked these bytes
                            break;
                        }
                    }
                }
            }
            if (indexes.Count > 0)
                return indexes.ToArray();
        }
        return null;
    }

您的解决方案不是O(n),因为您已经嵌套了!
Amirhossein Yari

0

我试图理解Sanchez的建议并加快搜索速度,在代码的性能之下几乎相等,但是代码更容易理解。

public int Search3(byte[] src, byte[] pattern)
    {
        int index = -1;

        for (int i = 0; i < src.Length; i++)
        {
            if (src[i] != pattern[0])
            {
                continue;
            }
            else
            {
                bool isContinoue = true;
                for (int j = 1; j < pattern.Length; j++)
                {
                    if (src[++i] != pattern[j])
                    {
                        isContinoue = true;
                        break;
                    }
                    if(j == pattern.Length - 1)
                    {
                        isContinoue = false;
                    }
                }
                if ( ! isContinoue)
                {
                    index = i-( pattern.Length-1) ;
                    break;
                }
            }
        }
        return index;
    }

0

这是我对这个主题的看法。我使用指针来确保在较大的数组上速度更快。此函数将返回序列的首次出现(这是我自己需要的情况)。

我确信您可以对其进行一些修改,以便返回所有出现的列表。

我的工作很简单。我遍历源数组(干草堆),直到找到模式的第一个字节(needle)。找到第一个字节后,我将继续分别检查下一个字节是否与模式的下一个字节匹配。如果不是这样,在尝试匹配针之前,我会继续从以前的索引(在大海捞针中)正常搜索。

所以这是代码:

    public unsafe int IndexOfPattern(byte[] src, byte[] pattern)
    {
        fixed(byte *srcPtr = &src[0])
        fixed (byte* patternPtr = &pattern[0])
        {
            for (int x = 0; x < src.Length; x++)
            {
                byte currentValue = *(srcPtr + x);

                if (currentValue != *patternPtr) continue;

                bool match = false;

                for (int y = 0; y < pattern.Length; y++)
                {
                    byte tempValue = *(srcPtr + x + y);
                    if (tempValue != *(patternPtr + y))
                    {
                        match = false;
                        break;
                    }

                    match = true;
                }

                if (match)
                    return x;
            }
        }
        return -1;
    }

安全代码如下:

    public int IndexOfPatternSafe(byte[] src, byte[] pattern)
    {
        for (int x = 0; x < src.Length; x++)
        {
            byte currentValue = src[x];
            if (currentValue != pattern[0]) continue;

            bool match = false;

            for (int y = 0; y < pattern.Length; y++)
            {
                byte tempValue = src[x + y];
                if (tempValue != pattern[y])
                {
                    match = false;
                    break;
                }

                match = true;
            }

            if (match)
                return x;
        }

        return -1;
    }

0

前几天我遇到了这个问题,请尝试以下操作:

        public static long FindBinaryPattern(byte[] data, byte[] pattern)
        {
            using (MemoryStream stream = new MemoryStream(data))
            {
                return FindBinaryPattern(stream, pattern);
            }
        }
        public static long FindBinaryPattern(string filename, byte[] pattern)
        {
            using (FileStream stream = new FileStream(filename, FileMode.Open))
            {
                return FindBinaryPattern(stream, pattern);
            }
        }
        public static long FindBinaryPattern(Stream stream, byte[] pattern)
        {
            byte[] buffer = new byte[1024 * 1024];
            int patternIndex = 0;
            int read;
            while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
            {
                for (int bufferIndex = 0; bufferIndex < read; ++bufferIndex)
                {
                    if (buffer[bufferIndex] == pattern[patternIndex])
                    {
                        ++patternIndex;
                        if (patternIndex == pattern.Length)
                            return stream.Position - (read - bufferIndex) - pattern.Length + 1;
                    }
                    else
                    {
                        patternIndex = 0;
                    }
                }
            }
            return -1;
        }

它无能为力,保持简单。


-1

您可以将字节数组放入String中并通过IndexOf进行匹配。或者,您至少可以在字符串匹配上重用现有算法

    [STAThread]
    static void Main(string[] args)
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
        byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};
        string needle, haystack;

        unsafe 
        {
            fixed(byte * p = pattern) {
                needle = new string((SByte *) p, 0, pattern.Length);
            } // fixed

            fixed (byte * p2 = toBeSearched) 
            {
                haystack = new string((SByte *) p2, 0, toBeSearched.Length);
            } // fixed

            int i = haystack.IndexOf(needle, 0);
            System.Console.Out.WriteLine(i);
        }
    }

您的代码只会发现第一个匹配项,但是这个问题意味着所有匹配项...
Mitch Wheat

我很高兴它能奏效。如果ASCII覆盖整个8位,则代码更干净。
尤金·横田

不,ASCII不能覆盖整个8位,而是7位。
君士坦丁

使用UTF-8是一个坏主意:1. Assert.AreNotEqual(new byte [] {0xc2,0x00},Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(new byte [] {0xc2,0x00})))); 2.您以字符串而不是字节数组(多字节字符)的形式打印索引
Pawel Lesnikowski 2014年

-3

toBeSearched.Except(pattern)将返回差异toBeSearched.Intersect(pattern)将产生一组交集通常,您应该研究Linq扩展内的扩展方法

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.