搜索文本时的话喜欢我遇到了一个更糟糕的问题.NET
,C++
,C#
,和C
。您可能会认为,计算机程序员比为某种语言编写难于为其编写正则表达式的语言更好地了解。
无论如何,这就是我发现的内容(主要从http://www.regular-expressions.info进行总结,该网站是一个不错的网站):在大多数regex风格中,与简写字符类匹配的字符\w
都是被单词边界视为单词字符的字符。Java是一个例外。Java支持Unicode,\b
但不支持\w
。(我确信当时有充分的理由)。
该\w
代表“单词字符”。它始终与ASCII字符匹配[A-Za-z0-9_]
。请注意包含下划线和数字(但不包括破折号!)。在大多数支持Unicode的版本中,都\w
包含许多其他脚本中的字符。关于实际包含哪些字符有很多不一致之处。通常包括字母脚本和表意文字的字母和数字。除下划线和非数字符号之外的连接器标点符号可能会包含在内,也可能不包含。XML Schema和XPath甚至包括中的所有符号\w
。但是Java,JavaScript和PCRE仅将ASCII字符与匹配\w
。
这就是为什么基于Java的正则表达式搜索C++
,C#
或.NET
(甚至当你还记得逃脱周期和加号)被拧\b
。
注意:我不确定该如何处理文本错误,例如当某人在句子结尾的句号后不加空格时。我同意了,但是我不确定这一定是正确的做法。
无论如何,在Java中,如果要在文本中搜索那些奇怪的语言,则需要\b
在空格和标点符号之前和之后替换。例如:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
然后在您的测试或主要功能中:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
PS:我感谢http://regexpal.com/,没有它们,正则表达式世界将非常痛苦!