RegEx拆分camelCase或TitleCase（高级）

Question 1

我找到了一个出色的RegEx来提取camelCase或TitleCase表达的一部分。

 (?<!^)(?=[A-Z])

它按预期工作：

值->值
camelValue-> camel /值
TitleValue->标题/值

例如，使用Java：

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

我的问题是在某些情况下它不起作用：

情况1：VALUE-> V / A / L / U / E
情况2：eclipseRCPExt-> eclipse / R / C / P / Ext

在我看来，结果应该是：

情况1：VALUE
情况2：日食/ RCP /外部

换句话说，给定n个大写字符：

如果n个字符后跟小写字符，则组应为：（n-1个字符）/（第n个字符+小写字符）
如果n个字符位于末尾，则该组应为：（n个字符）。

关于如何改善此正则表达式的任何想法吗？

Question 2

以下正则表达式适用于上述所有示例：

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}

它通过强制否定的向后看不仅在字符串的开头忽略匹配项，而且在大写字母后跟另一个大写字母的情况下也忽略匹配项。这样可以处理“ VALUE”之类的情况。

正则表达式的第一部分本身由于无法在“ RPC”和“ Ext”之间分割而在“ eclipseRCPExt”上失败。这是第二个条款的目的：(?<!^)(?=[A-Z][a-z]。此子句允许在每个大写字母前跟一个小写字母前进行拆分，但字符串的开头除外。

Question 3

看来您正在使它变得比所需的更加复杂。对于camelCase，拆分位置仅是大写字母紧跟在小写字母之后的任何位置：

(?<=[a-z])(?=[A-Z])

这是此正则表达式如何拆分示例数据的方法：

value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt

与所需输出的唯一区别是与eclipseRCPExt，我认为这里已正确分割。

附录-改进版本

注意：这个答案最近得到了好评，我意识到有更好的方法...

通过在上述正则表达式中添加第二种替代方法，可以正确拆分所有OP的测试用例。

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

这是改进的正则表达式如何拆分示例数据的方法：

value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext

编辑：20130824添加了改进的版本来处理RCPExt -> RCP / Ext案例。

Question 4

另一种解决方案是在commons-lang中使用专用方法：StringUtils＃splitByCharacterTypeCamelCase

Question 5

我无法使用aix的解决方案（它也不能在RegExr上使用），所以我想出了自己的经过测试的方法，似乎可以完全满足您的要求：

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

这是一个使用它的示例：

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

在这里，我用空格分隔每个单词，因此，下面是一些如何转换字符串的示例：

ThisIsATitleCASEString =>这是一个标题案例字符串
andThisOneIsCamelCASE =>而这一个是Camel CASE

上面的解决方案可以满足原始帖子的要求，但是我还需要一个正则表达式来查找包含数字的骆驼和帕斯卡字符串，因此我也想出了一种包含数字的变体：

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

以及使用它的示例：

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

以下是一些使用此正则表达式转换数字字符串的示例：

myVariable123 =>我的变量123
my2Variables =>我的2个变量
3rdVariableIsHere =>第3rdVariable在这里
12345NumsAtTheStartIncludedToo => 12345 Nums在开始时也包含

Question 6

处理更多的信件，不仅仅是`A-Z`：

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

要么：

在任何小写字母之后分割，后面跟着大写字母。

例如parseXML- > parse，XML。

要么

在任何字母之后分割，然后是大写字母和小写字母。

例如XMLParser- > XML，Parser。

以更具可读性的形式：

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }    
}

Question 7

简要

此处的两个最高答案都使用正向隐式提供了代码，并非所有正则表达式都支持。下面的正则表达式将同时捕获PascalCase和，camelCase并且可以使用多种语言。

注意：我确实意识到这个问题是关于Java的，但是，我也看到在用不同语言标记的其他问题中多次提到了此帖子，以及对此问题的一些评论。

码

看到这里使用的正则表达式

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

结果

样本输入

eclipseRCPExt

SomethingIsWrittenHere

TEXTIsWrittenHERE

VALUE

loremIpsum

样本输出

eclipse
RCP
Ext

Something
Is
Written
Here

TEXT
Is
Written
HERE

VALUE

lorem
Ipsum

说明

匹配一个或多个大写字母字符 [A-Z]+
或匹配零个或一个大写字母字符[A-Z]?，后跟一个或多个小写字母字符[a-z]+
确保后面是大写字母字符[A-Z]或单词边界字符\b

Question 8

您可以使用StringUtils。来自Apache Commons Lang的splitByCharacterTypeCamelCase（“ loremIpsum”）。

Question 9

您可以将以下表达式用于Java：

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)

Question 10

除了寻找不存在的分隔符之外，您还可以考虑查找名称组件（肯定存在这些组件）：

String test = "_eclipse福福RCPExt";

Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);

Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
    // matches should be consecutive
    if (componentMatcher.start() != endOfLastMatch) {
        // do something horrible if you don't want garbage in between

        // we're lenient though, any Chinese characters are lucky and get through as group
        String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
        components.add(startOrInBetween);
    }
    components.add(componentMatcher.group(1));
    endOfLastMatch = componentMatcher.end();
}

if (endOfLastMatch != test.length()) {
    String end = test.substring(endOfLastMatch, componentMatcher.start());
    components.add(end);
}

System.out.println(components);

这输出[eclipse, 福福, RCP, Ext]。转换为数组当然很简单。

Question 11

我可以确认([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)上面ctwheels给出的regex字符串可以与Microsoft regex风格一起使用。

我还想根据ctwheels的正则表达式提出以下替代方案，该替代方案处理数字字符：([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b)。

这能够拆分字符串，例如：

从2019年开始驾驶B2BTrade

至

从2019年开始推动B2B贸易

Question 12

JavaScript解决方案

/**
 * howToDoThis ===> ["", "how", "To", "Do", "This"]
 * @param word word to be split
 */
export const splitCamelCaseWords = (word: string) => {
    if (typeof word !== 'string') return [];
    return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};