我正在使用Java处理一些Java源代码。我正在提取字符串文字并将其提供给带有String的函数。问题是我需要将String的未转义版本传递给函数(即,这意味着要转换\n
为换行符和\\
单个字符\
,等等)。
Java API内是否有执行此操作的功能?如果没有,我可以从某些库中获得这种功能吗?显然,Java编译器必须执行此转换。
万一有人想知道,我正在尝试对经过反混淆的Java文件中的字符串文字进行反混淆。
Answers:
这里org.apache.commons.lang.StringEscapeUtils.unescapeJava()
给出的另一个答案实际上根本没有什么帮助。
\0
了null。java.util.regex.Pattern.compile()
和一切使用它,其中包括\a
,\e
,特别是\cX
。charAt
接口而不是codePoint
接口,从而散布了char
保证Java保留Unicode字符的错觉。不是。他们之所以无法解决,是因为没有UTF-16代理会寻找他们想要的东西。我写了一个字符串unescaper,它解决了OP的问题,而没有Apache代码的所有烦恼。
/*
*
* unescape_perl_string()
*
* Tom Christiansen <tchrist@perl.com>
* Sun Nov 28 12:55:24 MST 2010
*
* It's completely ridiculous that there's no standard
* unescape_java_string function. Since I have to do the
* damn thing myself, I might as well make it halfway useful
* by supporting things Java was too stupid to consider in
* strings:
*
* => "?" items are additions to Java string escapes
* but normal in Java regexes
*
* => "!" items are also additions to Java regex escapes
*
* Standard singletons: ?\a ?\e \f \n \r \t
*
* NB: \b is unsupported as backspace so it can pass-through
* to the regex translator untouched; I refuse to make anyone
* doublebackslash it as doublebackslashing is a Java idiocy
* I desperately wish would die out. There are plenty of
* other ways to write it:
*
* \cH, \12, \012, \x08 \x{8}, \u0008, \U00000008
*
* Octal escapes: \0 \0N \0NN \N \NN \NNN
* Can range up to !\777 not \377
*
* TODO: add !\o{NNNNN}
* last Unicode is 4177777
* maxint is 37777777777
*
* Control chars: ?\cX
* Means: ord(X) ^ ord('@')
*
* Old hex escapes: \xXX
* unbraced must be 2 xdigits
*
* Perl hex escapes: !\x{XXX} braced may be 1-8 xdigits
* NB: proper Unicode never needs more than 6, as highest
* valid codepoint is 0x10FFFF, not maxint 0xFFFFFFFF
*
* Lame Java escape: \[IDIOT JAVA PREPROCESSOR]uXXXX must be
* exactly 4 xdigits;
*
* I can't write XXXX in this comment where it belongs
* because the damned Java Preprocessor can't mind its
* own business. Idiots!
*
* Lame Python escape: !\UXXXXXXXX must be exactly 8 xdigits
*
* TODO: Perl translation escapes: \Q \U \L \E \[IDIOT JAVA PREPROCESSOR]u \l
* These are not so important to cover if you're passing the
* result to Pattern.compile(), since it handles them for you
* further downstream. Hm, what about \[IDIOT JAVA PREPROCESSOR]u?
*
*/
public final static
String unescape_perl_string(String oldstr) {
/*
* In contrast to fixing Java's broken regex charclasses,
* this one need be no bigger, as unescaping shrinks the string
* here, where in the other one, it grows it.
*/
StringBuffer newstr = new StringBuffer(oldstr.length());
boolean saw_backslash = false;
for (int i = 0; i < oldstr.length(); i++) {
int cp = oldstr.codePointAt(i);
if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
}
if (!saw_backslash) {
if (cp == '\\') {
saw_backslash = true;
} else {
newstr.append(Character.toChars(cp));
}
continue; /* switch */
}
if (cp == '\\') {
saw_backslash = false;
newstr.append('\\');
newstr.append('\\');
continue; /* switch */
}
switch (cp) {
case 'r': newstr.append('\r');
break; /* switch */
case 'n': newstr.append('\n');
break; /* switch */
case 'f': newstr.append('\f');
break; /* switch */
/* PASS a \b THROUGH!! */
case 'b': newstr.append("\\b");
break; /* switch */
case 't': newstr.append('\t');
break; /* switch */
case 'a': newstr.append('\007');
break; /* switch */
case 'e': newstr.append('\033');
break; /* switch */
/*
* A "control" character is what you get when you xor its
* codepoint with '@'==64. This only makes sense for ASCII,
* and may not yield a "control" character after all.
*
* Strange but true: "\c{" is ";", "\c}" is "=", etc.
*/
case 'c': {
if (++i == oldstr.length()) { die("trailing \\c"); }
cp = oldstr.codePointAt(i);
/*
* don't need to grok surrogates, as next line blows them up
*/
if (cp > 0x7f) { die("expected ASCII after \\c"); }
newstr.append(Character.toChars(cp ^ 64));
break; /* switch */
}
case '8':
case '9': die("illegal octal digit");
/* NOTREACHED */
/*
* may be 0 to 2 octal digits following this one
* so back up one for fallthrough to next case;
* unread this digit and fall through to next case.
*/
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7': --i;
/* FALLTHROUGH */
/*
* Can have 0, 1, or 2 octal digits following a 0
* this permits larger values than octal 377, up to
* octal 777.
*/
case '0': {
if (i+1 == oldstr.length()) {
/* found \0 at end of string */
newstr.append(Character.toChars(0));
break; /* switch */
}
i++;
int digits = 0;
int j;
for (j = 0; j <= 2; j++) {
if (i+j == oldstr.length()) {
break; /* for */
}
/* safe because will unread surrogate */
int ch = oldstr.charAt(i+j);
if (ch < '0' || ch > '7') {
break; /* for */
}
digits++;
}
if (digits == 0) {
--i;
newstr.append('\0');
break; /* switch */
}
int value = 0;
try {
value = Integer.parseInt(
oldstr.substring(i, i+digits), 8);
} catch (NumberFormatException nfe) {
die("invalid octal value for \\0 escape");
}
newstr.append(Character.toChars(value));
i += digits-1;
break; /* switch */
} /* end case '0' */
case 'x': {
if (i+2 > oldstr.length()) {
die("string too short for \\x escape");
}
i++;
boolean saw_brace = false;
if (oldstr.charAt(i) == '{') {
/* ^^^^^^ ok to ignore surrogates here */
i++;
saw_brace = true;
}
int j;
for (j = 0; j < 8; j++) {
if (!saw_brace && j == 2) {
break; /* for */
}
/*
* ASCII test also catches surrogates
*/
int ch = oldstr.charAt(i+j);
if (ch > 127) {
die("illegal non-ASCII hex digit in \\x escape");
}
if (saw_brace && ch == '}') { break; /* for */ }
if (! ( (ch >= '0' && ch <= '9')
||
(ch >= 'a' && ch <= 'f')
||
(ch >= 'A' && ch <= 'F')
)
)
{
die(String.format(
"illegal hex digit #%d '%c' in \\x", ch, ch));
}
}
if (j == 0) { die("empty braces in \\x{} escape"); }
int value = 0;
try {
value = Integer.parseInt(oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\x escape");
}
newstr.append(Character.toChars(value));
if (saw_brace) { j++; }
i += j-1;
break; /* switch */
}
case 'u': {
if (i+4 > oldstr.length()) {
die("string too short for \\u escape");
}
i++;
int j;
for (j = 0; j < 4; j++) {
/* this also handles the surrogate issue */
if (oldstr.charAt(i+j) > 127) {
die("illegal non-ASCII hex digit in \\u escape");
}
}
int value = 0;
try {
value = Integer.parseInt( oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\u escape");
}
newstr.append(Character.toChars(value));
i += j-1;
break; /* switch */
}
case 'U': {
if (i+8 > oldstr.length()) {
die("string too short for \\U escape");
}
i++;
int j;
for (j = 0; j < 8; j++) {
/* this also handles the surrogate issue */
if (oldstr.charAt(i+j) > 127) {
die("illegal non-ASCII hex digit in \\U escape");
}
}
int value = 0;
try {
value = Integer.parseInt(oldstr.substring(i, i+j), 16);
} catch (NumberFormatException nfe) {
die("invalid hex value for \\U escape");
}
newstr.append(Character.toChars(value));
i += j-1;
break; /* switch */
}
default: newstr.append('\\');
newstr.append(Character.toChars(cp));
/*
* say(String.format(
* "DEFAULT unrecognized escape %c passed through",
* cp));
*/
break; /* switch */
}
saw_backslash = false;
}
/* weird to leave one at the end */
if (saw_backslash) {
newstr.append('\\');
}
return newstr.toString();
}
/*
* Return a string "U+XX.XXX.XXXX" etc, where each XX set is the
* xdigits of the logical Unicode code point. No bloody brain-damaged
* UTF-16 surrogate crap, just true logical characters.
*/
public final static
String uniplus(String s) {
if (s.length() == 0) {
return "";
}
/* This is just the minimum; sb will grow as needed. */
StringBuffer sb = new StringBuffer(2 + 3 * s.length());
sb.append("U+");
for (int i = 0; i < s.length(); i++) {
sb.append(String.format("%X", s.codePointAt(i)));
if (s.codePointAt(i) > Character.MAX_VALUE) {
i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
}
if (i+1 < s.length()) {
sb.append(".");
}
}
return sb.toString();
}
private static final
void die(String foa) {
throw new IllegalArgumentException(foa);
}
private static final
void say(String what) {
System.out.println(what);
}
如果它可以帮助其他人,则欢迎您使用它-不附带任何条件。如果您有所改进,我很乐意将您的增强功能邮寄给我,但您当然不必这样做。
foo\\bar
,它返回了foo\\bar
。我本来希望这样foo\bar
。这是错误还是我误解了该方法背后的想法?
您可以使用String unescapeJava(String)
的方法StringEscapeUtils
从阿帕奇共享郎。
这是一个示例片段:
String in = "a\\tb\\n\\\"c\\\"";
System.out.println(in);
// a\tb\n\"c\"
String out = StringEscapeUtils.unescapeJava(in);
System.out.println(out);
// a b
// "c"
实用程序类具有对Java,Java Script,HTML,XML和SQL进行字符串转义和不转义的方法。它还具有直接写入的重载java.io.Writer
。
看起来像StringEscapeUtils
使用一个u
,而不是八进制的转义,或使用无关的u
s的Unicode转义来处理Unicode转义。
/* Unicode escape test #1: PASS */
System.out.println(
"\u0030"
); // 0
System.out.println(
StringEscapeUtils.unescapeJava("\\u0030")
); // 0
System.out.println(
"\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
); // true
/* Octal escape test: FAIL */
System.out.println(
"\45"
); // %
System.out.println(
StringEscapeUtils.unescapeJava("\\45")
); // 45
System.out.println(
"\45".equals(StringEscapeUtils.unescapeJava("\\45"))
); // false
/* Unicode escape test #2: FAIL */
System.out.println(
"\uu0030"
); // 0
System.out.println(
StringEscapeUtils.unescapeJava("\\uu0030")
); // throws NestableRuntimeException:
// Unable to parse unicode value: u003
来自JLS的报价:
提供八进制转义符是为了与C兼容,但只能
\u0000
通过表示Unicode值\u00FF
,因此通常首选Unicode转义符。
如果您的字符串可以包含八进制转义符,则可能需要先将它们转换为Unicode转义符,或使用其他方法。
无关u
的也记录如下:
Java编程语言指定了一种将Unicode编写的程序转换为ASCII的标准方法,该程序将程序更改为可以由基于ASCII的工具处理的形式。转换涉及通过添加一个额外的字符将程序源文本中的任何Unicode转义转换为ASCII,
u
例如,\uxxxx
变为\uuxxxx
-,同时将源文本中的非ASCII字符转换为每个均包含一个u的Unicode转义。对于Java编程语言,此转换版本同样适用于编译器,并且表示完全相同的程序。稍后,可以通过将每个
u
存在多个的转义序列转换为少一个的Unicode字符序列u
,同时将每个转义序列从一个u
转换为相应的单个Unicode字符,从而从此ASCII格式还原确切的Unicode源。
如果您的字符串可以包含多余的Unicode转义u
,那么您可能还需要在使用前对其进行预处理StringEscapeUtils
。
另外,您可以尝试从头开始编写自己的Java字符串文字unescaper,确保遵循确切的JLS规范。
0-255
那样(请参见引用)。最大的八进制逸出量为\377
。
遇到了类似的问题,对提出的解决方案也不满意,因此自己实施了这个解决方案。
也可以在Github上作为Gist使用:
/**
* Unescapes a string that contains standard Java escape sequences.
* <ul>
* <li><strong>\b \f \n \r \t \" \'</strong> :
* BS, FF, NL, CR, TAB, double and single quote.</li>
* <li><strong>\X \XX \XXX</strong> : Octal character
* specification (0 - 377, 0x00 - 0xFF).</li>
* <li><strong>\uXXXX</strong> : Hexadecimal based Unicode character.</li>
* </ul>
*
* @param st
* A string optionally containing standard java escape sequences.
* @return The translated string.
*/
public String unescapeJavaString(String st) {
StringBuilder sb = new StringBuilder(st.length());
for (int i = 0; i < st.length(); i++) {
char ch = st.charAt(i);
if (ch == '\\') {
char nextChar = (i == st.length() - 1) ? '\\' : st
.charAt(i + 1);
// Octal escape?
if (nextChar >= '0' && nextChar <= '7') {
String code = "" + nextChar;
i++;
if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
&& st.charAt(i + 1) <= '7') {
code += st.charAt(i + 1);
i++;
if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
&& st.charAt(i + 1) <= '7') {
code += st.charAt(i + 1);
i++;
}
}
sb.append((char) Integer.parseInt(code, 8));
continue;
}
switch (nextChar) {
case '\\':
ch = '\\';
break;
case 'b':
ch = '\b';
break;
case 'f':
ch = '\f';
break;
case 'n':
ch = '\n';
break;
case 'r':
ch = '\r';
break;
case 't':
ch = '\t';
break;
case '\"':
ch = '\"';
break;
case '\'':
ch = '\'';
break;
// Hex Unicode: u????
case 'u':
if (i >= st.length() - 5) {
ch = 'u';
break;
}
int code = Integer.parseInt(
"" + st.charAt(i + 2) + st.charAt(i + 3)
+ st.charAt(i + 4) + st.charAt(i + 5), 16);
sb.append(Character.toChars(code));
i += 5;
continue;
}
i++;
}
sb.append(ch);
}
return sb.toString();
}
我知道这个问题很旧,但是我想要一个不涉及JRE6之外的库的解决方案(即Apache Commons是不可接受的),我想出了一个使用内置的简单解决方案java.io.StreamTokenizer
:
import java.io.*;
// ...
String literal = "\"Has \\\"\\\\\\\t\\\" & isn\\\'t \\\r\\\n on 1 line.\"";
StreamTokenizer parser = new StreamTokenizer(new StringReader(literal));
String result;
try {
parser.nextToken();
if (parser.ttype == '"') {
result = parser.sval;
}
else {
result = "ERROR!";
}
}
catch (IOException e) {
result = e.toString();
}
System.out.println(result);
输出:
Has "\ " & isn't
on 1 line.
我在这方面有点迟了,但是我认为我会提供我的解决方案,因为我需要相同的功能。我决定使用Java Compiler API,它会使速度变慢,但会使结果准确。基本上,我现场创建一个类,然后返回结果。方法如下:
public static String[] unescapeJavaStrings(String... escaped) {
//class name
final String className = "Temp" + System.currentTimeMillis();
//build the source
final StringBuilder source = new StringBuilder(100 + escaped.length * 20).
append("public class ").append(className).append("{\n").
append("\tpublic static String[] getStrings() {\n").
append("\t\treturn new String[] {\n");
for (String string : escaped) {
source.append("\t\t\t\"");
//we escape non-escaped quotes here to be safe
// (but something like \\" will fail, oh well for now)
for (int i = 0; i < string.length(); i++) {
char chr = string.charAt(i);
if (chr == '"' && i > 0 && string.charAt(i - 1) != '\\') {
source.append('\\');
}
source.append(chr);
}
source.append("\",\n");
}
source.append("\t\t};\n\t}\n}\n");
//obtain compiler
final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
//local stream for output
final ByteArrayOutputStream out = new ByteArrayOutputStream();
//local stream for error
ByteArrayOutputStream err = new ByteArrayOutputStream();
//source file
JavaFileObject sourceFile = new SimpleJavaFileObject(
URI.create("string:///" + className + Kind.SOURCE.extension), Kind.SOURCE) {
@Override
public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException {
return source;
}
};
//target file
final JavaFileObject targetFile = new SimpleJavaFileObject(
URI.create("string:///" + className + Kind.CLASS.extension), Kind.CLASS) {
@Override
public OutputStream openOutputStream() throws IOException {
return out;
}
};
//file manager proxy, with most parts delegated to the standard one
JavaFileManager fileManagerProxy = (JavaFileManager) Proxy.newProxyInstance(
StringUtils.class.getClassLoader(), new Class[] { JavaFileManager.class },
new InvocationHandler() {
//standard file manager to delegate to
private final JavaFileManager standard =
compiler.getStandardFileManager(null, null, null);
@Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
if ("getJavaFileForOutput".equals(method.getName())) {
//return the target file when it's asking for output
return targetFile;
} else {
return method.invoke(standard, args);
}
}
});
//create the task
CompilationTask task = compiler.getTask(new OutputStreamWriter(err),
fileManagerProxy, null, null, null, Collections.singleton(sourceFile));
//call it
if (!task.call()) {
throw new RuntimeException("Compilation failed, output:\n" +
new String(err.toByteArray()));
}
//get the result
final byte[] bytes = out.toByteArray();
//load class
Class<?> clazz;
try {
//custom class loader for garbage collection
clazz = new ClassLoader() {
protected Class<?> findClass(String name) throws ClassNotFoundException {
if (name.equals(className)) {
return defineClass(className, bytes, 0, bytes.length);
} else {
return super.findClass(name);
}
}
}.loadClass(className);
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
}
//reflectively call method
try {
return (String[]) clazz.getDeclaredMethod("getStrings").invoke(null);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
它需要一个数组,因此您可以分批取消转义。因此,以下简单测试成功了:
public static void main(String[] meh) {
if ("1\02\03\n".equals(unescapeJavaStrings("1\\02\\03\\n")[0])) {
System.out.println("Success");
} else {
System.out.println("Failure");
}
}
我遇到了同样的问题,但是我对这里找到的任何解决方案都不感兴趣。因此,我写了一篇使用匹配器遍历字符串的字符以查找和替换转义序列的代码。该解决方案假定输入格式正确。也就是说,它愉快地跳过了无意义的转义,并且解码了换行和回车的Unicode转义(否则,由于此类文字的定义和Java转换阶段的顺序,它们不能出现在字符文字或字符串文字中)资源)。不好意思,为简洁起见,代码有点挤。
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Decoder {
// The encoded character of each character escape.
// This array functions as the keys of a sorted map, from encoded characters to decoded characters.
static final char[] ENCODED_ESCAPES = { '\"', '\'', '\\', 'b', 'f', 'n', 'r', 't' };
// The decoded character of each character escape.
// This array functions as the values of a sorted map, from encoded characters to decoded characters.
static final char[] DECODED_ESCAPES = { '\"', '\'', '\\', '\b', '\f', '\n', '\r', '\t' };
// A pattern that matches an escape.
// What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode.
static final Pattern PATTERN = Pattern.compile("\\\\(?:(b|t|n|f|r|\\\"|\\\'|\\\\)|((?:[0-3]?[0-7])?[0-7])|u+(\\p{XDigit}{4}))");
public static CharSequence decodeString(CharSequence encodedString) {
Matcher matcher = PATTERN.matcher(encodedString);
StringBuffer decodedString = new StringBuffer();
// Find each escape of the encoded string in succession.
while (matcher.find()) {
char ch;
if (matcher.start(1) >= 0) {
// Decode a character escape.
ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))];
} else if (matcher.start(2) >= 0) {
// Decode an octal escape.
ch = (char)(Integer.parseInt(matcher.group(2), 8));
} else /* if (matcher.start(3) >= 0) */ {
// Decode a Unicode escape.
ch = (char)(Integer.parseInt(matcher.group(3), 16));
}
// Replace the escape with the decoded character.
matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch)));
}
// Append the remainder of the encoded string to the decoded string.
// The remainder is the longest suffix of the encoded string such that the suffix contains no escapes.
matcher.appendTail(decodedString);
return decodedString;
}
public static void main(String... args) {
System.out.println(decodeString(args[0]));
}
}
我应该注意,Apache Commons Lang3似乎没有遭受公认的解决方案中指出的弱点。也就是说,StringEscapeUtils
似乎可以处理八进制转义和u
Unicode转义的多个字符。这意味着除非您出于某些迫切的原因要避免使用Apache Commons,否则您应该使用它而不是我的解决方案(或此处的任何其他解决方案)。
Java 13添加了执行此操作的方法:String#translateEscapes
。
它是Java 13和14中的预览功能,但在Java 15中被提升为完整功能。
org.apache.commons.lang3.StringEscapeUtils
from commons-lang3已被标记为已弃用。您可以org.apache.commons.text.StringEscapeUtils#unescapeJava(String)
改用。它需要附加的Maven依赖项:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.4</version>
</dependency>
并且似乎处理了一些更特殊的情况,例如unescapes:
\\b
,\\n
,\\t
,\\f
,\\r
如果您要从文件中读取Unicode逸出的字符,那么这样做会很困难,因为字符串将被字面读取,并带有反斜杠的转义符:
my_file.txt
Blah blah...
Column delimiter=;
Word delimiter=\u0020 #This is just unicode for whitespace
.. more stuff
在这里,当您从文件中读取第3行时,字符串/行将具有:
"Word delimiter=\u0020 #This is just unicode for whitespace"
字符串中的char []将显示:
{...., '=', '\\', 'u', '0', '0', '2', '0', ' ', '#', 't', 'h', ...}
Commons StringUnescape不会为您取消转义(我尝试过unescapeXml())。您必须按照此处所述手动进行操作。
因此,子字符串“ \ u0020”应该成为1个单字符'\ u0020'
但是,如果您正在使用此“ \ u0020”进行String.split("... ..... ..", columnDelimiterReadFromFile)
内部真正的正则表达式,则它将直接起作用,因为从文件读取的字符串已转义并且非常适合在正则表达式模式中使用!(困惑?)