我有一个未知或混合编码的文本文件。我想查看包含无效UTF-8字节序列的行(通过将文本文件传送到某些程序中)。同样,我想过滤掉有效的UTF-8行。换句话说,我正在寻找。grep [notutf8]
我有一个未知或混合编码的文本文件。我想查看包含无效UTF-8字节序列的行(通过将文本文件传送到某些程序中)。同样,我想过滤掉有效的UTF-8行。换句话说,我正在寻找。grep [notutf8]
Answers:
如果要使用grep
,可以执行以下操作:
grep -axv '.*' file
在UTF-8语言环境中获取至少具有无效UTF-8序列的行(至少与GNU Grep一起使用)。
-a
,POSIX才需要使用它。但是,GNU grep
至少不能发现0x10FFFF以上的UTF-8编码的UTF-16替代非字符或代码点。
-a
是GNU所需要的grep
(我认为它不符合POSIX)。有关,替代区域和上述在0x10FFFF的码点,这是一个错误,然后(这可以解释其)。为此,添加-P
应该与GNU grep
2.21 一起工作(但速度很慢)。至少在Debian grep / 2.20-4中它是错误的。
grep
是一个文本实用程序(仅适用于文本输入),所以我认为GNU grep的行为与此处的行为一样有效。
grep
(其意图是将无效序列视为不匹配)和可能的错误。
我认为您可能想要iconv。它用于在代码集之间进行转换,并且支持多种格式。例如,要删除在UTF-8中无效的所有内容,可以使用:
iconv -c -t UTF-8 < input.txt > output.txt
如果没有-c选项,它将报告转换为stderr时出现的问题,因此可以按照进程方向保存这些列表。另一种方法是剥离非UTF8内容,然后
diff input.txt output.txt
有关进行更改的位置的列表。
iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'
。(不,这是行不通的管道,不过,因为你需要读取输入两次tee
它可能阻止这取决于有多少缓冲不会做,iconv
而且diff
做的)。
diff <(iconv -c -t UTF-8 <input.txt) input.txt
编辑:我已经修复了正则表达式中的一个错字错误。它需要一个'\ x80`而不是\ 80。
为了严格遵守UTF-8,用于过滤掉无效UTF-8格式的正则表达式如下:
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print'
输出(来自测试1的关键行):
Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)
问:如何创建测试数据以测试过滤无效Unicode的正则表达式?
A.创建自己的UTF-8测试算法,并打破规则...
Catch-22 .. 但是,然后您如何测试测试算法呢?
上面的正则表达式已iconv
针对从0x00000
到0x10FFFF
。的每个整数值进行了测试(用作参考)。此上限值是Unicode Codepoint的最大整数值
根据此Wikipedia UTF-8页面,。
此numeber(1112064)等同于范围0x000000
于0x10F7FF
,其为0x0800害羞对于最高的Unicode代码点的实际最大整数值的:0x10FFFF
Unicode Codepoints频谱中缺少此整数块,因为需要UTF-16编码通过称为代理对的系统超越其原始设计意图。保留了一个整数块,以供UTF-16使用。.该块的范围是到。这些整数都不是合法的Unicode值,因此不是无效的UTF-8值。 0x0800
0x00D800
0x00DFFF
在测试1中,regex
已经针对Unicode Codepoint范围内的每个数字进行了测试,它与..ie的结果完全匹配iconv
。0x010F7FF有效值和0x000800无效值。
但是,现在出现了以下问题:*正则表达式如何处理超出范围的UTF-8值;上面的代码0x010FFFF
(UTF-8可以扩展到6个字节,最大整数值为0x7FFFFFFF?
为了生成必要的* 非Unicode UTF-8字节值,我使用了以下命令:
perl -C -e 'print chr 0x'$hexUTF32BE
为了测试其有效性(以某种方式),我使用了Gilles'
UTF-8正则表达式...
perl -l -ne '/
^( [\000-\177] # 1-byte pattern
|[\300-\337][\200-\277] # 2-byte pattern
|[\340-\357][\200-\277]{2} # 3-byte pattern
|[\360-\367][\200-\277]{3} # 4-byte pattern
|[\370-\373][\200-\277]{4} # 5-byte pattern
|[\374-\375][\200-\277]{5} # 6-byte pattern
)*$ /x or print'
“ perl的print chr”的输出与Gilles的正则表达式的过滤匹配。.一个增强了另一个的有效性..我不能使用,iconv
因为它仅处理更宽(原始)的UTF-8的有效Unicode标准子集。标准...
所涉及的数字是相当大的,因此我已经测试了范围的顶部,范围的底部,以及以11111、13579、33333、53441之类的增量逐步进行的几次扫描...结果全部匹配,所以现在剩下的就是针对这些超出范围的UTF-8样式值测试regex(对于Unicode无效,因此对于严格的UTF-8本身也无效)。
以下是测试模块:
[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }
# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0
: regex; Rx=1 # run-test=1 do-not-test=0
((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)
# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione
# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range
Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
#########################################################################
# 1 byte UTF-8 values: Nothing to do; no complexities.
#########################################################################
# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time \
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
time \
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi
#########################################################################
# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time \
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
#
# real 26m28.462s \
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s \
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time \
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo
fi
#########################################################################
# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s \
# user 56m44.317s |
# sys 27m29.867s /
time \
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi
########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################
################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################
# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [\x00-\x7F]
# ===========
# ([\x00-\x7F])
#
# 2: [\xC2-\xDF] [\x80-\xBF]
# =================================
# ([\xC2-\xDF][\x80-\xBF])
#
# 3: [\xE0] [\xA0-\xBF] [\x80-\xBF]
# [\xED] [\x80-\x9F] [\x80-\xBF]
# [\xE1-\xEC\xEE-\xEF] [\x80-\xBF] [\x80-\xBF]
# ==============================================
# ((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
#
# 4 [\xF0] [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF4] [\x80-\x8F] [\x80-\xBF] [\x80-\xBF]
# ===========================================================
# ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))
# 4-1: (((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|([\xC2-\xDF][\x80-\xBF])|([\x00-\x7F]))
#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8X\n' $CPDec) # hexUTF32BE
# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [\000-\177]
| [\300-\337][\200-\277]
| [\340-\357][\200-\277]{2}
| [\360-\367][\200-\277]{3}
| [\370-\373][\200-\277]{4}
| [\374-\375][\200-\277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]] \
&& rcIco32=0 || rcIco32=1
rcIco16=
elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)" \
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/\(..\)/\1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8/" |
sort -R |
tr -d '\n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegex\t: $(printf "%0.8X\n" $invalCt)\t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")\r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) \r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))\r"
fi
fi
done
} # End time
fi
exit
\300\200
(非常糟糕:这是代码点0,没有用空字节表示!)。我认为您的正则表达式正确拒绝了它们。
我发现uconv
(在icu-devtools
Debian 中的软件包中)对检查UTF-8数据很有用:
$ print '\\xE9 \xe9 \u20ac \ud800\udc00 \U110000' |
uconv --callback escape-c -t us
\xE9 \xE9 \u20ac \xED\xA0\x80\xED\xB0\x80 \xF4\x90\x80\x80
(\x
s帮助识别无效字符(除了\xE9
上面用文字自动引入的误报外)。
(大量其他不错的用法)。
recode
可以类似地使用-除了我认为如果要求翻译无效的多字节序列应该失败。不过我不确定。print...|recode u8..u8/x4
例如,它不会失败(它只会像您在上面那样做一个十六进制转储),因为它除了什么都不会做iconv data data
,但它确实会失败,recode u8..u2..u8/x4
因为它会先翻译然后打印。但是我对它的了解还不足以确定-而且有很多可能性。
test.txt
。我应该如何使用您的解决方案找到无效字符?是什么us
在你的代码是什么意思?
us
表示美国,是ASCII的缩写。它将输入转换为ASCII码,将非ASCII字符转换为\uXXXX
符号,将非字符转换为\xXX
。
从2.0版开始,Python就具有内置unicode
功能。
#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)
在Python 3中,unicode
已折叠为str
。它需要传递一个类似字节的对象,这里buffer
是标准描述符的基础对象。
#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)
python 2
一个发生故障,以标志UTF-8编码的UTF-16替代物的非字符(至少2.7.6)。
我遇到了类似的问题(“上下文”部分中的详细信息),并收到以下ftfy_line_by_line.py解决方案:
#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))
我已经使用以下gen_basic_files_metadata.csv.sh脚本收集了基本文件系统元数据的> 10GiB CSV,该脚本实际上是在运行的:
find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" \;
我遇到的麻烦是跨文件系统的文件名编码不一致,导致UnicodeDecodeError
在使用python应用程序进一步处理时会引起问题(更具体地讲csvsql)。
因此,我在ftfy脚本之上应用了代码,
请注意,ftfy速度非常慢,处理> 10GiB所需的时间:
real 147m35.182s
user 146m14.329s
sys 2m8.713s
而sha256sum用于比较:
real 6m28.897s
user 1m9.273s
sys 0m6.210s
在Intel(R)i7-3520M CPU @ 2.90GHz + 16GiB RAM上运行(以及外部驱动器上的数据)