Objective-C / Cocoa Touch中的HTML字符解码


103

首先,我发现了这一点: 目标C HTML转义/转义,但对我不起作用。

我的编码字符(来自RSS feed,顺便说一句)如下所示: &

我在网上搜索了所有内容,并找到了相关的讨论,但没有解决我的特定编码,我认为它们被称为十六进制字符。


3
该评论距原始问题六个月,因此对于那些偶然发现该问题以寻找答案和解决方案的人来说,它的意义更大。最近出现了一个非常类似的问题,我回答了stackoverflow.com/questions/2254862/… 它使用RegexKitLite和Blocks搜索和替换&#...;字符串中的等效字符。
约翰

具体是什么“无效”?我在这个问题中没有看到任何与先前问题不重复的内容。
Peter Hosey 2010年

是十进制。十六进制为8
kennytm 2010年

十进制和十六进制之间的差异是十进制以10为底,而十六进制则以16为底。每个基数中的“ 38”是不同的数字;在基数10中,它是3×10 + 8×1 = 38;而在基数16中,它是3×16 + 8×1 =五十六。较高的数字是基数的较高幂。最低的整数是基数0(= 1),下一个较高的数位是基数 1(= base),下一个数位是基数** 2(= base * base),依此类推。这是工作中的幂。
Peter Hosey

Answers:


46

这些被称为字符实体引用。当采用它们的形式时,&#<number>;它们称为数字实体引用。基本上,它是应该替换的字节的字符串表示形式。在的情况下&#038;,它表示ISO-8859-1字符编码方案中值为38的字符&

必须在RSS中对与号进行编码的原因是它是保留的特殊字符。

你需要做的是分析该字符串和匹配的值的字节替换实体&#;。我不知道在目标C中执行此操作的任何好方法,但是此堆栈溢出问题可能会有所帮助。

编辑:自从两年前回答这个问题以来,有一些很好的解决方案;请参阅下面的@Michael Waterfall答案。


2
+1我正要提交完全相同的答案(包括相同的链接,不少!)
e.James

“基本上,它是应该替换的字节的字符串表示形式。” 更像性格。这是文本,不是数据;在将文本转换为数据时,字符可能会占用多个字节,具体取决于字符和编码。
Peter Hosey,

谢谢回复。您说“它代表ISO-8859-1字符编码方案中值为38的字符,即&”。你确定吗?您是否有指向此类字符表的链接?因为我记得那只是一个引号。
treznik,2009年


以及&amp; 或&copy; 符号?
vokilam 2013年

162

查看我的HTML的NSString类别。以下是可用的方法:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;

3
杜德,出色的功能。您的stringByDecodingXMLEntities方法让我很高兴!谢谢!
布莱恩·莫斯考

3
没问题;)很高兴您发现它有用!
迈克尔瀑布

4
经过几个小时的搜索,我知道这是真正可行的唯一方法。对于可以执行此操作的字符串方法,NSString已过期。做得好。
亚当·埃伯巴赫,

1
我发现迈克尔(2)的许可证对我的用例而言过于严格,所以我使用了尼基塔(Nikita)的解决方案。包括来自Google工具箱的三个Apache-2.0许可文件对我来说非常有用。
杰米

10
ARC的代码更新将非常方便。.Xcode在构建时会抛出大量ARC错误和警告
Matej

52

丹尼尔(Daniel)撰写的文章基本上非常好,我在那里修复了一些问题:

  1. 删除了NSSCanner的跳过字符(否则将忽略两个连续实体之间的空格

    [scanner setCharactersToBeSkipped:nil];

  2. 修复了当存在孤立的“&”符号时的解析(我不确定“正确”输出是什么,我只是将它与firefox进行了比较):

例如

    &#ABC DF & B&#39;  & C&#39; Items (288)

这是修改后的代码:

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", (unichar)charCode];

                [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

                [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


                [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
            NSString *amp;

            [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol
            [result appendString:amp];

            /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
             */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}

这应该是对这个问题的肯定答案!谢谢!
boliva 2010年

这很棒。不幸的是,由于ARC问题,收视率最高的答案的代码不再起作用,但是确实如此。
泰德·库尔普

@TedKulp可以正常工作,您只需要禁用每个文件的ARC。stackoverflow.com/questions/6646052/…–
凯尔

如果可以的话,我会两次竖起大拇指。
Kibitz503

2016年以来仍在访问此问题的人们的快速翻译:stackoverflow.com/a/35303635/1153630
Max Chuquimia,2016年

46

从iOS 7开始,您可以使用NSAttributedString带有NSHTMLTextDocumentType属性的来本地解码HTML字符:

NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
                                                 options:options
                                      documentAttributes:NULL
                                                   error:NULL];

解码后的属性字符串现在将显示为:&&<>™©♥♣♦。

注意:仅在主线程上调用时,此方法才有效。


6
如果您不需要支持iOS 6和更早版本,最好的答案
jcesarmobile 2014年

1
不,如果有人想在bg线程上对其进行编码,则不是最好的方法; O
badeleux 2014年

4
这可用于解码实体,但也会弄乱非编码的破折号。
Andrew

这被强制在主线程上发生。因此,如果没有必要,您可能不想这样做。
Keith Smiley,2015年

与UITableView有关的只是挂起GUI。因此,无法正常工作。
2015年

35

似乎没有人提到最简单的选项之一:Mac版Google工具箱
(尽管名称如此,它也适用于iOS。)

https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h

/// Get a string where internal characters that are escaped for HTML are unescaped 
//
///  For example, '&amp;' becomes '&'
///  Handles &#32; and &#x32; cases as well
///
//  Returns:
//    Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;

而且我只需要在项目中包括三个文件:标头,实现和GTMDefines.h


我已经包含了这三个脚本,但是现在如何使用它呢?
2011年

@ borut-t [myString gtm_stringByUnescapingFromHTML]
Nikita Rybak

2
我选择仅包含这三个文件,因此我需要执行此操作以使其与arc兼容:code.google.com/p/google-toolbox-for-mac/wiki/ARC_Compatibility
jaime

我不得不说,这是迄今为止最简单,最轻便的解决方案
lensovet 2013年

我希望我能使它完全正常工作。似乎跳过了我的字符串中的许多字符串。
约瑟夫·多伦多

17

我应该将其发布在GitHub之类的东西上。这属于NSString类别,NSScanner用于实现,并处理十六进制和十进制数字字符实体以及通常的符号实体。

而且,它可以相对优雅地处理格式错误的字符串(当您带有&后跟无效的字符序列时),这在我使用此代码的已发布应用程序中至关重要。

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];
    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }
            if (gotNumber) {
                [result appendFormat:@"%C", charCode];
            }
            else {
                NSString *unknownEntity = @"";
                [scanner scanUpToString:@";" intoString:&unknownEntity];
                [result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
            }
            [scanner scanString:@";" intoString:NULL];
        }
        else {
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
        }
    }
    while (![scanner isAtEnd]);

finish:
    return result;
}

非常有用的一段代码,但是确实有Walty解决的几个问题。感谢分享!
Michael瀑布2010年

您知道一种通过解码像&micro;这样的XML实体来显示lambda,mu,nu,pi符号的方法吗?...等等????
chinthakad 2012年

您应该避免使用gotos作为其糟糕的代码样式。此时应更换线路goto finish;break;
Stunner

4

这是我使用RegexKitLite框架的方法:

-(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html {
NSString* result = [html copy];
NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"];

if (![matches count]) 
    return result;

for (int i=0; i<[matches count]; i++) {
    NSArray* array = [matches objectAtIndex: i];
    NSString* charCode = [array objectAtIndex: 1];
    int code = [charCode intValue];
    NSString* character = [NSString stringWithFormat:@"%C", code];
    result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0]
                                               withString: character];      
}   
return result;  

}

希望这会帮助某人。


4

您可以仅使用此功能来解决此问题。

+ (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str
{
    NSMutableString* string = [[NSMutableString alloc] initWithString:str];  // #&39; replace with '
    NSString* unicodeStr = nil;
    NSString* replaceStr = nil;
    int counter = -1;

    for(int i = 0; i < [string length]; ++i)
    {
        unichar char1 = [string characterAtIndex:i];    
        for (int k = i + 1; k < [string length] - 1; ++k)
        {
            unichar char2 = [string characterAtIndex:k];    

            if (char1 == '&'  && char2 == '#' ) 
            {   
                ++counter;
                unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)];    
                // read integer value i.e, 39
                replaceStr = [string substringWithRange:NSMakeRange (i, 5)];     //     #&39;
                [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]];
                break;
            }
        }
    }
    [string autorelease];

    if (counter > 1)
        return  [self decodeHtmlUnicodeCharactersToString:string]; 
    else
        return string;
}

2

这是杨永'的答案的Swift版本:

extension String {
    static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",]

    func stringByDecodingXMLEntities() -> String {

        guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else {
            return self
        }

        var result = ""

        let scanner = NSScanner(string: self)
        scanner.charactersToBeSkipped = nil

        let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;")

        repeat {
            var nonEntityString: NSString? = nil

            if scanner.scanUpToString("&", intoString: &nonEntityString) {
                if let s = nonEntityString as? String {
                    result.appendContentsOf(s)
                }
            }

            if scanner.atEnd {
                break
            }

            var didBreak = false
            for (k,v) in String.mappings {
                if scanner.scanString(k, intoString: nil) {
                    result.appendContentsOf(v)
                    didBreak = true
                    break
                }
            }

            if !didBreak {

                if scanner.scanString("&#", intoString: nil) {

                    var gotNumber = false
                    var charCodeUInt: UInt32 = 0
                    var charCodeInt: Int32 = -1
                    var xForHex: NSString? = nil

                    if scanner.scanString("x", intoString: &xForHex) {
                        gotNumber = scanner.scanHexInt(&charCodeUInt)
                    }
                    else {
                        gotNumber = scanner.scanInt(&charCodeInt)
                    }

                    if gotNumber {
                        let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
                        result.appendContentsOf(newChar)
                        scanner.scanString(";", intoString: nil)
                    }
                    else {
                        var unknownEntity: NSString? = nil
                        scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity)
                        let h = xForHex ?? ""
                        let u = unknownEntity ?? ""
                        result.appendContentsOf("&#\(h)\(u)")
                    }
                }
                else {
                    scanner.scanString("&", intoString: nil)
                    result.appendContentsOf("&")
                }
            }

        } while (!scanner.atEnd)

        return result
    }
}

1

实际上,Michael Waterfall出色的MWFeedParser框架(请参阅他的回答)是由rmchaara派生的,rmchaara在ARC支持下对其进行了更新!

您可以在Github上找到它在这里

它确实很棒,我使用了stringByDecodingHTMLEntities方法并且完美无瑕。


这解决了ARC问题-但引入了一些警告。我认为忽略它们是安全的吗?
罗伯特·克莱格

0

好像您需要其他解决方案!这很简单,也很有效:

@interface NSString (NSStringCategory)
- (NSString *) stringByReplacingISO8859Codes;
@end


@implementation NSString (NSStringCategory)
- (NSString *) stringByReplacingISO8859Codes
{
    NSString *dataString = self;
    do {
        //*** See if string contains &# prefix
        NSRange range = [dataString rangeOfString: @"&#" options: NSRegularExpressionSearch];
        if (range.location == NSNotFound) {
            break;
        }
        //*** Get the next three charaters after the prefix
        NSString *isoHex = [dataString substringWithRange: NSMakeRange(range.location + 2, 3)];
        //*** Create the full code for replacement
        NSString *isoString = [NSString stringWithFormat: @"&#%@;", isoHex];
        //*** Convert to decimal integer
        unsigned decimal = 0;
        NSScanner *scanner = [NSScanner scannerWithString: [NSString stringWithFormat: @"0%@", isoHex]];
        [scanner scanHexInt: &decimal];
        //*** Use decimal code to get unicode character
        NSString *unicode = [NSString stringWithFormat:@"%C", decimal];
        //*** Replace all occurences of this code in the string
        dataString = [dataString stringByReplacingOccurrencesOfString: isoString withString: unicode];
    } while (TRUE); //*** Loop until we hit the NSNotFound

    return dataString;
}
@end

0

例如@"2318",如果将字符实体引用作为字符串,则可以使用strtoul; 提取具有正确unicode字符的经过重新编码的NSString 。

NSString *unicodePoint = @"2318"
unichar iconChar = (unichar) strtoul(unicodePoint.UTF8String, NULL, 16);
NSString *recoded = [NSString stringWithFormat:@"%C", iconChar];
NSLog(@"recoded: %@", recoded");
// prints out "recoded: ⌘"

0

Swift 3版本的Jugale的答案

extension String {
    static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",]

    func stringByDecodingXMLEntities() -> String {

        guard let _ = self.range(of: "&", options: [.literal]) else {
            return self
        }

        var result = ""

        let scanner = Scanner(string: self)
        scanner.charactersToBeSkipped = nil

        let boundaryCharacterSet = CharacterSet(charactersIn: " \t\n\r;")

        repeat {
            var nonEntityString: NSString? = nil

            if scanner.scanUpTo("&", into: &nonEntityString) {
                if let s = nonEntityString as? String {
                    result.append(s)
                }
            }

            if scanner.isAtEnd {
                break
            }

            var didBreak = false
            for (k,v) in String.mappings {
                if scanner.scanString(k, into: nil) {
                    result.append(v)
                    didBreak = true
                    break
                }
            }

            if !didBreak {

                if scanner.scanString("&#", into: nil) {

                    var gotNumber = false
                    var charCodeUInt: UInt32 = 0
                    var charCodeInt: Int32 = -1
                    var xForHex: NSString? = nil

                    if scanner.scanString("x", into: &xForHex) {
                        gotNumber = scanner.scanHexInt32(&charCodeUInt)
                    }
                    else {
                        gotNumber = scanner.scanInt32(&charCodeInt)
                    }

                    if gotNumber {
                        let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
                        result.append(newChar)
                        scanner.scanString(";", into: nil)
                    }
                    else {
                        var unknownEntity: NSString? = nil
                        scanner.scanUpToCharacters(from: boundaryCharacterSet, into: &unknownEntity)
                        let h = xForHex ?? ""
                        let u = unknownEntity ?? ""
                        result.append("&#\(h)\(u)")
                    }
                }
                else {
                    scanner.scanString("&", into: nil)
                    result.append("&")
                }
            }

        } while (!scanner.isAtEnd)

        return result
    }
}
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.