从SQL Server中的字符串中删除HTML标签的最佳方法?


112

我在SQL Server 2005中有包含html标记的数据,我想删除所有内容,只在标记之间保留文本。理想的情况下也更换喜欢的东西&lt;<

有没有简单的方法可以做到这一点,或者有人已经获得了一些示例t-sql代码?

我没有添加扩展存储的proc之类的功能,因此更喜欢纯t-sql方法(最好是一种与sql 2000向后兼容的方法)。

我只想使用剥离的html来检索数据,而不是对其进行更新,因此理想情况下,它将被编写为用户定义的函数,以使其易于重用。

因此,例如,将其转换为:

<B>Some useful text</B>&nbsp;
<A onclick="return openInfo(this)"
   href="http://there.com/3ce984e88d0531bac5349"
   target=globalhelp>
   <IMG title="Source Description" height=15 alt="Source Description" 
        src="/ri/new_info.gif" width=15 align=top border=0>
</A>&gt;&nbsp;<b>more text</b></TD></TR>

对此:

Some useful text > more text

Answers:


161

有一个UDF将执行此处描述的操作:

用户定义的功能以剥离HTML

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<',@HTMLText)
    SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
    SET @Length = (@End - @Start) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
        SET @Start = CHARINDEX('<',@HTMLText)
        SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
        SET @Length = (@End - @Start) + 1
    END
    RETURN LTRIM(RTRIM(@HTMLText))
END
GO

编辑:请注意,这是针对SQL Server 2005的,但是如果将关键字MAX更改为4000,则它也将在SQL Server 2000中工作。


9
十分感谢。那里的评论链接到一个改进的版本: lazycoders.blogspot.com/2007/06/…处理更多的html实体。
罗里

4
请注意,作为SQL Server 2005或更高版本中的字符串密集型UDF,这是实现CLR UDF函数以大幅提高性能的理想选择。:做所以这里更多信息stackoverflow.com/questions/34509/...
RedFilter

10
请注意,lazycoders帖子有两个错别字。从具有单引号的CHAR(13) + CHAR(10)两个部分中删除单引号。直到它超过了一个短字段的长度,我才意识到它的细微之处(有趣的是,对我而言,所有替换都比原始字符串短)。
goodeye 2011年

1
html编码的值呢?需要将它们解码。谢谢。
JDPeckham

2
我使用了lazycoders,加上上面@goodeye的拼写错误-效果很好。为了节省时间,在lazycoders博客版是在这里:lazycoders.blogspot.com/2007/06/...
qxotk

17

源自@Goner Doug的答案,并进行了一些更新:
-尽可能使用REPLACE-
转换预定义的实体,例如&eacute;(我选择了所需的实体:-)-
列表标记的一些转换<ul> and <li>

ALTER FUNCTION [dbo].[udf_StripHTML]
--by Patrick Honorez --- www.idevlop.com
--inspired by http://stackoverflow.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<li>','- ')
set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10))

set @HTMLText = replace(@htmlText, '&rsquo;' collate Latin1_General_CS_AS, ''''  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&euro;' collate Latin1_General_CS_AS, '€'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&oelig;' collate Latin1_General_CS_AS, 'oe'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&nbsp;' collate Latin1_General_CS_AS, ' '  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&copy;' collate Latin1_General_CS_AS, '©'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&laquo;' collate Latin1_General_CS_AS, '«'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&reg;' collate Latin1_General_CS_AS, '®'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&plusmn;' collate Latin1_General_CS_AS, '±'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&sup2;' collate Latin1_General_CS_AS, '²'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&sup3;' collate Latin1_General_CS_AS, '³'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&micro;' collate Latin1_General_CS_AS, 'µ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&middot;' collate Latin1_General_CS_AS, '·'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ordm;' collate Latin1_General_CS_AS, 'º'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&raquo;' collate Latin1_General_CS_AS, '»'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac14;' collate Latin1_General_CS_AS, '¼'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac12;' collate Latin1_General_CS_AS, '½'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac34;' collate Latin1_General_CS_AS, '¾'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ccedil;' collate Latin1_General_CS_AS, 'Ç'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Egrave;' collate Latin1_General_CS_AS, 'È'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Eacute;' collate Latin1_General_CS_AS, 'É'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ecirc;' collate Latin1_General_CS_AS, 'Ê'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ouml;' collate Latin1_General_CS_AS, 'Ö'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&agrave;' collate Latin1_General_CS_AS, 'à'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&acirc;' collate Latin1_General_CS_AS, 'â'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&auml;' collate Latin1_General_CS_AS, 'ä'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&aelig;' collate Latin1_General_CS_AS, 'æ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ccedil;' collate Latin1_General_CS_AS, 'ç'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&egrave;' collate Latin1_General_CS_AS, 'è'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&eacute;' collate Latin1_General_CS_AS, 'é'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ecirc;' collate Latin1_General_CS_AS, 'ê'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&euml;' collate Latin1_General_CS_AS, 'ë'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&icirc;' collate Latin1_General_CS_AS, 'î'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ocirc;' collate Latin1_General_CS_AS, 'ô'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ouml;' collate Latin1_General_CS_AS, 'ö'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&divide;' collate Latin1_General_CS_AS, '÷'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&oslash;' collate Latin1_General_CS_AS, 'ø'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ugrave;' collate Latin1_General_CS_AS, 'ù'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&uacute;' collate Latin1_General_CS_AS, 'ú'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ucirc;' collate Latin1_General_CS_AS, 'û'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&uuml;' collate Latin1_General_CS_AS, 'ü'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&lsaquo;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&rsaquo;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)


-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END

4
我已经使用了它并且很喜欢它,但是我确实在顶层组中添加了一个替换:</ p>我也将其更改为char 13 + char 10,因为段落标签的末尾通常表示换行。它的工作完全在我的特殊情况
DR

1
这个答案在大多数情况下都很好用,但是假设您所有的HTML标签都是有效的。就我而言,VARCHAR上传中存在一个截断问题,从而消除了一些结束标记。一个简单的PATINDEX RTrim可以删除其他所有内容。
matt123788

2
除了所做的更改@DR(加上一些更多的在需要回车),我也动了替换结果<,并>到尽头。否则,它们将与标签一起删除。
a_hardin

8

如果您的HTML格式正确,我认为这是一个更好的解决方案:

create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
    declare @textXML xml
    declare @result varchar(max)
    set @textXML = REPLACE( @text, '&', '' );
    with doc(contents) as
    (
        select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
    )
    select @result = contents.value('.', 'varchar(max)') from doc
    return @result
end
go

select dbo.StripHTML('This <i>is</i> an <b>html</b> test')

1
这对我有用。+1。但是请您解释一下您的代码,以便开发人员更轻松地理解它?:)
Saeed Neamati 2012年

看起来它将html加载为xml文档,然后从中选择所有值。注意:此代码会在&nbsp;
JDPeckham

2
进行黑客攻击,以免炸毁HTML代码。显然,这只是一个供内部使用或其他用途的快速技巧(与公认的UDF一样)。
dudeNumber4 2013年

它的格式必须正确,因此它不如RedFilter容错。
Micah B.

1
HTML不是XML的子集。XHTML是,但是HTML不再是这条路。
大卫

7

这是此功能的更新版本,该功能将RedFilter答案(Pinal的原始功能)与LazyCoders的附加内容,goodeye错字更正以及我自己的附加内容结合在一起,以处理<STYLE>HTML中的内嵌标签。

ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

-- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;amp;')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END

1
就我的信息而言,出于什么原因使用STUFF()代替REPLACE()(哪个IMO更短)?
帕特里克·诺诺雷斯

我还没有真正考虑过。如所示,我只是复制/修改了原件。替换可能是一个更好的选择。我想知道要考虑的两个功能之间是否存在性能比较...
Goner Doug

1
@GonerDoug为此欢呼,正在阅读已接受的评论,就像这样,这确实需要更新。
乔诺

4

这不是一个全新的解决方案,而是对afwebservant解决方案的更正:

--note comments to see the corrections

CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))  
RETURNS VARCHAR(MAX)  
AS  
BEGIN  
 DECLARE @Start  INT  
 DECLARE @End    INT  
 DECLARE @Length INT  
 --DECLARE @TempStr varchar(255) (this is not used)  

 SET @Start = CHARINDEX('<',@HTMLText)  
 SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))  
 SET @Length = (@End - @Start) + 1  

 WHILE @Start > 0 AND @End > 0 AND @Length > 0  
 BEGIN  
   IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')  
    begin  
      SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')  
      end  
-- this ELSE and SET is important
   ELSE  
      SET @Length = 0;  

-- minus @Length here below is important
   SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)  
   SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))  
-- instead of -1 it should be +1
   SET @Length = (@End - @Start) + 1  
 END  

 RETURN RTRIM(LTRIM(@HTMLText))  
END  

在我使用nvarchar而不是varchar后,这对我有用,因为我在html标签中使用了Unicode字符
Shadi Namrouti

3

试试这个。这是RedFilter发布的版本的修改版本...此SQL删除具有任何随附属性的除BR,B和P之外的所有标签:

CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
 DECLARE @Start  INT
 DECLARE @End    INT
 DECLARE @Length INT
 DECLARE @TempStr varchar(255)

 SET @Start = CHARINDEX('<',@HTMLText)
 SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
 SET @Length = (@End - @Start) + 1

 WHILE @Start > 0 AND @End > 0 AND @Length > 0
 BEGIN
   IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
   BEGIN
      SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
   END

   SET @Start = CHARINDEX('<',@HTMLText, @End)
   SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
   SET @Length = (@End - @Start) - 1
 END

 RETURN RTRIM(LTRIM(@HTMLText))
END

对我不起作用SELECT dbo.StripHtml('<b> somestuff </ b>'); 返回确切的字符串
ladieu

@ladieu,这是预期的。检查答案的第一行(“此SQL除去具有任何附带属性的BR,B和P之外的所有标记”)。
Peter Herdenborg

此SQL函数不正确。请参考以下答案以获得更正的功能。
futureelite7

@ futureelite7使用“以下”和“上方”作为在SO页面上哪里查找答案的参考是荒谬的,因为可以使用顶部的选项卡更改答案顺序(此外,投票可以更改答案顺序)。请使用作者发布它的名称来指定一个答案
凯斯Jard

3

如何将XQuery与一个衬套一起使用:

DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return ($x)//text()')
SET @StrippedText = CAST(@MalformedXML as varchar(max))

这将遍历所有元素,并且仅返回text()。

为避免元素之间的文本串联而没有空格,请使用:

DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
SET @StrippedText = CAST(@MalformedXML as varchar(max))

并回答“您如何将此用于列:

  SELECT CAST(html_column.query('for $x in //. return concat((($x)//text()) as varchar(max))
  FROM table

对于以上代码,请确保您html_column的数据类型为xml,否则,您需要将html的强制转换版本另存为xml。我将在加载HTML数据时将其作为一个单独的练习,因为如果SQL发现格式不正确的xml(例如,起始/结束标签不匹配,无效字符),则会引发错误。

当您要构建seachh短语,剥离HTML等时,这些功能非常有用。

只需注意,这将返回xml类型,因此在适当的地方将CAST或COVERT转换为文本。此数据类型的xml版本没有用,因为它不是格式正确的XML。


如果没有从xml强制转换的实际解决方案,我觉得这充其量只是部分解决方案。
丹尼斯·贾赫鲁丁

CAST(@xml为varchar(max))。或CONVERT(xml),@ XML)。假设大多数开发人员都会弄清楚这一点。
阿文·阿米尔

1
假设开发人员知道如何进行转换绝对是合理的,但是请记住,阅读您的答案的人可能不会直接看到“简单”的转换就是所有需要完成的工作。特别是因为提到可以在适当的地方进行投射 。-我并不是想消极,只是希望这可以帮助您创建更容易被识别为有用的答案!
丹尼斯·贾赫鲁丁

那么,列名的哪一部分呢?可以说我有一个表data,该表的列名为html,我想选择该列中的所有值,但要去掉html标记,我该如何使用您的答案来实现这一点?
菲利克斯·夏娃

2

这是一个不需要UDF的版本,即使HTML包含没有匹配结束标记的标记也可以使用。

TRY_CAST(REPLACE(REPLACE(REPLACE([HtmlCol], '>', '/> '), '</', '<'), '--/>', '-->') AS XML).value('.', 'NVARCHAR(MAX)')

1

尽管Arvin Amir的答案接近于完整的一站式解决方案,但您可以在任何地方使用。他的select语句中有一个小错误(缺少该行的结尾),我想处理最常见的字符引用。

我最终要做的是:

SELECT replace(replace(replace(CAST(CAST(replace([columnNameHere], '&', '&amp;') as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max)), '&amp;', '&'), '&nbsp;', ' '), '&#x20;', ' ')
FROM [tableName]

没有字符参考代码,可以简化为:

SELECT CAST(CAST([columnNameHere] as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max))
FROM [tableName]

0

帕特里克·诺诺雷斯(Patrick Honorez)的代码需要稍作更改。

对于包含&lt;或的html,它返回不完整的结果&gt;

这是因为该部分下面的代码

-删除标签之间的所有内容

实际上会将<>替换为空。解决方法是在末尾应用以下两行:

set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.