获取lxml中标签内的所有文本

75

我想编写一个代码片段<content>，在以下所有三个实例中（包括代码标签），都将在lxml中的标签中捕获所有文本。我已经尝试过了，tostring(getchildren())但是那样会错过标签之间的文本。我没有太多运气在API中搜索相关功能。你能帮我吗？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

python parsing lxml

— 凯文·伯克
source

1

谢谢-我试图编写一个RSS feed解析器，并显示<content>标记内的所有内容，其中包括feed提供商的HTML标记。

— 凯文·伯克

42

尝试：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

例：

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

产生： '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

— 阿尔贝托夫
source

2

@delnan。不需要，tostring已经处理了递归情况。您使我感到怀疑，所以我在实际代码上进行了尝试，并用示例更新了答案。感谢您指出。

— albertov 2011年

5

代码被破坏并产生重复的内容：>>> stringify_children（lxmlhtml.fromstring（'A <div> B </ div> C'））'A <p> A </ p> B <div> B </ div> CC”

— hoju

1

要修复@hoju报告的错误，请将with_tail=False作为参数添加到中tostring()。这样tostring(c, with_tail=False)。这样可以解决尾部文字（C）的问题。为了解决带有前缀文本（A）的问题，这似乎是tostring()添加<p>标记的错误，因此这不是OP代码中的错误。

— anana 2015年

1

可以通过c.text从parts列表中删除来解决第二个错误。我提交了修正了这些错误的新答案。

— anana

3

应该添加tostring(c, encoding=str)要在Python 3的运行

— 安托万Dusséaux

77

text_content（）是否可以满足您的需求？

— 埃德·萨默斯
source

6

text_content（）删除所有标记，OP希望保留标记内的标记。

— benselme 2013年

7

@benselme我为什么使用text_content它，它说AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'

— 罗杰

6

@rogertext_content()仅在您的树是HTML时才可用（即，如果使用中的方法进行了解析lxml.html）。

— 路易（Louis）

@EdSummers非常感谢！这在解析<p>标签时很有用。text()在XPath中使用时，我缺少文本（例如嵌套链接），但是您的方法对我有用！

— 山姆在2017年

1

正如Louis所指出的，这仅适用于使用解析的树lxml.html。Arthur Debert的解决方案具有itertext()通用性。

— SergiyKolesnikov，

72

只需使用该node.itertext()方法即可，如下所示：

 ''.join(node.itertext())

— 亚瑟·德伯特
source

3

这很好用，但是会去除您可能需要的所有标签。

— Yablargo 2014年

字符串中应该没有空格吗？还是我错过了什么？

— 私人

1

@Private这取决于您的特定需求。例如，我可以使用标记<word><pre>con</pre>gregate</word>来表示单词中的前缀。假设我要提取不带标记的单词。如果我使用.join空格，那么我会得到，"con gregate"而如果没有空格，我会得到"congregate"。

— 路易（Louis）

虽然上面的答案被接受了，但这是我真正想要的。

— 杰森

19

以下使用python生成器的代码片段非常有效并且非常有效。

''.join(node.itertext()).strip()

— 桑迪普
source

1

如果节点是从文本缩进的文本中获取的，则取决于解析器，它通常将具有缩进文本，itertext（）会在普通文本片段中交织。根据实际设置，以下操作可能有用：' '.join(node.itertext('span', 'b'))-仅使用<span>and<b>标记中的文本，从缩进中丢弃带有“ \ n”的标记。

— Zoltan K.

19

的的Albertov的一个版本字符串化内容，解决了错误报告hoju：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

— 安娜娜
source

6

定义stringify_children这种方式可能不太复杂：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

或一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

基本原理与此答案相同：将子节点的序列化保留为lxml。在这种情况下，该tail部分node并不有趣，因为它位于end标记之后。注意，encoding可以根据需要改变自变量。

另一个可能的解决方案是序列化节点本身，然后剥离开始和结束标签：

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

这有点可怕。这段代码只有在node没有属性的情况下才是正确的，而且我认为即使那样，任何人也不想使用它。

— 珀西瓦尔·尤利西斯
source

1

node.text if node.text is not None else ''可以只是node.txt or ''

— yprez

在这里玩了拉撒路（复活的笑话……不是一丁点儿），但是当我不记得自己到底在做什么时，我已经看过很多次了。给定node.text仅返回不视为迭代器一部分的文本（当直接迭代到节点时，与我相信的node.getChildren（）相同），似乎可以轻松地将解决方案简化为：''.join([node.text or ''] + [etree.tostring(e) for e in node])

— Tim Alexander

这实际上与python 3一起使用，而最受支持的答案则不行。

— Andrey

5

import urllib2
from lxml import etree
url = 'some_url'

取得网址

test = urllib2.urlopen(url)
page = test.read()

获取包括表标签在内的所有html代码

tree = etree.HTML(page)

xpath选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res是正在为我工作的表格的html代码。

因此您可以使用xpath_text（）提取标签内容，并使用tostring（）提取包含标签内容的标签

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

或text = tree.xpath（“ // content / text（）”）

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

使用strip方法的最后一行不是很好，但它只能工作

— d3天
source

对我来说，这很好用，并且公认简单得多。我知道我每次都有一个<details> </ details>标签，并且可以删除它

— Yablargo 2014年

1

已经xpath_text已经从LXML删除？它说AttributeError: 'lxml.etree._Element' object has no attribute 'xpath_text'

— 罗杰

3

实际上最简单的代码段之一（根据http://lxml.de/tutorial.html#using-xpath-to-find-text上的文档为我使用）是

etree.tostring(html, method="text")

其中etree是您尝试读取其全文的节点/标签。请注意，虽然它并没有摆脱脚本和样式标签。

— Deepan Prabhu Babu
source

4

— 删除

2

为了回应@Richard的上述评论，如果将stringify_children修补为：

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

似乎避免了他所指的重复。

— 布温根罗斯
source

1

我知道这是一个老问题，但这是一个常见问题，而且我有一个比到目前为止建议的解决方案更简单的解决方案：

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

与该问题的其他一些答案不同，该解决方案保留了其中包含的所有标签，并且从其他工作解决方案的角度出发来解决问题。

— 乔希梅克
source

1

给出的答案只是一个快速的增强。如果要清除内部文本：

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

— 倒序索引
source

0

这是一个可行的解决方案。我们可以使用父标记获取内容，然后从输出中剪切父标记。

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element必须具有Element类型。

请注意，如果您想要文本内容（而不是文本中的html实体），请将html_entities参数保留为False。

— 塞尔加赫
source

0

lxml为此提供了一种方法：

node.text_content()

— Hrabal
source

2

此答案不会添加任何新内容。与stackoverflow.com/a/11963661/407651相同。

— mzjn

-2

如果这是一个标签，则可以尝试：

node.values()

— 大卫
source

1

这不会在标签内获取文本，而会在标签内获取属性。

— 蒂莫西·尤尔卡

-2

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

— 和房
source