将HTML转换为PHP中的纯文本格式以用于电子邮件

Question 1

我使用TinyMCE允许在我的网站中使用最少的文本格式。从生成的HTML中，我想将其转换为纯文本格式以发送电子邮件。我一直在使用一个名为html2text的类，但除其他外，它实际上缺少UTF-8支持。但是，我确实这样做，它将某些HTML标记映射为纯文本格式-就像在HTML之前带有标记的文本周围加下划线。

是否有人使用类似的方法将HTML转换为PHP中的纯文本？如果是这样：您是否推荐我可以使用的任何第三方类？或者您如何最好地解决这个问题？

Question 2

使用html2text（示例HTML到text），该文件已获得Eclipse Public License的许可。它使用PHP的DOM方法从HTML加载，然后遍历生成的DOM以提取纯文本。用法：

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

尽管不完整，但它是开源的，欢迎您提供帮助。

其他转换脚本的问题：

由于html2text（GPL）不兼容EPL。
lkessler的链接（归属）与大多数开源许可证都不兼容。

Question 3

这是另一种解决方案：

$cleaner_input = strip_tags($text);

有关消毒功能的其他变化，请参见：

https://github.com/tazotodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

Question 4

使用DOMDocument将HTML转换为文本是可行的解决方案。考虑HTML2Text，它需要PHP5：

关于UTF-8，“ howto”页面上的内容指出：

PHP自己对unicode的支持非常差，并且不能始终正确处理utf-8。尽管html2text脚本使用unicode-safe方法（不需要mbstring模块），但是它不能总是应付PHP自己的编码处理。PHP并不真正了解unicode或utf-8之类的编码，而是使用系统的基本编码，该编码通常是ISO-8859家族之一。结果，看起来像文本编辑器中的有效字符（utf-8或单字节）的内容可能会被PHP误解。因此，即使您认为自己将有效字符输入到html2text中，也可能不是这样。

作者提供了几种解决方法，并指出HTML2Text版本2（使用DOMDocument）具有UTF-8支持。

注意商业用途的限制。

Question 5

有值得信赖的strip_tags函数。虽然不漂亮。只会消毒。您可以将其与字符串替换结合使用，以得到下划线。


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

Question 6

您可以将lynx与-stdin和-dump选项一起使用来实现：

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}

Question 7

您可以测试此功能

function html2text($Document) {
    $Rules = array ('@<script[^>]*?>.*?</script>@si',
                    '@<[\/\!]*?[^<>]*?>@si',
                    '@([\r\n])[\s]+@',
                    '@&(quot|#34);@i',
                    '@&(amp|#38);@i',
                    '@&(lt|#60);@i',
                    '@&(gt|#62);@i',
                    '@&(nbsp|#160);@i',
                    '@&(iexcl|#161);@i',
                    '@&(cent|#162);@i',
                    '@&(pound|#163);@i',
                    '@&(copy|#169);@i',
                    '@&(reg|#174);@i',
                    '@&#(d+);@e'
             );
    $Replace = array ('',
                      '',
                      '',
                      '',
                      '&',
                      '<',
                      '>',
                      ' ',
                      chr(161),
                      chr(162),
                      chr(163),
                      chr(169),
                      chr(174),
                      'chr()'
                );
  return preg_replace($Rules, $Replace, $Document);
}

Question 8

我找不到适合的任何现有解决方案-简单的HTML电子邮件到简单的纯文本文件。

我已经打开了此存储库，希望对您有所帮助。MIT许可证，顺便说一句:)

https://github.com/RobQuistNL/SimpleHtmlToText

例：

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

返回：

**This is HTML**
### Header ###


Newlines

Question 9

如果您要转换HTML特殊字符，而不仅仅是删除它们以及剥离内容并准备纯文本，这是对我有用的解决方案...

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

带有ENT_QUOTES的html_entity_decode | ENT_XML1转换类似' htmlspecialchars_decode转换类似& html_entity_decode转换像'< ，strip_tags删除所有剩余的HTML标记。

Question 10

Markdownify将HTML转换为Markdown，这是该站点上使用的纯文本格式系统。

Question 11

public function plainText($text)
{
    $text = strip_tags($text, '<br><p><li>');
    $text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);

    return $text;
}

$text = "string 1 string 2 <ul><li>string 3</li><li>string 4</li></ul>string 5";

echo planText($text);

输出
字符串1
字符串2
字符串3
字符串4
字符串5

Question 12

我遇到了与OP相同的问题，尝试从上面的最高答案中尝试一些解决方案并不能证明适合我的方案。看看为什么最后。

相反，我发现了这个有用的脚本，为避免混淆，我们称之为html2text_roundcubeGPL下可用的脚本：

https://github.com/mtibben/html2text

它实际上是已经提到的脚本的更新版本- http://www.chuggnutt.com/html2text.php-由RoundCube邮件更新。

用法：

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

为什么html2text_roundcube证明比其他更好：

http://www.chuggnutt.com/html2text.php对于具有特殊HTML代码/名称（例如ä）或不成对的引号（例如25" Monitor）的情况，脚本无法立即使用。
脚本https://github.com/soundasleep/html2text无法选择在文本末尾隐藏或分组链接，从而使普通的HTML页面在采用纯文本格式时看起来充满了链接。自定义代码以对转换如何进行特殊处理并不像直接在中编辑数组那样简单html2text_roundcube。

Question 13

我刚刚发现一个PHP函数“ strip_tags（）”及其在我的情况下可以正常工作。

我试图转换以下HTML：

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

应用strip_tags（）函数后，我得到以下输出：

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

Question 14

对于utf-8中的文本，它对我有用mb_convert_encoding。要处理所有内容而不管错误，请确保使用“ @”。

我使用的基本代码是：

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

如果需要更高级的功能，可以迭代分析节点，但是空白会遇到很多问题。

我已经根据我在这里所说的实现了一个转换器。如果您有兴趣，可以从git https://github.com/kranemora/html2text下载

它可以作为您的参考

您可以像这样使用它：

$html = <<<EOF
<p>Welcome to <strong>html2text<strong></p>
<p>It's <em>works</em> for you?</p>
EOF;

$html2Text = new \kranemora\Html2Text\Html2Text;
$text = $html2Text->convert($html);

Question 15

如果您不想完全剥离标签并将内容保留在标签内，则可以使用DOMDocument并提取textContent根节点的，如下所示：

function html2text($html) {
    $dom = new DOMDocument();
    $dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
    $xpath = new DOMXPath($dom);
    $node = $xpath->query('body')->item(0);
    return $node->textContent; // text
}

$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

这种方法的一个优点是它不需要任何外部程序包。