从PDF复制或打印文档时，为什么`fi`文本会被剪切？

当我从包含以下内容的Adobe Reader PDF文件复制时

Define an operation

我宁愿看到

Dene an operation

当我粘贴文本时，这是为什么？

我该如何解决这个烦人的问题？

我过去也曾在将Microsoft Office Word文件打印到打印机时看到这种情况。

windows clipboard

— 塔玛拉·维斯曼（Tamara Wijsman）
source

Answers:

这听起来像是字体问题。PDF可能在word中使用OpenType fi 连字define，并且目标应用程序的当前字体缺少该字形。

我不知道是否有一种简单的方法可以使Acrobat分解副本上的连字。

您的打印问题也可能与字体有关。可能是某种原因允许打印机用其自己的内置字体替换文档的字体，并且打印机的字体版本也缺少该特定字形。您必须告诉Windows始终将字体下载到打印机，以解决此问题。

打印时的另一种可能性：可能未启用UniScribe。 MS KB 2642020讨论了此问题以及一些可能的解决方法（即，使用RAW类型打印而不是EMF类型打印）。尽管上下文与您的特定问题稍有不同，但原因可能是相同的，并且可能会采用相同的解决方法。

— 冒犯者
source

关于连字有趣，我想知道是否可以将其配置为正常运行。也许我可以看看其他PDF阅读器的行为。我到底在哪里配置它，以便将字体发送到打印机？

— 塔玛拉·威斯曼

在应用程序的打印对话框中：单击Properties（或Preferences，取决于对话框的版本）打印机，确保您位于Layout或Quality选项卡上，然后单击Advanced按钮。在Graphic组中，将TrueType Font选项更改为Download as Softfont。这涵盖了大多数PostScript打印机和使用Windows内置对话框的打印机（我认为），但是其他驱动程序可能会四处移动或丢失所有东西。

— afrazier 2012年

您可能会发现MS KB 2642020的某些用途。我已经用该信息编辑了答案。

— afrazier 2012年

感谢您描述问题。我尚未尝试解决此问题，但是当我再次遇到打印问题时一定会尝试。我猜这两个解决方案之一肯定会解决这个非常具体的问题... :)

— Tamara Wijsman 2012年

@afrazier，您在注释中写的“从应用程序的打印对话框开始：”的解决方案为我工作。我建议将该文本放入您的答案中。（我可以编辑它，但我认为决定应该由您决定。）

— 艾伦（Alan

您可以将大多数这些“折断”词替换为原始词。在以下情况下，您可以安全地替换单词：

像dene或者rey，它是不是一个真正的字
像define或者firefly，还有一个办法重新添加结扎SEQUENECES（，ff，fi，fl，ffi或ffl），并做一个真正的字

大多数连字问题符合这些条件。但是，您不能替换：

us 因为它是一个真实的词，即使它最初可能是 fluffs
- 同时affirm，butterfly，fielders，fortifies，flimflam，misfits...
cus因为它可能成为cuffs或ficus
- 也stiffed/ stifled，rifle/ riffle，flung/ fluffing...

在这496万余字的英文字典，有16055个包含至少一个字ff，fi，fl，ffi，或者ffl，它变成15879分的话，当他们连字都被删除。173相撞像漏字cuffs和ficus，最后3是因为该字典包含文字ff，fi和fl。

这些“去除连字”的单词中有790个是真实单词，例如us，而15089个是残缺单词。可以安全地将14960个损坏的单词替换为原始单词，这意味着99.1％的损坏单词是可修复的，而包含连字的原始单词的93.2％可在复制粘贴PDF后恢复。除非有某种选择（单词/文档上下文？）为没有保证的每个单词选择最佳替换，否则包含6.8个连字序列的单词会丢失到冲突（cus）和子单词（us）中替代。

下面是生成上述统计信息的Python脚本。它期望每行一个单词的字典文本文件。最后，它将编写一个CSV文件，该文件将可修复的损坏单词映射到其原始单词。

这是下载CSV的链接：http : //www.filedropper.com/brokenligaturewordfixes 将此映射与诸如regex替换脚本之类的东西结合在一起，以替换大多数损坏的单词。

import csv
import itertools
import operator
import re


dictionary_file_path = 'dictionary.txt'
broken_word_fixes_file_path = 'broken_word_fixes.csv'
ligatures = 'ffi', 'ffl', 'ff', 'fi', 'fl'


with open(dictionary_file_path, 'r') as dictionary_file:
    dictionary_words = list(set(line.strip()
                                for line in dictionary_file.readlines()))


broken_word_fixes = {}
ligature_words = set()
ligature_removed_words = set()
broken_words = set()
multi_ligature_words = set()


# Find broken word fixes for words with one ligature sequence
# Example: "dene" --> "define"
words_and_ligatures = list(itertools.product(dictionary_words, ligatures))
for i, (word, ligature) in enumerate(words_and_ligatures):
    if i % 50000 == 0:
        print('1-ligature words {percent:.3g}% complete'
              .format(percent=100 * i / len(words_and_ligatures)))
    for ligature_match in re.finditer(ligature, word):
        if word in ligature_words:
            multi_ligature_words.add(word)
        ligature_words.add(word)
        if word == ligature:
            break
        # Skip words that contain a larger ligature
        if (('ffi' in word and ligature != 'ffi') or
                ('ffl' in word and ligature != 'ffl')):
            break
        # Replace ligatures with dots to avoid creating new ligatures
        # Example: "offline" --> "of.ine" to avoid creating "fi"
        ligature_removed_word = (word[:ligature_match.start()] +
                                 '.' +
                                 word[ligature_match.end():])
        # Skip words that contain another ligature
        if any(ligature in ligature_removed_word for ligature in ligatures):
            continue
        ligature_removed_word = ligature_removed_word.replace('.', '')
        ligature_removed_words.add(ligature_removed_word)
        if ligature_removed_word not in dictionary_words:
            broken_word = ligature_removed_word
            broken_words.add(broken_word)
            if broken_word not in broken_word_fixes:
                broken_word_fixes[broken_word] = word
            else:
                # Ignore broken words with multiple possible fixes
                # Example: "cus" --> "cuffs" or "ficus"
                broken_word_fixes[broken_word] = None


# Find broken word fixes for word with multiple ligature sequences
# Example: "rey" --> "firefly"
multi_ligature_words = sorted(multi_ligature_words)
numbers_of_ligatures_in_word = 2, 3
for number_of_ligatures_in_word in numbers_of_ligatures_in_word:
    ligature_lists = itertools.combinations_with_replacement(
        ligatures, r=number_of_ligatures_in_word
    )
    words_and_ligature_lists = list(itertools.product(
        multi_ligature_words, ligature_lists
    ))
    for i, (word, ligature_list) in enumerate(words_and_ligature_lists):
        if i % 1000 == 0:
            print('{n}-ligature words {percent:.3g}% complete'
                  .format(n=number_of_ligatures_in_word,
                          percent=100 * i / len(words_and_ligature_lists)))
        # Skip words that contain a larger ligature
        if (('ffi' in word and 'ffi' not in ligature_list) or
                ('ffl' in word and 'ffl' not in ligature_list)):
            continue
        ligature_removed_word = word
        for ligature in ligature_list:
            ligature_matches = list(re.finditer(ligature, ligature_removed_word))
            if not ligature_matches:
                break
            ligature_match = ligature_matches[0]
            # Replace ligatures with dots to avoid creating new ligatures
            # Example: "offline" --> "of.ine" to avoid creating "fi"
            ligature_removed_word = (
                ligature_removed_word[:ligature_match.start()] +
                '.' +
                ligature_removed_word[ligature_match.end():]
            )
        else:
            # Skip words that contain another ligature
            if any(ligature in ligature_removed_word for ligature in ligatures):
                continue
            ligature_removed_word = ligature_removed_word.replace('.', '')
            ligature_removed_words.add(ligature_removed_word)
            if ligature_removed_word not in dictionary_words:
                broken_word = ligature_removed_word
                broken_words.add(broken_word)
                if broken_word not in broken_word_fixes:
                    broken_word_fixes[broken_word] = word
                else:
                    # Ignore broken words with multiple possible fixes
                    # Example: "ung" --> "flung" or "fluffing"
                    broken_word_fixes[broken_word] = None


# Remove broken words with multiple possible fixes
for broken_word, fixed_word in broken_word_fixes.copy().items():
    if not fixed_word:
        broken_word_fixes.pop(broken_word)


number_of_ligature_words = len(ligature_words)
number_of_ligature_removed_words = len(ligature_removed_words)
number_of_broken_words = len(broken_words)
number_of_fixable_broken_words = len(
    [word for word in set(broken_word_fixes.keys())
     if word and broken_word_fixes[word]]
)
number_of_recoverable_ligature_words = len(
    [word for word in set(broken_word_fixes.values())
     if word]
)
print(number_of_ligature_words, 'ligature words')
print(number_of_ligature_removed_words, 'ligature-removed words')
print(number_of_broken_words, 'broken words')
print(number_of_fixable_broken_words,
      'fixable broken words ({percent:.3g}% fixable)'
      .format(percent=(
      100 * number_of_fixable_broken_words / number_of_broken_words
  )))
print(number_of_recoverable_ligature_words,
      'recoverable ligature words ({percent:.3g}% recoverable)'
      '(for at least one broken word)'
      .format(percent=(
          100 * number_of_recoverable_ligature_words / number_of_ligature_words
      )))


with open(broken_word_fixes_file_path, 'w+', newline='') as broken_word_fixes_file:
    csv_writer = csv.writer(broken_word_fixes_file)
    sorted_broken_word_fixes = sorted(broken_word_fixes.items(),
                                      key=operator.itemgetter(0))
    for broken_word, fixed_word in sorted_broken_word_fixes:
        csv_writer.writerow([broken_word, fixed_word])

— 扬·范·布鲁根（Jan Van Bruggen）
source

到的链接.csv已损坏。如果您可以再次上传，那就太好了！无论如何，感谢您的代码。

— MagTun

@Enora我在同一链接上重新上传了CSV-希望对您有所帮助！我还注意到了代码/结果中的一些问题（在新词典的单词中使用句点，而在比较它们之前不使用小写）时，使用句点作为占位符。我相信所有的替代品都是正确的，但要带些盐，并知道可以有更好的替代品。我建议使用正则表达式自动执行替换，但是用您自己的眼睛确认每个替换是否合适。

— Jan Van Bruggen

正如其他答案所指出的，这里的问题是连字。但是，它与OpenType完全无关。根本的问题是，PDF是一种预打印格式，它只涉及很少的内容和语义，而是旨在忠实地表示要打印的页面。

文本的布局不是以文本形式，而是以某些位置字体的字形排列。因此，您会得到以下信息：»在此处放置第72个字形，在第101个字形之间，在第108个字形之间，...«。在这个层面有根本没有文字的概念可言。这只是说明它是如何看起来。从一堆字形中提取含义有两个问题：

空间布局。由于PDF已经包含将每个字形放置在何处的特定信息，因此没有像通常那样在其下隐藏任何实际文本。另一个副作用是没有空格。当然，如果您查看其中的文本，则看不到PDF。当您根本不发出任何字形时，为什么还要发出空白字形呢？毕竟结果是一样的。因此，PDF阅读器必须再次仔细地将文本拼凑在一起，每当它们在字形之间遇到较大的间隙时，都要插入一个空格。
PDF呈现字形，而不是文本。通常，字形ID对应于Unicode编码点或嵌入字体中的至少ASCII码，这意味着您通常可以使ASCII或Latin 1文本恢复得足够好，这取决于首先创建PDF的人（有些人过程中的所有内容都乱码）。但是通常，即使PDF允许您很好地提取ASCII文本，也会破坏所有不是 ASCII的内容。对于复杂的脚本（例如阿拉伯语）尤其可怕，它们在布局阶段之后仅包含连字和替代字形，这意味着阿拉伯语PDF几乎从不包含实际文本

第二个问题就像您面临的那样。这里常见的罪魁祸首是LaTeX，它利用估计数量的238982375种不同的字体（每种字体限制为256个字形）来实现其输出。普通文本，数学（使用不止一种）等不同的字体使事情变得非常困难，特别是因为Metafont比Unicode早了近二十年，因此从来没有Unicode映射。也可以通过在字母上加一个字母符号来绘制Umlaut，例如从PDF复制时，您会得到“¨a”而不是»ä«（当然也不能搜索它）。

生成PDF的应用程序可以选择包含实际文本作为元数据。如果不这样做，您将受制于如何处理嵌入字体以及PDF阅读器是否可以重新组合原始文本。但是，»fi«被复制为空白或完全不复制通常是LaTeX PDF的标志。您应该在石头上绘制Unicode字符，然后将其扔给生产者，希望它们会切换到XeLaTeX，从而最终在1990年代出现字符编码和字体标准。

— 乔伊
source