Python:鉴于原始电子邮件没有“ Body”标签或任何内容,因此如何从原始电子邮件中解析正文


81

看起来很容易

From
To
Subject

等通过

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

假设这"a"是原始电子邮件字符串,看起来像这样。

a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: oooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""

问题

您如何Body通过python获取此电子邮件的?

到目前为止,这是我所知道的唯一代码,但我尚未对其进行测试。

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

这是正确的方法吗?

或者也许有一些更简单的事情,例如...

import email
b = email.message_from_string(a)
bbb = b['body']

Answers:


94

使用Message.get_payload

b = email.message_from_string(a)
if b.is_multipart():
    for payload in b.get_payload():
        # if payload.is_multipart(): ...
        print payload.get_payload()
else:
    print b.get_payload()

111

要获得高度肯定,您需要使用实际的电子邮件正文(但是,仍然有可能您没有解析正确的部分),您必须跳过附件,并专注于纯文本或html部分(取决于您的需求)以作进一步处理处理。

由于前面提到的附件可以而且经常是text / plain或text / html部分的附件,因此此非项目符号证明示例通过检查content-disposition标头跳过了这些附件:

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)

顺便说一句,walk()在mime部分上进行了出色的迭代,并get_payload(decode=True)为您解码base64等上的内容。

某些背景-正如我所暗示的那样,MIME电子邮件的奇妙世界带来了“错误地”找到消息正文的许多陷阱。在最简单的情况下,它位于唯一的“文本/纯文本”部分,并且get_payload()非常诱人,但我们生活的环境并不简单-它经常被多部分/替代,相关,混合等内容包围。Wikipedia对其进行了严格的描述-MIME,但考虑到以下所有这些情况都是有效且常见的-人们必须全面考虑安全网:

非常常见-在普通编辑器(Gmail,Outlook)中,您几乎可以得到发送带有附件的格式化文本的信息:

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel

相对简单-只是替代表示形式:

multipart/alternative
 |
 +- text/plain
 +- text/html

无论好坏,此结构都有效:

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg

希望这个对你有帮助。

附言:我的意思是不要轻易接近电子邮件-当您最不期望它时会咬:)


5
感谢您提供了详尽的示例,并给出了警告-与已接受的答案相反。我认为这是一种更好/更安全的方法。
Simon Steinberger

1
啊,很好!.get_payload(decode=True)不仅仅是.get_payload()让生活变得更加轻松,谢谢!
标记

9

有一个非常好的软件包可用于使用适当的文档来解析电子邮件内容。

import mailparser

mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)

如何使用:

mail.attachments: list of all attachments
mail.body
mail.to

2
库很棒,但是我必须使自己的类继承MailParser并覆盖body方法,因为它以“ \ n--mail_boundary --- \ n”连接了电子邮件正文,这对我来说并不理想。
avram

嗨,@ avram,您能分享您所写的课程吗?
艾米·奈克

我设法将结果拆分到“ \ n --- mail_boundary --- \ n”上。
阿米·P·奈克

1
@AmeyPNaik在这里,我做了一个快速的github要点:gist.github.com/aleksaa01/ccd37186​​9f3a3c7b3e47822d5d78ccdf
avram

1
@AmeyPNaik在其文档中说:邮件解析器可以解析Outlook电子邮件格式(.msg)。要使用此功能,您需要安装libemail-outlook-message-perl软件包
CiprianTomoiagă19/

6

Python 3.6+提供了内置的便捷方法来查找和解码纯文本主体,如@Todor Minakov的答案所示。您可以使用EMailMessage.get_body()get_content()方法:

msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
    body = body.get_content()
print(body)

请注意,None如果没有(明显)纯文本正文部分,这将给出。

如果您正在从例如mbox文件读取文件,则可以为邮箱构造函数指定一个EmailMessage工厂:

mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
    ...

请注意,您必须通过email.policy.default此策略,因为它不是默认策略。


2
为什么没有email.policy.default默认值?似乎应该如此。
PartialOrder

4

b['body']python中没有。您必须使用get_payload。

if isinstance(mailEntity.get_payload(), list):
    for eachPayload in mailEntity.get_payload():
        ...do things you want...
        ...real mail body is in eachPayload.get_payload()...
else:
    ...means there is only text/plain part....
    ...use mailEntity.get_payload() to get the body...

祝好运。


0

如果电子邮件是pandas数据框架,则emails.message电子邮件文本列

## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()

-3

这是每次都适用于我的代码(适用于Outlook电子邮件):

#to read Subjects and Body of email in a folder (or subfolder)

import win32com.client  
#import package

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")  
#create object

#get to the desired folder (MyEmail@xyz.com is my root folder)

root_folder = 
outlook.Folders['MyEmail@xyz.com'].Folders['Inbox'].Folders['SubFolderName']

#('Inbox' and 'SubFolderName' are the subfolders)

messages = root_folder.Items

for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
    subject_content = message.subject
# to store subject lines of mails

    body_content = message.body
# to store Body of mails

    print(subject_content)
    print(body_content)

    message.Unread = True         # mark the mail as 'Read'
    message = messages.GetNext()  #iterate over mails

4
也许说明这是针对Windows上的Outlook,而不是真实电子邮件。
三人
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.