Parsing Mails in Python, How Difficult Can It Be?
Challenge accepted and solved.

I told a friend I had to write an email parser in Python.

“Oh, you’re just attempting one of the hardest things ever done in coding”, he laughed.

Challenge accepted.

Research

From my previous foray into the email module of Python I learned two things: The Internet isn’t much of a help in terms of code ideas, and a close reading of the documentation is beneficial. I would also need to dive into some RFCs.1 Email is a very flexible format and has been around for a long, long time; it came into existence when things like writing email in Chinese were pure science fiction. The standard is complex and shows its age, especially with regards to MIME types. Along with some RFCs which are already mentioned in Python’s documentation, I found two resources to be of help to tackle this beast.

In 2017, Comp Sci professor Bo Waggoner did a statistical analysis on the MIME types of his emails and published his findings in a blog post, “If MIMEs could talk: Email structures in the wild”. The results list the most prevalent email structures and how they relate to the look of the actual message. It essentially maps the theoretical descriptions of the RFCs to emails in the wild, making it a very useful guide.

In addition, figuring out the actual structure of an email can be bewildering. How is the message tree structured when you just get an EmailMessage object? Hidden in the depths of the email documentation is a small code snippet that visualizes the structure according to the content type. You pass it an EmailMessage and there it is, complete with tabs and all.

1
2
3
4
5
6
7
8
9
from email.iterators import _structure
print(_structure(email_message))  # email_message: email.message.EmailMessage

# Prints for example:
multipart/mixed
    multipart/alternative
        text/html
    application/pdf
None

These two resources in addition to the RFCs provided the foundation for the parser. I ran the snippet against my own dataset to see the most prevalent patterns to then start coding. But first I had to stamp out one of the ugliest code smells.

Implementation

In short, mail structures are similar to a tree, where each node ends in text (html or plaintext) or an attachment. The EmailMessage module provides a handy walk() method that takes an Euler Tour around that tree. How to check which node our tour guide has just passed? The most obvious answer would be if-else statements that check the Content-Type header field to determine what to do - and then drop a code smell the size of mammoth dung.

Checking each Content-Type would lead to a lot of conditional statements. However, “Conditional Complexity” is a code smell that makes code very hard to read because of many, many nested if-else statements. I also call it Conditional Hell, because it throws the reader into an abyss where she has to figure out which of the dozen possibilities are chosen at every moment.

A design pattern comes to the rescue which I already used in another project. The code of that existing project parses Markdown files using markdown-it-py and walks down each node in a Markdown tree. My implementation for the mail parser is very similar: It adapts the Visitor pattern to be more Pythonic and simpler. The adapted pattern declares a Visitor class that visits each node. It then uses a property of each node and transforms that into the name of a method that is subsequently called it it exists. In our case, that property is the Content-Type header field.

Here’s the skeleton implementation with more explanations below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class MailVisitor:
    multipart_alternative_visited = False

    def visit(self, part: email.message.EmailMessage):
        if part.is_attachment():
            self.visit_attachment(part)
        else:
            method_name = part.get_content_type().replace("/", "_")
            meth = getattr(self, f"visit_{method_name}", None)
            if meth is not None:
                meth(part)

    def visit_multipart_alternative(self, part: email.message.EmailMessage):
        self.multipart_alternative_visited = True
        print("I'm a multipart message!")

    def visit_text_html(self, part: email.message.EmailMessage):
        print("I'm in text/html land!")

    def visit_text_plain(self, part: email.message.EmailMessage):
        print("I'm in text/plain land!")

    def visit_attachment(self, part: email.message.EmailMessage):
        print("Taking care of attachments!")

        if part.get_content_type() == "message/rfc822":
            filename = part.get_payload()[0].get("Subject", failobj="attached_mail")
            attached_mail = part.get_payload()[0]
            with pathlib.Path(f"{filename}.eml").open("wb") as fi:
                gen = BytesGenerator(fi, policy=policy.SMTP)
                gen.flatten(attached_mail)

        # everything else, images, csvs, pdfs...
        else:
            filename = part.get_filename(failobj="unknown.bin")
            with pathlib.Path(filename).open("wb") as fi:
                fi.write(part.get_payload(decode=True))

def parse_mail_contents(email_message: email.message.EmailMessage):
    visitor = MailVisitor()
    for part in email_message.walk():
        visitor.visit(part)

The visit-method gets the Content-Type of the message part (e.g. text-html), transforms it into text_html (line 8) and then calls visit_text_html() to parse the content of that part (line 11). Python’s functional programming features treat functions as first-class citizens, which shine on line 9: You can pass the method name around like regular variables. We now danced our way out of Conditional Hell and have code that clearly says what it does by way of method names.

Running it on the same message that generated the output of _structure, we get this:

I'm a multipart message!
I'm in text/html land!
Taking care of attachments!

Multipart Messages

On line 14 the code sets a flag whenever a multipart-alternative message is visited. Multipart-alternative messages, according to the RFC, provide multiple versions of the same message. The Visitor class will pass each of those versions, but depending on your parsing needs, you may only want one version, for example the text/html version. The flag can be checked later on and allows you to parse only what you need.

Dealing with Attachments

You will have noticed the implementation of visit_attachment() starting on line 23 onwards. The code saves all attachments to disk. However, attachments require some special care, since an email that is sent as attachment (Content-Type message/rfc822) must be treated differently than all other attachments. EmailMessage has a handy as_bytes() method, but the documentation clearly advises against it:

Note that this method is provided as a convenience and may not be the most useful way to serialize messages in your application, especially if you are dealing with multiple messages. See email.generator.BytesGenerator for a more flexible API for serializing messages.

The documentation of BytesGenerator is more specific:

As a convenience, EmailMessage provides the methods as_bytes() and bytes(aMessage), which simplify the generation of a serialized binary representation of a message object. For more detail, see email.message.

Because strings cannot represent binary data, the Generator class must convert any binary data in any message it flattens to an ASCII compatible format, by converting them to an ASCII compatible Content-Transfer_Encoding. Using the terminology of the email RFCs, you can think of this as Generator serializing to an I/O stream that is not “8 bit clean”. In other words, most applications will want to be using BytesGenerator, and not Generator.

In short: “Serialize messages using BytesGenerator, we’ll take care of the dinosaur that is email”. And, since we want to serialize email, the SMTP policy is the best choice, again leaving dinosaur-taming to our friendly Python module:

email.policy.SMTP Suitable for serializing messages in conformance with the email RFCs. Like default, but with linesep set to \r\n, which is RFC compliant.

The code then serializes an email to disk as .eml file, and it can be later deserialized with code from my previous blog post. Eml files are basically plaintext files - you can open them with any text editor, in contrast to Microsoft’s proprietary .msg-format.

Parsing the Body of a Message

If the body is of Content-Type text/html, life is easy: Use the excellent BeautifulSoup library to get any content you want.

When it’s text/plaintext, life is harder. For URL parsing, I use Diego Perini’s regex that had the best results out of a list of insane URL parsers. The regex is also able to parse all those funny punycode URLs that make a blue teamer’s life hard.

Email address parsing is easier, and I use some simple regex I adapted from O’Reilly.2

Concluding Note

Initially, The-Compiler has suggested the adapted Visitor pattern for my Markdown project, and I am very grateful for their idea. It is an elegant and adaptable implementation to get rid of Conditional Hell - thank you!

Image generated via Nightcafé/Dall-E


  1. After having partly implemented TLS 1.3 (RFC 8446) in C as a team-based term project, email-related RFCs look like a walk in the park. ↩︎

  2. Yes, I’m aware↩︎


Last modified on 2023-03-05

Comments Disabled.