How to extract Information from Raw Email

-2

What I have done so far

I use POP3 package to read all my emails from my mailbox.

I received a raw email from the POP3 function which shown at the example below. (I omitted some information)

Issue

I'm facing the issue to extract the information from it.

I used the mail to extract the information, but unfortunately, this package cannot extract information from the raw email.

Seeking for help

Is there any method or package there which can help me to extract the information from the raw email?


Methods I had tried

// Retrieve all the email from your mailbox
msgs, _, error := connection.ListAll()

// Convert a chunk of integer to raw email
data, _ := connection.Retr(msgs[0])

// Extract Email Address
to, _ := mail.ParseAddress(data)

Error I am facing

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x5819ed]

Raw Email Example

From: 
To: 
Subject: 
Thread-Topic: 
Thread-Index: AdPgWzcFT3FjbSbUT1ycoU2ioB1bKAAAAHuA
X-MS-Exchange-MessageSentRepresentingType: 1
Date: 
Message-ID: 
Accept-Language: en-SG, en-US
Content-Language: en-US
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 04
X-MS-Exchange-Organization-AuthSource: 
X-MS-Has-Attach:
X-MS-Exchange-Organization-Network-Message-Id:
    8c1e141b-c3ba-4471-0526-08d5b3b59967
X-MS-TNEF-Correlator:
Content-Type: multipart/alternative;
    boundary="_002_b1a01aa36e1c4b3d9969a7bbb856bd3b"
MIME-Version: 1.0

--_002_b1a01aa36e1c4b3d9969a7bbb856bd3b
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: base64

PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiIHhtbG5zOm89InVy
bjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9mZmljZSIgeG1sbnM6dz0idXJuOnNjaGVt
YXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCIgeG1sbnM6bT0iaHR0cDovL3NjaGVtYXMubWlj
cm9zb2Z0LmNvbS9vZmZpY2UvMjAwNC8xMi9vbW1sIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv
VFIvUkVDLWh0bWw0MCI+DQo8aGVhZD4NCjxtZXRhIGh0dHAtZXF1aXY9IkNvbnRlbnQtVHlwZSIg
Y29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04Ij4NCjxtZXRhIG5hbWU9IkdlbmVyYXRv
ciIgY29udGVudD0iTWljcm9zb2Z0IFdvcmQgMTUgKGZpbHRlcmVkIG1lZGl1bSkiPg0KPHN0eWxl
PjwhLS0NCi8qIEZvbnQgRGVmaW5pdGlvbnMgKi8NCkBmb250LWZhY2UNCgl7Zm9udC1mYW1pbHk6
IkNhbWJyaWEgTWF0aCI7DQoJcGFub3NlLTE6MiA0IDUgMyA1IDQgNiAzIDIgNDt9DQpAZm9udC1m
YWNlDQoJe2ZvbnQtZmFtaWx5OkRlbmdYaWFuOw0KCXBhbm9zZS0xOjIgMSA2IDAgMyAxIDEgMSAx
IDE7fQ0KQGZvbnQtZmFjZQ0KCXtmb250LWZhbWlseTpDYWxpYnJpOw0KCXBhbm9zZS0xOjIgMTUg
NSAyIDIgMiA0IDMgMiA0O30NCkBmb250LWZhY2UNCgl7Zm9udC1mYW1pbHk6IlxARGVuZ1hpYW4i
Ow0KCXBhbm9zZS0xOjIgMSA2IDAgMyAxIDEgMSAxIDE7fQ0KLyogU3R5bGUgRGVmaW5pdGlvbnMg
Ki8NCnAuTXNvTm9ybWFsLCBsaS5Nc29Ob3JtYWwsIGRpdi5Nc29Ob3JtYWwNCgl7bWFyZ2luOjBj
bTsNCgltYXJnaW4tYm90dG9tOi4wMDAxcHQ7DQoJZm9udC1zaXplOjExLjBwdDsNCglmb250LWZh
bWlseToiQ2FsaWJyaSIsc2Fucy1zZXJpZjt9DQphOmxpbmssIHNwYW4uTXNvSHlwZXJsaW5rDQoJ
e21zby1zdHlsZS1wcmlvcml0eTo5OTsNCgljb2xvcjojMDU2M0MxOw0KCXRleHQtZGVjb3JhdGlv
bjp1bmRlcmxpbmU7fQ0KYTp2aXNpdGVkLCBzcGFuLk1zb0h5cGVybGlua0ZvbGxvd2VkDQoJe21z
by1zdHlsZS1wcmlvcml0eTo5OTsNCgljb2xvcjojOTU0RjcyOw0KCXRleHQtZGVjb3JhdGlvbjp1
bmRlcmxpbmU7fQ0KcC5tc29ub3JtYWwwLCBsaS5tc29ub3JtYWwwLCBkaXYubXNvbm9ybWFsMA0K
CXttc28tc3R5bGUtbmFtZTptc29ub3JtYWw7DQoJbXNvLW1hcmdpbi10b3AtYWx0OmF1dG87DQoJ
bWFyZ2luLXJpZ2h0OjBjbTsNCgltc28tbWFyZ2luLWJvdHRvbS1hbHQ6YXV0bzsNCgltYXJnaW4t
bGVmdDowY207DQoJZm9udC1zaXplOjEyLjBwdDsNCglmb250LWZhbWlseToiVGltZXMgTmV3IFJv
bWFuIixzZXJpZjt9DQpzcGFuLkVtYWlsU3R5bGUxOA0KCXttc28tc3R5bGUtdHlwZTpwZXJzb25h
bC1jb21wb3NlOw0KCWZvbnQtZmFtaWx5OiJDYWxpYnJpIixzYW5zLXNlcmlmOw0KCWNvbG9yOndp
bmRvd3RleHQ7fQ0KLk1zb0NocERlZmF1bHQNCgl7bXNvLXN0eWxlLXR5cGU6ZXhwb3J0LW9ubHk7
DQoJZm9udC1zaXplOjEwLjBwdDsNCglmb250LWZhbWlseToiQ2FsaWJyaSIsc2Fucy1zZXJpZjt9
DQpAcGFnZSBXb3JkU2VjdGlvbjENCgl7c2l6ZTo2MTIuMHB0IDc5Mi4wcHQ7DQoJbWFyZ2luOjcy
LjBwdCA3Mi4wcHQgNzIuMHB0IDcyLjBwdDt9DQpkaXYuV29yZFNlY3Rpb24xDQoJe3BhZ2U6V29y
ZFNlY3Rpb24xO30NCi0tPjwvc3R5bGU+PCEtLVtpZiBndGUgbXNvIDldPjx4bWw+DQo8bzpzaGFw
ZWRlZmF1bHRzIHY6ZXh0PSJlZGl0IiBzcGlkbWF4PSIxMDI2IiAvPg0KPC94bWw+PCFbZW5kaWZd
LS0+PCEtLVtpZiBndGUgbXNvIDldPjx4bWw+DQo8bzpzaGFwZWxheW91dCB2OmV4dD0iZWRpdCI+
DQo8bzppZG1hcCB2OmV4dD0iZWRpdCIgZGF0YT0iMSIgLz4NCjwvbzpzaGFwZWxheW91dD48L3ht
bD48IVtlbmRpZl0tLT4NCjwvaGVhZD4NCjxib2R5IGxhbmc9IkVOLVNHIiBsaW5rPSIjMDU2M0Mx
IiB2bGluaz0iIzk1NEY3MiI+DQo8ZGl2IGNsYXNzPSJXb3JkU2VjdGlvbjEiPg0KPHAgY2xhc3M9
Ik1zb05vcm1hbCI+PG86cD4mbmJzcDs8L286cD48L3A+DQo8L2Rpdj4NCjxicj4NCjxociBhbGln
bj0ibGVmdCIgc3R5bGU9Im1hcmdpbi1sZWZ0OjA7dGV4dC1hbGlnbjpsZWZ0O3dpZHRoOjUwJTto
ZWlnaHQ6MXB4O2JhY2tncm91bmQtY29sb3I6Z3JheTtib3JkZXI6MHB4OyI+DQo8Zm9udCBzdHls
ZT0iY29sb3I6Z3JheTsiIHNpemU9Ii0xIj5UaGlzIGVtYWlsIHdhcyBzY2FubmVkIGJ5IEJpdGRl
ZmVuZGVyPC9mb250Pg0KPC9ib2R5Pg0KPC9odG1sPg0K

--_002_b1a01aa36e1c4b3d9969a7bbb856bd3b
Content-Type: text/calendar; charset="utf-8"; method=REQUEST
Content-Transfer-Encoding: base64

QkVHSU46VkNBTEVOREFSDQpNRVRIT0Q6UkVRVUVTVA0KUFJPRElEOk1pY3Jvc29mdCBFeGNoYW5n
ZSBTZXJ2ZXIgMjAxMA0KVkVSU0lPTjoyLjANCkJFR0lOOlZUSU1FWk9ORQ0KVFpJRDpTaW5nYXBv
cmUgU3RhbmRhcmQgVGltZQ0KQkVHSU46U1RBTkRBUkQNCkRUU1RBUlQ6MTYwMTAxMDFUMDAwMDAw
DQpUWk9GRlNFVEZST006KzA4MDANClRaT0ZGU0VUVE86KzA4MDANCkVORDpTVEFOREFSRA0KQkVH
SU46REFZTElHSFQNCkRUU1RBUlQ6MTYwMTAxMDFUMDAwMDAwDQpUWk9GRlNFVEZST006KzA4MDAN
ClRaT0ZGU0VUVE86KzA4MDANCkVORDpEQVlMSUdIVA0KRU5EOlZUSU1FWk9ORQ0KQkVHSU46VkVW
RU5UDQpPUkdBTklaRVI7Q049SG8gU2lldyBLZWU6TUFJTFRPOkhPX1NpZXdfS2VlQGlwaS1zaW5n
YXBvcmUub3JnDQpBVFRFTkRFRTtST0xFPVJFUS1QQVJUSUNJUEFOVDtQQVJUU1RBVD1ORUVEUy1B
Q1RJT047UlNWUD1UUlVFO0NOPUtvaCBXZWUgSG8NCiBuZzpNQUlMVE86S09IX1dlZV9Ib25nQGlw
aS1zaW5nYXBvcmUub3JnDQpERVNDUklQVElPTjtMQU5HVUFHRT1lbi1VUzpcblxuX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19cblRoaXMgZW1haWwNCiAgd2FzIHNjYW5uZWQgYnkgQml0
ZGVmZW5kZXJcbg0KVUlEOjA0MDAwMDAwODIwMEUwMDA3NEM1QjcxMDFBODJFMDA4MDAwMDAwMDA0
MDEzMTY0NzlFRTBEMzAxMDAwMDAwMDAwMDAwMDAwDQogMDEwMDAwMDAwMzQ5NDcwQUMzMzQ0NzM0
MDk5QzM0OEE0M0E0M0ZCREMNClNVTU1BUlk7TEFOR1VBR0U9ZW4tVVM6SFIgT3JpZW50YXRpb246
IEtvaCBXZWUgSG9uZyAoQU0gLSBEaWdpdGFsIFBsYXRmb3Jtcw0KICkNCkRUU1RBUlQ7VFpJRD1T
aW5nYXBvcmUgU3RhbmRhcmQgVGltZToyMDE4MDUwN1QxNDAwMDANCkRURU5EO1RaSUQ9U2luZ2Fw
b3JlIFN0YW5kYXJkIFRpbWU6MjAxODA1MDdUMTUzMDAwDQpDTEFTUzpQVUJMSUMNClBSSU9SSVRZ
OjUNCkRUU1RBTVA6MjAxODA1MDdUMDA1ODA2Wg0KVFJBTlNQOk9QQVFVRQ0KU1RBVFVTOkNPTkZJ
Uk1FRA0KU0VRVUVOQ0U6Mw0KTE9DQVRJT047TEFOR1VBR0U9ZW4tVVM6Q29ubmVjdGlvbg0KWC1N
SUNST1NPRlQtQ0RPLUFQUFQtU0VRVUVOQ0U6Mw0KWC1NSUNST1NPRlQtQ0RPLU9XTkVSQVBQVElE
OjEzMDY0MTMwMjYNClgtTUlDUk9TT0ZULUNETy1CVVNZU1RBVFVTOlRFTlRBVElWRQ0KWC1NSUNS
T1NPRlQtQ0RPLUlOVEVOREVEU1RBVFVTOkJVU1kNClgtTUlDUk9TT0ZULUNETy1BTExEQVlFVkVO
VDpGQUxTRQ0KWC1NSUNST1NPRlQtQ0RPLUlNUE9SVEFOQ0U6MQ0KWC1NSUNST1NPRlQtQ0RPLUlO
U1RUWVBFOjANClgtTUlDUk9TT0ZULURJU0FMTE9XLUNPVU5URVI6RkFMU0UNCkJFR0lOOlZBTEFS
TQ0KREVTQ1JJUFRJT046UkVNSU5ERVINClRSSUdHRVI7UkVMQVRFRD1TVEFSVDotUFQxNU0NCkFD
VElPTjpESVNQTEFZDQpFTkQ6VkFMQVJNDQpFTkQ6VkVWRU5UDQpFTkQ6VkNBTEVOREFSDQo=

--_002_b1a01aa36e1c4b3d9969a7bbb856bd3b
go
mime
email-parsing
asked on Stack Overflow Mar 17, 2019 by WeeHong • edited Mar 17, 2019 by kostix

1 Answer

3

The theory

E-mail messages are formatted using basically one of the following two formats.

The oldest one is defined by RFC 5322 (originally it was RFC 822 but it has been updated since then).

This format does not support messages using non-ASCII character encodings and neither does it support what general public calls "attachments".

To rectify the situation, the set of standards commonly known as MIME was invented. The standards in this set define:

  • Ways to use non-ASCII character encodings.
  • Ways to compose multi-part messages.

The two most interesting MIME standards are RFC 2045 and RFC 2046.

The standards incorporated by MIME were specifically devised in a way to make MIME-encoded stuff still compatible with RFC 822—this, among other things, allowed not to change MTAs to support the new message format.

The practice

Libraries implemented in various programming lanugages to deal with various bits defined by MIME usually follow the trend of the MIME itself and are able to transparently handle "plain" RFC 822-formatted mails and MIME-formatted.

To handle MIME-formatted messages, Go offers in its standard library the three packages:

and another one, net/textproto, to deal with MIME-style headers (also used in HTTP, IMAP/POP3 etc).

Together, these packages cover most (or all) of what you need to read and write MIME-formatted e-mail messages.

The basic approach to parsing

The basic approach to parsing an e-mail message is as follows:

  1. Create an instance of bufio.Reader from an io.Reader supplying the data of the e-mail message to be parsed.

  2. Create an instance of net/textproto.Reader from the bufio.Reader made on the previous step.

  3. Use its ReadMIMEHeader() method to read and parse the message's header block.

  4. Check to see whether it contains the MIME-Version field mandated by RFC 2045 to indicate a MIME-formatted content.

    1. If it contains one, verify its value is literally "1.0". If it isn't, this is some MIME format from the future you won't be able to cope with. Barf accordingly; otherwise proceed to the next step.
    2. If there is no such field, the message is a plain old e-mail consisting of a single part. Consider its whole contents as if it were a single MIME-message part (this is oversimplified but would mostly work).
  5. If you have verified you're dealing with a MIME 1.0 message, then

    1. Read the Content-Type header field then use mime.ParseMediaType() to parse it.
      • If the resulting mediatype value will start with the "multipart/" prefix, literally, then be prepared to deal with multiple parts, which require recursive processing as each part is formatted almost like the top-level message—that is, contains the header and the body (see below).
      • Otherwise the content type will indicate some kind of "direct" payload (such as "text/plain" or "text/html" or whatever). In this case, a special "parameter" of such media type (see below) indicates the character encoding used for the part's content, if it's textual. (Also note that it may be "message/rfc822" which actually indicates the payload is another e-mail message which might need to be parsed according to the same rules as the encloding one.)

Dealing with a "leaf" part or non-multi-part message payloads

An importand header field to read next is Content-Transfer-Encoding which defines how the part is physically encoded to be transferred over the wire.

Most of the time it will be "base64" but it also may be "quoted-printable".

Dealing with multi-part messages

First, be prepared to properly deal with the different aspects of what "multi part" is: the parts might be either alternatives to each other—that is, the program which is to render a message to the user might pick any of them—whatever suits the user's preference best, or ask them or something else,—or they may be entities sort-of equal to each other. The former scheme is indicated by the "multipart/alternative" media type and is typically used by MUAs which allow the user to compose a mail message using markup and then encode the result so that it contains two alternative parts—one marked as "text/html" and another one marked as "text/plain" and containing the source content stripped of that markup nonsense. The latter is indicated by "multipart/mixed" and is typically used for attachments: with this scheme the first part is typically (but it's not required to be that) is the message's text in whatever format and the rest of the parts are attachments.

To be able to pick out the individual parts from the message's encoding, the "multipart/whatever" media type most of the time contains a so-called "parameter" named "boundary" and containing a unique string which is used to delimit the parts. The mime.ParseMediaType() function returns the media type's parameters in the form of a map as its second result value.

After extracting that "boundary" media type's parameter, you can use it to create an instance of mime/multipart.Reader from the bufio.Reader instance made on the very first step. You can then read the message parts one-by one and act on them.

Note that, as already indicated, a message part might have the content type "message/rfc822" which means it contains another complete mail message (and it itself might be multi-part and contain other mail messages, and so on).

answered on Stack Overflow Mar 17, 2019 by kostix • edited Apr 22, 2019 by kostix

User contributions licensed under CC BY-SA 3.0