parsing emails

Red Squirrel

No Lifer
May 24, 2003
68,367
12,573
126
www.anyf.ca
I'm having a nightmare writing an email parser because it seems either the clients or the email servers keep adding returns at all the wrong spots. For example an email with a paragraph like this:

quote: text text text text text text text text text text text text text text text text text text text text text text text text text text

Should be on one line, but when parsing, it's put on a bunch of seperate lines, so there's no way for the parser to know when to stop. just wondering is there a reason why this hapends, and is it a return, or is it some other special character that can be removed to put it all back on one line?

Has anyone ever written an email parser before to parse specific things? In this case It's news emails for my tech site, but because every single site sends different emails, it's very hard to try and parse them all, especially when emails are also formatted all differently. I'd have to say 90% of emails have this return nightmare where returns are randomly put in lines. If anyone knows a solution to this let me know, thanks.
 

bofkentucky

Member
Nov 8, 2004
28
0
0
Grab all the text between the Subject: line and the next From:

Strip out any other headers and mime sections then strip out any newlines in the remaining text, in perl (or languages that have perl regular expressions) it will be relatively easy, not so sure about other langs.
 

kamper

Diamond Member
Mar 18, 2003
5,513
0
0
Why would you start after the Subject:? There's no reason there couldn't be more headers after it... Isn't the rule that headers end with the first empty line?
 

bofkentucky

Member
Nov 8, 2004
28
0
0
Supposed to be that way but I have seen some clients/servers munge it, your headers will all be in the form of :startline:type-of-header: value:newline: so they are fairly easy to parse
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Email parsing is probably one of the most difficult things to do, as you're finding out. You could always run the message through formail first, formail's main job is to force mail into mailbox format, perform ?From ? escaping, generate auto-replying headers, do simple header munging/extracting or split up a mailbox/digest/articles file and then you can be sure it's in mbox format.
 

Red Squirrel

No Lifer
May 24, 2003
68,367
12,573
126
www.anyf.ca
Actually the header arn't an issue since I look for the first occurence of \n\n (end of headers) and take note of the position, then I grab the subject and from (I need those) then do a subsctring starting from the position if the first \n\n and the end, then add in the subject and from on top.

but the rest of the email is a pain, here's an example:



Newsposters, editors,

XYZ Computing has just posted a new article=2E A post in your daily news w=
ould be great!

Title:=20
HIS X850/X800 CrossFire Edition Preview at XYZ Computing

Link:
http://www=2Exyzcomputing=2Ecom/index=2Ephp=3Foption=3Dcontent&task=3Dview=
&id=3D357&Itemid=3D26

Snip:
"The Computex conference has revealed a number of exciting developments in=
ATI's CrossFire=2E This is ATI's long-awaited and well overdue answer to =
Nvidia's SLI=2E With the technology in place the burden falls upon the sho=
ulders of the partner companies to make the products which the gamers are =
dying to get their hands on=2E This means not only CrossFire compatible mo=
therboards, but also video cards=2E Unlike Nvidia, ATI does make its one c=
ards, but the partners do play a very large role in product marketing and =
development=2E This article goes over some of the latest news about HIS's =
implementation of ATI's CrossFire system=2E"

Thanks and be sure to send your news to news@xyzcomputing=2Ecom



Best,
SAL CANGELOSO
XYZ COMPUTING






------NEWS PARSE DEBUG INFORMATION------
Title:"BTX'plained"

Link:http://www.xyzcomputing.com/index.php?option=content&task=view&id=357&Itemid=26

Quote:The Computex conference has revealed a number of exciting developments in ATI's CrossFire=2E This is ATI's long-awaited and well overdue answer to Nvidia's SLI=2E With the technology in place the burden falls upon the shoulders of the partner companies to make the products which the gamers are dying to get their hands on=2E This means not only CrossFire compatible motherboards, but also video cards=2E Unlike Nvidia, ATI does make its one cards, but the partners do play a very large role in product marketing and development=2E This article goes over some of the latest news about HIS's implementation of ATI's CrossFire system=2E

Source:www.xyzcomputing.com

Size:3622 Bytes
--- END ---



The debug info is added by my parser. This is actually an easy email compared to others, but what throws off the parser is the unecesarry return in the middle of the link, and it seems it's the client or server doing that, all emails seem to have this problem, it's just a matter of how big the link is, which depends if it parses successfully.

The ones with multiple types (ex: text and html) tend to be more iffy, but if worse comes to worse I'll have to strip those down as well, but my parser strips html as a prephase thing anyway, only thing that sometimes messes up is the boundery lines, but I think my last tweak fixed that problem, hopefully.

Also, do all emails use \n returns, or do some use \r\n? As if they're all different it could pose a problem.



edit: hey, just realized my parser did in fact respond to the =\n removal rule that I set, so all those bad returns seem to get killed, when they use =\n (most of them do) since the link in that post is fine, as in the original email the 26 part is missing.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Some email clients will wrap messages at ~80 chars to make them more readable on text MUAs like mutt. Of course not all do and it's usually configurable so there's no way to guarantee either way. An MTA shouldn't change the message at all, if it does it's broken. I can see some things like virus or spam scanners adding headers, but they should never change the body of the message.
 

Red Squirrel

No Lifer
May 24, 2003
68,367
12,573
126
www.anyf.ca
Things seem to be working not too bad, the results can be seen on my home page (http://www.iceteks.com). All the news posted by news-bot is parsed as mail comes through and put in a database, then regularly a cron job starts up the bot which reads the DB and posts the articles and deletes the database. Only issues now are that some clients replace some chars with =[hex value] so as I see these, I add in a replace.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |