On Thu, Mar 04, 2004 at 02:42:02PM -0500, Gregory Hansell ([log in to unmask]) said: > Well... > > I'm not sure what you mean. I noticed emails went out without any > extra characters, [that you would normally see,] > but that the "=" at the ends of lines, plus some weird character > codes (e.g. 09, 20, etc) were in the archive, so I assumed they > were part of the listserv archive formatting (I am frankly clueless > with some of this). Part of the reason I made this assumption is > that emails would come through listserv with no extra characters, > and web interface would not display these characters in the > archives, but they are in the archive file itself... Yes, because they're part of the quoted-printable format of sending email, including plain text email. See: http://www.faqs.org/rfcs/rfc1521.html but you probably don't really need to read the whole thing. Look for "Quoted-Printable Content-Transfer-Encoding". Both Listserv and your mail client know enough about this to display correctly, as it's been part of email (you can read from the date of the RFC) for over 10 years now. (Eek, I feel old.) > But, are you saying that the 09's, ='s, etc., are codes from the > email clients? If so, is there anyway of preventing these codes (we > send in Plain Text and still get them) or of stripping them out > systemically from the archive? Judging from the: X-MIME-Autoconverted: from quoted-printable to 8bit by baka.acw.vcu.edu id +i24Hoah21901 (baka is my computer) header that I see in the first message you sent, you yourself are sending like that. :-) I'm not sure if it's bad or not that my computer is autoconverting like that...I suspect it might be. Certainly, if it hadn't, I would have shown you...oh, wait, I bet I have another copy on another machine, before it gets to my workstation. Yup. I've appended what your first email *really* looked like at the bottom. Outlook was probably "nice" enough to do that for you because of your relatively long lines. Basically, there's a difference between content type (text/plain in this case) and content transfer encoding (quoted-printable vs 8bit vs base64 (which makes quoted-printable look normal and sane, if you're looking at it raw; it's how stuff like graphics files are normally sent, but it's perfectly possible to send text that way, too)). > Basically, we grab individual postings from the archive, import them > into SQL, and that's a major portion of our site's content. So, lots > of weird codes and characters are a major problem. If you have something that will display email, it ought to be able to parse email. I suspect there's all sorts of stuff out there that will do this correctly for you--I sure wouldn't want to write it from scratch! I suggested MIME::Parse earlier, looks like I was thinking of MIME::Tools: http://search.cpan.org/~eryq/MIME-tools-5.411a/lib/MIME/Tools.pm which might be overkill for what you need to do... If you're writing something yourself, you might want to pull the guts out of one of the open source mail viewers, like, oh squirrelmail, the mailman or majordomo archives viewer, something like that. Assuming doing so wouldn't violate its license. ---------------------------------------- [snip most of the headers up to the MIME stuff...] MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable [snip the rest of the headers...] Right now our website uses a PERL script to go to the Listserv = archive/notebook of some of our lists, grabs all new postings, and then imports the = postings into our SQL database, where they are then coded for certain features and are displayed dynamically based on date and code on our homepage. The problem is the PERL script has difficultly contending with the = formating of the listserv archives (i.e. the =3D's, 09's, etc). Is there any API or = control that listserv or third party has available for accessing the listserv = archives in the way that the web interface does, reading the formating of the = postings in the archive? We would like to re-write the script, which never really = worked that great. Any help you could provide would be great. Thanks, Greg =20 -- Jim Toth [log in to unmask]