LSTSRV-L Archives

LISTSERV Site Administrators' Forum

LSTSRV-L

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Jim Toth <[log in to unmask]>
Thu, 4 Mar 2004 15:42:13 -0500
text/plain (112 lines)
On Thu, Mar 04, 2004 at 02:42:02PM -0500, Gregory Hansell
([log in to unmask]) said:
> Well...
>
> I'm not sure what you mean. I noticed emails went out without any
> extra characters,

[that you would normally see,]

> but that the "=" at the ends of lines, plus some weird character
> codes (e.g. 09, 20, etc) were in the archive, so I assumed they
> were part of the listserv archive formatting (I am frankly clueless
> with some of this). Part of the reason I made this assumption is
> that emails would come through listserv with no extra characters,
> and web interface would not display these characters in the
> archives, but they are in the archive file itself...

Yes, because they're part of the quoted-printable format of sending
email, including plain text email.

See:  http://www.faqs.org/rfcs/rfc1521.html
but you probably don't really need to read the whole thing.  Look for
"Quoted-Printable Content-Transfer-Encoding".

Both Listserv and your mail client know enough about this to display
correctly, as it's been part of email (you can read from the date of
the RFC) for over 10 years now.  (Eek, I feel old.)

> But, are you saying that the 09's, ='s, etc., are codes from the
> email clients?  If so, is there anyway of preventing these codes (we
> send in Plain Text and still get them) or of stripping them out
> systemically from the archive?

Judging from the:

  X-MIME-Autoconverted: from quoted-printable to 8bit by baka.acw.vcu.edu id +i24Hoah21901

(baka is my computer) header that I see in the first message you sent,
you yourself are sending like that.  :-)  I'm not sure if it's bad or
not that my computer is autoconverting like that...I suspect it might
be.  Certainly, if it hadn't, I would have shown you...oh, wait, I bet
I have another copy on another machine, before it gets to my
workstation.  Yup.  I've appended what your first email *really*
looked like at the bottom.  Outlook was probably "nice" enough to do
that for you because of your relatively long lines.

Basically, there's a difference between content type (text/plain in
this case) and content transfer encoding (quoted-printable vs 8bit vs
base64 (which makes quoted-printable look normal and sane, if you're
looking at it raw; it's how stuff like graphics files are normally
sent, but it's perfectly possible to send text that way, too)).

> Basically, we grab individual postings from the archive, import them
> into SQL, and that's a major portion of our site's content. So, lots
> of weird codes and characters are a major problem.

If you have something that will display email, it ought to be able to
parse email.  I suspect there's all sorts of stuff out there that will
do this correctly for you--I sure wouldn't want to write it from
scratch!

I suggested MIME::Parse earlier, looks like I was thinking of
MIME::Tools:

  http://search.cpan.org/~eryq/MIME-tools-5.411a/lib/MIME/Tools.pm

which might be overkill for what you need to do...

If you're writing something yourself, you might want to pull the guts
out of one of the open source mail viewers, like, oh squirrelmail, the
mailman or majordomo archives viewer, something like that.  Assuming
doing so wouldn't violate its license.

----------------------------------------
[snip most of the headers up to the MIME stuff...]
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
[snip the rest of the headers...]

Right now our website uses a PERL script to go to the Listserv =
archive/notebook
of some of our lists, grabs all new postings, and then imports the =
postings into
our SQL database, where they are then coded for certain features and are
displayed dynamically based on date and code on our homepage.

The problem is the PERL script has difficultly contending with the =
formating of
the listserv archives (i.e. the =3D's, 09's, etc). Is there any API or =
control
that listserv or third party has available for accessing the listserv =
archives
in the way that the web interface does, reading the formating of the =
postings in
the archive? We would like to re-write the script, which never really =
worked
that great.

Any help you could provide would be great.

Thanks,
Greg


=20


--
Jim Toth
[log in to unmask]

ATOM RSS1 RSS2