The manual itself says[1]:
> Often when I read the manual, I think that we should take a collection up to have Lars psycho-analysed.
0: https://www.gnu.org/software/emacs/manual/html_mono/gnus.htm...
Gnus was absolutely delightful back in the day. I moved on around the time I had to start writing non-plaintext emails for work reasons. It's also handy to be using the same general email apps and systems as 99.99% of the rest of the world. I still have a soft spot in my heart for it.
PS: Also, I have no idea whatsoever why someone would downvote you for that. Weird.
I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454
I know it's a hassle for a platform to moderate good rants from bad ones, and I decry SO from pushing too hard against these. I truly believe that our industry would benefit from more drunken technical rants.
It also comes from a time in Internet culture when humor was appreciated instead of aggressively downvoted.
This is also the reason why I consider the lack of images in IRC a feature.
Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.
What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.
So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.
For example, this is perfectly valid XHTML:
<a href="/" title="<a /> />"></a> <a href="/" title="<a /> />"></a> <!—- Don't count <hr> this! -—> but do count <hr> this -->
and <!-- <!-- Ignore <ht> this --> but do count <hr> this —->
Now your regex has to include balanced comment markers. Solve thatYou need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.
Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.
So extracting information from this text with regexps often makes perfect sense.
A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.
It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.
The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".
I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.
Like, it’s not a matter of cleverness, either. You can’t code around it. It’s simply not possible.
Why do mail server care about how long a line is? Why don't they just let the client reading the mail worry about wrapping the lines?
The server needs to parse the message headers, so it can't be an opaque blob. If the client uses IMAP, the server needs to fully parse the message. The only alternative is POP3, where the client downloads all messages as blobs and you can only read your email from one location, which made sense in the year 2000 but not now when everyone has several devices.
POP3 is line–based too, anyway. Maybe you can rsync your maildir?
I use imap on my mobile device, but that’s mostly for recent emails until I get to my computer. Then it’s downloaded and deleted from the server.
IMAP is an interactive protocol that is closer to the interaction between Gmail frontend and backend. It does many things. The client implements a local view of a central source of truth.
I don't have an IMAP account available to check, but AFAIK, you should not have locally the content of any message you've never read before. The whole point of IMAP is that it doesn't download messages, but instead acts like a window into the server.
Given a mechanism for soft line breaks, breaking already at below 80 characters would increase compatibility with older mail software and be more convenient when listing the raw email in a terminal.
This is also why MIME Base64 typically inserts line breaks after 76 characters.
For example, the PDP-11 (early 1970s), which was shared among dozens of concurrent users, had 512 kilobytes of RAM. The VAX-11 (late 1970s) might have as much as 2 megabytes.
Programmers were literally counting bytes to write programs.
https://en.wikipedia.org/wiki/BITNET
BITNET connected mainframes, had gateways to the Unix world and was still active in the 90s. And limited line lengths … some may remember SYSIN DD DATA … oh my goodness …
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
telnet smtp.mailserver.com 25
HELO
MAIL FROM: me@foo.com
RCPT TO: you@bar.com
DATA
blah blah blah
how's it going?
talk to you later!
.
QUIT
openssl s_client -connect smtp.mailserver.com:smtps -crlf
220 smtp.mailserver.com ESMTP Postfix (Debian/GNU)
EHLO example.com
250-smtp.mailserver.com
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-AUTH PLAIN LOGIN
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-DSN
250-SMTPUTF8
250 CHUNKING
MAIL FROM:me@example.com
250 2.1.0 Ok
RCPT TO:postmaster
250 2.1.5 Ok
DATA
354 End data with <CR><LF>.<CR><LF>
Hi
.
250 2.0.0 Ok: queued as BADA579CCB
QUIT
221 2.0.0 ByeIf you were typing into a feedback form powered by something from Matt’s Script Archive, there was about a 95% chance you could trivially get it to send out multiple emails to other parties for every one email sent to the site’s owner.
Edit: wrong.
However what most mail programs show as sender and recipient is neither, they rather show the headers contained in the message.
I suspect this is relevant because Quoted Printable was only a useful encoding for MIME types like text and HTML (the human readable email body), not binary (eg. Attachments, images, videos). Mail servers (if they want) can effectively treat the binary types as an opaque blob, while the text types can be read for more efficient transfer of message listings to the client.
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
And BITNET …
Wake up, everyone! Brand new sentence just dropped!
For instance, consider FTP’s text mode, which was primarily a way to accidentally corrupt your download when you forgot to type “bin” first, but was also handy for getting human readable files from one incompatible system to another.
As to the other bits, I think even in the uucp era, email was mostly internal, by volume of mail sent, even though you could clearly talk to remote sites if everything was set up correctly. It was capable of being a worldwide communication system. I bet the local admins responsible for monitoring the telephone bill preferred to keep that in check, though.
> For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days.
On the risk of having missed the latest meme or social media drama, but does anyone know what this "some reason or other" is?
Edit: Question answered.
But not everybody has every single global development / news event IVed into their veins. Many of us just don’t keep updated on global news such that we may not be aware of an event that happened in the last 3 days.
Important news tends to get to me eventually. And there is usually nothing I can do about something personally anyway (at least within a short time horizon), so there is really very little value in trying to stay informed of the absolute latest developments. The signal to noise ratio is far too low, and it also induces a bunch of unnecessary anxiety and stress.
So yes, believe it or not very many people are unaware of this.
It never got too popular, but I had users for a few years and I can honestly say MIME was the bane of my life for most of those years.
I think there is a second possible conclusion, which is that the transformation happened historically. Everyone assumes these emails are an exact dump from Gmail, but isn't it possible that Epstein was syncing emails from Gmail to a third party mail server?
Since the Stackoverflow post details the exact situation in 2011, I think we should be open to the idea that we're seeing data collected from a secondary mail server, not Gmail directly.
Do we have anything to discount this?
(If I'm not mistaken, I think you can also see the "=" issue simply by applying the Quoted-Printable encoding twice, not just by mishandling the line-endings, which also makes me think two mail servers. It also explains why the "=" symbol is retained.)
Did the site get the HN kiss of death?
I, too, was reading about the new Epstein files, wondering what text artifact was causing things to look like that.
https://nitter.net/AFpost/status/2017415163763429779?s=201
Something clearly went wrong in the process.
I'm glad to know the real reason!
I wonder why even have a max line length limit in the first place? I.e. is this for a technical reason or just display related?
I wonder if the person who had the idea of virtualizing the typewriter carriage knew how much trouble they would cause over time.
It would’ve been far less messy to make printers process linefeed like \n acts today, and omit the redundant CR. Then you could still use CR for those overstrike purposes but have a 1-byte universal newline character, which we almost finally have today now that Windows mostly stopped resisting the inevitable.
Now, if you want to use CR by itself for fancy overstriking etc. you'd need to put something else into the character stream, like a space followed by a backspace, just to kill time.
In any event, wouldn't you have to either buffer or use flow-control to pause receiving while a CR was being processed? You wouldn't want to start printing the next line's characters in reverse while the carriage was going back to the beginning.
My suspicion is there was a committee that was more bent on purity than practicality that day, and they were opposed to the idea of having CR for "go to column 0" and newline for "go to column 0 and also advance the paper", even though it seems extremely unlikely you'd ever want "advance the paper without going to column 0" (which you could still emulate it with newline + tab or newline + 43 spaces for those exceptional cases).
If you look at the schematics for an ASR-33, there's just 2 transistors in the whole thing (https://drive.google.com/file/d/1acB3nhXU1Bb7YhQZcCb5jBA8cer...). Even the serial decoding is done electromechanically (per https://www.pdp8online.com/asr33/asr33.shtml), and the only "flow control" was that if you sent XON, the teletype would start the paper tape reader -- there was no way, as far as I can tell, for the teletype to ask the sender to pause while it processes a CR.
These things ran at 110 baud. If you can't do flow control, your only option if CR takes more than 1/10th of a second is to buffer... but if you can't do flow control, and the computer continues to send you stuff at 110 baud, you can't get that buffer emptied until the computer stops sending, so each subsequent CR will fill your buffer just a little bit more until you're screwed. You need the character following CR (which presumably takes about 2/10ths of a second) to be a non-printing character... so splitting out LF as its own thing gives you that and allows for the occasional case where doing a linefeed without a carriage return is desirable.
Curious Marc (https://www.curiousmarc.com/mechanical/teletype-asr-33) built a current loop adapter for his ASR-33, and you'll note that one of the features is "Pin #32: Send extra NUL character after CR (helps to not loose first char of new line)" -- so I'd guess that on his old and probably worn-out machine, even sending LR after CR doesn't buy enough time and the next character sometimes gets "lost" unless you send a filler NUL.
Now, I haven't really used serial communications in anger for over a decade, and I've never used a printing terminal, so somebody with actual experience is welcome to come in and tell me I'm wrong.
I've been trying to get Visual Studio to stop mucking with line endings and encodings for years. I've searched and set all the relevant settings I could find, including using a .editorconfig file, but it refuses to be consistent. Someone please tell me I'm wrong and there's a way to force LF and UTF-8 no-BOM for all files all the time. I can't believe how much time I waste on this, mainly so diffs are clean.
How far can you get with setting core.autocrlf on your machine? See https://git-scm.com/book/en/v2/Customizing-Git-Git-Configura...
Edit: yes I think that's most likely what it is (and it's SHOULD 78ch; MUST 998ch) - I was forgetting that it also specifies the CRLF usage, it's not (necessarily) related to Windows at all here as described in TFA.
Here it is in my 'notmuch-more' email lib: https://github.com/OJFord/amail/blob/8904c91de6dfb5cba2b279f...
The article doesn't claim that it's Windows related. The article is very clear in explaining that the spec requires =CRLF (3 characters), then mentions (in passing) that CRLF is the typical line ending on Windows, then speculates that someone replaced the two characters CRLF with a one character new line, as on Unix or other OSs.
It's just sp hacky i cant belive it's a real life's solution
Consider converting the original text (maintaining the author’s original line wrapping and indentation) to base64. Has anything been “inserted” into the text? I would suggest not. It has been encoded.
Now consider an encoding that leaves most of the text readable, translates some things based on a line length limit, and some other things based on transport limitations (e.g. passing through 7-bit systems.) As long as one follows the correct decoding rules, the original will remain intact - nothing “inserted.” The problem is someone just knowledgeable enough to be aware that email is human readable but not aware of the proper decoding has attempted to “clean up” the email for sharing.
Infinite line length = infinite buffer. Even worse, QP is 7-bit (because SMTP started out ASCII only), so characters >127 get encoded as three bytes (equal, then two hex digits), so a 500-character non-ASCII UTF8 line is 1500 bytes.
It all made sense at the time. Not so much these days when 7-bit pipes only exist because they always have.
But I agree with sibling comment: it makes more sense when its called "encoding" instead of "inserting chars into original stream"
Digital communication is based on the postmen reading, transcribing and copying your letters. There is a reason why digital communication is treated differently then letters by the law and why the legally mandated secrecy for letters doesn't apply to emails.
cat title | sed 's/anyway/in email/'
would save a click for those already familiar with =20 etc.On a side note: There are actually products marketed as kosher bacon (it's usually beef or turkey). And secular Jews frequently make jokes like this about our kosher bros who aren't allowed to eat the real stuff for some dumb reason like it has too many toes.
That said, there is a _possibly_ kosher pig: https://en.wikipedia.org/wiki/Babirusa#Relationship_with_hum...
Yeah clearly you guys are the biggest victims in all this... get in there and make it about you!
We’ve become so accustomed to modern libraries handling encoding transparently that when raw data surfaces (like in these dumps), we often lack the 'Digital Archeology' skills to recognize basic Quoted-Printable.
These artifacts (=20, =3D) are effectively fossils of the transport layer. It’s a stark reminder that underneath our modern AI/React/JSON world, the internet is still largely held together by 7-bit ASCII constraints and protocols from the 1980s.
Geezus...
The writer presumably knows that umlauts and other non-ascii characters are functional in many languages. "rock döts" is poking fun at the trend in a certain tranche of anglophone rock/metal to use them in a purely aesthetic way in band names etc.
Back in those days optical scanners were still used.