this post was submitted on 13 Dec 2024

272 points (98.9% liked)

Open Source

31710 readers

222 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Posts must be relevant to the open source ideology
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago

MODERATORS

Cloak@lemmy.ml

kevincox@lemmy.ml

CrypticCoffee@lemmy.ml

Lettuceeatlettuce@lemmy.ml

272

SingleFile: Web Extension for saving a faithful copy of a complete web page in a single HTML file (github.com)

submitted 1 week ago by Zerush@lemmy.ml to c/opensource@lemmy.ml

14 comments fedilink hide all child comments

top 14 comments

sorted by: hot top controversial new old

[–] Sterile_Technique@lemmy.world 65 points 1 week ago* (last edited 1 week ago) (1 children)

Been using this in nursing school - a lot of our content is done on horribly designed websites, and it's pretty common to hit submit and...... Naw it didn't take: all your shit just disappeared.

So, save with single file first, then submit, and if it fucks up then I've got a backup to copy and paste a replacement out of.

[–] Tehdastehdas@lemmy.world 4 points 1 week ago

Clearing the whole form when submitting should be defined a crime with fines.

[–] cygnus@lemmy.ca 44 points 1 week ago* (last edited 1 week ago)

Whoa. If anyone else s wondering like I was how the hell this works, it's basically a ZIP file that appears as HTML: https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

[–] GenderNeutralBro@lemmy.sdf.org 23 points 1 week ago (2 children)

This is probably the best solution I've found so far.

Unfortunately, even this is no match for the user-hostile design of, say, Microsoft Copilot, because it hides content that is scrolled off screen so it's invisible in the output. That's no fault of this extension. It actually DOES capture the data. It's not the extension's fault that the web site intentionally obscures itself. Funnily enough, if I open the resulting html file in Lynx, I can read the hidden text, no problem. LOL.

[–] Deebster@lemmy.ml 16 points 1 week ago

I was on a site that did that and was confused why my text search wasn't finding much. Thanks devs for breaking basic browser features.

[–] bruce965@lemmy.ml 6 points 1 week ago (1 children)

Actually that might not have been done to deliberately disrupt your flow. Culling elements that are outside of the viewport is a technique used to reduce the amount of memory the browser consumes.

[–] Tehdastehdas@lemmy.world 2 points 1 week ago (1 children)

...which should be used only when the browser is running out of memory.

[–] bruce965@lemmy.ml 1 points 1 week ago

Well... that would make sense. But it's much much easier to just do it preemptively. The browser API to check how much memory is available are quite limited afaik. Also if there are too many elements the browser will have to do more work when interacting with the page (i.e. on every rendered frame), thus wasting slightly more power and in a extreme cases even lagging.

For what it's worth, I, as a web developer, have done it too in a couple occasions (in my case it was absolutely necessary when working with a 10K × 10K table, way above what a browser is designed to handle).

[–] davel@lemmy.ml 10 points 1 week ago (1 children)

I’m a big fan given that MHTML & MAFF were abandoned and WebArchive is Safari-only.

[–] blackbeards_bounty@lemmy.dbzer0.com 3 points 1 week ago

Very sad to learn these were abandoned

[–] N0x0n@lemmy.ml 1 points 1 week ago (1 children)

Not a single self-hosted read-it later use single file in their backend. I wonder why... single file works flawlessly on every site !

The only one that works similarly is linking.

[–] n0x0n@feddit.org 3 points 1 week ago* (last edited 1 week ago) (1 children)

Hey, my nick! I’m not 100% sure but readeck saves into a single file.

[–] N0x0n@lemmy.ml 1 points 1 week ago (1 children)

Haha kinda scary !! XD

Yeah I tried readeck and while it does better than the others it still doesn't use single file (or I missed some configuration?)

I do like it because it even transcribes YouTube videos and that's very neat ! However, I work lot on superuser, stack*,ask* pages which aren't properly scraped and render with their comments.

The only perfect working solution I found was singlefile + Zotero.

[–] n0x0n@feddit.org 1 points 1 week ago

OMG, are you me? I’m also a heavy stack* and ask* user, that’s so funny!

I’m not sure if it really uses single file, I’m a relatively new user and while linking around the system, saw that it saves its stuff in a single … what was it… gz or zip or tug. But not sure if it just scrapes everything and puts it into an archive file.

If you use the browser extension, it saves the page as rendered (at least it should). If not, open a bug report, the maintainer seems quite helpful.

You can also somehow script extractors I believe, so it should be possible to correct saving ask* and stack* pages.