I'm still mad there's no straightforward way to convert a PDF into semantic HTML. There's plenty of tools to convert it into HTML that looks the same with pages and such, but I just want the content.

[–] AnimalsDream@slrpnk.net 6 points 12 hours ago (1 children)

Would it work to convert it to a simpler intermediate format like rtf or txt, and then convert into html? Why html anyway, Isn't epub more appropriate?

[–] JackbyDev@programming.dev 5 points 10 hours ago (1 children)

I just hate two column paginated lay outs. Give me pageless single column text.

[–] AnimalsDream@slrpnk.net 3 points 8 hours ago

Yeah I get that. I've just gotten used to leaving pdfs the way they are, and choosing to read them on more appropriate devices like laptops or tablets.

[–] WhatYouNeed@lemmy.world 2 points 10 hours ago

Sounds like a good opportunity for a crowdfunded start up.

[–] GroundedGator@lemmy.world 12 points 14 hours ago (1 children)

We absolutely should have more specialized LLMs. That being said we have dozens of tools that convert documents and data. Also any engineer worth a nickel should be able to whip something up in an hour or so for most cases.

[–] ByteOnBikes@slrpnk.net 15 points 13 hours ago (1 children)

But I want AI to convert my mp3s to Oggs and vice versa 😭😭😭

Not some stupid "conversion library" or whatever that is

[–] GroundedGator@lemmy.world 4 points 12 hours ago

Oooh do you think that would work on my 96kbps mp3s?

[–] Gsus4@mander.xyz 18 points 22 hours ago* (last edited 15 hours ago) (11 children)

Same for music like suno. I don't need to remix and hallucitate new fusion music, I just need a really good way to effectively search/discover all music that already exists in one place.

load more comments (11 replies)

[–] SkybreakerEngineer@lemmy.world 210 points 1 day ago (13 children)

You can't parse [X]HTML with LLM. Because HTML can't be parsed by LLM. LLM is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of LLM will not allow you to consume HTML. LLM are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by LLM. LLM queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular LLM as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by LLM. Even Jon Skeet cannot parse HTML using LLM. Every time you attempt to parse HTML with LLM, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with LLM summons tainted souls into the realm of the living. HTML and LLM go together like love, marriage, and ritual infanticide. The cannot hold it is too late. The force of LLM and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with LLM you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-LLM will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. LLM-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures LLM will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using LLM to parse HTML has doomed humanity to an eternity of dread torture and security holes using LLM as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of LLM parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy LLM-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

[–] LiveLM@lemmy.zip 2 points 5 hours ago

HTML and regex go together like love, marriage, and ritual infanticide

I'm dying lmao

load more comments (12 replies)

[–] jonne@infosec.pub 60 points 1 day ago (13 children)

So I guess it's just only us Millennials that know how to convert a PDF properly, and we're just sandwiched in between boomers and gen Z finding the most ridiculous ways to try to accomplish that task.

[–] bitchkat@lemmy.world 2 points 7 hours ago

I just look for a command named ${src}2${dest} like pdf2html

[–] dubyakay@lemmy.ca 18 points 21 hours ago (3 children)

As always, GenX just forgotten.

[–] RagingRobot@lemmy.world 7 points 11 hours ago

Who?

[–] shawn1122@lemm.ee 3 points 13 hours ago

Gen X is a mixed bag when it comes to this but most I've come across are more like Boomers than Millenials on this.

[–] jonne@infosec.pub 8 points 21 hours ago

Yeah, I didn't want to shit on you guys. Most of you are all right.

[–] AppleTea@lemmy.zip 13 points 21 hours ago (2 children)

Ctrl a

Ctrl c

alt tab

Ctl v

Ctl s

conversion complete

[–] Eyekaytee@aussie.zone 17 points 20 hours ago

real men find a random pdf converter website on the internets and upload our secret documents there

[–] lime@feddit.nu 14 points 20 hours ago* (last edited 10 hours ago) (4 children)

$ pandoc doc.pdf -o doc.txt

Edit: welp, pandoc can't do that. pdftotext it is.

[–] mexicancartel@lemmy.dbzer0.com 2 points 13 hours ago* (last edited 13 hours ago) (1 children)

magick file.jpg file.html

Imagemagick be converting anything into anything (Actually in this case, it make an html file and a png file which is referenced in html file and html page displays it)

[–] lime@feddit.nu 2 points 11 hours ago

not really a good way to get the text out of a pdf though. then again, turns out neither is pandoc.

load more comments (3 replies)

load more comments (10 replies)

load more comments