I'm still mad there's no straightforward way to convert a PDF into semantic HTML. There's plenty of tools to convert it into HTML that looks the same with pages and such, but I just want the content.

[–] AnimalsDream@slrpnk.net 6 points 1 day ago (1 children)

Would it work to convert it to a simpler intermediate format like rtf or txt, and then convert into html? Why html anyway, Isn't epub more appropriate?

[–] JackbyDev@programming.dev 5 points 1 day ago (1 children)

I just hate two column paginated lay outs. Give me pageless single column text.

[–] AnimalsDream@slrpnk.net 3 points 1 day ago

Yeah I get that. I've just gotten used to leaving pdfs the way they are, and choosing to read them on more appropriate devices like laptops or tablets.

[–] WhatYouNeed@lemmy.world 2 points 1 day ago

Sounds like a good opportunity for a crowdfunded start up.

[–] GroundedGator@lemmy.world 14 points 1 day ago (1 children)

We absolutely should have more specialized LLMs. That being said we have dozens of tools that convert documents and data. Also any engineer worth a nickel should be able to whip something up in an hour or so for most cases.

[–] ByteOnBikes@slrpnk.net 15 points 1 day ago (1 children)

But I want AI to convert my mp3s to Oggs and vice versa 😭😭😭

Not some stupid "conversion library" or whatever that is

[–] GroundedGator@lemmy.world 4 points 1 day ago

Oooh do you think that would work on my 96kbps mp3s?

[–] SkybreakerEngineer@lemmy.world 217 points 2 days ago (13 children)

You can't parse [X]HTML with LLM. Because HTML can't be parsed by LLM. LLM is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of LLM will not allow you to consume HTML. LLM are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by LLM. LLM queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular LLM as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by LLM. Even Jon Skeet cannot parse HTML using LLM. Every time you attempt to parse HTML with LLM, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with LLM summons tainted souls into the realm of the living. HTML and LLM go together like love, marriage, and ritual infanticide. The cannot hold it is too late. The force of LLM and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with LLM you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-LLM will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. LLM-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures LLM will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using LLM to parse HTML has doomed humanity to an eternity of dread torture and security holes using LLM as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of LLM parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy LLM-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

[–] LiveLM@lemmy.zip 3 points 1 day ago

HTML and regex go together like love, marriage, and ritual infanticide

I'm dying lmao

[–] LeFrog@discuss.tchncs.de 53 points 2 days ago

Oh, the timeless classic. Chef's kiss

load more comments (11 replies)

[–] Gsus4@mander.xyz 18 points 1 day ago* (last edited 1 day ago) (11 children)

Same for music like suno. I don't need to remix and hallucitate new fusion music, I just need a really good way to effectively search/discover all music that already exists in one place.

load more comments (11 replies)

[–] jonne@infosec.pub 62 points 2 days ago (6 children)

So I guess it's just only us Millennials that know how to convert a PDF properly, and we're just sandwiched in between boomers and gen Z finding the most ridiculous ways to try to accomplish that task.

[–] dubyakay@lemmy.ca 18 points 1 day ago (3 children)

As always, GenX just forgotten.

[–] RagingRobot@lemmy.world 7 points 1 day ago

Who?

[–] jonne@infosec.pub 8 points 1 day ago

Yeah, I didn't want to shit on you guys. Most of you are all right.

[–] shawn1122@lemm.ee 3 points 1 day ago

Gen X is a mixed bag when it comes to this but most I've come across are more like Boomers than Millenials on this.

[–] bitchkat@lemmy.world 2 points 1 day ago

I just look for a command named ${src}2${dest} like pdf2html

[–] AppleTea@lemmy.zip 14 points 1 day ago (2 children)

Ctrl a

Ctrl c

alt tab

Ctl v

Ctl s

conversion complete

[–] Eyekaytee@aussie.zone 19 points 1 day ago

real men find a random pdf converter website on the internets and upload our secret documents there

[–] lime@feddit.nu 14 points 1 day ago* (last edited 1 day ago) (5 children)

$ pandoc doc.pdf -o doc.txt

Edit: welp, pandoc can't do that. pdftotext it is.

load more comments (5 replies)

[–] dan@upvote.au 37 points 2 days ago (4 children)

Somehow, millennials ended up being the only generation that at least kind of knows how to use computers.

[–] Saleh@feddit.org 13 points 1 day ago (2 children)

Naah. There is plenty of Gen X, Y, Z who know and plenty of Millenials who dont.

Its just if you wanted to "do stuff with computers" you had to develop some understanding back then.

Today you can "do stuff" like gaming much easier out of the box. So not everyone who "does stuff" knows his way around.

In the office most colleagues of all generations just know how to do their specific things, mostly in MS Office products.

[–] Tuxman@sh.itjust.works 7 points 1 day ago* (last edited 1 day ago) (1 children)

Of god…. The number of colleagues that their WHOLE job depends on MS Word and they never heard of "Insert page break"………

Then they complain when inserting an image breaks their whole document…….

[–] LiveLM@lemmy.zip 2 points 1 day ago* (last edited 1 day ago)

To be fair, I've had Word absolutely freak out with images even in the simplest documents so I don't blame your colleagues, even without the page breaks

[–] SpacetimeMachine@lemmy.world 6 points 1 day ago

The difference being that a lot of millennials know how to figure out how to do stuff on computers by doing basic research. I've found a lot of my Gen-z friends to be more helpless in that regard.

load more comments (3 replies)

load more comments (2 replies)

[–] tsiad_mordecai_miktros@lemmy.world 76 points 2 days ago* (last edited 2 days ago) (3 children)

send me your data and i will parse it for you

it may take me a week to get back to you

load more comments (3 replies)

[–] pewgar_seemsimandroid@lemmy.blahaj.zone 27 points 2 days ago (1 children)

pdf to brainrot

[–] jdeath@lemm.ee 2 points 1 day ago

PdfTok

load more comments