linuxmemes

19747 readers

1459 users here now

I use Arch btw

Sister communities:

LemmyMemes: Memes
LemmyShitpost: Anything and everything goes.
RISA: Star Trek memes and shitposts

Community rules

Follow the site-wide rules and code of conduct
Be civil
Post Linux-related content
No recent reposts

Please report posts and comments that break these rules!

founded 1 year ago

MODERATORS

Linuxmemed@lemmy.world

poopsmith@lemmy.world

Geert@lemmy.world

338

Parsing HTML with regex (lemmy.sdf.org)

submitted 4 months ago by pmjv@lemmy.sdf.org to c/linuxmemes@lemmy.world

40 comments fedilink hide all child comments

cross-posted from: https://lemmy.sdf.org/post/12950329

you are viewing a single comment's thread
view the rest of the comments

[–] 2xsaiko@discuss.tchncs.de 9 points 4 months ago (1 children)

None of these examples are for parsing English sentences. They parse completely different formal languages. That it's text is irrelevant, regex usually operates on text.

You cannot write a regex to give you for example "the subject of an English sentence", just as you can't write a regex to give you "the contents of a complete div tag", because neither of those are regular languages (HTML is context-free, not sure about English, my guess is it would be considered recursively enumerable).

You can't even write a regex to just consume <div> repeated exactly n times followed by </div> repeated exactly n times, because that is already a context-free language instead of a regular language, in fact it is the classic example for a minimal context-free language that Wikipedia also uses.

[–] Blue_Morpho@lemmy.world -5 points 4 months ago* (last edited 4 months ago)

None of these examples are for parsing English sentences.

Read it again:

"At one point while working on the manuscript for this book. I ran such a tool on what I’d written so far"

The author explicitly stated that he used regex to parse his own book for errors! The example was using regex to parse html.

You can’t even write a regex to just consume repeated exactly n times followed by repeated exactly n times,

Just because regex can't do everything in all cases doesn't mean it isn't useful to parse some html and English text.

It's like screaming, "You can't build an Operating system with C because it doesn't solve the halting problem!"