this post was submitted on 20 Dec 2023

90 points (90.9% liked)

Technology

59206 readers

2539 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

Child sex abuse images found in dataset training image generators, report says (arstechnica.com)

submitted 10 months ago by SapphireVelvet84839@lemmynsfw.com to c/technology@lemmy.world

12 comments fedilink hide all child comments

The report: https://stacks.stanford.edu/file/druid:kh752sm9123/ml_training_data_csam_report-2023-12-20.pdf

top 12 comments

sorted by: hot top controversial new old

[–] Supermariofan67@programming.dev 20 points 10 months ago* (last edited 10 months ago) (1 children)

Given that Facebook and meta's other platforms are one of the largest distributors of that, if they scrape Facebook for data this is not exactly a surprise unfortunately...

[–] Yuki@kutsuya.dev 1 points 10 months ago

That's sad :c

[–] LainOfTheWired@lemy.lol 5 points 10 months ago (2 children)

How are you supposed to train the dam thing to detect something without using that thing though?

[–] SapphireVelvet84839@lemmynsfw.com 8 points 10 months ago

How are you supposed to train the dam thing to detect something without using that thing though?

There are various organizations which have clearances to handle child abuse images. Where the data is handled like the plutonium it is, and everybody is vetted. I'm sure they've already experimented with developing a bot to detect images.

They even make available their traditional hash databases to server admins who want to run their images against their hash databank.

The issue as the report states is that nobody will willingly link said bot/database to the training data, because either they don't want a copyright fight or they don't want to acknowledge the issue.

[–] GigglyBobble@kbin.social 5 points 10 months ago

It's about generators which certainly should not be trained with such material.

[–] autotldr@lemmings.world 4 points 10 months ago

This is the best summary I could come up with:

More than 1,000 known child sexual abuse materials (CSAM) were found in a large open dataset—known as LAION-5B—that was used to train popular text-to-image generators such as Stable Diffusion, Stanford Internet Observatory (SIO) researcher David Thiel revealed on Wednesday.

His goal was to find out what role CSAM may play in the training process of AI models powering the image generators spouting this illicit content.

"Our new investigation reveals that these models are trained directly on CSAM present in a public dataset of billions of images, known as LAION-5B," Thiel's report said.

But because users were dissatisfied by these later, more filtered versions, Stable Diffusion 1.5 remains "the most popular model for generating explicit imagery," Thiel's report said.

While a YCombinator thread linking to a blog—titled "Why we chose not to release Stable Diffusion 1.5 as quickly"—from Stability AI's former chief information officer, Daniel Jeffries, may have provided some clarity on this, it has since been deleted.

Thiel's report warned that both figures are "inherently a significant undercount" due to researchers' limited ability to detect and flag all the CSAM in the datasets.

The original article contains 837 words, the summary contains 182 words. Saved 78%. I'm a bot and I'm open source!

[–] BetaDoggo_@lemmy.world 1 points 10 months ago (1 children)

Between 0.00002% and 0.00006%

[–] snooggums@kbin.social 9 points 10 months ago (1 children)

Anything > 0 is too many.

[–] KoboldCoterie@pawb.social 8 points 10 months ago* (last edited 10 months ago) (2 children)

While I agree with the sentiment, that's 2-6 in 10,000,000 images; even if someone was personally reviewing all of the images that went into these data sets, which I strongly doubt, that's a pretty easy mistake to make, when looking at that many images.

[–] RecallMadness@lemmy.nz 8 points 10 months ago

“Known CSAM” suggests researchers ran it through automated detection tools which the dataset authors could have used.

[–] SapphireVelvet84839@lemmynsfw.com 1 points 10 months ago (1 children)

They're not looking at the images though. They're scraping. And their own legal defenses rely on them not looking too carefully else they cede their position to the copyright holders.

[–] snooggums@kbin.social 4 points 10 months ago

Technically they violated the copyright of the CSAM creators!