Microsoft’s VASA-1 can deepfake a person with one photo and one audio track

Submitted ⁨⁨9⁩ ⁨months⁩ ago⁩ by ⁨return2ozma@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://arstechnica.com/information-technology/2024/04/microsofts-vasa-1-can-deepfake-a-person-with-one-photo-and-one-audio-track/

source

Comments

Sort:hotnew top

ptz@dubvee.org ⁨9⁩ ⁨months⁩ ago
Also Microsoft…

Microsoft warns deepfake election subversion is disturbingly easy

I know the genie’s out of the bottle, but goddamn.

source
- Etterra@lemmy.world ⁨9⁩ ⁨months⁩ ago
  Microsoft: I know this will only be used for evil, but I’ll be damned if I’m gonna pass up on the hype-boost to my market share.
  
  Every other big corp: same!
  
  source
slaacaa@lemmy.world ⁨9⁩ ⁨months⁩ ago
“At long last, we have created the Torment Nexus from classic sci-fi novel Don’t Create The Torment Nexus”

source
brown567@sh.itjust.works ⁨9⁩ ⁨months⁩ ago
Can we maybe stop making these? XD

source
- NoRodent@lemmy.world ⁨9⁩ ⁨months⁩ ago
  Image
  
  source
  - ThePantser@lemmy.world ⁨9⁩ ⁨months⁩ ago
    This coming from the guy who turned himself into a fly for fun
    
    source
    -> View More Comments
- MysticKetchup@lemmy.world ⁨9⁩ ⁨months⁩ ago
  Like what even is a legitimate use case for these? It just seems tailor made for either misinformation or pointless memes, neither of which seem like a good sales pitch
  
  source
  - Deceptichum@sh.itjust.works ⁨9⁩ ⁨months⁩ ago
    I could see a few uses, but the biggest would probably be advertising. Tailored ads that look like they’re coming from a real person.
    
    Imagine Jake from State Farm addressing you personally about your insurance in an ad.
    
    source
    -> View More Comments
  - Even_Adder@lemmy.dbzer0.com ⁨9⁩ ⁨months⁩ ago
    I think you’re falling for the overblown fearmongering headline, and pointless memes is a great reason to make things.
    
    source
  - chiisana@lemmy.chiisana.net ⁨9⁩ ⁨months⁩ ago
    Say you’re a movie studio director making the next big movie with some big name celebs. Filming is in progress, and one of the actor dies in the most on brand way possible. Everyone decides that the film must be finished to honor the actor’s legacy, but how can you film someone who is dead? This technology would enable you to create footage the VFX team can use to lay over top of stand-in actor’s face and provide a better experience for your audience.
    
    I’m sure there are other uses, but this one pops to mind as a very legitimate use case that could’ve benefited from the technology.
    
    source
    -> View More Comments
  - frezik@midwest.social ⁨9⁩ ⁨months⁩ ago
    Maybe a historical biopic in the style of photos of the time. Like take pictures of Lincoln, Grant, Lee, etc., use voice actors plus modern reenactors for background characters, and build it into a whole movie.
    
    I dunno, I’m probably reaching.
    
    source
  - fartsparkles@sh.itjust.works ⁨9⁩ ⁨months⁩ ago
    [deleted]
    source
    -> View More Comments
  - Jimmycakes@lemmy.world ⁨9⁩ ⁨months⁩ ago
    Avatars for ugly people who are good at games and want to get into streaming
    
    source
dhork@lemmy.world ⁨9⁩ ⁨months⁩ ago
Vasa? Like, the Swedish ship that sank 10 minutes after it was launched? Who named that project?

source
- Jimmycakes@lemmy.world ⁨9⁩ ⁨months⁩ ago
  They developed an ai to name all future ai. Ironically it is unnamed.
  
  source
- dumbass@leminal.space ⁨9⁩ ⁨months⁩ ago
  There are a lot of flying vehicles named after birds who famously plummet to the ground at breakneck speeds.
  
  source
- hakunawazo@lemmy.world ⁨9⁩ ⁨months⁩ ago
  No, like the crispbread.
  
  source
ArmoredThirteen@lemmy.ml ⁨9⁩ ⁨months⁩ ago
These vids are just off enough that I think doing a bunch of mushrooms and watching them would be a deeply haunting experience

source
- Dozzi92@lemmy.world ⁨9⁩ ⁨months⁩ ago
  So esse finally the music video for Drugs by Ratatat.
  
  source
- return2ozma@lemmy.world ⁨9⁩ ⁨months⁩ ago
  The first video her bottom teeth shift around.
  
  source
redcalcium@lemmy.institute ⁨9⁩ ⁨months⁩ ago
Combine this with an LLM with speech-to-text input and we could create a talking paintings like in harry potter movies. Heck, hang it on a door and hook it with smart lock to recreate the dorm doors in harry potter and see if people can trick it to open the door.

source
- NotMyOldRedditName@lemmy.world ⁨9⁩ ⁨months⁩ ago
  Any sufficiently advanced technology is indistinguishable from magic.
  
  Harry Potter wasn’t a fantasy movie, it was a SciFi and we just didn’t know it.
  
  source
  - venoft@lemmy.world ⁨9⁩ ⁨months⁩ ago
    It was midichlorians all along.
    
    source
    -> View More Comments
- chatokun@lemmy.dbzer0.com ⁨9⁩ ⁨months⁩ ago
  I was actually discussing this very idea with my brother, who went to the Wizarding World of Harry Potter at Universal Studios, Orrrlandooooo recently and while he enjoyed himself, said it felt like not much is new in theme parks nowadays. Adding in AI driven pictures you could actually talk to might spice it up.
  
  source
Maeve@kbin.social ⁨9⁩ ⁨months⁩ ago
A long time ago, someone from a not free country wrote a white paper on why we should care about privacy, because written words can be edited to level false accusations (charges) with false evidence. This chills me to the bone.

source
- Kiosade@lemmy.ca ⁨9⁩ ⁨months⁩ ago
  This is turning into some Mistborn shit. “Don’t trust writing not written on metal”
  
  source
- Sanctus@lemmy.dbzer0.com ⁨9⁩ ⁨months⁩ ago
  “You shot that man, citizen. Here is video evidence. Put your hands against the wall.” - and more coming to you soon!
  
  source
- tal@lemmy.today ⁨9⁩ ⁨months⁩ ago
  I’d be less-concerned about the impact on not-free countries than free countries. Dictator Bob doesn’t need evidence to have the justice system get rid of you, because he controls the justice system.
  
  source
MeekerThanBeaker@lemmy.world ⁨9⁩ ⁨months⁩ ago
This is why I don’t post my picture online and I never talk to anyone ever, while hiding my head inside a nylon stocking (unrelated).

source
- unreachable@lemmy.world ⁨9⁩ ⁨months⁩ ago
  Image
  
  source
ReallyActuallyFrankenstein@lemmynsfw.com ⁨9⁩ ⁨months⁩ ago
I mean, I know it’s scary, but I’ll admit it is impressive, even when I watched it with jaded “every day is another AI breakthrough” exhaustion.

The subtle face movements, eyebrow expression, everything seems to correctly infer how the face would articulate those specific words. When you think of how many decades something like this would be in the uncanny valley even with a team of trained people hand -tweaking the image and video, and this is doing it better in nearly every way, automatically, with just an image? Insane.

source
- kromem@lemmy.world ⁨9⁩ ⁨months⁩ ago
  It’s pretty wild that this is the tech being produced by the trillion dollar company who has already been granted a patent on creating digital resurrections of dead people from the data they left behind.
  
  So we now already have LLMs that could take what you said and say new things that seem like what you would have said, take a voice sample of you and create new voice synthesis of that text where it sounds a lot like you were actually saying it, and can take a photo of you and make a video where you are synched up saying that voice sample with facial expressions and all.
  
  And this could be done for anyone who has a social media profile with a number of text posts, a profile photo, and a 15 second sample of their voice.
  
  I really don’t get how every single person isn’t just having a daily existential crisis questioning the nature of their present reality given what’s coming.
  
  Do people just think the current trends aren’t going to continue, or just don’t think about the notion that what happens in the future could in fact have been their own nonlocal past?
  
  source
- Speculater@lemmy.world ⁨9⁩ ⁨months⁩ ago
  And you can run it on a single 4090, that’s crazy.
  
  source
  - mPony@lemmy.world ⁨9⁩ ⁨months⁩ ago
    uh, are graphics cards supposed to be 2500 bucks? (I play boardgames)
    
    source
    -> View More Comments
AnAnonymous@lemm.ee ⁨9⁩ ⁨months⁩ ago
Paranoia vibes starting in 3, 2, 1…

source
venusaur@lemmy.world ⁨9⁩ ⁨months⁩ ago
One photo? That’s incredible.

source
- clark@midwest.social ⁨9⁩ ⁨months⁩ ago
  Yeah. Incredibly horrific.
  
  source
BetaDoggo_@lemmy.world ⁨9⁩ ⁨months⁩ ago
The “why would they make this” people don’t understand how important this type of research is. It’s important to show what’s possible so that we can be ready for it. There are many bad actors already pursuing similar tools if they don’t have them already. The worst case is being blindsided by something not seen before.

source
I_Miss_Daniel@lemmy.world ⁨9⁩ ⁨months⁩ ago
Feed it Microsoft Merlin. What will happen?

source
thefartographer@lemm.ee ⁨9⁩ ⁨months⁩ ago
The pores don’t stretch, but the teeth and irises sure do!

source
Dasus@lemmy.world ⁨9⁩ ⁨months⁩ ago
The only use of this I’m in favour of is recreating Majel Barret’s voice as an AI for computer systems.

source
- kromem@lemmy.world ⁨9⁩ ⁨months⁩ ago
  This project doesn’t recreate or simulate voices at all.
  
  It takes a still photograph and created a lip synched video of that person saying the paired full audio clip.
  
  There’s other projects that simulate voices.
  
  source
  - trolololol@lemmy.world ⁨9⁩ ⁨months⁩ ago
    Yep it’s part of it to generate the sound track
    
    One of the videos show the voice changing in mid sentence
    
    source
    -> View More Comments
autotldr@lemmings.world [bot] ⁨9⁩ ⁨months⁩ ago
This is the best summary I could come up with:

On Tuesday, Microsoft Research Asia unveiled VASA-1, an AI model that can create a synchronized animated video of a person talking or singing from a single photo and an existing audio track.

In the future, it could power virtual avatars that render locally and don’t require video feeds—or allow anyone with similar tools to take a photo of a person found online and make them appear to say whatever they want.

To show off the model, Microsoft created a VASA-1 research page featuring many sample videos of the tool in action, including people singing and speaking in sync with pre-recorded audio tracks.

The examples also include some more fanciful generations, such as Mona Lisa rapping to an audio track of Anne Hathaway performing a “Paparazzi” song on Conan O’Brien.

While the Microsoft researchers tout potential positive applications like enhancing educational equity, improving accessibility, and providing therapeutic companionship, the technology could also easily be misused.

“We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection,” write the researchers.

The original article contains 797 words, the summary contains 183 words. Saved 77%. I’m a bot and I’m open source!

source
simplejack@lemmy.world ⁨9⁩ ⁨months⁩ ago
Microsoft’s research teams always makes some pretty crazy stuff. The problem with Microsoft is that they absolutely suck at translating their lab work into consumer products. Their labs publications are an amazing archive of shit that MS couldn’t get out the door properly or on time. Example - multitouch gesture UIs.

As interesting as this is, I’ll bet MS just ends up using some tech that Open AI launches before MS’s bureaucratic product team can get their shit together.

source