Unless the videos have proper depth maps and identifiers for objects and actions they’re not going to be as effective as, say, robot arm surgery data, or vr captured movement and tracking. You’re basically adding a layer to the learning to first process the video correctly into something usable and then learn from that. Not very efficient and highly dependant on cameras and angles.