Can Computer Vision do that?

I was listening to one Bach as recorded for BBC Proms Bach Day ‘Passacaglia and fugue’, in the comments of youtube video there was reference to the Flute player.  I wanted to know where the video captured the Flute player so I had to walk through whole video (somewhat Binary search manner) until I found the video clip.

While searching I began wondering can recent technology solve problem of finding that part of video where it shows certain instrument is being “played”. Note the PLAYED part.

The naive way is to find the frames where e.g. Flute is being shown, then use the sound analysis to find whether in  those frames we can hear the Flute sound or not. It appears to be good solution, however the problem is finding where the Flute is being played is not easy when many other instruments are also playing. Secondly it’s not necessarily that the flute person that is being shown is actually playing flute also.

Question is Can we judge (both using the sound and not using sound) whether in given clip some instrument is being “PLAYED” or is being just shown,

Try this by looking at this video while you enjoy the amazing Bach (From 4:37)