Learning visual speech representations from talking face videos is an important problem for several speech-related tasks, such as lip reading, talking face generation, audio-visual speech separation, ...