Follow
David Harwath
Title
Cited by
Cited by
Year
Unsupervised learning of spoken language with visual context
D Harwath, A Torralba, J Glass
Advances in neural information processing systems 29, 2016
2952016
Jointly discovering visual objects and spoken words from raw sensory input
D Harwath, A Recasens, D Surís, G Chuang, A Torralba, J Glass
Proceedings of the European conference on computer vision (ECCV), 649-665, 2018
2452018
Deep multimodal semantic embeddings for speech and images
D Harwath, J Glass
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU …, 2015
1922015
Everything at once-multi-modal fusion transformer for video retrieval
N Shvetsova, B Chen, A Rouditchenko, S Thomas, B Kingsbury, RS Feris, ...
Proceedings of the ieee/cvf conference on computer vision and pattern …, 2022
1562022
Avlnet: Learning audio-visual language representations from instructional videos
A Rouditchenko, A Boggust, D Harwath, B Chen, D Joshi, S Thomas, ...
arXiv preprint arXiv:2006.09199, 2020
1512020
Contrastive audio-visual masked autoencoder
Y Gong, A Rouditchenko, AH Liu, D Harwath, L Karlinsky, H Kuehne, ...
arXiv preprint arXiv:2210.07839, 2022
1332022
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition
A Jansen, E Dupoux, S Goldwater, M Johnson, S Khudanpur, K Church, ...
2013 IEEE International Conference on Acoustics, Speech and Signal …, 2013
1222013
Learning word-like units from joint audio-visual analysis
D Harwath, JR Glass
arXiv preprint arXiv:1701.07481, 2017
1192017
Learning hierarchical discrete linguistic units from visually-grounded speech
D Harwath, WN Hsu, J Glass
arXiv preprint arXiv:1911.09602, 2019
1022019
Mae-ast: Masked autoencoding audio spectrogram transformer
A Baade, P Peng, D Harwath
arXiv preprint arXiv:2203.16691, 2022
1002022
Multimodal clustering networks for self-supervised learning from unlabeled videos
B Chen, A Rouditchenko, K Duarte, H Kuehne, S Thomas, A Boggust, ...
Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2021
992021
Text-free image-to-speech synthesis using learned segmental units
WN Hsu, D Harwath, C Song, J Glass
arXiv preprint arXiv:2012.15454, 2020
802020
Spoken moments: Learning joint audio-visual representations from video descriptions
M Monfort, SY Jin, A Liu, D Harwath, R Feris, J Glass, A Oliva
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2021
702021
Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech
D Harwath, G Chuang, J Glass
2018 IEEE International Conference on Acoustics, Speech and Signal …, 2018
692018
Why is winoground hard? investigating failures in visuolinguistic compositionality
A Diwan, L Berry, E Choi, D Harwath, K Mahowald
arXiv preprint arXiv:2211.00768, 2022
502022
Word discovery in visually grounded, self-supervised speech models
P Peng, D Harwath
arXiv preprint arXiv:2203.15081, 2022
462022
Towards visually grounded sub-word speech unit discovery
D Harwath, J Glass
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and …, 2019
432019
Prompting the hidden talent of web-scale speech models for zero-shot task generalization
P Peng, B Yan, S Watanabe, D Harwath
arXiv preprint arXiv:2305.11095, 2023
402023
Voicecraft: Zero-shot speech editing and text-to-speech in the wild
P Peng, PY Huang, SW Li, A Mohamed, D Harwath
arXiv preprint arXiv:2403.16973, 2024
372024
Fast-slow transformer for visually grounding speech
P Peng, D Harwath
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and …, 2022
362022
The system can't perform the operation now. Try again later.
Articles 1–20