Let’s think frame by frame with VIP: A video infilling and prediction dataset for evaluating video chain-of-thought V Himakunthala, A Ouyang, D Rose, R He, A Mei, Y Lu, C Sonar, ... Proceedings of the 2023 Conference on Empirical Methods in Natural Language …, 2023 | 5* | 2023 |