Decoupling semantic understanding and temporal reasoning is essential for efficient scene identification.