We use a spatial and a temporal stream with VGG-16 and CNN-M respectively for modeling video information. LSTMs are stacked on top of the CNNs for modeling long term dependencies between video frames. For more information, see these papers:
Two-Stream Convolutional Networks for Action Recognition in Videos
Fusing Multi-Stream Deep Networks for Video Classification
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification