Dynamic action recognition

I’ve developed a CNN which provides me a decent accuracy of 80%, which basically classifies Humans and vehicles, now I need to classify if it’s a human, if it’s walking, sitting, talking (Normal behaviour) or hitting another one, Falling down (Abnormal behaviour). Can anyone tell me where to start? (I have no idea how to train a system using video data)

