Towards Human Action Understanding In Social Media Videos Using Multimodal Models