Open Access. Powered by Scholars. Published by Universities.®

Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics

Edith Cowan University

Series

2024

Human action recognition

Articles 1 - 2 of 2

Full-Text Articles in Engineering

Mhair: A Dataset Of Audio-Image Representations For Multimodal Human Actions, Muhammad Bilal Shaikh, Douglas Chai, Syed M. S. Islam, Naveed Akhtar Jan 2024

Mhair: A Dataset Of Audio-Image Representations For Multimodal Human Actions, Muhammad Bilal Shaikh, Douglas Chai, Syed M. S. Islam, Naveed Akhtar

Research outputs 2022 to 2026

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can …


Multimodal Fusion For Audio-Image And Video Action Recognition, Muhammad B. Shaikh, Douglas Chai, Syed M. S. Islam, Naveed Akhtar Jan 2024

Multimodal Fusion For Audio-Image And Video Action Recognition, Muhammad B. Shaikh, Douglas Chai, Syed M. S. Islam, Naveed Akhtar

Research outputs 2022 to 2026

Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes …