Research | Dr Vivek Singh - AI, Computer Vision & Medical Imaging

This project develops adaptive AI methods for understanding human activities in dynamic, real-world video. Most activity-recognition systems work under a closed-world assumption and hence can only recognise actions seen during training. In applications like, public safety, healthcare monitoring, and smart environments, new behaviours emerge continuously and often need to be recognised without rebuilding the system from scratch. This research focuses on developing a Continual Open-Vocabulary Action Detection framework that can learn new activity concepts over time, localise them in video, and retain previously learned knowledge. The aim is to make video understanding algorithms more flexible and useful in changing environments where safety, wellbeing, and timely response matter.

Human activity analysis is a key capability for intelligent systems that need to interpret what is happening in a scene, when it is happening, and why it may matter. Current methods have made strong progress in recognising predefined activities. The recent vision-language models have opened the door to open-vocabulary recognition, where actions can be described as concepts rather than fixed classes. These advances improve generalisation, but they do not fully solve the problem of long-term adaptation where deployed system may still struggle when environments, routines, risks, or user needs evolve.

This research brings together open-vocabulary learning, continual learning, and spatiotemporal video understanding. The framework in this research explores how models can detect activities from semantic descriptions, learn incrementally from new examples, and avoid catastrophic forgetting of previously learned behaviours. This is especially important for real deployments, where retraining from scratch is expensive, data access may be limited, and the system must remain reliable as new scenarios appear.

The wider impact is in making activity-aware AI more practical for society-facing applications. In assisted living, it could support safer monitoring while adapting to individual routines. In healthcare and rehabilitation, it could help track clinically relevant movements and behaviours over time. In public safety and smart-city contexts, it could support more responsive systems that recognise emerging events without requiring every possible scenario to be specified in advance. The project is therefore a great opportunity for collaboration between researchers and practitioners in computer vision, healthcare technology, robotics, human behaviour analysis, security, and responsible AI.

This project investigates how multimodal artificial intelligence can support precise and interpretable assessment of Autism Spectrum Disorder (ASD). ASD is a heterogeneous neurodevelopmental condition characterised by differences in social communication, interaction, behaviour, cognition, and sensory processing. ASD has a substantial variation across individuals. Diagnosis currently relies heavily on specialist clinical assessment, which remains essential but can be time-consuming, resource-intensive, and difficult to access in many settings. By integrating behavioural, physiological, and clinical data, this research aims to develop AI tools that help clinicians understand ASD as a spectrum, support more personalised decision-making, and improve outcomes for autistic individuals and their families.

A central challenge in autism research is that no single measurement captures the full diversity of autistic experience. Brain imaging can reveal patterns in neural connectivity, behavioural assessments can describe social and communication profiles, and physiological signals may reflect differences in attention, arousal, or sensory processing. Each data source is incomplete in isolation, but together they can offer a richer representation of the individual. This project uses multimodal learning to connect these complementary views and identify patterns that may be missed by single-modality approaches.

The research explores MRI, behavioural data, and clinically relevant assessment measures. Deep learning models are being developed to identify ASD-related patterns, estimate variation across the spectrum, and investigate relationships of observed behaviour. Lightweight architectures (e.g. compact vision-transformer models) are also being studied to make analysis more efficient and more feasible for clinical environments with limited computational resources. A major emphasis of the project is generalisability, reliability, and robustness across heterogeneous patient populations. For AI to be useful in healthcare, models should maintain performance across demographic groups, acquisition settings, and underlying data distributions.

This research investigates learning architectures that can estimate relative severity and individual differences using multimodal behavioural and neuroimaging data. The goal is to move towards decision-support tools that complement clinical expertise, guide personalised intervention planning, and support better long-term outcomes. The broader impact of this work lies in building bridges between AI, neuroscience, psychology, and clinical practice. More reliable and interpretable AI tools could help reduce diagnostic uncertainty, support earlier access to services, and improve understanding of ASD heterogeneity.

This project focuses on interpretable artificial intelligence methods that use routine MRI scans to support brain tumour assessment before treatment. Brain tumours are clinically complex, and critical information about tumour grade, biological risk, and likely progression is often confirmed only after surgery, histopathology, or molecular testing. By extracting quantitative imaging patterns from preoperative MRI, this work aims to provide additional non-invasive evidence that can support risk stratification, treatment planning, and personalised decision-making in neuro-oncology. The main goal of this research is to provide transparent AI tools that help clinical teams make more informed decisions earlier in the patient pathway.

AI and MRI workflow for brain tumour assessment

MRI is central to brain tumour diagnosis and monitoring, but many clinically meaningful image patterns are difficult to assess reliably by visual inspection alone. Tumour shape, texture, intensity variation, boundary characteristics, and spatial relationships can contain information about underlying biology. This project uses radiomics, machine learning, deep learning, and vision transformers to convert these image patterns into measurable features that can be analysed alongside clinical information.

One important focus is meningioma, the most common primary intracranial tumour. The research develops machine learning models that combine preoperative MRI radiomics with clinical variables to predict tumour grade and biological risk. A second strand of the work explores advanced deep learning for tumour assessment. Transformer-based models are used to focus attention on tumour-relevant image regions, while prototype-based models aim to make predictions more understandable by comparing a new scan with learned representative examples. This is important because clinical AI must be more than accurate; it must also be inspectable. Researchers and clinicians need to understand what evidence the model used, where that evidence appears in the scan, and whether the explanation is clinically plausible.

The wider value of this research is in developing AI that is reproducible, transparent, and aligned with clinical needs. Non-invasive tumour characterisation could reduce uncertainty before surgery, inform multidisciplinary discussion, and support more personalised care. The methods are also relevant beyond one tumour type, offering a pathway for responsible AI in medical imaging where decisions affect diagnosis, prognosis, and treatment.

This research develops robust, multimodal, and multitask algorithms for surgical scene understanding, with a focus on generalisability across procedures, surgical phases, anatomical contexts, and surgery types. The work aims to build AI systems capable of jointly learning multiple surgical tasks and supporting clinically meaningful reasoning from surgical video data. Central questions of this research are: how well do developed solutions adapt following procedural changes; how robust are they to variation across centres and surgical setups; and how can they be made computationally efficient enough for real-time deployment. The ultimate aim is to provide support for intraoperative awareness, workflow analysis, and future decision-support tools in the operating theatre.

Surgical video contains rich, complementary information about how an operation progresses, such as, which tools are in use, what anatomical structures are visible, which actions are being performed, and how the surgical team responds to changing intraoperative context. Extracting this information automatically is challenging because surgical scenes are visually complex, deformable, and highly variable across patients, surgeons, camera viewpoints, and procedural stages. This research addresses these challenges through dataset design and computer vision methods spanning action detection, semantic scene understanding, domain adaptation, and multitask learning across surgical settings.

SARAS-ESAD surgical action detection dataset overview

SARAS-MESAD real and phantom surgical dataset overview

A major contribution is the SARAS-ESAD dataset, developed for surgical action detection in Robotic-Assisted Radical Prostatectomy (RARP). While centred on RARP, the dataset is designed as a foundation for studying how AI systems can recognise and localise fine-grained surgical actions in complex operative video. It comprises 21 action classes, with annotated frames that may contain one or more simultaneous action instances. Released through the SARAS-ESAD challenge, the dataset is available from the challenge website and dataset download page. Beyond the benchmark itself, the work establishes task definitions, annotation strategies, and evaluation protocols that can inform action understanding across other minimally invasive and robotic procedures.

The SARAS-MESAD dataset extends this by investigating whether training on surgical phantoms can improve action detection on real patient data. MESAD comprises two complementary subsets — MESAD-Real, drawn from real RARP procedures, and MESAD-Phantom, drawn from procedures performed on anatomically relevant training phantoms. This enables research into domain transfer, surgical skill acquisition, and the relationship between controlled training environments and live clinical data. This directly supports objective of learning representations that remain useful when the data source, operative setting, or procedure type changes. Further details are available from the SARAS-MESAD challenge website and dataset download page.

The modelling work includes a surgical action detection baseline built around a Feature Pyramid Network architecture, in which a ResNet backbone extracts multi-level visual features from surgical frames and the detection head jointly predicts action categories and bounding-box locations. A second strand focuses on surgical tool and organ segmentation. Semantic segmentation is particularly demanding in endoscopic surgery due to continuous organs deform, rapid tools move, and small camera motion that substantially alter the visible scene. The segmentation models target semantic understanding of surgical scenes by identifying relevant tools and anatomical structures in a form that can support downstream analysis and safer AI-assisted workflows. Across the broader research programme, action detection and segmentation are treated as complementary modalities of scene understanding, to be integrated with temporal workflow modelling, surgical phase analysis, and procedural metadata.

The overarching aim is to make surgical AI more interpretable, reusable, and clinically meaningful across procedures; rather than narrowly optimised for individual datasets. Robust multimodal and multitask learning can help surgeons analyse procedural workflow, support objective assessment of surgical performance, and enable comparison across surgical approaches. This research creates opportunities for collaboration across computer vision, robotics, surgical data science, and responsible AI for healthcare.