Research | Sigma Jahan

Illustration of reliability-driven AI research focus — Reliability diagnostics guiding current research directions.

Reliability-First Deep Learning Systems

Research ThemeAI ReliabilityAttention Models

We study how deep learning systems fail in the wild and build guardrails so they do not fail twice. The projects below combine large-scale fault mining, interpretable diagnostics, and automation across attention-heavy architectures to catch silent failures before deployment. The goal is simple: trustworthy models that earn their place in safety-critical software.

Benchmark pipeline overview — Benchmark of classic and ML era bug localization on DNN vs traditional faults.

Understanding the Diagnostic Gap of DNNs

Bug LocalizationDeep Learning FaultsEmpirical StudyExtrinsic Failures

This comprehensive empirical study evaluates eight widely-used fault detection techniques on 2,365 DNN faults versus 2,913 traditional software faults, revealing that existing methods perform approximately 35% worse on DNN systems. The research exposes that training faults like exploding gradients and model faults like incompatible layer shapes represent fundamentally different failure modes than conventional logic errors, with static analysis techniques particularly struggling on tensor-related issues while dynamic approaches achieve success rates below 50%. The study discovers that DNN systems exhibit four times more extrinsic faults caused by external factors like GPU failures, establishing the need for specialized diagnostic approaches.

Hierarchical detection, categorization, and root cause analysis pipeline — DEFault: hierarchical and explainable classifier combining static and dynamic signals.

Bridging the Diagnostic Gap of Standard DNNs

Fault DetectionDeep Neural NetworksHierarchical ClassificationExplainable AI

Building upon the diagnostic gap findings, this work introduces DEFault, a novel hierarchical fault diagnosis technique that integrates static and dynamic analysis to comprehensively address model and training-related faults in deep neural networks, achieving 11.54% higher performance than four state-of-the-art approaches. DEFault employs a three-stage hierarchical learning approach covering all major fault categories across Feedforward, Convolutional, and Recurrent Neural Networks using 14,000 DNN programs, making it the most comprehensive fault diagnosis solution for standard DNN architectures by providing interpretable explanations for detected faults.

Screenshot of the DEEBug LLM-driven debugging workspace — Agent-based pipeline editor and telemetry dashboard inside DEEBug.

DEEBug: An LLM Platform for DNN Debugging

Large Language ModelsAutomated DebuggingTool BuildingCommercialization

As principal investigator, I lead DEEBug, a commercialization-ready platform that orchestrates open-source large language models such as Gemini and DeepSeek to automate fault diagnosis in deep learning workflows. The system provides a modular pipeline editor, agent-based reasoning over telemetry, and real-time dashboards that surface emerging reliability issues before they reach production. Five student developers are co-creating the stack under my supervision, and we are preparing the prototype for Lab2Market with integrated IP strategy, observability tooling, and safeguards that make responsible AI debugging deployable for research labs and industry teams.

Taxonomy of attention-specific faults — First broad study of attention faults across real projects and frameworks.

Characterizing Attention-Specific Fault Patterns

Attention MechanismsFault TaxonomyTransformer ArchitecturesSilent Failures

This groundbreaking empirical study presents the first comprehensive analysis of fault patterns unique to attention-based neural networks by examining 555 real-world faults from 96 projects across 10 frameworks. The research reveals that over half of ABNN faults fall into seven previously unreported categories arising exclusively from attention mechanisms, identifying 25 distinct root causes including QKV dimension mismatches and dynamic mask generation errors that frequently lead to silent failures occurring twice as often as in standard DNNs. Through systematic analysis using the Apriori algorithm, the study derives four evidence-based diagnostic heuristics that collectively explain 33% of attention-specific faults, providing the first systematic diagnostic guidance for transformer-based architectures.

Hessian based curvature and parameter interaction analysis — Second order signals reveal QKV head interactions that gradients miss.

Hessian-based Analysis for Attention Models

Hessian AnalysisSecond-order OptimizationAttention MechanismsCurvature Analysis

This novel research explores second-order Hessian analysis as an advanced diagnostic approach for attention-based neural networks, addressing limitations of gradient-based techniques that only capture first-order parameter sensitivity and cannot model complex multi-parameter interactions critical in attention mechanisms. Through controlled perturbation experiments on Hierarchical Attention Networks, 3D-CNNs, and DistilBERT, the study demonstrates that Hessian-derived metrics including curvature analysis and parameter interdependency mapping can more effectively localize instability and pinpoint fault sources compared to gradient-based approaches alone, establishing second-order analysis as a promising direction for improving fault diagnosis in complex neural architectures.

Examples of textually dissimilar duplicates — 92K reports across three systems. Why dissimilar duplicates evade detectors.

Impacts of Textual Dissimilarity on Duplicate Bug Report Detection

Duplicate Bug DetectionTextual DissimilaritySoftware MaintenanceNatural Language Processing

This comprehensive empirical investigation analyzes the challenge of textual dissimilarity in duplicate bug report detection through a large-scale study of 92,000 bug reports from three open-source systems, demonstrating that existing techniques perform poorly when duplicate reports exhibit significant textual differences despite describing identical underlying issues. The research reveals that textually dissimilar duplicates frequently lack essential components such as steps to reproduce, leading to vocabulary gaps that confound traditional similarity-based detection methods, and proposes domain-specific embedding approaches combined with convolutional neural networks to address these limitations while establishing critical insights into natural language variation challenges in automated duplicate detection systems.

Research Service

Program Committee Member

Software Engineering in Practice track, International Conference on Software Engineering (ICSE 2026)
International Workshop on Causal Methods in Software Engineering (CauSE) @ FSE 2026
Mining Challenge Track, International Conference on Mining Software Repositories (MSR 2026)
Junior PC, International Conference on Mining Software Repositories (MSR 2026)
International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest) @ ICSE 2026
International Workshop on Validation, Analysis and Evolution of Software Tests (VST) @ SANER 2026

Conference Sub-Reviewing

ACM International Conference on the Foundations of Software Engineering (FSE 2025, 2024)
IEEE/ACM International Conference on Automated Software Engineering (ASE 2025, 2023)
International Conference on Mining Software Repositories (MSR 2022)

Journal Reviewing

Springer Nature Cluster Computing: The Journal of Networks, Software Tools (2025)
ACM Transactions on Software Engineering and Methodology (TOSEM 2025, 2023)
Elsevier Journal of Systems and Software (JSS 2025, 2023)
Springer Empirical Software Engineering (EMSE 2022)

Editorial & Organizational

Web & Publicity Chair, Consortium for Software Engineering Research (CSER Spring 2025)

Reviewing Awards

Distinguished Reviewer Award, Software Engineering in Practice (SEIP) track, International Conference on Software Engineering (ICSE 2026)

Volunteer Roles

Volunteer, Consortium for Software Engineering Research (CSER Spring 2025)