Robik Shrestha Personal Homepage

Robik Shrestha
Linkedin Github Google Scholar Twitter

An incoming Applied Scientist at Amazon AGI

robikshrestha [at] gmail [dot] com

About Me


I am an incoming Applied Scientist at Amazon AGI. I received my PhD in 2023 (advisor: Dr. Christopher Kanan). My research interests include: multimodal vision and language systems, responsible AI and generative AI.




Robik Shrestha

Latest News


May 2024: Started as an Applied Scientist at Amazon AGI

Feb 2024: The work done during my Amazon internship: FairRAG has been accepted to CVPR 2024

Feb 2024: Joined University of Rochester for a temporary role on building Generative AI models to optimize inertial confinement fusion

Dec 2023: Successfully defended my Ph.D.

May 2023: Started as an Applied Scientist Intern at Amazon AWS AI Labs. I am a part of the Responsible AI team

July 2022: Our paper "OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses " was accepted at ECCV 2022 (Oral)

Publications



FairRAG

Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work, we introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity, image-text alignment, and image fidelity while incurring minimal computational overhead during inference.

Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Paper Bibtex
@article{shrestha2024fairrag,
                            title={FairRAG: Fair Human Generation via Fair Retrieval Augmentation},
                            author={Shrestha, Robik and Zou, Yang and Chen, Qiuyu and Li, Zhiheng and Xie, Yusheng and Deng, Siqi},
                            journal={CVPR},
                            year={2024}
                          }
                        

OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses

Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we propose OccamNets, which are biased to favor simpler solutions by design. OccamNets have two inductive biases. First, they are biased to use as little network depth as needed for an individual example. Second, they are biased toward using fewer image locations for prediction. While OccamNets are biased toward simpler hypotheses, they can learn more complex hypotheses if necessary. In experiments, OccamNets outperform or rival state-of-the-art methods run on architectures that do not incorporate these inductive biases. Furthermore, we demonstrate that when the state-of-the-art debiasing methods are combined with OccamNets, results further improve.

European Conference on Computer Vision. (ECCV 2022)

Paper Code Bibtex
@article{shrestha2022occamnets,
                        title={OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses},
                        author={Shrestha, Robik and Kafle, Kushal and Kanan, Christopher},
                        booktitle={European Conference on Computer Vision (ECCV)},
                        year={2022}
                        }
                        

An Investigation of Critical Issues in Bias Mitigation Techniques

A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods

IEEE/CVF Winter Conference of Applications on Computer Vision (WACV 2022)

Paper Code Bibtex
@article{shrestha2021investigation,
  title={An investigation of critical issues in bias mitigation techniques},
  author={Shrestha, Robik and Kafle, Kushal and Kanan, Christopher},
  journal={Workshop on Applications of Computer Vision},
  year={2021}
}


Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.

Frontiers in Digital Health (2021)

Paper Bibtex
@ARTICLE{10.3389/fdgth.2021.671015,

AUTHOR={Mahmood, Usman and Shrestha, Robik and Bates, David D. B. and Mannelli, Lorenzo and Corrias, Giuseppe and Erdi, Yusuf Emre and Kanan, Christopher},

TITLE={Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems},

JOURNAL={Frontiers in Digital Health},

VOLUME={3},

YEAR={2021},

URL={https://www.frontiersin.org/article/10.3389/fdgth.2021.671015},

DOI={10.3389/fdgth.2021.671015},

ISSN={2673-253X},

ABSTRACT={Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.}
}


Negative Case Analysis

Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, humanbased cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.

Association for Computational Linguistics (ACL 2020)

Paper Code Bibtex
@inproceedings{shrestha-etal-2020-negative,
title = "A negative case analysis of visual grounding methods for {VQA}",
author = "Shrestha, Robik  and
  Kafle, Kushal  and
  Kanan, Christopher",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.727",
pages = "8172--8181"
}


On the Value of OOD Testing

Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on 'inverting' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation.

Neural Information Processing Systems (NeurIPS 2020)

Paper Bibtex

    @article{teney2020value,
  title={On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law},
  author={Teney, Damien and Kafle, Kushal and Shrestha, Robik and Abbasnejad, Ehsan and Kanan, Christopher and Hengel, Anton van den},
  booktitle={Advances in neural information processing systems (NeurIPS)},
  year={2020}
}


REMIND Your Neural Network to Prevent Catastrophic Forgetting

In lifelong machine learning, an agent must be incrementally updated with new knowledge, instead of having distinct train and deployment phases. For incrementally training convolutional neural network models, prior work has enabled replay by storing raw images, but this is memory intensive and not ideal for embedded agents. Here, we propose REMIND, a tensor quantization approach that enables efficient replay with tensors. Unlike other methods, REMIND is trained in a streaming manner, meaning it learns one example at a time rather than in large batches containing multiple classes. Our approach achieves state-of-the-art results for incremental class learning on the ImageNet-1K dataset. We demonstrate REMIND's generality by pioneering multi-modal incremental learning for visual question answering (VQA), which cannot be readily done with comparison models.

European Conference on Computer Vision (ECCV 2020)

Paper Code Bibtex
@article{hayes2019remind,
  title={REMIND Your Neural Network to Prevent Catastrophic Forgetting},
  author={Hayes, Tyler L and Kafle, Kushal and Shrestha, Robik and Acharya, Manoj and Kanan, Christopher},
  journal={arXiv preprint arXiv:1910.02509},
  year={2019}
}


Parallel Recurrent Fusion for Chart Question Answering

Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.

IEEE Winter Conference on Applications of Computer Vision (WACV 2020)

Paper Bibtex
@inproceedings{kafle2020answering,
  title={Answering Questions about Data Visualizations using Efficient Bimodal Fusion},
  author={Kafle, Kushal and Shrestha, Robik and Cohen, Scott and Price, Brian and Kanan, Christopher},
  booktitle={The IEEE Winter Conference on Applications of Computer Vision},
  pages={1498--1507},
  year={2020}
}}


Parallel Recurrent Fusion for Chart Question Answering

Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.

Frontiers in Artificial Intelligence - Language and Computation (2019)

Paper Bibtex
@article{kafle2019challenges,
  title={Challenges and Prospects in Vision and Language Research},
  author={Kafle, Kushal and Shrestha, Robik and Kanan, Christopher},
  journal={arXiv preprint arXiv:1904.09317},
  year={2019}
}


RAMEN

Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, E.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019)

Paper Code Bibtex
@inproceedings{shrestha2019ramen,
title={Answer Them All! Toward Universal Visual Question Answering Models},
  author={Shrestha, Robik and Kafle, Kushal and Kanan, Christopher},
  booktitle={CVPR},
  year={2019}
    }