Eklavya Sarkar

PhD Candidate in Machine Learning

Final year PhD student at EPFL and Research Assistant at Idiap Research Insitute,
working in the Speech and Audio Processing group, under Dr. Mathew Magimai Doss.

My research focuses on speech and audio methods based on self-supervised
representation learning for analyzing human and non-human vocal communication,
for the wider purpose of studying the evolution of language.

Previously I worked worked on computer vision topics such as deepfakes and biometrics
spoofing, as well as physics research and development at CERN in the CMS experiment.

Publications

Acoustic TokensDiscretizationVector QuantizationBioacoustics

Leveraging Sequential Structure in Animal Vocalizations

Abstract

Authors: Eklavya Sarkar, Mathew Magimai-Doss.

JournalFine-TuningSpeechBioacoustics

Adaptation of Speech and Bioacoustics Models

Abstract

Authors: Eklavya Sarkar, Amir Mohammadi, Mathew Magimai-Doss.

JournalFrequency ResponseFeature RepresentationBioacoustics

On Feature Representations for Marmoset Vocal Communication Analysis

Abstract

Authors: Eklavya Sarkar, Mathew Magimai-Doss. The acoustic analysis of marmoset (Callithrix jacchus) vocalizations is often used to understand the evolutionary origins of human language. Currently, the analysis is largely carried out in a manual or semi-manual manner. Thus, there is a need to develop automatic call analysis methods. In that direction, research has been limited to the development of analysis methods with small amounts of data or for specific scenarios. Furthermore, there is lack of prior knowledge about what type of information is relevant for different call analysis tasks. To address these issues, as a first step, this paper explores different feature representation methods, namely, HCTSA-based hand-crafted features Catch22, pre-trained self supervised learning (SSL) based features extracted from neural networks trained on human speech and end-to-end acoustic modeling for call-type classification, caller identification and caller sex identification. Through an investigation on three different marmoset call datasets, we demonstrate that SSL-based feature representations and end-to-end acoustic modeling tend to lead to better systems than Catch22 features for call-type and caller classification. Furthermore, we also highlight the impact of signal bandwidth on the obtained task performances.

Accepted at Bioacoustics 📓

Pre-training DomainFine-TuningSelf-Supervised LearningBioacoustics

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing

Abstract

Authors: Eklavya Sarkar, Mathew Magimai-Doss. Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.

Accepted at ICASSP 2025 🇮🇳

Foundation ModelsPre-training DomainBandwidthBioacoustics

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Abstract

Authors: Eklavya Sarkar, Mathew Magimai-Doss. Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

Accepted at Interspeech 2024 🇬🇷

Feature RepresentationCall-Type ClassificationBioacoustics

Feature Representations for Automatic Meerkat Vocalization Classification

Abstract

Authors: Imen Ben Mahoud, Eklavya Sarkar, Marta Manser, Mathew Magimai-Doss. Understanding evolution of vocal communication in social animals is an important research problem. In that context, beyond humans, there is an interest in analyzing vocalizations of other social animals such as, meerkats, marmosets, apes. While existing approaches address vocalizations of certain species, a reliable method tailored for meerkat calls is lacking. To that extent, this paper investigates feature representations for automatic meerkat vocalization analysis. Both traditional signal processing-based representations and data-driven representations facilitated by advances in deep learning are explored. Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.

Accepted at Interspeech 2024 🇬🇷

Self-Supervised LearningSpeaker IDSpeech

Can Self-Supervised Neural Networks Pre-Trained on Human Speech distinguish Animal Callers?

Abstract

Authors: Eklavya Sarkar, Mathew Magimai-Doss. Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.

Accepted at Interspeech 2023 🇮🇪

Signal ProcessingVADSpeech

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Abstract

Authors: Eklavya Sarkar, RaviShankar Prasad, Mathew Magimai-Doss. Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.

Accepted at Interspeech 2022 🇰🇷

StyleGAN2Face RecognitionBiometrics

Are GAN-based Morphs Threatening Face Recognition?

Abstract

Authors: Eklavya Sarkar, Pavel Korschunov, Laurent Colbois, Sébastien Marcel. Morphing attacks are a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in generation of face morphs and their detection is developing rapidly, however very few datasets with morphing attacks and open-source detection toolkits are publicly available. This paper bridges this gap by providing two datasets and the corresponding code for four types of morphing attacks: two that rely on facial landmarks based on OpenCV and FaceMorpher, and two that use StyleGAN 2 to generate synthetic morphs. We also conduct extensive experiments to assess the vulnerability of four state-of-the-art face recognition systems, including FaceNet, VGG-Face, ArcFace, and ISV. Surprisingly, the experiments demonstrate that, although visually more appealing, morphs based on StyleGAN 2 do not pose a significant threat to the state to face recognition systems, as these morphs were outmatched by the simple morphs that are based facial landmarks.

Accepted at ICASSP 2022 🇸🇬

StyleGAN2Face RecognitionBiometrics

Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Abstract

Authors: Eklavya Sarkar, Pavel Korschunov, Laurent Colbois, Sébastien Marcel. Morphing attacks are a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in face morphing attack detection is developing rapidly, however very few datasets with several forms of attacks are publicly available. This paper bridges this gap by providing a new dataset with four different types of morphing attacks, based on OpenCV, FaceMorpher, WebMorph and a generative adversarial network (StyleGAN), generated with original face images from three public face datasets. We also conduct extensive experiments to assess the vulnerability of the state-of-the-art face recognition systems, notably FaceNet, VGG-Face, and ArcFace. The experiments demonstrate that VGG-Face, while being less accurate face recognition system compared to FaceNet, is also less vulnerable to morphing attacks. Also, we observed that naıve morphs generated with a StyleGAN do not pose a significant threat.

Idiap-RR-38-2020

Work Experience

Speech Processing

Research Assistant (PhD Candidate)

Idiap Research Institute

Supervisor: Dr. Mathew Magimai Doss, Speech and Audio Processing Group

Self-Supervised Speech Learning, Representation Learning
SSL, VAD, Diarization, ASWUs, Bioacoustics
Audio Segmentation Methods for Analyzing Vocal Communication: From Humans to Animals.
Low Resource Speech and Animal Vocalizations processing.
Working on EvoLang Project, TTF Tech ASR.

March 2021 - Present

BiometricsMLDLGANs

Research Intern

Idiap Research Institute

Supervisor: Dr. Sébastien Marcel, HOD Biometrics Security and Privacy Group

Developed and released StyleGAN2 latent space editing code for morphing.
Implemented different techniques to generate traditional and StyleGAN2-based face morphs.
Investigated vulnerabilities of modern facial recognition systems against morphing attacks.
Currently researching detection techniques for such attacks to publish paper by November.

May 2020 - Feb 2021
(10 months)

SEDSR&D

Intern

CERN

Project Manager: Dr. Archana Sharma, Principal Scientist, CMS Experiment

Contributed to CERN's CMS-GEM-DAQ project's production code: PR1, PR2.
Refined efficiency of production code by implementing requested features on Python scripts.
Improved code used for testing detector in a QC stand by adding an step-size feature.
Created method for configuring detector’s electrical state with custom values.
Published real time gas levels of a mixer by writing code to send data to a server via an API.

July 2017 - September 2017
(3 months)

Thesis

DLML130 Pages

Transferability of Learnt Speech Representations for Decoding Non-Human Vocal Communication

Ph.D., Speech, Bioacoustics, Animal Vocalizations

Investigated learnt speech representations and their transferability to animal vocalizations.
Published 6+ first-author papers at top ML conferences and journals.
Research topics: speech processing, self-supervised learning, bio-acoustics, speaker diarization, voice activity detection, domain adaption, acoustic sub-word units, and low-resource scenarios.
Supervised Interns and Master students.

2021-25

DLML160 Pages

Facial Information Extraction

M.Sc., Computer Vision, Convolutional Neural Networks

Attempted to use state-of-the-art deep learning techniques to build models which take an image as input.
Performed facial detection, recognition, and emotion classification on the present individuals on the images.
Achieved 95% test accuracy on facial recognition with convolutional neural networks and hyper-parameter tuning.
Built separate models for tasks such as emotion classification before combining them into an end-to-end models.
Optimised performance with DL best practices: data augmentation, batch-normalisation, cross-validation.

2018-19
Grade: Distinction

SEML200 Pages

Kohonen Self-Organising Maps

B.Sc., Computer Vision, Pattern Recognition

Implemented unsupervised machine learning neural network from scratch without using any specific ML library.
Trained back-end model on 3 different open-source datasets to test neural network’s efficiency and scalability.
Developed front-end GUI for interactive data visualisation before & after clustering and dimensionality reduction.
Wrote extensive thesis covering all aspects of project such as system design, algorithmic optimisation, scalability.

2017-18
Grade: 90%

DS50 Pages

Exoplanets: Discoveries and Prospects

Research, Data Analysis, Literature Review

2019 Update: Dider Queloz has since won the Physics Nobel Prize !
Conducted literature review on Exoplanets, with inputs from Didier Queloz, co-discoverer of the first exoplanet.
Showed correlations between possibly habitable planets and core laws of physics by analyzing open-source DB.
50 page report selected among top 2013 student scientific projects in Geneva canton and Pays de Gex.
Invited to present project at a public ‘Science Sharing’ event at CERN's Universe de Particules museum.

2012-13
Grade: 6/6

Projects and Open-source Contributions

RLDLDQNDDPG

Deep Reinforcement Learning: Flappy Bird

Deep Q-Learning Network, Deep Deterministic Policy Gradient, Experience Replay

Attempted to a develop model which is able to learn to play Flappy Bird, and surpass human level scores by using Reinforcement Learning techniques. Specifically investigated Deep Q-Learning networks to develop an overview of the problem and deeper understanding on reinforcement learning techniques. Wished to showcase how computer vision and deep neural networks such as convolutional neural networks can be used in the context of reinforcement learning as well.

2019

NLPMLDL

Kaggle Competition: Toxic Comment Classification

Multi-Label Classification Problem

Attempted to solve a Kaggle competition in a group of three to the best of our abilities. Specifically strove for implementations beyond the exsiting classical ones, and attempted to develop a model which is well-adapted and fine tuned to the specific problem at hand. Implemented a Naive-Bayes Bag of Words model, Random Forest, Extra Trees, and compared their results with the Log Regression, Convolutional Neural Network, and Long Short-Term Memory Recurrent models.

2019

MLBayesianStats

Bayesian Machine Learning

Hamiltonian Monte Carlo Stochastic Methods, Automatic Relevance Determination

Used Bayesian modelling methods, specifically Hamiltonian Monte Carlo, to approximate Gaussian posterior distributions on a multivariate regression task to derive a good predictor from the dataset, and estimate which of the input variabels are relevant for prediction.

2019

NLPML

Open Information Extraction

Speech Tagging, Named Entity Recognition, Relation Extraction, Kitchen Sink

Attempted to summarise Jules Verne's 20,000 leagues under the seas' by training a classifier that indicates which of the part of speech tags each word is. The approach was based on Identifying Relations for Open Information Extraction (Fader, Soderland & Etzioni). To this end, Glove word vectors were employed to implement a logistic one vs all kitchen sink model, and attempted speech tagging on word and sentence levels, named entity resolution and relation extraction.

2019

Robotics I

Localisation, Pathfinding, Navigation, Calibration, Object Detection

Wrote a program using the Java LeJOS framework that enables a robot to explore the arena which contains a small number of obstacles, placed at random locations. There was a single coloured sheet of paper which the robots had to be able to detect using the colour sensor which also signifed the end location, to which the robot had optimally navigate back to the ending position.

2017

Robotics II

Scout, Doctor, Agents, Jason

Wrote a program using the Java LeJOS framework allowing a robot to determine it's starting location in the arena, and optimally work its way to the pre-determined ending position using scout and doctor agents while avoiding the possible obstacles.

2017

AndroidSE

Android Food App

Full stack development

Scran is a user-oriented application that aids in the decision-making process when choosing a restaurant, and more specifically a dish. Scran will maintain, search and track user and restaurant data to help its users to choose the dish they didn’t know they wanted.

2017

Moving Average Filter

Generate, Filter and Display data

Wrote C++ in Xcode to generate random plot and noise values of a sinusoidal function using signal characteristics as parameters, which would then be handled by the designed event driven panels and data structures in LabVIEW, and subsequently transferred to Matlab to be displayed in both filtered and unfiltered states.

2014

Eklavya Sarkar

Publications

Work Experience

Research Assistant (PhD Candidate)

Research Intern

Intern

Thesis

Transferability of Learnt Speech Representations for Decoding Non-Human Vocal Communication

Facial Information Extraction

Kohonen Self-Organising Maps

Exoplanets: Discoveries and Prospects

Projects and Open-source Contributions

Deep Reinforcement Learning: Flappy Bird

Kaggle Competition: Toxic Comment Classification

Bayesian Machine Learning

Open Information Extraction

Robotics I

Robotics II

Android Food App

Moving Average Filter

Talks

Automatic Speech Segmentation

Hidden Markov Models

Generative Adversarial Networks

Convolutional Neural Networks

Competitions

International Create Challenge

Facebook Hackathon 2015

News

CERN Intern

Exoplanet Project Presentation

Education

Ecole Polytechnique Fédérale de Lausanne

University of Bath

University of Liverpool

Skills

Extra-Curricular

Organizer

President

Interests