Eklavya Sarkar

PhD Candidate in Machine Learning

Third year PhD student at EPFL and Research Assistant at Idiap Research Insitute,
working in the Speech and Audio Processing group, under Dr. Mathew Magimai Doss.

My research focuses on speech and audio methods based on self-supervised
representation learning for analyzing human and non-human vocal communication,
for the wider purpose of studying the evolution of language.

Previously I worked worked on computer vision topics such as deepfakes and biometrics
spoofing, as well as physics research and development at CERN in the CMS experiment.

Work Experience

Speech Processing

Research Assistant

Idiap Research Institute

Supervisor: Dr. Mathew Magimai Doss, Speech and Audio Processing Group

Self-Supervised Learning
SSL, VAD, Diarization, ASWUs, Bio-Acoustics
Audio Segmentation Methods for Analyzing Vocal Communication: From Humans to Animals.
Low Resource Speech and Animal Vocalizations processing.
Working on EvoLang Project, TTF Tech ASR.

March 2021 - Present

BiometricsMLDLGANs

Research Intern

Idiap Research Institute

Supervisor: Dr. Sébastien Marcel, HOD Biometrics Security and Privacy Group

Developed and released StyleGAN2 latent space editing code for morphing.
Implemented different techniques to generate traditional and StyleGAN2-based face morphs.
Investigated vulnerabilities of modern facial recognition systems against morphing attacks.
Currently researching detection techniques for such attacks to publish paper by November.

May 2020 - Feb 2021
(10 months)

SEDSR&D

Intern

CERN

Project Manager: Dr. Archana Sharma, Principal Scientist, CMS Experiment

Contributed to CERN's CMS-GEM-DAQ project's production code: PR1, PR2.
Refined efficiency of production code by implementing requested features on Python scripts.
Improved code used for testing detector in a QC stand by adding an step-size feature.
Created method for configuring detector’s electrical state with custom values.
Published real time gas levels of a mixer by writing code to send data to a server via an API.

July 2017 - September 2017
(3 months)

Publications

Foundation ModelsPre-training DomainBandwidthBioacoustics

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Abstract

Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

Accepted at Interspeech 2024 🇬🇷

Self-Supervised LearningSpeaker IDSpeech

Can Self-Supervised Neural Networks Pre-Trained on Human Speech distinguish Animal Callers?

Abstract

Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.

Accepted at Interspeech 2023 🇮🇪

Signal ProcessingVADSpeech

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Abstract

Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.

Accepted at Interspeech 2022 🇰🇷

StyleGAN2Face RecognitionBiometrics

Are GAN-based Morphs Threatening Face Recognition?

Abstract

Morphing attacks are a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in generation of face morphs and their detection is developing rapidly, however very few datasets with morphing attacks and open-source detection toolkits are publicly available. This paper bridges this gap by providing two datasets and the corresponding code for four types of morphing attacks: two that rely on facial landmarks based on OpenCV and FaceMorpher, and two that use StyleGAN 2 to generate synthetic morphs. We also conduct extensive experiments to assess the vulnerability of four state-of-the-art face recognition systems, including FaceNet, VGG-Face, ArcFace, and ISV. Surprisingly, the experiments demonstrate that, although visually more appealing, morphs based on StyleGAN 2 do not pose a significant threat to the state to face recognition systems, as these morphs were outmatched by the simple morphs that are based facial landmarks.

Accepted at ICASSP 2022 🇸🇬

StyleGAN2Face RecognitionBiometrics

Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks

Abstract

Morphing attacks are a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in face morphing attack detection is developing rapidly, however very few datasets with several forms of attacks are publicly available. This paper bridges this gap by providing a new dataset with four different types of morphing attacks, based on OpenCV, FaceMorpher, WebMorph and a generative adversarial network (StyleGAN), generated with original face images from three public face datasets. We also conduct extensive experiments to assess the vulnerability of the state-of-the-art face recognition systems, notably FaceNet, VGG-Face, and ArcFace. The experiments demonstrate that VGG-Face, while being less accurate face recognition system compared to FaceNet, is also less vulnerable to morphing attacks. Also, we observed that naıve morphs generated with a StyleGAN do not pose a significant threat.

Idiap-RR-38-2020

Thesis

DLML160 Pages

Facial Information Extraction

MSc, Computer Vision, Convolutional Neural Networks

Attempted to use state-of-the-art deep learning techniques to build models which take an image as input.
Performed facial detection, recognition, and emotion classification on the present individuals on the images.
Achieved 95% test accuracy on facial recognition with convolutional neural networks and hyper-parameter tuning.
Built separate models for tasks such as emotion classification before combining them into an end-to-end models.
Optimised performance with DL best practices: data augmentation, batch-normalisation, cross-validation.

2018-19
Grade: Distinction

SEML200 Pages

Kohonen Self-Organising Maps

BSc, Computer Vision, Pattern Recognition

Implemented unsupervised machine learning neural network from scratch without using any specific ML library.
Trained back-end model on 3 different open-source datasets to test neural network’s efficiency and scalability.
Developed front-end GUI for interactive data visualisation before & after clustering and dimensionality reduction.
Wrote extensive thesis covering all aspects of project such as system design, algorithmic optimisation, scalability.

2017-18
Grade: 90%

DS50 Pages

Exoplanets: Discoveries and Prospects

Research, Data Analysis, Literature Review

2019 Update: Dider Queloz has since won the Physics Nobel Prize !
Conducted literature review on Exoplanets, with inputs from Didier Queloz, co-discoverer of the first exoplanet.
Showed correlations between possibly habitable planets and core laws of physics by analyzing open-source DB.
50 page report selected among top 2013 student scientific projects in Geneva canton and Pays de Gex.
Invited to present project at a public ‘Science Sharing’ event at CERN's Universe de Particules museum.

2012-13
Grade: 6/6

Talks

DLDiarizationNCCR

Automatic Speech Segmentation

14 June 2021

HMM Tutorial Problems

Hidden Markov Models

8 Dec 2020

DLStyleGAN2Morphing AttacksVulnerability Analysis

Vulnerability Analysis of Face Morphing Attacks from Landmarks and GANs

4 Nov 2020

DLGANsVAEs

Generative Adversarial Networks

30 April 2020

DLCNNs

Convolutional Neural Networks

30 April 2020

Projects and Open-source Contributions

RLDLDQNDDPG

Deep Reinforcement Learning: Flappy Bird

Deep Q-Learning Network, Deep Deterministic Policy Gradient, Experience Replay

Attempted to a develop model which is able to learn to play Flappy Bird, and surpass human level scores by using Reinforcement Learning techniques. Specifically investigated Deep Q-Learning networks to develop an overview of the problem and deeper understanding on reinforcement learning techniques. Wished to showcase how computer vision and deep neural networks such as convolutional neural networks can be used in the context of reinforcement learning as well.

2019

NLPMLDL

Kaggle Competition: Toxic Comment Classification

Multi-Label Classification Problem

Attempted to solve a Kaggle competition in a group of three to the best of our abilities. Specifically strove for implementations beyond the exsiting classical ones, and attempted to develop a model which is well-adapted and fine tuned to the specific problem at hand. Implemented a Naive-Bayes Bag of Words model, Random Forest, Extra Trees, and compared their results with the Log Regression, Convolutional Neural Network, and Long Short-Term Memory Recurrent models.

2019

MLBayesianStats

Bayesian Machine Learning

Hamiltonian Monte Carlo Stochastic Methods, Automatic Relevance Determination

Used Bayesian modelling methods, specifically Hamiltonian Monte Carlo, to approximate Gaussian posterior distributions on a multivariate regression task to derive a good predictor from the dataset, and estimate which of the input variabels are relevant for prediction.

2019

NLPML

Open Information Extraction

Speech Tagging, Named Entity Recognition, Relation Extraction, Kitchen Sink

Attempted to summarise Jules Verne's 20,000 leagues under the seas' by training a classifier that indicates which of the part of speech tags each word is. The approach was based on Identifying Relations for Open Information Extraction (Fader, Soderland & Etzioni). To this end, Glove word vectors were employed to implement a logistic one vs all kitchen sink model, and attempted speech tagging on word and sentence levels, named entity resolution and relation extraction.

2019

Robotics I

Localisation, Pathfinding, Navigation, Calibration, Object Detection

Wrote a program using the Java LeJOS framework that enables a robot to explore the arena which contains a small number of obstacles, placed at random locations. There was a single coloured sheet of paper which the robots had to be able to detect using the colour sensor which also signifed the end location, to which the robot had optimally navigate back to the ending position.

2017

Robotics II

Scout, Doctor, Agents, Jason

Wrote a program using the Java LeJOS framework allowing a robot to determine it's starting location in the arena, and optimally work its way to the pre-determined ending position using scout and doctor agents while avoiding the possible obstacles.

2017

AndroidSE

Android Food App

Full stack development

Scran is a user-oriented application that aids in the decision-making process when choosing a restaurant, and more specifically a dish. Scran will maintain, search and track user and restaurant data to help its users to choose the dish they didn’t know they wanted.

2017

Moving Average Filter

Generate, Filter and Display data

Wrote C++ in Xcode to generate random plot and noise values of a sinusoidal function using signal characteristics as parameters, which would then be handled by the designed event driven panels and data structures in LabVIEW, and subsequently transferred to Matlab to be displayed in both filtered and unfiltered states.

2014

Eklavya Sarkar

Work Experience

Research Assistant

Research Intern

Intern

Publications

Thesis

Facial Information Extraction

Kohonen Self-Organising Maps

Exoplanets: Discoveries and Prospects

Talks

Automatic Speech Segmentation

Hidden Markov Models

Vulnerability Analysis of Face Morphing Attacks from Landmarks and GANs

Generative Adversarial Networks

Convolutional Neural Networks

Projects and Open-source Contributions

Deep Reinforcement Learning: Flappy Bird

Kaggle Competition: Toxic Comment Classification

Bayesian Machine Learning

Open Information Extraction

Robotics I

Robotics II

Android Food App

Moving Average Filter

Competitions

International Create Challenge

Facebook Hackathon 2015

News

CERN Intern

Exoplanet Project Presentation

Education

Ecole Polytechnique Fédérale de Lausanne

University of Bath

University of Liverpool

Skills

Extra-Curricular

Organizer

President

Interests