-By N. Tomashenko1
, B. M. L. Srivastava , X. Wang , E. Vincent , A. Nautsch , J. Yamagishi, Evans , J. Patino , J.-F. Bonastre1
, P.-G. Noé1
, M. Todisco
University of Edinburgh, UK
Abstract
The VoicePrivacy initiative aims to promote the development of
privacy preservation tools for speech technology by gathering a
new community to define the tasks of interest and the evaluation
methodology, and benchmarking solutions through a series of
challenges. In this paper, we formulate the voice anonymization
task selected for the VoicePrivacy 2020 Challenge and describe
the datasets used for system development and evaluation. We
also present the attack models and the associated objective and
subjective evaluation metrics. We introduce two anonymization
baselines and report objective evaluation results.
Recent years have seen mounting calls for the preservation of
privacy when treating or storing personal data. This is not
least the result of the European general data protection regulation (GDPR). While there is no legal definition of privacy,
speech data encapsulates a wealth of personal information that
can be revealed by listening or by automated systems. This
includes, e.g., age, gender, ethnic origin, geographical background, health or emotional state, political orientations, and religious beliefs, among others. In addition, speaker
recognition systems can reveal the speaker’s identity. It is
thus of no surprise that efforts to develop privacy preservation
solutions for speech technology are starting to emerge. The
VoicePrivacy initiative aims to gather a new community to define the tasks of interest and the evaluation methodology, and to
benchmark these solutions through a series of challenges.
Current methods fall into four categories: deletion, encryption, distributed learning, and anonymization. Deletion methods are meant for ambient sound analysis. They delete or
obfuscate any overlapping speech to the point where no information about it can be recovered. Encryption methods such as fully homomorphic encryption and secure multiparty computation, support computation upon data in the
encrypted domain. They incur significant increases in computational complexity, which require special hardware. Decentralized or federated learning methods aim to learn models from
distributed data without accessing it directly. The derived
data used for learning (e.g., model gradients) may still leak information about the original data.
Privacy preservation is formulated as a game between users who
publish some data and attackers who access this data or data
derived from it and wish to infer information about the users. To protect their privacy, the users publish data that
contain as little personal information as possible while allowing
one or more downstream goals to be achieved. To infer personal
information, the attackers may use additional prior knowledge.
Focusing on speech data, a given privacy preservation scenario is specified by:
(i) the nature of the data: waveform, features, etc.,
(ii) the information seen as personal: speaker identity, traits, spoken contents, etc.,
(iii) the downstream goal(s):
human communication, automated processing, model training,
etc.,
(iv) the data accessed by the attackers: one or more utterances, derived data or model, etc.,
(v) the attackers’ prior
knowledge: previously published data, privacy preservation
method applied, etc.
Different specifications lead to different
privacy preservation methods from the users’ point of view and
different attacks from the attackers’ point of view.
For objective evaluation, we train two systems to assess speaker
verifiability and ASR decoding error. The first system denoted
ASVeval is an automatic speaker verification (ASV) system,
which produces log-likelihood ratio (LLR) scores. The second system denoted ASReval is an ASR system which outputs a
word error rate (WER). Both are trained on LibriSpeech trainclean-360 using Kaldi
Subjective speaker verifiability
To evaluate subjective speaker verifiability, listeners are given
pairs of one anonymized trial utterance and one distinct original
enrollment utterance of the same speaker. Following, they
are instructed to imagine a scenario in which the anonymized
sample is from an incoming telephone call, and to rate the similarity between the voice and the original voice using a scale
of 1 to 10, where 1 denotes ‘different speakers’ and 10 denotes
‘the same speaker’ with highest confidence. The performance
of each anonymization system will be visualized through detection error tradeoff (DET) curves.
Subjective speaker linkability
The second subjective metric assesses speaker linkability, i.e.,
the ability to cluster several utterances into speakers. Listeners are asked to place a set of anonymized trial utterances from
different speakers in a 1- or 2-dimensional space according to
speaker similarity. This relies on a graphical interface, where
each utterance is represented as a point in space and the distance
between two points expresses subjective speaker dissimilarity.
Subjective speech intelligibility
Listeners are also asked to rate the intelligibility of individual
samples (anonymized trial utterances or original enrollment utterances) on a scale from 1 (totally unintelligible) to 10 (totally
intelligible). The results can be visualized through DET curves.
Subjective speech naturalness
Finally, the naturalness of the anonymized speech will be evaluated on a scale from 1 (totally unnatural) to 10 (totally natural).
Conclusion
The VoicePrivacy initiative aims to promote the development
of private-by-design speech technology. Our initial event, the
VoicePrivacy 2020 Challenge, provides a complete evaluation
protocol for voice anonymization systems. We formulated the
voice anonymization task as a game between users and attackers, and highlighted three possible attack models. We also designed suitable datasets and evaluation metrics, and we released
two open-source baseline voice anonymization systems. Future work includes evaluating and comparing the participants’
systems using objective and subjective metrics, computing alternative objective metrics relating to, e.g., requirement, and drawing initial conclusions regarding the best
anonymization strategies for a given attack model. A revised,
stronger evaluation protocol is also expected as an outcome.
In this regard, it is essential to realize that the users’ downstream goals and the attack models listed above are not exhaustive. For instance, beyond ASR decoding, anonymization is extremely useful in the context of anonymized data collection for
ASR training. It is also known that the EER becomes lower
when the attackers generate anonymized training data and retrains ASVeval on this data. In order to assess these aspects,
we will ask volunteer participants to share additional data with
us and run additional experiments in a post-evaluation phase.
Comments