Introducing the VoicePrivacy Initiative

-By N. Tomashenko1 , B. M. L. Srivastava , X. Wang , E. Vincent , A. Nautsch , J. Yamagishi, Evans , J. Patino , J.-F. Bonastre1 , P.-G. Noé1 , M. Todisco

University of Edinburgh, UK

Paper Link

Abstract

The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy 2020 Challenge and describe the datasets used for system development and evaluation. We also present the attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and report objective evaluation results.

Recent years have seen mounting calls for the preservation of privacy when treating or storing personal data. This is not least the result of the European general data protection regulation (GDPR). While there is no legal definition of privacy, speech data encapsulates a wealth of personal information that can be revealed by listening or by automated systems. This includes, e.g., age, gender, ethnic origin, geographical background, health or emotional state, political orientations, and religious beliefs, among others. In addition, speaker recognition systems can reveal the speaker’s identity. It is thus of no surprise that efforts to develop privacy preservation solutions for speech technology are starting to emerge. The VoicePrivacy initiative aims to gather a new community to define the tasks of interest and the evaluation methodology, and to benchmark these solutions through a series of challenges. Current methods fall into four categories: deletion, encryption, distributed learning, and anonymization. Deletion methods are meant for ambient sound analysis. They delete or obfuscate any overlapping speech to the point where no information about it can be recovered. Encryption methods such as fully homomorphic encryption and secure multiparty computation, support computation upon data in the encrypted domain. They incur significant increases in computational complexity, which require special hardware. Decentralized or federated learning methods aim to learn models from distributed data without accessing it directly. The derived data used for learning (e.g., model gradients) may still leak information about the original data.

Privacy preservation is formulated as a game between users who publish some data and attackers who access this data or data derived from it and wish to infer information about the users. To protect their privacy, the users publish data that contain as little personal information as possible while allowing one or more downstream goals to be achieved. To infer personal information, the attackers may use additional prior knowledge.

Focusing on speech data, a given privacy preservation scenario is specified by:

(i) the nature of the data: waveform, features, etc.,

(ii) the information seen as personal: speaker identity, traits, spoken contents, etc.,

(iii) the downstream goal(s): human communication, automated processing, model training, etc.,

(iv) the data accessed by the attackers: one or more utterances, derived data or model, etc.,

(v) the attackers’ prior knowledge: previously published data, privacy preservation method applied, etc.

Different specifications lead to different privacy preservation methods from the users’ point of view and different attacks from the attackers’ point of view.

For objective evaluation, we train two systems to assess speaker verifiability and ASR decoding error. The first system denoted ASVeval is an automatic speaker verification (ASV) system, which produces log-likelihood ratio (LLR) scores. The second system denoted ASReval is an ASR system which outputs a word error rate (WER). Both are trained on LibriSpeech trainclean-360 using Kaldi

Subjective speaker verifiability

To evaluate subjective speaker verifiability, listeners are given pairs of one anonymized trial utterance and one distinct original enrollment utterance of the same speaker. Following, they are instructed to imagine a scenario in which the anonymized sample is from an incoming telephone call, and to rate the similarity between the voice and the original voice using a scale of 1 to 10, where 1 denotes ‘different speakers’ and 10 denotes ‘the same speaker’ with highest confidence. The performance of each anonymization system will be visualized through detection error tradeoff (DET) curves.

Subjective speaker linkability

The second subjective metric assesses speaker linkability, i.e., the ability to cluster several utterances into speakers. Listeners are asked to place a set of anonymized trial utterances from different speakers in a 1- or 2-dimensional space according to speaker similarity. This relies on a graphical interface, where each utterance is represented as a point in space and the distance between two points expresses subjective speaker dissimilarity.

Subjective speech intelligibility

Listeners are also asked to rate the intelligibility of individual samples (anonymized trial utterances or original enrollment utterances) on a scale from 1 (totally unintelligible) to 10 (totally intelligible). The results can be visualized through DET curves.

Subjective speech naturalness

Finally, the naturalness of the anonymized speech will be evaluated on a scale from 1 (totally unnatural) to 10 (totally natural).

Conclusion

The VoicePrivacy initiative aims to promote the development of private-by-design speech technology. Our initial event, the VoicePrivacy 2020 Challenge, provides a complete evaluation protocol for voice anonymization systems. We formulated the voice anonymization task as a game between users and attackers, and highlighted three possible attack models. We also designed suitable datasets and evaluation metrics, and we released two open-source baseline voice anonymization systems. Future work includes evaluating and comparing the participants’ systems using objective and subjective metrics, computing alternative objective metrics relating to, e.g., requirement, and drawing initial conclusions regarding the best anonymization strategies for a given attack model. A revised, stronger evaluation protocol is also expected as an outcome. In this regard, it is essential to realize that the users’ downstream goals and the attack models listed above are not exhaustive. For instance, beyond ASR decoding, anonymization is extremely useful in the context of anonymized data collection for ASR training. It is also known that the EER becomes lower when the attackers generate anonymized training data and retrains ASVeval on this data. In order to assess these aspects, we will ask volunteer participants to share additional data with us and run additional experiments in a post-evaluation phase.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog