| 119 | 0 | 285 |
| 下载次数 | 被引频次 | 阅读次数 |
针对深度学习文本表示隐私保护面临可用性与隐私性难以平衡的问题,该文提出一种基于随机掩码和对抗训练的文本表示隐私保护算法RMAT。该算法首先对原始输入文本序列做随机掩盖,之后注入差分隐私噪声,并结合模拟攻击器与任务分类器间的对抗训练,实现深度学习文本表示的隐私脱敏。文章通过理论推导证明了算法满足差分隐私要求,并用5个公开数据集的实验结果验证了算法在提供完备隐私保障的同时提升了脱敏文本的可用性。通过本项实验,学生不仅对深度学习文本表示模型面临的安全风险有了更清晰的认识,还提升了利用深度学习方法分析和解决安全问题的能力。
Abstract:To address the problem of striking the privacy-utility balance for the privacy protection of deep-learning based text representation, this paper proposes a privacy preservation algorithm for text representation based on random mask and adversarial training. The algorithm first masks the original input text sequence randomly, and then injects differential privacy noise, and combines the adversarial training between the simulated attacker and the task classifier to realize the privacy preservation of deep learning text representation. Through theoretical derivation, the paper proves that the algorithm meets the differential privacy requirements, and verifies that the algorithm improves the usability of desensitized text while providing complete privacy protection with experimental results of five public datasets. Through this experiment, students not only have a clearer understanding of the security risks faced by the deep-learning text representation model, but also improve their ability to analyze and solve security problems by using the deep learning method.
[1] DEVLIN J, CHANG M, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1(Long and Short Papers)[C].Minneapolis, Minnesota:Association for Computational Linguistics,2019:4171–4186.
[2]岳增营,叶霞,刘睿珩.基于语言模型的预训练技术研究综述[J].中文信息学报,2021, 35(9):15–29.
[3] COAVOUX M, NARAYAN S, COHEN S. Privacy-preserving neural representations of text:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing[C].Brussels, Belgium:Association for Computational Linguistics,2018:1–10.
[4] PAN X, ZHANG M, JI S, et al. Privacy risks of general-purpose language models:Proceedingsof the 2020 IEEE Symposium on Security and Privacy(SP)[C]. San Francisco, CA, USA:IEEE Press. 2020:1314–1331.
[5]谭作文,张连福.机器学习隐私保护研究综述[J].软件学报,2020, 31(7):2127–2156.
[6]郑海斌,陈晋音,章燕等.面向自然语言处理的对抗攻防与鲁棒性分析综述[J].计算机研究与发展,2021, 58(8):1727–1750.
[7] LI Y, BALDWIN T, COHN T. Towards robust and privacypreserving text representations:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers)[C]. Melbourne, Australia:Association for Computational Linguistics, 2018:25–30.
[8] SONG C, RAGHUNATHAN A. Information leakage in embedding models:Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security(CCS'20)[C]. New York,NY, USA:Association for Computing Machinery, 2020:377–390.
[9]纪守领,杜天宇,李进锋等.机器学习模型安全与隐私研究综述[J].软件学报,2021, 32(1):41–67.
[10] XIE Q, DAI Z, DU Y, et al. Controllable invariance through adversarial feature learning:Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS'17)[C].NY, USA:Curran Associates Inc., Red Hook. 2017:585–596.
[11] ELAZAR Y, GOLDBERG Y. Adversarial removal of demographic attributes from text data:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing[C].Brussels, Belgium:Association for Computational Linguistics.2018:11–21.
[12] FEYISETAN O, BALLE B, DRAKE T, et al. Privacy-and utilitypreserving textual analysis via calibrated multivariate perturbations:Proceedings of the 13th International Conference on Web Search and Data Mining(WSDM'20)[C]. New York, NY, USA:Association for Computing Machinery. 2020:178–186.
[13] BASU S, CHOWDHURY R, GHOSH S, et al. Adversarial scrubbing of demographic information for text classification:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing[C]. Online and Punta Cana, Dominican Republic:Association for Computational Linguistics. 2021:550–562.
[14] LYU L, HE X, LI Y. Differentially private representation for NLP:Formal guarantee and an empirical study on privacy and fairness:Findings of the Association for Computational Linguistics:EMNLP 2020[C]. Online:Association for Computational Linguistics,2020:2355–2365.
[15] PLANT R, GKATZIA D, GIUFFRIDA V. CAPE:Context-aware private embeddings for private language learning:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing[C]. Online and Punta Cana, Dominican Republic:Association for Computational Linguistics.2021:7970–7978.
[16]李效光,李晖,李凤华,等.差分隐私综述[J].信息安全学报,2018, 3(5):92–104.
基本信息:
DOI:10.16791/j.cnki.sjg.2023.08.011
中图分类号:TP391.1
引用信息:
[1]吴舟婷,罗森林.基于随机掩码和对抗训练的文本隐私保护实验[J].实验技术与管理,2023,40(08):72-76.DOI:10.16791/j.cnki.sjg.2023.08.011.
基金信息:
国家242信息安全专项(2019A021,2020A065)
2023-03-31
2023
2023-04-12
2023
1
2023-09-15
2023-09-15
2023-09-15