(1.School of Computer Science and Technology, Hubei University of Technology, Wuhan 430068,China;2.School of Materials Science and Engineering, Wuhan University of Technology, Wuhan 430070,China)
Abstract:With the rapid development of big data technology, healthcare data security issues have also arisen, how to protect patients' private data from being leaked has become a research hot spot. The healthcare data also has 5V characters as Volume, Velocity, Variety, Value, Veracity. In this paper, patient privacy protection and its data analysis problems are studied. Taking PCA-GRA Datafly algorithm as the research object, in order to solve the problem of excessive generalization of QI attributes of traditional algorithms and the local optimization problem of K-means algorithm, the PCA-GRA-BK algorithm (principal component analysis gray-level correlation analysis BiK-means k anonymous algorithm) is proposed. Firstly the PCA algorithm is used to analyze the dimensionality of the healthcare data, several data is used to reveal the internal connection between the healthcare data, and the QI attribute is selected. Secondly the GRA algorithm is used to analyze the correlation degree of the QI attribute to determine the correlation degree with the sensitive attribute, and to construct the generalization level of QI attributes. Then we use the elbow method to determine the best k value of the clustering algorithm, and complete the clustering of similar equivalence classes of the healthcare data set through the clustering algorithm. Finally complete the anonymity of the healthcare data with the help of the K anonymity algorithm change. By comparing Datafly algorithm, PCA-GRA Datafly algorithm, PCA-GRA-KK algorithm and PCA-GRA-BK algorithm to the anonymous analysis of healthcare data, it is found that the loss rate of information is significantly reduced and the running speed of the algorithm is also significantly improved, which further proves the PAC-GRA-BK algorithm proposed in this paper.