Abstract:
The growth of various technologies in the modern digital world results in the collection and storage of huge amounts of individual's data. In addition of providing direct services delivery, this data can be used for other non-direct activities known as secondary use. This includes activities such as doing research, analysis, quality and safety measurement, public health, and marketing. These activities enhance services experiences for individuals, expand knowledge and making appropriate decisions, strengthen understanding about the effectiveness and efficiency of the systems, support public education and aid organizations in meeting customers' needs.
The collected data may contain personal-specific and sensitive information, such as medical records and financial records that may cause privacy breaches if compromised. The process of ensuring an individual's privacy results in information loss which renders data less useful. This problem is everywhere were data is collected, but the problem is critical in the healthcare domain due to the sensitive nature of the healthcare data and their importance for several secondary uses. Therefore, in order to increase sharing of the collected data, approaches that ensure an individual's privacy with reduced information loss that renders the data useful are needed. There are number of approaches used to ensure an individual's privacy such as removing Personal Identifiable Information (PII), encryption, and statistical databases. But most of the existing approaches results in substantial information loss or the anonymisation level achieved may still results in the identification of the individual's sensitive information. This research investigates the problem of ensuring an individual's privacy while reducing the amount of information loss. Thus, the research attempts to answer the problem of how the data holders, such as hospitals, private, and government agencies, can ensure an individual's privacy while sharing data which is still useful. This research proposes an anonymisation algorithm, named kl-redInfo that ensures individual's privacy with a reduced amount of information loss that renders data useful. The kl-redInfo algorithm ensures individual's privacy by achieving the main two privacy requirements, k-anonymity and l-diversity, that aim at ensuring an individual's privacy against both identity and sensitive attribute disclosures. The information loss is reduced by using the three proposed modified approaches that reduce the values of the information loss metrics, which indicate a reduction of the information loss. These approaches are; systematic incorporation of the remaining records in the group that result in lower information loss, using both the group-creation part of the anatomization approach and cell-based generalization, and sorting the records according to the attributes that can be linked to identify an individual, also known as quasi-identifier attributes. The research shows that, each of the proposed modified approaches contribute in reducing the amount of information loss with the approach of systematic incorporation of the remaining records in the group that results in a lower value of the information loss metric being the most important. The research find that, even though each of the proposed modifications contributes in reducing the amount of information loss, the amount of information loss resulting from the application of the combined three proposed modifications is significantly reduced. Therefore, the research uses the three proposed modifications to design the proposed kl-redInfo algorithm. The research shows that, the proposed kl-redInfo algorithm results in significant reduction of the information loss compared to the widely used privacy-preserving data publishing algorithms that proved to result in lower information loss. This was indicated by the lower values of the three information loss metrics; Normalized Certainty Penalty (NCP), Discernibility Penalty (DP), and Kullback_Leibler divergence (KL divergence), that implies reduction in the amount of information loss. The reduction of the information loss resulting from the application of the kl-redInfo algorithm was due to the use of the three proposed modified approaches, systematic Incorporation of the remaining records in the group that result in a lower amount of information loss; using both group-creation part of the anatomization approach and cell-based generalization; and sorting the records according to quasi-identifiers.