Preprint Article Version 2 Preserved in Portico This version is not peer-reviewed

First Steps towards Data-driven Adversarial Deduplication

Version 1 : Received: 26 June 2018 / Approved: 26 June 2018 / Online: 26 June 2018 (14:21:02 CEST)
Version 2 : Received: 20 July 2018 / Approved: 23 July 2018 / Online: 23 July 2018 (12:21:00 CEST)

A peer-reviewed article of this Preprint also exists.

Paredes, J.N.; Simari, G.I.; Martinez, M.V.; Falappa, M.A. First Steps towards Data-Driven Adversarial Deduplication. Information 2018, 9, 189. Paredes, J.N.; Simari, G.I.; Martinez, M.V.; Falappa, M.A. First Steps towards Data-Driven Adversarial Deduplication. Information 2018, 9, 189.

Journal reference: Information 2018, 9, 189
DOI: 10.3390/info9080189


In traditional databases, the entity resolution problem (which is also known as deduplication), refers to the task of mapping multiple manifestations of virtual objects to its corresponding real-world entity. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual object appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, even more challenging problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult---if not impossible---to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.


Adversarial Deduplication; Machine Learning Classifiers; Cyber Threat Intelligence


MATHEMATICS & COMPUTER SCIENCE, Information Technology & Data Management

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0

Notify me about updates to this article or when a peer-reviewed version is published.

We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.