Vol. 7 No. 1 (2022): Proceedings of Botconf 2021/2022
Conference proceedings

Detect emerging malware on cloud before VirusTotal can see it

Thanh Nguyen
Security Innovation Labs - Alibaba Cloud
Gan Feng
Security Innovation Labs - Alibaba Cloud
Andreas Pfadler
DAMO Academy
Anastasia Poliakova
Security Innovation Labs - Alibaba Cloud
Ali Fakeri-Tabrizi
Security Innovation Labs - Alibaba Cloud
Hongliang Liu
Security Innovation Labs - Alibaba Cloud
Yuriy Yuzifovich
Security Innovation Labs - Alibaba Cloud

Published 2022-08-19


  • Fuzzy hash,
  • ssdeep,
  • Similarity graph,
  • Algorithms

How to Cite

Nguyen, T., Feng, G. ., Pfadler, A. ., Poliakova, A. ., Fakeri-Tabrizi, A. ., Liu, H. ., & Yuzifovich, Y. . (2022). Detect emerging malware on cloud before VirusTotal can see it. The Journal on Cybercrime and Digital Investigations, 7(1), 7-16. https://doi.org/10.18464/cybin.v7i1.33

Download Citation


In this paper, we present a new methodology to discover emerging malware where new malware candidates are continuously discovered by our general anomaly detection, and the graph learning system predicts the behavior and the threat family using fuzzy similarity via a correlation knowledge graph to support further analysis by the security researchers, or for the automatic enforcement and remediation. This methodology can be applied at large scale to detect and analyze emerging malware while providing rich contextual information.    


  1. V. Hugo G Moia and M. A. Amaral Henriques, “Similarity digest search: A survey and comparative analysis of strategies to perform known file filtering using approximate matching,” Security and Communication Networks, vol. 2017, pp. 1–17, 09 2017.
  2. A. P. Namanya, I. Awan, J. P. Disso, and M. Younas, “Similarity hash based scoring of portable executable files for efficient malware detection in iot,” Future Gener. Comput. Syst., vol. 110, pp. 824–832, 2020.
  3. N. Sarantinos, C. Benzaid, O. Arabiat, and A. Al-Nemrat, “Forensic malware analysis: The value of fuzzy hashing algorithms in identifying similarities,” pp. 1782–1787, 08 2016.
  4. B. Rahbarinia, M. Balduzzi, and R. Perdisci, “Exploring the long tail of (malicious) software downloads,” 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 391–402, 2017.
  5. M. Botacin, V. Hugo G Moia, F. Ceschin, M. Henriques, and A. Grégio, “Understanding uses and misuses of similarity hashing functions for malware detection and family clustering in actual scenarios,” Forensic Science International: Digital Investigation, vol. 38, p. 301220, 09 2021.
  6. L. Liebler and H. Baier, “Towards exact and inexact approximate matching of executable binaries,” Digital Investigation, vol. 28, pp. S12–S21, 2019.
  7. S. Peiser, L. Friborg, and R. Scandariato, JavaScript Malware Detection Using Locality Sensitive Hashing, pp. 143–154. 09 2020.
  8. N. Naik, P. Jenkins, N. Savage, L. Yang, K. Naik, J. Song, T. Boongoen, and N. Iam-On, “Fuzzy hashing aided enhanced yara rules for malware triaging,” pp. 1138–1145, 12 2020.
  9. V. Roussev, “An evaluation of forensic similarity hashes,” Digital Investigation, vol. 8, 08 2011.
  10. F. Breitinger, H. Baier, and J. Beckingham, “Security and implementation analysis of the similarity digest sdhash,” 08 2012.
  11. J. Oliver, C. Cheng, and Y. Chen, “Tlsh – a locality sensitive hash,” in 2013 Fourth Cybercrime and Trustworthy ComputingWorkshop, pp. 7–13, 2013.
  12. F. Breitinger, K. P. Astebøl, H. Baier, and C. Busch, “mvhash-b - a new approach for similarity preserving hashing,” in 2013 Seventh International Conference on IT Security Incident Management and IT Forensics, pp. 33–44, 2013.
  13. D. Chang, S. Sanadhya, and M. Singh, “Security analysis of mvhash-b similarity hashing,” Journal of Digital Forensics, Security and Law, 01 2016.
  14. E. Raff and C. Nicholas, “Lempel-ziv jaccard distance, an effective alternative to ssdeep and sdhash,” Digital Investigation, 08 2017.
  15. virustotal.com, “https://www.virustotal.com/gui/file/599393e258d8ba7b8f8633e20c651868258827d3a43a4d0712125bc487eabf92.‍
  16. A. Pfadler, A. Poliakova, G. Feng, T. Nguyen, A. Fakeri-Tabrizi, H. Liu, and Y. Yuzifovich, “Detect emerging malware on cloud before virustotal can see it,” Botconf, 2021.
  17. V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet Physics Doklady, 02 1966.
  18. N. Naik, P. Jenkins, N. Savage, L. Yang, T. Boongoen, and N. Iam-On, “Fuzzy-import hashing: A malware analysis approach,” in 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8, 2020.
  19. L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.,” Technical Report 1999-66, Stanford Info-Lab, November 1999. Previous number = SIDLWP-1999-0120.
  20. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  21. K. Berahmand, S. Haghani, M. Rostami, and Y. Li, “A new attributed graph clustering by using label propagation in complex networks,” Journal of King Saud University - Computer and Information Sciences, 2020.
  22. A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” CoRR, vol. abs/1607.00653, 2016.
  23. A. Bhattacharyya and D. Chakravarty, “Graph database: A survey,” in 2020 International Conference on Computer, Electrical Communication Engineering (ICCECE), pp. 1–8, 2020.
  24. W. Fan, T. He, L. Lai, X. Li, Y. Li, Z. Li, Z. Qian, C. Tian, L. Wang, J. Xu, Y. Yao, Q. Yin, W. Yu, K. Zeng, K. Zhao, J. Zhou, D. Zhu, and R. Zhu, “Graphscope: A unified engine for big graph processing,” Proc. VLDB Endow., vol. 14, pp. 2879–2892, 2021.
  25. J. Wing, “Scenario graphs applied to security,” 01 2005.
  26. C. Phillips, “A graph-based system for network-vulnerability analysis,” in Proceedings of the 1998 workshop on New security paradigms, pp. 71–79, ACM Press, 1998.
  27. X. Tao, Y. Liu, F. Zhao, C. Yang, and Y. Wang, “Graph database-based network security situation awareness data storage method,” EURASIP Journal on Wireless Communications and Networking, 2018.
  28. B. Abu Jamous, R. Fa, and A. Nandi, Graph Clustering, pp. 227–246. 04 2015.
  29. J. Creusefond, “A comparison of graph clustering algorithms,” 06 2015.
  30. Y. Peng, X. Zhu, F. Nie, W. Kong, and Y. Ge, “Fuzzy graph clustering,” Information Sciences, vol. 571, 04 2021.
  31. B. Auffarth, “Spectral graph clustering,” 01 2007.
  32. B. Wallace, “Optimizing ssdeep for use at scale.,” technical report, Cylance, USA, November 2015.
  33. L. Allison and T. I. Dix, “A bit-string longestcommon-subsequence algorithm,” Information Processing Letters, vol. 23, no. 5, pp. 305–310, 1986.
  34. T. OI, “https://github.com/a4lg/ffuzzypp.‍
  35. virustotal.com, “https://www.virustotal.com/gui/file/dba757c20fbc1d81566ef2877a9bfca9b3ddb84b9f04c0ca5ae668b7f40ea8c3.‍
  36. virustotal.com, “https://www.virustotal.com/gui/search/6a749f7b071e713affdcd759bc90707e.‍
  37. “Cryptocurrency miners using hacked cloud accounts, google warns,” The Guardian, 2021.
  38. A. Fakeri-Tabrizi, H. Liu, A. Polyakova, and Y. Einav, “Honeypot + graph learning + reasoning = scale up your emerging threat analysis,” Botconf, 2020.