The Role of Data in Cybersecurity


If you have been following the news recently, you would have probably come across a couple of big stories covering attacks that struck global economies. Still fresh in the minds of many would probably be the Wannacry Ransomware attack in May.  Then, computers running on the Microsoft Windows Operating System were locked out by dint of data encryption and users were demanded to pay a ransom in cryptocurrency. So widespread was this cyberattack that it temporarily paralysed digital infrastructure across Europe, affecting the operations of governmental and corporate entities. Most recently – in fact just a couple of days ago- a new incarnation of Wannacry “NotPetya” emerged, once again threatening many European countries.

Ransomware Attacks have been pervasive

Fortunately, Singapore was left pretty unscathed from both these cyberattacks but that the spectre of other attacks still looms, some have suggested that we turn to data science and machine learning to better prevent and detect future attacks.

Cybersecurity has indeed been highlighted as one of the key ICT skills that the Singapore government is investing into and a clearer understanding on the practices of malware detection would be helpful. As it stands, data science and machine learning could very well hold the key to more robust cyber security systems over the world.

One common practice of detecting viruses or Malware is to rely on databases of known viruses or Malware. In this signature-based solution, patterns of new strains of viruses are matched with that of older ones and subsequently identified and eliminated. For this very reason, the signature-based solution cannot prevent the onset of cyber attacks as it does not have the predictive power that data analytics and Machine Learning could offer.

Another practice that is more current relies on the characterisation of Malware behaviour. In the case of Ransomware, software could look for repeated attempts to lock files by encrypting them. But that can flag ordinary computer behaviour such as file compression.

Big Data and Machine Learning can here provide greater accuracy. A data model has to be able to automatically distinguish between benign network traffic and potentially malicious traffic. To do this, data collected will be about groups of features or classifiers. These classifiers are decided based on typical behaviours of Malware and the ML algorithm will subsequently flag out potentially malicious programmes based on a probability model (ie. when scanning a software, with regards to how many classifiers does the software exhibit suspicious behaviour).

If its not already apparent, a comprehensive list of classifiers would also entail a large amount of training data. Training data can be further broken down to positive and negative data. ‘Positive Data’ refer to network traffic that has been infected by Malware and ‘Negative Data’ refer to the aforementioned benign traffic. To give a clearer example, negative data in the email context refer to normal emails and positive data phish or spam emails.

As summarised by David Lopes Pegna in his exposition on ML and cybersecurity, the problem today lies in the lack of positive data wherein the the abundance of negative data or “normal network traffic” dilutes the ML algorithm leading to it over diagnosing applications and programmes as being benign. The key to making Machine Learning a bonafide tool for cybersecurity henceforth is in identifying positive data.


This article was written by Joshua Chan, an intern here at Hackwagon Academy.