The field of cybersecurity — both offensive and defensive security has come along a long way since the first computer virus called Creeper, created by Bob Thomas in 1971, that successfully disguised itself to move from one host to another within the intranet and leaving its trail on the hosts. The program was a research project, and it was not intended to harm anything.
In today’s era, where we have a massive spider web of internet, the attack surface is huge, and we are not talking about simple programs but advanced persistent threats. They are advanced because the attackers/hackers have a full spectrum of intelligence-gathering techniques at their disposal. They are persistent because they like to follow a ‘low-and-slow’ approach to intrude and maintain long-term access on the target devices. They are threats because they have both capability and intent to harm.
On the other hand, Data Science, Big Data & Artificial Intelligence (AI) with its sub-domains like Machine Learning and Deep Learning are making tremendous progress in many business areas like Sales and Marketing, Product Recommendation, Image Classification, Risk Analysis, Fraud Detections, Autonomous driving, and the list goes on.
I have been working in the Cyber Security industry for a while, and now I have also gained some knowledge and professional skills in Data Science. I used to always pause and ponder- where is that perfect intersection or connection of Data Science and Security. If you browse the internet to search for articles, research papers, and more importantly datasets to work on network/cyber security area, the chances of finding what you want or need are lower as compared to other areas.
Artificial Intelligence connecting to Security
Now let’s talk about what are some of the biggest opportunities and challenges for this connection.
Opportunities: Both data science and Cyber security jobs are in high demand. The mix of these two skillsets will definitely give you more strengths to become marketable. Besides the big names like Symantec and McAfee, there are many companies that are trying to pioneer and invent predictive models in cyber security space. Some of those companies are Cylance (now acquired by BlackBerry), Darktrace, Crowdstrike, Carbon Black (acquired by VMware), Obsidian Security, and more. You can already realize that- if big companies are acquiring the start-ups in billions of dollars, undoubtedly that subject area is hot in the market.
Some of the use-cases where machine learning (both supervised vs. unsupervised approaches) will shine in cyber are: detecting and predicting zero-day vulnerabilities (signature or rule based systems can only catch what was detected in past), predicting malicious vs. benign activities (spam vs. ham emails, malware vs. benign code, good vs. bad urls etc.) , preventive/prescriptive maintenance of network devices, automating the repetitive tasks of incident management and security operation centers, predicting anomalous user behavior to prevent insider threats, risk modeling or scoring based on the knowledge of existing infrastructure and past data breaches, graph analytics (also called network or link analysis) for catching bots.
Challenges: I would like to re-iterate the fact that the knowledge sources, and datasets are still very limited in the public space to learn from. It is sort of depressing to see that KDD Cup 1999 dataset is still the cleaner and easily accessible data when it comes to intrusion detection. One of the biggest challenges is the confidentiality of the data itself. No companies will like to share the data about their network. Some of the sources I have found to be useful are: https://vizsec.org/data/ , https://github.com/jivoi/awesome-ml-for-cybersecurity. If you find more, please share the links on comments. Using techniques like k-NN it is also possible to generate synthetic data to research. However, synthetic data will not be able to capture true variance needed for models to learn.
Some of the other challenges are dealing with high number of false positives. For example, if we predict a good code as malware, then that is a false positive. On the flip side, if we fail to predict a malware and we call it a good code, then that is false negative. There is always a tradeoff between false positives vs. false negatives. In cyber, unlike other fields, it is always hard to make that decision to lower one vs. the other. If the false positive rate is high, we are adding more time, resources and human intervention to resolve.
Another challenge is creating a baseline network activity when it comes to predicting anomalous behavior. In order to predict something as anomalous, we first need to know- what is a normal behavior. There are so many data points (logs) feeding into the system, and it is not easy to come up with a baseline. What we think as normal behavior in one subnet, might not be true for another subnet or across the networks, depending on the situations.
Beside some of these challenges, I still see some light at the end of the tunnel. Also, the new findings and success stories in deep learning (like RNNs, especially LSTMs as log data is time-series) with automated feature engineering have given new hopes for cyber industry.