In a digital world, Software is king.
There has been massive software development taking place in every domain, and there is no doubt that Software has eaten the world. Code bases of companies are huge, and so the real challenge is maintaining such large codebases efficiently. With pressure for faster software development and delivery, there is a greater possibility of bugs in software. And now bugs are eating software.
Bugs are important to fix, not only for mission-critical applications like space or automobile sector software but also for others. Small bugs can cost millions of dollars for companies along with a bad reputation that comes along with a security breach in software. The existence of these bugs can lead to serious trust issues among the users of a product and can lead to a decreased investor confidence. Therefore, it is an absolute necessity that bugs are identified and fixed just in time.
Static code analysis: the past and the path forward.
Conventional static code analysis is driven by several rules you define, and those rules are checked against the codebase to scan for any violations. Traditional static analysis has existed for many years, but it takes a lot of effort to follow up with a lot of security feeds that are updated daily and come up with new rules.
We all know the power of data and must realize that the presence of massive open-source public code, (often referred to as “Big-Code”) is also a great data source that can be harnessed to get useful insights. With the availability of this massive amount of “Big-Code” and advancements in deep learning, especially with graph neural networks, hierarchical neural networks, or transformer-based models in general; static analysis can be made data-driven instead of rules-driven.
Data-driven static code analysis: the cutting edge of code quality maintenance.
Applying Machine Learning to Source Code is a great idea, but quite challenging to achieve due to multiple reasons. First, Source code is highly structured, unlike natural languages. Secondly, collecting, identifying, or creating a rich quality dataset is still difficult, primarily due to many human factors like the efficiency of writing a good commit message, fixing one issue per commit, etc. Next, modeling source code is challenging, as we need to find effective code representation techniques. Next, deploying large language models on commodity hardware is challenging and may need several optimizations, which may affect model performance. Finally, machine learning models are usually black-box models, and for convincing developers of the presence of a vulnerability in each piece of code, a machine learning model build for vulnerability detection tasks must be explainable.
While these are significant challenges, once we overcome these, we get state-of-the-art results with this data-driven technique. Also, these models can be continuously trained with new data and can be improved. Not only it is possible to train these models continuously with new open-source code, but these models can be trained specifically on a particular company’s codebase and some other repositories from open-source, whose code base resembles that of companies’ codebase, to learn specific patterns that are relevant to the developers of a given company.
Vulnerability detection in source code using machine learning can be developed at line level, method level, or file level. Generally, machine learning-based techniques for vulnerability detection would work at line level or at method level. This would involve the collection of vulnerable functions using GitHub APIs and a combination of some heuristics, adding some abstraction at token level and training model using Conv1d, LSTM, etc. While some newer techniques represent source code as a graph (Abstract Syntax Tree / Control Flow Graph / Program Dependency Graph) or a combination of Graph (Code Property Graph) and finally model these using Graph Neural Networks. (Read more here)
Embold: the best of both worlds.
Embold takes machine learning for static analysis to one next level with its powerful Recommendation Engine(RE). Embold combines the traditional rules-based analysis (visit https://rules.embold.io/ for all the rules) with data-driven analysis of codebases to discover new vulnerabilities in the scanned code base and prevent bugs that have been repeated in the past in the same code base just-in-time at early development life-cycle of software development and help reduce a lot of technical debt from accumulating over time.
Learning from past bug fixes should not be lost as most of the new bugs are not actually new, they have been repeated from the past (around 70%).
There are version control platforms like GitHub, Bitbucket, etc, and tools like Jira that keep track of all issues in software development. Usually, bugs are logged in the issue tracker system and commits are done by developers to fix that bug (or issue) with the mention of the issue id in the commit message. In fact, with Big-Code learning, it is even possible to detect if a commit introduces a bug in the software, based on passed commit history and some metadata associated with commits—this is termed as just-in-time defect detection.
Embold’s RE learns from the historical issues in a code base and highlights potential issues which developers can fix before committing the code. RE considers committing history as well as issue tracking information to produce its suggestions.
Embold’s RE utilizes the Hierarchical structure of commits (Commit changes multiple files, having multiple hunks with certain lines added or removed) together with information from issue management software, to build a knowledge graph and applies Hierarchical Neural networks on top of it to learn patterns of code change (Read more about representing code changes as vectors here) to predict if a code change is a buggy or not. It can then suggest fixes from the knowledge graph for the same.
Book a demo here to learn more.