My thesis
SecureSync: Detection of Recurring Software Vulnerabilities
The official website is here
PROBLEM
Software security vulnerabilities are discovered on an almost daily basis and have caused substantial damage. We have conducted an empirical study on thousands of vulnerabilities and found that many of them are recurring due to software reuse. Based on the knowledge gained from the study, we developed SecureSync, an automatic tool to detect recurring software vulnerabilities on the systems that reuse source code or libraries.
APPROACH
1. Key philosophy
The key philosophy/hypothesis in SecureSync is “The reuse of design, specification APIs/libraries, frameworks, or source code could cause recurring vulnerabilities, i.e. when a bug or a security vulnerability occurs in one software entity, it likely occurs in the corresponding counter-parts of that entity“.
2. Empirical study
Analyze around 3000 vulnerability reports in four security databases:
- National Vulnerability Database (NVD),
- Open Source Computer Emergency Response Team Advisories (oCERT),
- Mozilla Foundation Security Advisories (MFSA), and
- Apache Security Team (ASF).
Collect and analyze available source code, bug reports and relevant discussions about such vulnerabilities.
The detailed report and some interesting examples from our empirical study could be found here.
3. Prototype tool
From the experimental study on recurring vulnerabilities, we develop SecureSync, a prototype tool to support developers with recurring vulnerabilities. The main task of SecureSync is to identify the recurring vulnerabilites existing across different systems, and when a vulnerable system is found, SecureSync will suggest the patch for that system via consulting the sample patch. The working process of SecureSync is illustrated in Figure 1.

Figure 1: SecureSync working process
4. Vulnerability Detection
There are four tasks SecureSync needs to do:
- Presentation
- Feature Extraction and Similarity Measure
- Candidate Searching
- Patch Recommendaton
4.1 Type 1 Vulnerability Detection
Representation
To detect Type 1 recurring vulnerabilities, SecureSync represents
code fragments, including vulnerable and patched code in its
knowledge base, via an AST-like structure, which we call extended
AST (xAST), that incorporates textual and structural features of code fragments.
Feature Extraction and Similarity Measure
A feature set of a code fragment A is a set of xASTs, each represents a statement of A. Then the similarity of two fragments is measured via the similarity of corressponding features sets of xASTs.
Candidate Searching
Because a vulnerable code is scattered in several non-consecutive statements, SecureSync first search candidates at statement level and merge them later at method level.
To improve searching, SecureSync uses two filtering techniques:
- Text-based filtering: keep only source files with specific tokens.
- Structure-based filtering: keep only statements with similar xAST structure as ones in knowledge base.
Patch Recommendation
The nature of Type 1 vulnerability is the reuse of source code. Therefore, the patch for new vulnerable code is highly similar to one in knowledge base. For that reason, SecureSync will point out the location of vulnerable statements and sample patches so that the developers can derive the new patches.
4.2 Type 2 Vulnerability Detection
Type 2 vulnerabilities are caused by the misuse or mishandling of APIs. Therefore, we emphasize on API usages to detect such recurring vulnerabilities. In SecureSync, API usages are represented as graph-based models, in which nodes represent the usages of API function calls, data structures, and control structures, and edges represent the relations or control/data dependencies between them. Our graph-based representation for API usages is called Extended GRaph-based Usage Model (xGRUM).
Feature Extraction and Similarity Measure
First using graph alignment algorithm, SecureSync is able to exact a feature set F(A) and F(A’) which are a set of changed nodes in vulnerable code A (represented as xGRUM G) and patched code A’ (represented as xGRUM G’) respectively.
Given a candidate code fragment B with corresponding xGRUM H. The alignment algoirthm is applied to find the two sets of mapped nodes between F(A) and H. Secure builds two xGRUMs based on that two set of nodes with their neighbor (dependent) nodes to cover the context of API usage. The similarity between code fragment B and vulnerable code fragment A is measured via the similarity of those two xGRUMs.
Candidate Searching
To improve searching, SecureSync also uses two filtering techniques:
- Text-based filtering: keep only source files with specific tokens.
- Set-based filtering: keep only graphs containing nodes with the same labels as nodes of graphs in knowledge bases.
Chosen candidates are then compared against a pair of vulnerable and patched code fragments in knowledge base.
Patching Recommendation
The nature of Type 2 vulnerability is sharing API/library usage. SecureSync only focus on how to fix that incorrect usage by suggesting changes related to APIs/libraries:
- Addition of missed function calls
- Deletion of function calls
- Check input/output before/after function calls.
5. Evaluation on the vulnerability detection can be found here with latest results.
Here is the presentation:
—LOG—
- 06/07/2010: My paper “Detection of Recurring Software Vulnerabilities” has been accepted at ASE 2010. Hura !!!
- I am working on a security-related project. My ambitious and ultimate goal is to detect and fix vulnerabilities in different modules of a system or across different systems. I know it’s very hard problem. The problem, however, can be solvable by some relaxation. It is submitted at NIER, ICSE 2010 and used for my master dissertation. The initial result is very encouraging and I am excited.
- 02/01/2010: I proposed two types of vulnerabilities to detect in my thesis:
->Type 1: Vulnerability happens at one specific API/library that is used in the same way in different versions of one system or across different system. This API/lib can be changed in interface to adapt specific requirements in different systems.
->Type2: We focus on vulnerability happening in the way (usage) of APIs/libraries which some systems uses the same (wrong/buggy) usage of APIs/libs to do the same task in different scenarios. This type of vulnerability is harder to detect because these systems share very little similar code (to invoke vulnerable APIs/libs) scattered in complete different code in different modules. It is important to design new representation of source code which capture the usage of APIs/libs and extract them.
- 02/04/2010: I have come up with the new graph representation for my security model. The idea of this design focuses on the type of vulnerability that reuse the APIs in different context. The presentation is informative enough and strict enough to extract only the relevant information related to APIs supposed to be use in incorrect ways. Essentially, I designed the graph representation which captures information such as the order of function calls, the branching points and branching conditions.
- 02/09/2010: I have finished coding to build the new graph representation for source code called SSGraph and extract the subgraphs/subset of functions (called SSSign)that capture the wrong usage of APIs/libs. Next step I will automatically look into other systems, represent their source code as SSGraph and find potential candidate in which SSSign appears.
- 02/10/2010: My paper for ICSE 2010 NIER track has been accepted !!!
Dear Nam, Tung, Hoan, Xinying, Anh and Tien, We are pleased to inform you that your paper, “Detecting Recurring and Similar Software Vulnerabilities”(Paper-ID: xxx) has been accepted for presentation in the ICSE New Ideas and Emergent Results program and for publication in the conference companion proceedings. The competition was strong: only 19 of the 76 submissions were accepted, giving an acceptance rate of 25.0%.
- 02/14/2010: I have finished coding in the second round. There are some interesting features like:
-> Automatically represent the source code as SSGraph and extract “the core” pattern between the sample code of vulnerability.
-> Automatically search in other systems (via Google code search or in database) for potential similar vulnerability.
-> Calculate the distance between the candidate with the buggy and patch sample to determine whether it is a vulnerability or not.
- 02/22/2010: I had some trouble with coding and debugging the package for detecting type 1. I need to create the oracle and data input again . This time with with some semi-automated tool I wrote, the data making process is pretty fast. The previous data were not extracted correctly from the bug database. Therefore, the result of detecting type 1 (which I expect pretty high) is not satisfiable. I plan to finish the type 1 quickly so that I can focus more on type 2 – which is the key contribution. There are, however, a lot of interesting engineering work on Type 1 to make it scalable. The whole database of only source code is 7.2GB. One possible solution I come up with after discussing with Dr. Jeff Foster is using XDR presentation to transform source code into intermediate format for faster parsing.
- 03/19/2010: Finally, I have finished the project and submit a paper to ASE. The tittle of the paper is “Detection of Recurring Software Vulnerabilities”. The project website will be up soon. I also intend to use it for my master thesis.
- 03/24/2010: My thesis is almost done. I am also building the website for this project. It is still under construction – a lot of details need to be filled in. I aim to finish the website by this week. You can check it out at SecureSync Website
—-LOG—