Google Summer of Code 2021 Final Report
Organization - AboutCode
VulnerableCode is a decentralized python program to collect data about open source software vulnerabilities across the internet. My proposal for this year’s Google Summer of Code involved improving the import speed, refactoring existing code, finding points for overall improvement and adding importers.
Improve Import Time
Profiling showed that a lot of time was being wasted during auto commits undertaken by django. Wraping the importer in an atomic block avoids lots of database commits and shows huge performance improvement. This simple change allows for much faster import times while not drastically changing the code structure:
Alpine: 202.7s -> 50.9s Archlinux 2116.6s -> 107.8s Gentoo 3176.3s -> 225.8s
Yielding an average of 93% reduction in time (14x faster)
Speed up upstream tests
VulnerableCode performs upstream tests for all the importers to make sure that any change change in upstream data structure is easily spotted. This allows us to have a look at failing importers without actually deploying the application.
Earlier, all of the importers were run one by one in order to verify that they are intact. While this being the obvious and the full proof way to detect any anomalies in the imported data schema, it did not work because the time required to run all the importers much exceeded 6 hours - which is the maximum time allowed for GitHub actions to run. With this PR, the updated_advisories method of each importer is expected to create at least one Advisory object. If it does so, the importer is marked working. While this is not full proof, it stays much below the allowed resource usage cap. In the end, this is a trade off between resource usage and data accuracy. This brings major performance improvement during the test.
Improve Docker Configuration
The preferred mode of deployment for VulnerableCode is deploying using Docker
images. Docker configuration existing earlier was very insecure and
rudimentary. I took the inspiration for a uniform Docker configuration from the
ScanCodeIO project and provided with detailed documentation for installation
using a docker image. The current configuration makes use of files like
docker.env to supply container’s environment and
.dockerignore to skip
over any unnecessary files for deployment.
Makefile usage is prevalent in sister projects like ScanCodeIO. It gives VulnerableCode a consistent
behavior and provides a very friendly interface for invocations. This also
avoids security risks like having a default django
SECRET_KEY as it can be
easily generated by a make target. I added a Makefile which has a similar
usage as that of ScanCodeIO, replaced all the CI tests to use make, updated the
relevant part of the documentation and updated settings to reject insecure
Separate import and improve operations - WIP
This introduces a new concept of
improver. Earlier, data fetching and
improvement were done as one single process by
importer. This meant that
importers were convoluted and not very modular. The concept of
comes from the idea that an
importer should only do one thing - import. Any
further improvement on the data is delegated to the improvers. This allows for
us to have multiple ways of improvement with certain confidence on the improved
data making the import and improve operations modular and simpler to work with.
As a bonus, writing importers will be very easy and welcome more contributors
to the project. As of writing this report, this remains a work in progress
which will be finished very soon.
helper: split_markdown_front_matter: https://github.com/nexB/vulnerablecode/pull/443
Dump yaml in favor of saneyaml https://github.com/nexB/vulnerablecode/pull/452
Refactor package_managers https://github.com/nexB/vulnerablecode/pull/495/commits
Importers bugfix https://github.com/nexB/vulnerablecode/pull/505
I started to like VulnerableCode as soon as I laid eyes on the project. While exploring the codebase, I realized that there is a lot of room for improvement. Thus I looked for simple improvements and bugs to fix in the early stage, which were:
Fix istio importer (cleared a huge confusion about the codebase)
Add me to AUTHORS (Should’ve done this a lot earlier)
Fix redhat import failure (This one took a lot of effort to pinpoint)
Post GSoC - Future Plans and what’s left
I wish to carry on with the development of VulnerableCode and implement the ideas suggested by my mentors. This will require a lot of effort to bring VulnerableCode to a stable point. I hope to see VulnerableCode integrated into the ScanCode toolkit happen in a near future.
Further, if possible, I would like VulnerableCode to interact with other great open source tools like Eclipse Steady and Prospector. VulnerableCode, currently, works statically to collect all the vulnerabilities from different data sources, meanwhile there have been some developments with the Prospector project of Eclipse Steady. The project aims to scan fix-commits of the git repository in order to find out if the vulnerable part of a library was actually used in a project. It is not always the case that if a library is vulnerable then all the projects building upon it would be vulnerable too. It is crucial to identify if it is worth updating the library in use and dealing with the breaking changes. Prospectus is undergoing improvements in order to be released as a usable public tool. Project KB (Under Eclipse Steady) is also working on a “tool support for mining repositories and databases of advisories to establish the (missing) link between vulnerabilities (as described in natural language in the advisories) and the corresponding fix-commits”. When these projects are ready for public use I would like to add them to VulnerableCode as a modules. I hope this will benefit both the projects and the downstream.
After everything mentioned above, writing importers and improvers is something that is still left. In my opinion, this needs to be addressed after having a stable structure for VulnerableCode.
I really enjoyed working on the project. There were ups and downs when I met some weird bugs but every one of them taught me something new about Python, Django and programming in general. The best part of working with my amazing mentors - Philippe and Shivam - were the weekly meets where we would together try to figure out how to proceed with the development. I learned something new with every call and interaction we had. Thank you so much my mentors for providing a very smooth experience and Google for showing me the guiding light for participation.
To the reader, I would really like you to read this before Philippe asks you to ;)