How we sped up fetching our Gitlab resources in Python tenfold

Sep 9, 2020 · 788 words · 4 minute read python gitlab devops

Some time ago, a colleague approached me with an issue he was having with one of our homebrew Python scripts our team uses - a handy tool, which scans all the past jobs in a Gitlab project and searches for the ones that had leaked sensitive data, with possibility to delete such jobs.

His complaint was that the script was too slow, and had problems handling more than a couple hundreds of jobs… So I thought I’d take a look at it, because I like optimization challenges, as the success is always so satisfying :).

Background

In our company, Gitlab is used extensively as all-in-one solution for source code management, deployment and integration of services. Our workflow is heavily dependent on it and sometimes we find ourselves in need of automating some more bothersome or time consuming tasks, as one does.

Here is where python-gitlab package comes into play. For those not familiar, python-gitlab is a wrapper for Gitlab API, that allows you to work with its resources in more programmatic way - by representing them as Python objects and letting you make simple function calls instead of requests.

In this particular case, python-gitlab library was used to first fetch all the jobs corresponding to a project (or more like a metadata of these jobs) and then retreiving a trace (logs) of every one of them, in which we try to find a keyword. If a match is found, the job is then deleted.

Debugging

When I opened the project was when I had my first encounter with python-gitlab. After a quick run down of the code I found that fetching of the jobs was done something like this:

Turning up logging level to debug and checking logs as the program ran, it has become apparent what’s the main problem here. Requests were being sent only twice a second or so and retrieved just 20 results per page (pagination can be turned up to a 100 per page, but requests are then even slower). Again, this was NOT the content of the jobs that was being retrieved, just the metadata like an ID, status, name, etc.. Our projects usually have a couple hundreds, sometimes thousands of jobs, making this an unnecessary slowdown. Fetching IDs of the jobs is taking longer(!) then fetching contents of the jobs, which is just plain wrong and inefficient.

So solution becomes obvious - we just need to make requests asynchronous to cut down on the I/O wait time. This could have been simple enough, but there is one problem - python-gitlab package represents resources as its own, non-generic objects and the program is working with these objects later on, so just making asynchronous requests to Gitlab API on our own won’t cut it. We have to also create objects native to python-gitlab so it can work with them as the program continues.

Solution

The best solution for very slow I/O, as of time I’m writing this post, appears to be asyncio. If you are not familiar (as was I) with it, don’t worry. I won’t get too technical as I don’t feel competent enough for that - there are far better resources out there. After reading up a bit on it, I have written a POC that was able to send out requests, retrieve json from responses and load it to dicts, which represent attributes of our Gitlab resource:

Then we need to transform them into objects that python-gitlab can work with. For every resource, there is a specific class and object manager. First, we create resource specific object manager, such as gitlab.v4.objects.ProjectJobManager, which takes our gl session object as constructor parameter. If our resource has a parent resource, such as job has a project, we need to pass the parent to object manager constructor as well. Then, when we have our manager set up, we can start creating objects - we only need to pass the manager and the attributes to a constructor of our resource object, which in this case is gitlab.v4.objects.ProjectJob.

Implementation

And we are almost done here! Only thing left to do, is integrate this into our code. For the sake of re-usability as well as ease of use, I decided to create a package named aio-gitlab, which you can use to easily integrate this into your own project, which uses python-gitlab package. It’s meant to be used as a replacement of gitlab module, from which it inherits, while adding aio attribute with functions for fetching various resources. More information, as well as the` source code, can be found on our Github.

So we install the module:

pip install aio-gitlab

or

pip install git+github.com/pan-net-security/aio-gitlab.git

And use it:

And here we go! Finally, let’s compare the results!

Before:

and after:

Dominik Bucko
Technical Security