Original article was published on Deep Learning on Medium
Introducing tree-hugger: Source Code Mining for Human
We at CodistAI are working hard to build an AI which is able to understand source code and associated documentation. Because being developers ourselves we had faced the pain of writing, and keeping documentation up to date while writing code at the same time(add to it the pressure of delivery and deadlines!). It is also a huge problem to find them back when we need them. And we know that we are not the only one suffering from it.
But, to build such a system we needed data. And lots of it. We needed to mine huge amount of code spanning different languages, and guess what, data sources are not plenty when it comes to code as data. The few main sources that we could find are principally Github’s CodeSearchNet challenge data-set, Google big-query’s Github activity data set, Py150, and few others like these.
Mining different language code files and gathering important information from them is not a trivial job. We did not want to create new parsers, so great parser generator frameworks such as ANTLR or lex-yacc were not an options for us. What we needed was a good, high level library that exposes a simple, Pythonic API on top of some kind of universal code parser.
So at the end the choice came down to two options. Babelfish and tree-sitter. Now, babelfish was the newer kid on the block, and was coming with some nice properties, but the uAST (Universal AST) was not really something we liked that much. The API was not that easy either. So tree-sitter was a natural choice (Also babelfish is not maintained anymore).
We were impressed by the clean design, the speed, language coverage, and the minimal dependency of tree-sitter. However, we were still struggling with the low-level interface it provides with the Python binding. So we started to write some codes to create some higher level abstractions on top of it.
Thus, was born tree-hugger.
tree-hugger is a light-weight, extendable, high level, universal code parser built on top of tree-sitter.
Let’s unpack those words one by one.
- light-weight: tree-hugger aims to be a simple and easy-to-use framework. It gives just enough tools for a developer to quickly start working on mining code-data while it takes care about a lot of boilerplate. It also aims to make the life easier by providing some command line utilities to them. To that end it remains very light weight in itself. We are also pretty low on dependencies.
- extendable: tree-hugger aims to be extendable by design. It achieves that mainly in two ways. One is to have an external source of queries. We read the queries (s-expressions) from a yml file (an example can be found here) and that means we do not need to write them in the code and we can very easily iterate on them. And the second thing is to have a modular structure with some common, boiler plate code already supplied for you. Which means, you can focus on the actual thing. Writing important part of code that matters to you.
- high-level: tree-hugger hides the little details of running a query, or walking on the ast, and also the tricky part of retrieving some code from the query result under clean, Pythonic API and so you are free to concentrate on the problem at hand.
- universal: We actually leverage the amazing tree-sitter, so by default we are (almost) language agnostic 🙂
Use-case : Mine Code and Comments
Let’s say that you want to treat code as data and try to fit Machine Learning models on it (If you want to know more about it you can checkout one of our earlier article here.).
The task at hand is to mine a lot of code (different languages, such as Python, PHP, C, C++, JS, C#, Java etc.) and to generate a data-set where each sample looks like this — (f, d) Where f stands for the function body and d stands for the associated comment (Docstring, if you prefer).
Now that you started working on this, you discover that you have a problem. You need to have something that let’s you read through all of the different language files and get the data that you are interested in. Also, it is needed that the data you retrieve, should be in a form that you can use easily. A framework and library like tree-sitter is useful here.
Once you start using it, you will notice the pain points.
- You need to write queries as s-expression and they are embedded in code. Makes your own code very hard to read and also hard to debug.
- Although universal, the representation of each of those files differs internally when you get the s-expression back. So you need to write, manage, and keep track of similar queries but written slightly differently for all of those different languages.
- Once you start getting the data you will need to do some post-processing to make it usable in your modeling scheme.
All of those takes a huge amount of time and there is no trivial solution out there.