Catching bugs in code with AI, fully local CLI app
Catching bugs in code with AI, fully local CLI app. No data leaves your computer.
🤔 Overview • 🪄 Demos • 🔧 Installation • 💻 Usage • 🧠How it works
This is a CLI application that analyzes a source code file using an AI model. It then shows you parts that look suspicious to it.
It does not use rules or static analysis the way a linter tool would. Instead, the model generates its own code suggestions based on the surrounding context. Check out how it works.
NB: All processing is done on your hardware and no data is transmitted to the Internet
Example output:
Here's the output of running the application on its own source files (so meta).
cli.py
— source code → generated output
render.py
— source code → generated output
sus.py
— source code → generated output
There was this post AI found a bug in my code on Hacker News which was pretty cool. I wanted to try it on my own code, so I went ahead and built my implementation of the idea.
You can install sus
via pip
or from the source.
pip3 install suspicious
git clone [email protected]:sturdy-dev/suspicious.git
cd suspicious
python -m pip install .
You can run the program like this:
sus /path/to/file.py
Note that when you run this for the first time, the application will need to download a model (~500 MB) — more info section.
This will generate and open an .html
file with the results.
grey
means prediction is the same as the originallight grey
means the model had a different prediction but with super low confidencelight red
means things are looking a little susred
means there was a different prediction and confidence was higherUnclear. You run sus
on a file and skim over the red stuff, maybe it spots something you missed. Ping me on twitter if you catch something cool with it.
In a nutshell, it feeds a tokenized representation of your source text into a Transformer model and asks the model to predict one token at a time using Masked Language Modelling.
For a general overview about Transformer models, check out The Illustrated Transformer article by Jay Alammar, which helped me out in understanding the core ideas.
sus
uses a model called UniXcoder which has been trained on the CodeSearchNet dataset. To do the MLM (masked language modelling) we are adding a lm_head
layer.
When sus
processes your code, it first tokenizes the text, where a token could be a special character or programming language keyword, English word or part of a word.
Before feeding the sequence of token ids to the model, one or multiple tokens are replaced with a special <mask>
token. After feeding the input through the network, we extract just the value at the masked location. This masking is done in a loop for each token to generate individual predictions.
Since this process is impractically slow, instead of masking one token at a time, sus
masks 10% of the tokens, making sure that the masked locations are spread out (so that there is sufficient context around each prediction site).
The output of this entire process is a list of structs that contain the original and predicted values for each token. Example:
{
"idx": 0, // position in sequence
"original": "foo", // as originally written in the source file
"predicted": "bar", // what the model predicted
"cosine_similarity": 0.23, // how different the prediction is from the original in the vector space
"probability": 0.92, // how confident the model is in it's prediction
}
This is then fed into an html
template to be rendered for the user. Easy-peasy.
sus
uses the decoder of UniXcoder, specifically the unixcoder-base-nine checkpoint. What's cool is that it's only 500 MB and ~120M parameters, which means it's quick to download and fast enough to run locally.
Larger models produce higher quality outputs, but you need to run the inference on a server.
You can try sus
on any source file, but you can expect best results with the following languages:
sus
is meant to be executed locally (aka not sending code to a server), which puts some constraints on the AI model size. Larger models will produce higher quality results, but they can be tens of GB in size and without a beefy GPU could take a long time to generate the output. Because of this, sus
uses a modestly sized model.sus
works around this by batching the input, but as a result of this, batches are not aware of the 'context' / code that is in other batches. Files are split in batches of 2500 characters which is super crude and is meant to correspond to ~1024 tokens.Semantic Code Search is distributed under AGPL-3.0-only. For Apache-2.0 exceptions — [email protected]