CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.
Now you can load kernel files written in PTX assembly language and view SASS disassembly on-the-fly for debugging, learning or testing different compiler settings. These files are to be used in PTX or CUBIN format with the CUDA Driver API.
See the Article on CodeProject at http://www.codeproject.com/Articles/999744/CudaPAD for more details.
CudaPAD aids in the optimizing and understanding of nVidia’s Cuda kernels by displaying an on-the-fly view of the PTX/SASS that make up the GPU kernel. CudaPAD simply shows the PTX/SASS output, however it has several visual aids to help understand how minor code tweaks or compiler options can affect the PTX/SASS.
What is PTX or SASS anyway? NVidia’s PTX is an intermediate language for NVidia GPU’s. It is more closely tied to pure GPU assembly(SASS) but slightly abstracted. PTX is less tied to the specific hardware or a hardware generation which makes it more useful in most cases when compared to assembly. One item it abstracts is physical register numbers which makes it easier to use then assembly. PTX instructions are usually translated into one or more actual SASS hardware instructions. SASS is hardcore assembly. It is what the GPU actually runs and is directly translated into machine code. Viewing SASS code is more difficult but it does show exactly what the GPU will do. As mentioned, SASS code also works with the registers directly so there is more control where registers are stored but it’s another item that the programmer needs to keep track of and makes SASS more difficult to work with.
Often when programming in Cuda, there is a need to view what a kernel’s PTX/SASS might look like and CudaPAD helps with this. There might be a need to view PTX/SASS for debugging, understanding what’s happening, to squeezing a little more performance out of a kernel, or just for curiosity. To use the application, simply type or paste a kernel in the left panel and then the right panel will display the corresponding disassembly information. Visual informational aids like visual Cuda-to-PTX code matching lines, PTX cleanup, WinDiff, and quick register highlighting are built-in to help make the PTX easily to follow. Other on-the-fly information is also displayed like register counts, memory usage, and error information.
With any piece of code, there are often several ways to perform the same thing. Sometimes, just modifying a line or two will lead to different machine instructions with better registers and memory usage. Have fun and make some changes to a kernel in the left window and watch how the PTX/SASS changes on the right.
Just as a quick note. CudaPAD does not run any code. CudaPAD is only for viewing PTX, SASS, and register/memory usage.
Like most of my projects, this one was grown out of a personals need. For some algorithms I develop, GPU efficiency is important. One way to help with this is by understanding the low-level mechanics and making any necessary adjustments. Before creating this app, I would often get in this loop where I would write a performance critical kernel then view the PTX/SASS over and over using command line tools. Doing this repetitively was time consuming so I decided to build a quick C# app that would automate the process.
It started out as a simple app that would take a kernel in the left window and then output the PTX to the right side window. This was accomplished by basically running the same command line tools as before, mainly nvcc.exe, but now in an automated fashion in the background. I got carried away however and within a short period of time I started adding several features including automatic re-compiling, WinDiff, visual code lines markers, compile errors, and register/memory usage.
AMD used to have a similar tool for Brooke++ and this gave me the idea of having the two window app back in 2009 when I first built it. Basically the tool had a left window where a Brook+ kernel could be added and a right window where the assembly would output to. A button could be clicked to update the output window. AMD has had a couple of these over the years but it has since been replaced with AMD’s CodeXL.
AMD’s CodeXL and NVidia’s NSight have since replaced many tools like these however CudaPAD still has its place for quick, on the fly viewing of low-level assembly and experimentation. Both CodeXL and NSight are professional grade free tools and are a must have for GPU developers.
CudaPAD is simple to use. But before running it, make sure these system requirements are met:
A dedicated GPU is not required since we are only compiling code and not running anything.
If the requirements are met, then simply launch executable. When CudaPAD loads, it will have a sample kernel. The sample provides a quick place to start playing around or even a starting framework for a new kernel. Whenever the kernel on the left is edited, it will update the PTX or SASS on the right. If there is a compile error, it will show that near the bottom.
There are several features that can be enabled/disabled. All are on by default (also see Features section).
Change the drop down textbox between PTX, SASS or SOURCE views.
PTX view – shows the PTX intermediate language output of the kernel. PTX is close to SASS hardware instructions but is slightly higher level and is less tied to a particular GPU generation. Usually PTX instructions translate directly to SASS however sometimes there are multiple SASS instructions per PTX instruction.
SASS view – These are true assembly instructions. These types of instructions execute directly on the GPU. The amount of visual information supplied when viewing SASS is less then PTX – like the visual code lines do not show.
Raw code view – This view is mostly for debugging CudaPAD itself. Behind the covers, this app does not re-compile after every change. It only re-compiles when the code is modified and not comments or whitespace. The raw code is a stripped down version of the real code. The reason this was added was because I did not want it to keep compiling when I was adding/editing comments or adding/removing whitespace. This would not be resource friendly and would also throw off the WinDiff feature.
In the background, CudaPAD simply compiles the kernels with Cuda tools. The Cuda compiler then in turn calls a C++ compiler like Visual Studio. So to run this CudaPAD, Cuda needs to be installed and most likely a C++ compiler like Visual Studio.
Disabling the auto-compile is useful for making multiple changes before a compile. This can help show the changes in the diff (differencing) output over several changes. To do a manual compile, just click the green ‘start’ in the top right corner.
Let's take a look at how this application works. I will present what happens when the left window is edited. This triggers a recompile and then updates the right PTX/SASS window. Here it is in steps:
nvcc.exe -keep -cubin --generate-line-info ...
This command compiles the cuda file into a cubin file. (device code) We also use the -keep
option and keep the ptx files as well as the --generate-line-info
so we know the line numbers of the source file so we can draw the lines.diff
algorithm. The final results of the diff
function is the new PTX with what changed in the form of comments. I chose to put the change information in comments so that if the text is copied to another program, it will still run..loc # ## #
" statements. Any line information is then deleted from the PTX so that it is not displayed..loc 1 20 1
. The 20
here would be the source line so a line would be drawn from line 20 in the source to line 45 in the PTX window.
2. Next, we get the indentation for each line. This is done by counting the whitespace (spaces/tabs) before each word. This is needed so the lines start or end where the code starts instead of just at the beginning of the line.
3. Using the textbox height/width plus the current scroll positions for each window plus the indentation and line number of each line, we then draw the lines.These lines match up the Cuda source code to the PTX output. They help the programmer quickly identify what Cuda code matches up with what PTX. This function can be enabled or disabled by clicking the lines icon in the top of the PTX window.
When needed, the application will automatically re-generate the PTX code. It does not do this on each text change in the source window but rather when the stuff that matters changes. Many items are stripped from the source text that do not impact the output such as comments or spaces. The Auto Update function can be enabled or disabled by clicking the auto update icon in the top of the PTX window.
Each time the output window updates, this will automatically run a differencing algorithm each time the PTX output changes. The notes are added in such a way that it does not impact runnability of the code. I decided to add the diff
information inside of a comments in the event the user wants to copy and paste the code. I came up with a system of using //
style comments on deleted lines and a /*new*/
comment for new comments. The //
comments disable the entire line while the /*new*/
does not.
Just click on any register or word in the PTX window and it will highlight all instances of that item. Click on another and it will highlight those as well with a different color. Click on any highlighted item and it will un-highlight all instances of that item. With just three click the following can be achieved:
The ScintillaNET textbox control by Jacob Slusser has some convenient text highlighting abilities that visually helps when viewing code. Originally, this started out as a plain textbox, then moved to another 3rd party control and then finally to the ScintillaNET control. This results in more colorful and cleaner looking code.
Besides the text highlighting, the text in the output window is formatted so it’s a little cleaner. Things like compiler information and header information are removed:
Example of highlighted and cleaned up output formatting is as follows:
Often when running across an error, it is helpful to do a quick online search. I found I was often opening a browser and then copying and pasting the error in to a search box. This was not efficient so I added a search online function. At the time, I think this was one of the first of its kind but since it was released in 2009, I have seen other IDEs have this.
I had a little fun creating this. This is probably why so much time was put into this.
Getting the code lines to work was exciting for me. I believe the visual code lines might have been one of the first of their kind when I built this in 2009 but I am not sure. This was a wild idea I had and I was not sure if I could get it working. Drawing moving lines on the screen is not that easy as I found out as there always seemed to be some side effects. Drawing the spline was the easy part but all the miscellaneous stuff like cleaning it up was more difficult. Another difficult part was calculating the location in the text box. The textbox line height and line number must be known for each spline drawn. I’m not a graphics developer so I am just happy to get it to work! The visual lines turned out better than expected and are fun to play with.
At the time, I dreamed up many different “line” ideas to help break down the assembly but none of the others have been implemented yet:
Note: These other features have NOT been added to CudaPAD. (at least not at this time)
Here are some advantages of viewing PTX...
Curiosity - This is what I use it most for. Sometimes I just want to see what is going on at the lower levels and how small changes impact the code. This can be a very useful tool for trying to learn PTX/SASS and the Cuda compiler.
Software bug- Trying to figure out that annoying bug. Is it a compiler bug or is it something with my code? Sometimes viewing the machine instructions can aid in understanding an unexpected result.
Changing up a line or two often produces different results. When there exists a kernel that might need some performance optimization, toying with different ways of doing the same thing can produce more efficient code. One example that comes to mind was I found that using a union the PTX would always result in local memory. This was a while ago so it might not be true anymore but here is the example:
local .align 4 .b8 someLocMem[4];
....
st.local.s32 [someLocMem], someIntReg;
However, when using something like:
"int strangeInt = *(int*) &somefloat;”
the output looks like this:
mov.b32 someFloatReg, someIntReg;
This is easily spotted in CudaPAD because of the quick feedback and visual markers.
Does the code do nothing? Several times in the past, I realized that my kernel had a bug because when I changed or deleted some code nothing changed in the PTX output. I thought to myself, how could this be? The reason why PTX might not show up is because the compiler often simplifies out useless code that does not do anything. As I found out, this is more common then I expected because I ran into this a couple times. This is usually caused by a bug but it could also just be pointless code also. In most cases, code that is optimized out should either be removed or fixed. Noticing this can help find some hidden errors in a program.
Just as a word of caution, try not to go optimization crazy. Optimization does have its place for particular functions that get run often however optimization can make code less readable, awkward, and more difficult to maintain. Also, time should only be spent on code where a performance increase would have a large impact. There is much more on this subject that I will not get into.
Below is a quick tutorial video. The sub-menu options did not show properly in the video but I explain what I am clinking on so hopefully you can still follow along.
CudaPAD won a poster spot at the 2016 GPU Technology Conference. Even better than that it was also selected as one of the top 20! At the conference, I gave a short presentation to about 100-150 people on April 4th 2016.
Here are some wish list items I have that may or may not be added in the future: