It is not always worth it to pararellize a application using GPUs, this model attempts to predict the cost/benefit of doing so without having to run expensive tests on large amounts of data.
A CFG is created that can generate CUDA programs with the following features
- 1D, 2D, 3D problem sets
- shared memory utilization
- thread syncing
- atomic operations
- call
__device__
functions
The CFG is then used to generate 5000 different programs, each have a corresponding serialized version and paralleized version of the random program. Each of these is then comiled and run with several different inputs (matrix size, block sizes and grid sizes) and the performance is measured as well as the correctness of the outputs by comparing the serialized versions to the parallel ones.
Programs that generate coda that is not equivalent at runtime are discarded.
Code can be generated by calling
python -m nyu.gpu.speedup <num-samples>
This will generate a code snippet, compile and execute it, and record the source code and runtime for each in a csv file called cuda_speedup.csv
The train_model.ipynb can then be used to train the model.
Once the dataset is generated a pre trained gpt-neo model trained on source code is used as a function embedding featurizer. This is then fed into a small feed forward neural network.
The following results are reported over a smaller 500 sample dataset with a 25-75 train validation split, no hypertuning is done on the model
training score= 0.99 R^2
test score= 0.88 R^2