Instead of vdpau and vaapi is it possible to try OpenCL? It should be a STD that adapts to any format and hardware. However, it is likely that its implementation is too complex and intrusive on the code. CUDA, I wouldn't consider it because it's closed source.