Pytorch released version 1.0 in May, the “production ready pytorch”. “Married to Caffe2” and true production-readiness are still ideals on the horizon rather than present realities. Still, the idealogical shift points at a larger universal goal shared by the builders and users of ML infrastructure.
Pytorch is weak shipping to production, while Tensorflow’s API and interface are complex and inconsistent. The way the two libraries are changing to address those shortcomings are critial: Tensorflow is moving in the direction of Pytorch by adding dynamic sandboxing and removing duplciate libraries. Pytorch is moving more in the direction of becoming more extremely dynamic while building tools to make the backend graph-implementation irrelevent.
The ONNX specification and TVM compiler highlight the importance of Pytorch’s plans. ONNX is a Facebook-led project formalizing a consistent graph protocol to shift models between ML frameworks. Sharing and cross-translation is good for the community, and might allow for the trivial conversion of Pytorch to Caffe2 (or Tensorflow).
If ONNX lets model code shift between different frameworks, the TVM compiler supports a one-way transformation of models to pluggable hardware backends. The Amazon and Washington University project includes automatic optimization of their low-level representation, and the ability to supplement their work with custom hardware optimizations.
I think there are two important insights here. The obvious one is that top frameworks are moving in the direction of dynamic and Python-esque interfaces. If each model can be cross-compiled, or sub-compiled and optimized for various backends, the choice of framework will be independent of training speed or production-ease. The best UI will win.
My more intereseting hypothesis is that we should deploy models differently. (Summary of current state of model servers here.) TFServing is complete in the same sense that AWS is, but is similarly unapproachable to new-users. Furthermore, the ability to convert models to forms like WebAssembly (WASM) opens new options for the kind of servers that can run inference. Edge deployment can offer lower-latency, higher-throughput and lower-cost predictions than any localized servers (cloudflare workers).
Unfortunately, every step of that process involves engineering hurdles that are beyond my expertise:
Pytorch tracing fails with dynamic models are unpredictable at runtime
ONNX conversions have fluctuating dependencies. I could only test the library with official Docker images
TVM’s code and documentation is tough to understand. The only way I could figure out how to extract an executable WASM from it and Emscripten was probably an abuse of the compiler.
Properly building and running a WASM script requires pre- and post-js files (and probably a custom C++ template file) that differ per application.
Loading/caching .bin model weights into a WASM worker seems pretty critical for edge performance, but I can’t quite wrap my head around that lifecycle yet.
The Emscripten compiler and WASM are both in development, and might not always work as expected.
I have a feeling these technical problems will be smoothed out over time. I also think someone who knows a lot more about compilers and low-level graph representation of ML models than me will figure out how to commercialize a more modern form of Pytorch deployment by that point.
This article on Julia is interesting, if somewhat tangential
This is pretty hand-wavy, and I’m not a seasoned C engineer, so open to thoughts/feedback/criticism if anyone feels strongly about these technologies. If anyone knows someone building this as a research project or startup I’d be interested to hear about it.