Realtime diffusion in the cloud
In December 2023, I implemented a realtime diffusion toolkit with Daito Manabe and Rhizomatiks. The toolkit is based on SDXL Turbo running at 1024x1024 accelerated by stable-fast. Stability AI supported this work by granting us permission to use SDXL Turbo for interactive installations and related applications. We used this image-to-image system to power Transformirror in Osaka, and Generative MV for the Ryuichi Sakamoto retrospective at NTT ICC in Tokyo. We also managed to meet the grueling requirements of broadcast video for a live new year’s show from Perfume & Daichi Miura.
While video-to-video pipelines have drastically improved in the last year, there are no significant changes to what is possible for realtime, low-latency processing. Stream Diffusion made 512x512 diffusion more accessible, but due to a mix of model licensing restrictions and hardware requirements, not many people have had a chance to play with this effect in realtime.
Transformirror was curated into KIKK Festival in Belgium this year. Like many arts festivals, the budget was limited. In hopes of reducing the costs and making it easier to tour, I researched solutions for hosting the work in the cloud. This would allow us to have a thin client (a Mac mini) on site with a webcam, speakers, and a fast internet connection—while renting an expensive and heavy multi-GPU machine in the cloud.
The original version of Transformirror in Osaka used two desktop machines networked together, with one RTX 4090 in each machine. I would guess the total value of all this equipment was close to $10,000. Shipping this equipment could cost thousands of dollars, and we would only be able to show the work one place at a time. On the other hand, an RTX 4090 in the cloud can be as cheap as $0.30/hr. For a four-day festival, this could come out to as little as $100.
I reviewed a dozen cloud GPU providers. My initial requirement was only low latency. The installation was near Brussels, and I wanted to have under 30ms of latency—one or two frames of latency on top of the existing 100ms of latency from the diffusion. At the edge of this threshold are datacenters in Iceland (38ms ping), Romania (36ms), Finland (34ms). Anything closer would also work (France, Germany, Netherlands, even UK).
As I dug in, I realized that affordability was another major factor. I could always use Google Cloud or Amazon AWS, but their GPUs are nearly 5x more expensive than some of their competition.
I learned there are essentially two tiers of cloud GPUs: the enterprise GPUs like H100s and A100s that have very high memory and bandwidth and advanced features helpful for training, and consumer GPUs like 4090s. The enterprise GPUs are around 3–4x the cost of the consumer GPUs. And since we are only doing inference (not training) the consumer GPUs have similar performance. As far as I understand, NVIDIA has explicitly disallowed consumer-grade GPUs for datacenter use since 2018, but companies like Vast.ai, PureGPU and TensorDock are using them anyway.
The next thing I discovered is that each service has its own quirks around provisioning and network configurations. Some services cater to researchers who want a Jupyter Notebook with a GPU, where performance is not the main concern. Others are designed for large training runs and have restrictive networking policies. Others are meant to run “serverless”, launching a Docker image with a temporary endpoint and then shutting down. I was looking for something closer to bare metal, more persistent, and with an open network so I could send and receive video in realtime.
Before getting too deep into the cloud options, I decided to rewrite the project for my local GPU. I used WebRTC, the same tech powering in-browser video chat. WebRTC uses UDP, so it is low latency. It also automatically scales video quality to match the available bandwidth, and it buffers and de-jitters incoming video. I got a first version working quickly and I was ready to test it.
That’s when I learned that some services block UDP because it’s associated with torrenting. I figured this out by writing a minimal webrtc-bot that simply inverts the colors on your video stream using server-side code, and running this on different cloud servers. This minimal code meant I didn’t need to check NVIDIA driver configurations before learning whether the network supported WebRTC or not.
In the end, the two services that did not support WebRTC were Hyperstack and RunPod. They were also the cheapest options, so I decided instead of giving up on them, I would rewrite the tool to stream JPEGs over WebSockets. I probably should have tried this first, because it was already tested by the GenDJ project which had previously had success running my image-to-image code on RunPod.
WebSockets worked, and with less latency than WebRTC, but with more jitter—and none of the benefits of bandwidth adaptation.
The final question was where to host it. After a more thorough set of tests I settled on RunPod and Latitude. Mainly because they were both affordable, and quite close to Brussels: RunPod had 4090s in Romania for $0.70/hr and L40s in Netherlands for $1/hr, and Latitude had 8xA100 servers in Frankfurt for $11.50/hr. Hyperstack had a Norway datacenter but no availability.
Latitude was incredibly smooth. You know when you’re watching apt update scrolling over ssh, and it moves like butter? A nicely maintained and over-spec’d server just has a smell to it. If you know you know. There were also measurable differences: Latitude was incredibly fast to load the the model weights for SDXL Turbo, it took seconds. On RunPod, the startup was much much slower. Loading the weights onto the GPU could take a minute on RunPod, and the first runs would be very slow until the GPU warmed up. Once everything was loaded, both services performed similarly.
RunPod has a very nice feature of letting you stop your server without losing all your storage. RunPod mounts a persistent /workspace directory that is maintained for around $0.13/day. On Latitude, there is no similar option: once you stop your server, the disk is wiped and you have to start over next time. They are just now adding a “Network Volume” feature to address this, but it is in beta and wasn’t working for me yet.
There was one other gotcha I ran into. I was hoping that Lambda Cloud would work for us. I’ve followed their journey ever since they were one of the only companies selling deep learning desktops, and I’ve always appreciated their work. Unfortunately, running some tests on an 1xA100 machine, I was unable to get consistent performance. Inference took anywhere from 120–220ms.
For reference, a RunPod 4090 was always between 70–74ms. For realtime applications, this huge variation kills interactivity because actual latency is equivalent to worst-case latency—in other words, you have to delay all frames in order to have a properly de-jittered output.
Most services were similar to RunPod, but would sometimes have tiny glitches. Here’s an example of some runs on Latitude that had a brief but large latency jump, and a small but measurable few-millisecond increase over the few minutes of testing:
Some other services worth mentioning:
- I really love what LeafCloud is doing in Amsterdam. The waste heat from their datacenter is used to heat residential buildings. In the end they did not have the GPUs in stock that I needed to get that minimum inference latency that Transformirror demands.
- I very briefly tested the new Digital Ocean H100 offering but at the time they were only available in New York, and the pricing was out of this world at $24/hr for an 8xH100 machine. But it did work flawlessly.
- I tried Leader Telecom which has a very idiosyncratic interface with a lot of UI bugs, even making payment difficult. Then I had trouble provisioning my server—for some reason it took 15 minutes to spin up and I was charged for it the whole time. Their prices were also much higher, at €6/hr for a 2x4090 machine. Around 4x more expensive than RunPod. They refunded me after I contacted them.
- I looked into TensorDock but they had no availability of appropriately fast GPUs within the ping requirement.
- I tried Vast which runs on Docker images, but I had trouble getting my Docker image with WebRTC to work. I never figured out whether it’s because I’m bad with Docker, or because of their network. I did not return to try my WebSocket version.
- I looked into FluidStack but was unable to determine where their servers are actually located.
- I tried OVHCloud but ran into some jurisdiction issues where they wouldn’t let me sign up for an account because I was based in the US—but then when I tried to sign up for an account in the US they said I already had an account? If you are based in the US and want to try them, maybe use their US site first to avoid this issue.
- I looked into Linode but could not navigate their signup process.
- Vultr seems very promising but I couldn’t find good availability within ping distance.
- Salad claims to be the most affordable and have high availability, but I could not figure out where their servers were located.
- Scaleway seems great but has a complex signup and verification process involving passport verification. I was unable to get verified fast enough to test them. They had 2xH100 machines available in Paris for $6/hr which would be a pretty good deal if they are high availability.
I keep mentioning “availability” and that was one of the big lessons I learned: not every service has the GPUs you want available all the time. These services are designed to cater to shorter term applications of GPUs, like short-term fine-tuning and serverless GPU endpoints. And they are trying to price their servers dynamically so that they always have a few GPUs available, but not too many. I originally had this idea that I would start up the server at 8am and shut it down at 6pm for every day of the exhibition, only spending half as much as I would if I were running the server 24 hours a day. After testing this idea for a few days, I started to notice that sometimes I would go to turn on the server and there would only be 2x4090s available on RunPod instead of the 3x4090s that I needed. In the end, I needed to reserve the GPUs for the entire duration by not turning the machine off.
Thank you to Latitude and RunPod, who both provided technical support and credits for us to run this piece. In the end, we ran the piece for two days on Latitude and two days on RunPod and got very similar performance out of both. Additional thanks to LeafCloud for their support.