Pytorch distributed: Experiences on accelerating data
parallel training. Proceedings of the VLDB Endowment,
13(12), 2019.
[24]
R. Liaw, R. Bhardwaj, L. Dunlap, Y. Zou, J. E. Gonzalez,
I. Stoica, and A. Tumanov. Hypersched: Dynamic
resource reallocation for model development on a
deadline. In Proceedings of the ACM Symposium on
Cloud Computing, pages 61–73, 2019.
[25]
R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gon-
zalez, and I. Stoica. Tune: A research platform for
distributed model selection and training. arXiv preprint
arXiv:1807.05118, 2018.
[26]
H. Liu, Q. Zeng, J. Zhou, A. Bartlett, B.-A. Wang,
P. Berube, W. Tian, M. Kenworthy, J. Altshul, J. R. Nery,
H. Chen, R. G. Castanon, S. Zu, Y. E. Li, J. Lucero, J. K.
Osteen, A. Pinto-Duarte, J. Lee, J. Rink, S. Cho, N. Emer-
son, M. Nunn, C. O’Connor, Z. Yao, K. A. Smith, B. Tasic,
H. Zeng, C. Luo, J. R. Dixon, B. Ren, M. M. Behrens,
and J. R. Ecker. Single-cell dna methylome and 3d
multi-omic atlas of the adult mouse brain. bioRxiv, 2023.
[27]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov.
Roberta: A robustly optimized BERT pretraining
approach. CoRR, abs/1907.11692, 2019.
[28]
L. Luo, P. West, P. Patel, A. Krishnamurthy, and L. Ceze.
Srifty: Swift and thrifty distributed neural network
training on the cloud. Proceedings of Machine Learning
and Systems, 4:833–847, 2022.
[29] A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supin-
ski, B. Rountree, and M. Schulz. Exploiting redundancy
and application scalability for cost-effective, time-
constrained execution of hpc applications on amazon
ec2. IEEE Transactions on Parallel and Distributed
Systems, 27(9):2574–2588, 2015.
[30]
I. Menache, O. Shamir, and N. Jain. On-demand, spot, or
both: Dynamic resource allocation for executing batch
jobs in the cloud. In 11th International Conference
on Autonomic Computing (ICAC 14), pages 177–187,
Philadelphia, PA, June 2014. USENIX Association.
[31]
S. Mitchell, M. O’Sullivan, and I. Dunning. PuLP: A
Linear Programming Toolkit for Python. 2011.
[32]
R. O. Nambiar and M. Poess. The Making of TPC-DS.
In Proceedings of the 32nd International Conference
on Very Large Data Bases, VLDB ’06, page 1049–1058.
VLDB Endowment, 2006.
[33]
D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phan-
ishayee, and M. Zaharia. Analysis and exploitation
of dynamic pricing in the public cloud for ml training.
In Workshop on Distributed Infrastructure, Systems,
Programming, and AI, August 2020.
[34]
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng,
D. Grangier, and M. Auli. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of the
2019 Conference of the North American Chapter of the
Association for Computational Linguistics (Demonstra-
tions), pages 48–53, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics.
[35]
D. Poola, K. Ramamohanarao, and R. Buyya. Fault-
tolerant workflow scheduling using spot instances on
clouds. Procedia Computer Science, 29:523–533,
2014. 2014 International Conference on Computational
Science.
[36]
A. Sergeev and M. Del Balso. Horovod: fast and easy
distributed deep learning in tensorflow. arXiv preprint
arXiv:1802.05799, 2018.
[37]
J. Thorpe, P. Zhao, J. Eyolfson, Y. Qiao, Z. Jia, M. Zhang,
R. Netravali, and G. H. Xu. Bamboo: Making pre-
emptible instances resilient for affordable training of
large DNNs. In 20th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 23), pages
497–513, Boston, MA, Apr. 2023. USENIX Association.
[38]
J. Thorpe, P. Zhao, J. Eyolfson, Y. Qiao, Z. Jia, M. Zhang,
R. Netravali, and G. H. Xu. Bamboo: Making pre-
emptible instances resilient for affordable training of
large DNNs. In 20th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 23), pages
497–513, Boston, MA, Apr. 2023. USENIX Association.
[39]
P. Varshney and Y. Simmhan. Autobot: Resilient and
cost-effective scheduling of a bag of tasks on spot vms.
IEEE Transactions on Parallel and Distributed Systems,
30(7):1512–1527, 2019.
[40]
M. Wagenländer, L. Mai, G. Li, and P. Pietzuch. Spotnik:
Designing distributed machine learning for transient
cloud resources. In Proceedings of the 12th USENIX
Conference on Hot Topics in Cloud Computing, pages
4–4, 2020.
[41]
S. Wang and M. Casado. The Cost of Cloud, a
Trillion Dollar Paradox.
https://a16z.com/2021/
05/27/cost-of-cloud-paradox-market-cap-
cloud-lifecycle-scale-growth-repatriation-
optimization.
[42]
F. Yang, B. Pang, J. Zhang, B. Qiao, L. Wang, C. Cou-
turier, C. Bansal, S. Ram, S. Qin, Z. Ma, I. n. Goiri,
E. Cortez, S. Baladhandayutham, V. Rühle, S. Rajmohan,
Q. Lin, and D. Zhang. Spot virtual machine eviction pre-
diction in microsoft cloud. In Companion Proceedings
198 21st USENIX Symposium on Networked Systems Design and Implementation USENIX Association