Cloud Platform Engineering for Enterprise AI and Machine Learning Workloads: Optimizing Resource Allocation and Performance

Srinivasan Ramalingam; Rama Krishna Inampudi; Manish Tomar

Authors

Srinivasan Ramalingam Highbrow Technology Inc, USA Author
Rama Krishna Inampudi Independent Researcher, Mexico Author
Manish Tomar Citibank, USA Author

Keywords:

cloud platform engineering, enterprise AI

Abstract

Corporate computing for artificial intelligence and machine learning calls for cloud platform architecture. Organizations needing computer resources, data processing, and resource distribution face problems as artificial intelligence and machine learning proliferate. This work maximizes cloud platform resource allocation and performance for business artificial intelligence and machine learning projects. For dynamic, resource-intensive artificial intelligence/machine learning applications, we assess hybrid cloud architectures, IaaS, and PaaS. Elastic resource management systems are used in this study to dynamically allocate computer resources depending on workload demands in order to minimize operational costs and resource underutilization.

We scale and deploy ML models using Kubernetes and Docker. By enabling microservices-based iterative AI application development, these systems increase modularity, version control, and teamwork. FaaS and serverless computing help to lower overhead for temporary training projects or inference assignments. We investigate various architectural approaches with respect to fault tolerance, latency, and throughput.

References

A. M. Turing, "Computing machinery and intelligence," Mind, vol. 59, no. 236, pp. 433-460, 1950.

J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," in Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), San Francisco, CA, USA, Dec. 2004, pp. 137-150.

Tamanampudi, Venkata Mohit. "A Data-Driven Approach to Incident Management: Enhancing DevOps Operations with Machine Learning-Based Root Cause Analysis." Distributed Learning and Broad Applications in Scientific Research 6 (2020): 419-466.

Tamanampudi, Venkata Mohit. "Leveraging Machine Learning for Dynamic Resource Allocation in DevOps: A Scalable Approach to Managing Microservices Architectures." Journal of Science & Technology 1.1 (2020): 709-748.

S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google File System," in Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing, NY, USA, Oct. 2003, pp. 29-43.

M. Satyanand and D. Menasce, "AI and machine learning in cloud computing," Cloud Computing and Big Data, vol. 10, no. 1, pp. 41-53, 2018.

A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," ACM Transactions on Computer Systems (TOCS), vol. 2, no. 1, pp. 39-59, Feb. 1984.

F. C. Dobson and A. A. Vahdat, "The role of AI in modern cloud infrastructure," IEEE Cloud Computing, vol. 5, no. 3, pp. 58-65, May/June 2018.

G. K. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, 1949.

A. B. Dinh, M. M. Hsieh, and Z. F. Zhang, "Container orchestration with Kubernetes for large-scale cloud applications," IEEE Transactions on Cloud Computing, vol. 7, no. 3, pp. 865-878, Jul.-Sept. 2019.

Y. Chen, S. Li, and L. Zhang, "Leveraging Kubernetes for scalable AI/ML workloads in cloud computing environments," IEEE Cloud Computing, vol. 6, no. 1, pp. 18-26, Jan.-Feb. 2019.

C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

T. L. Shultz and L. C. Tsai, "Serverless computing for artificial intelligence in cloud environments," IEEE Transactions on Cloud Computing, vol. 10, no. 2, pp. 1011-1023, Apr.-Jun. 2021.

A. Z. Vasilenko and S. Y. Lee, "Federated learning and privacy-preserving machine learning in cloud environments," IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 998-1010, Mar. 2021.

N. P. Sheth, "Introduction to AI and ML on cloud platforms," IEEE Transactions on Cloud Computing, vol. 8, no. 1, pp. 122-131, Jan.-Mar. 2020.

H. Xie, F. Liu, and Z. Zhang, "Energy-efficient AI in cloud systems," IEEE Transactions on Sustainable Computing, vol. 6, no. 4, pp. 435-447, Oct.-Dec. 2021.

G. M. Constantine and J. C. Roberts, "The role of containerization and microservices in cloud AI deployments," IEEE Transactions on Cloud Computing, vol. 7, no. 2, pp. 215-228, Apr.-Jun. 2020.

S. K. Tiwari, R. Gupta, and P. Agarwal, "Optimizing machine learning model performance in cloud environments," IEEE Transactions on Cloud Computing, vol. 12, no. 4, pp. 898-909, Oct.-Dec. 2021.

L. H. Zhang, Y. Liu, and H. Xu, "Leveraging Kubernetes for real-time AI workloads in the cloud," Proceedings of the 2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 45-52, Dec. 2020.

C. J. Albrecht, "Scalable cloud AI: Beyond traditional infrastructure," IEEE Transactions on Cloud Computing, vol. 4, no. 2, pp. 278-285, Apr.-Jun. 2020.

D. R. Lovelace and B. R. P. Nair, "Edge computing: Enhancing cloud AI/ML models," IEEE Transactions on Cloud Computing, vol. 11, no. 5, pp. 76-84, May 2021.

K. S. Shaw, "Building a secure, compliant AI/ML infrastructure in the cloud," IEEE Cloud Computing, vol. 9, no. 6, pp. 65-75, Nov.-Dec. 2022.

Cloud Platform Engineering for Enterprise AI and Machine Learning Workloads: Optimizing Resource Allocation and Performance

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite