Vol. 3 No. 1 (2023): African Journal of Artificial Intelligence and Sustainable Development
Articles

Utilizing Machine Learning for Optimizing Kubernetes Scheduler Performance

Babulal Shaik
Cloud Solutions Architect at Amazon Web Services, USA
Jayaram Immaneni
SRE Lead at JP Morgan Chase, USA
Srikanth Bandi
Software Engineer at JP Morgan chase, USA
Cover

Published 22-06-2023

Keywords

  • Machine learning,
  • Kubernetes scheduler

How to Cite

[1]
Babulal Shaik, Jayaram Immaneni, and Srikanth Bandi, “Utilizing Machine Learning for Optimizing Kubernetes Scheduler Performance ”, African J. of Artificial Int. and Sust. Dev., vol. 3, no. 1, pp. 469–489, Jun. 2023, Accessed: Dec. 29, 2024. [Online]. Available: https://africansciencegroup.com/index.php/AJAISD/article/view/222

Abstract

Kubernetes, a widely adopted open-source platform for managing containerized applications, plays a crucial role in automating tasks like deployment, scaling, and orchestration of workloads across a cluster of machines. At the heart of Kubernetes is the scheduler, which determines how and where to place workloads, or pods, on the available nodes within the cluster. The efficiency of this scheduling process is vital for maintaining system performance, especially as Kubernetes clusters grow in size and complexity. While the default scheduler in Kubernetes is functional, it often faces challenges when dealing with large-scale or dynamic workloads that require real-time resource management. This is where machine learning (ML) comes into play. By integrating ML techniques, Kubernetes schedulers can be enhanced to predict resource usage more accurately, optimize pod placement, and make more intelligent scheduling decisions. ML models can analyze past usage patterns, anticipate the resource requirements of incoming workloads, and adjust scheduling strategies accordingly. This approach can significantly improve performance, reduce resource contention, and ensure better load balancing, all contributing to a more efficient and reliable system. However, incorporating ML into the Kubernetes scheduler is challenging. The integration must be seamless with existing scheduling algorithms and should not compromise the stability or predictability of the system. There are also concerns about the overhead introduced by ML models and the need for constant retraining to ensure they adapt to evolving workloads. Nevertheless, the potential benefits of ML-enhanced Kubernetes scheduling are substantial, including improved scalability, responsiveness, and resource efficiency. As Kubernetes continues to grow, leveraging ML for more innovative scheduling promises to be a key factor in optimizing the performance of large-scale cloud-native environments.

Downloads

Download data is not yet available.

References

  1. Huaxin, S., Gu, X., Ping, K., & Hongyu, H. (2020, December). An improved kubernetes scheduling algorithm for deep learning platform. In 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (pp. 113-116). IEEE.
  2. Peng, Y., Bao, Y., Chen, Y., Wu, C., & Guo, C. (2018, April). Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (pp. 1-14).
  3. Menouer, T. (2021). KCSS: Kubernetes container scheduling strategy. The Journal of Supercomputing, 77(5), 4267-4293.
  4. Dartois, J. E., Boukhobza, J., Knefati, A., & Barais, O. (2019). Investigating machine learning algorithms for modeling ssd i/o performance for container-based virtualization. IEEE transactions on cloud computing, 9(3), 1103-1116.
  5. Rossi, F., Cardellini, V., Presti, F. L., & Nardelli, M. (2020). Geo-distributed efficient deployment of containers with kubernetes. Computer Communications, 159, 161-174.
  6. Choudhary, S. (2021). Kubernetes-Based Architecture For An On-premises Machine Learning Platform (Master's thesis).
  7. Bao, Y., Peng, Y., & Wu, C. (2019, April). Deep learning-based job placement in distributed machine learning clusters. In IEEE INFOCOM 2019-IEEE conference on computer communications (pp. 505-513). IEEE.
  8. Zhong, Z., & Buyya, R. (2020). A cost-efficient container orchestration strategy in kubernetes-based cloud computing infrastructures with heterogeneous resources. ACM Transactions on Internet Technology (TOIT), 20(2), 1-24.
  9. Tran, M. N., & Kim, Y. (2021, October). A cloud QoS-driven scheduler based on deep reinforcement learning. In 2021 International Conference on Information and Communication Technology Convergence (ICTC) (pp. 1823-1825). IEEE.
  10. Seelam, S. R., & Li, Y. (2017, December). Orchestrating deep learning workloads on distributed infrastructure. In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (pp. 9-10).
  11. Genkin, M., Dehne, F., Navarro, P., & Zhou, S. (2019). Machine-learning based spark and hadoop workload classification using container performance patterns. In Benchmarking, Measuring, and Optimizing: First BenchCouncil International Symposium, Bench 2018, Seattle, WA, USA, December 10-13, 2018, Revised Selected Papers 1 (pp. 118-130). Springer International Publishing.
  12. Li, J., Xu, H., Zhu, Y., Liu, Z., Guo, C., & Wang, C. (2022). Aryl: An elastic cluster scheduler for deep learning. arXiv preprint arXiv:2202.07896.
  13. Reuther, A., Kepner, J., Byun, C., Samsi, S., Arcand, W., Bestor, D., ... & Michaleas, P. (2018, September). Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC) (pp. 1-6). IEEE.
  14. Li, Q., Li, B., Mercati, P., Illikkal, R., Tai, C., Kishinevsky, M., & Kozyrakis, C. (2021). RAMBO: Resource allocation for microservices using Bayesian optimization. IEEE Computer Architecture Letters, 20(1), 46-49.
  15. Li, J., Liu, B., Lin, W., Li, P., & Gao, Q. (2019). An improved container scheduling algorithm based on PSO for big data applications. In Cyberspace Safety and Security: 11th International Symposium, CSS 2019, Guangzhou, China, December 1–3, 2019, Proceedings, Part I 11 (pp. 516-530). Springer International Publishing.
  16. Boda, V. V. R., & Immaneni, J. (2022). Optimizing CI/CD in Healthcare: Tried and True Techniques. Innovative Computer Sciences Journal, 8(1).
  17. Immaneni, J. (2022). End-to-End MLOps in Financial Services: Resilient Machine Learning with Kubernetes. Journal of Computational Innovation, 2(1).
  18. Nookala, G., Gade, K. R., Dulam, N., & Thumburu, S. K. R. (2022). The Shift Towards Distributed Data Architectures in Cloud Environments. Innovative Computer Sciences Journal, 8(1).
  19. Nookala, G. (2022). Improving Business Intelligence through Agile Data Modeling: A Case Study. Journal of Computational Innovation, 2(1).
  20. Komandla, V. Enhancing Product Development through Continuous Feedback Integration “Vineela Komandla”.
  21. Komandla, V. Enhancing Security and Growth: Evaluating Password Vault Solutions for Fintech Companies.
  22. Thumburu, S. K. R. (2022). EDI and Blockchain in Supply Chain: A Security Analysis. Journal of Innovative Technologies, 5(1).
  23. Thumburu, S. K. R. (2022). A Framework for Seamless EDI Migrations to the Cloud: Best Practices and Challenges. Innovative Engineering Sciences Journal, 2(1).
  24. Gade, K. R. (2022). Data Analytics: Data Fabric Architecture and Its Benefits for Data Management. MZ Computing Journal, 3(2).
  25. Gade, K. R. (2022). Data Modeling for the Modern Enterprise: Navigating Complexity and Uncertainty. Innovative Engineering Sciences Journal, 2(1).
  26. Katari, A., Ankam, M., & Shankar, R. Data Versioning and Time Travel In Delta Lake for Financial Services: Use Cases and Implementation.
  27. Katari, A. (2022). Performance Optimization in Delta Lake for Financial Data: Techniques and Best Practices. MZ Computing Journal, 3(2).
  28. Immaneni, J. (2021). Using Swarm Intelligence and Graph Databases for Real-Time Fraud Detection. Journal of Computational Innovation, 1(1).
  29. Nookala, G. (2021). Automated Data Warehouse Optimization Using Machine Learning Algorithms. Journal of Computational Innovation, 1(1).
  30. Thumburu, S. K. R. (2021). Integrating Blockchain Technology into EDI for Enhanced Data Security and Transparency. MZ Computing Journal, 2(1).
  31. Muneer Ahmed Salamkar. Batch Vs. Stream Processing: In-Depth Comparison of Technologies, With Insights on Selecting the Right Approach for Specific Use Cases. Distributed Learning and Broad Applications in Scientific Research, vol. 6, Feb. 2020
  32. Muneer Ahmed Salamkar, and Karthik Allam. Data Integration Techniques: Exploring Tools and Methodologies for Harmonizing Data across Diverse Systems and Sources. Distributed Learning and Broad Applications in Scientific Research, vol. 6, June 2020
  33. Muneer Ahmed Salamkar, et al. The Big Data Ecosystem: An Overview of Critical Technologies Like Hadoop, Spark, and Their Roles in Data Processing Landscapes. Journal of AI-Assisted Scientific Discovery, vol. 1, no. 2, Sept. 2021, pp. 355-77
  34. Naresh Dulam, et al. “Apache Iceberg 1.0: The Future of Table Formats in Data Lakes”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 1, Feb. 2022, pp. 519-42
  35. Naresh Dulam, et al. “Kubernetes at the Edge: Enabling AI and Big Data Workloads in Remote Locations”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 2, Oct. 2022, pp. 251-77
  36. Naresh Dulam, et al. “Data Mesh and Data Governance: Finding the Balance”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 2, Dec. 2022, pp. 226-50
  37. Sarbaree Mishra. “Comparing Apache Iceberg and Databricks in Building Data Lakes and Mesh Architectures”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 2, Nov. 2022, pp. 278-03
  38. Sarbaree Mishra. “Reducing Points of Failure - a Hybrid and Multi-Cloud Deployment Strategy With Snowflake”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 1, Jan. 2022, pp. 568-95
  39. Sarbaree Mishra, et al. “A Domain Driven Data Architecture for Data Governance Strategies in the Enterprise”. Journal of AI-Assisted Scientific Discovery, vol. 2, no. 1, Apr. 2022, pp. 543-67
  40. Babulal Shaik. Automating Compliance in Amazon EKS Clusters With Custom Policies . Journal of Artificial Intelligence Research and Applications, vol. 1, no. 1, Jan. 2021, pp. 587-10
  41. Babulal Shaik. Developing Predictive Autoscaling Algorithms for Variable Traffic Patterns . Journal of Bioinformatics and Artificial Intelligence, vol. 1, no. 2, July 2021, pp. 71-90