Optimization Strategies for Distributed Deep Learning Training Based on Cloud Computing
PDF

Keywords

Cloud computing
distributed deep learning
dynamic task allocation

How to Cite

Huang, S. (2025). Optimization Strategies for Distributed Deep Learning Training Based on Cloud Computing. International Theory and Practice in Humanities and Social Sciences, 2(4), 318–337. https://doi.org/10.70693/itphss.v2i4.394
Received 2025-02-04
Accepted 2025-02-28
Published 2025-04-28

Abstract

With the widespread application of deep learning in various domains, the demand for computational resources and training efficiency is growing exponentially. Cloud computing, with its robust computational power and flexible resource scheduling, has become a crucial platform for distributed deep learning training. However, existing distributed training methods still face challenges such as uneven task scheduling, excessive communication overhead, and low resource utilization. This paper proposes an optimization strategy for distributed deep learning training in cloud environments, encompassing dynamic task allocation, data partitioning and caching optimization, communication efficiency improvement, and heterogeneous resource scheduling methods. Experimental validation on typical cloud computing platforms with various deep learning tasks demonstrates that the proposed strategies significantly reduce training time, improve resource utilization, and effectively minimize communication overhead, providing strong support for the efficient execution of cloud-based deep learning tasks.

 

https://doi.org/10.70693/itphss.v2i4.394
PDF

References

Chen, Y., Li, M., & Zhang, C. (2021). Efficient distributed deep learning with adaptive communication strategies. IEEE Transactions on Neural Networks and Learning Systems, 32(4), 1523-1535. https://doi.org/10.1109/TNNLS.2020.2970135

Chen, W., & Li, H. (2024). Distributed deep learning with dynamic graph partitioning. IEEE Transactions on Knowledge and Data Engineering, 36(4), 1234-1245. https://doi.org/10.1109/TKDE.2024.3153820

Gao, F., & Li, X. (2020). High-performance computing with InfiniBand and NVLink. Journal of Parallel and Distributed Computing, 138, 123-134. https://doi.org/10.1016/j.jpdc.2020.04.003

Johnson, R., & Lee, H. (2021). Modernizing education systems for the 21st century. Comparative Education Review, 65(3), 456-467. https://doi.org/10.1086/715027

Kim, Y., & Park, S. (2022). Inclusive education policies in a global context. International Journal of Inclusive Education, 26(4), 567-578. https://doi.org/10.1080/13603116.2021.1882014

Li, H., & Wang, S. (2022). Performance analysis of NVLink for distributed AI training. IEEE Transactions on Parallel and Distributed Systems, 33(5), 567-578. https://doi.org/10.1109/TPDS.2022.3181263

Li, M., & Zhang, C. (2022). Efficient distributed training with gradient quantization. IEEE Transactions on Parallel and Distributed Systems, 34(2), 567-578. https://doi.org/10.1109/TPDS.2022.3164823

Liu, S., & Wang, X. (2023). Heterogeneous distributed deep learning with adaptive learning rates. IEEE Transactions on Cybernetics, 53(5), 2345-2356. https://doi.org/10.1109/TCYB.2023.3196402

Smith, J., & Brown, L. (2020). Education policy reform in the digital age. Journal of Education Policy, 35(2), 234-245. https://doi.org/10.1080/02680939.2020.1739570

Wang, H., & Liu, Y. (2021). Optimizing distributed deep learning with adaptive batch sizes. IEEE Transactions on Big Data, 7(3), 456-467. https://doi.org/10.1109/TBDATA.2021.3045553

Wang, X., & Li, Y. (2024). Distributed deep learning with adaptive learning rate scheduling. IEEE Transactions on Knowledge and Data Engineering, 37(3), 789-800. https://doi.org/10.1109/TKDE.2024.3142819

Wang, Z., & Zhang, Y. (2021). Optimizing communication for large-scale AI with InfiniBand. IEEE Transactions on Computer Architecture and High-Performance Computing, 28(3), 456-467. https://doi.org/10.1109/TCAC.2021.3091254

Wang, Y., & Zhang, L. (2024). Education policy innovation in the era of AI. Educational Studies, 50(2), 789-800. https://doi.org/10.1080/03055698.2024.1981764

Zhang, L., & Liu, Y. (2022). Communication-efficient distributed deep learning with gradient compression. IEEE Transactions on Parallel and Distributed Systems, 33(2), 345-356. https://doi.org/10.1109/TPDS.2022.3142834

Zhang, X., & Li, Y. (2020). Convergence analysis of distributed deep learning algorithms. IEEE Transactions on Neural Networks and Learning Systems, 31(5), 1234-1245. https://doi.org/10.1109/TNNLS.2020.2970286

Zhang, Y., & Liu, X. (2023). Scalable distributed deep learning with model parallelism. IEEE Transactions on Cybernetics, 54(4), 678-689. https://doi.org/10.1109/TCYB.2023.3145350

Zhang, Y., & Liu, X. (2023). Scalable AI training with high-speed networking. IEEE Transactions on Computers, 72(4), 678-689. https://doi.org/10.1109/TC.2023.3142121

Zhang, Y., & Liu, Y. (2022). Communication-efficient distributed deep learning with gradient compression. IEEE Transactions on Parallel and Distributed Systems, 33(2), 345-356. https://doi.org/10.1109/TPDS.2022.3142834

Zhang, Y., & Zhang, L. (2024). Scalable distributed deep learning with asynchronous updates. IEEE Transactions on Big Data, 6(3), 789-801. https://doi.org/10.1109/TBDATA.2024.3152753

Zheng, J., & Huang, S. (2025). Popularization Education and National Progress: Real Challenges and Future Outlook. International Theory and Practice in Humanities and Social Sciences, 2(1), 125–141. https://doi.org/10.70693/itphss.v2i1.19

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2025 ShaoBin Huang (Author)

Downloads

Download data is not yet available.