The paper entitled Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System was accepted by The 23rd IEEE High Performance Extreme Computing Conference. HPEC is the largest computing conference in New England in USA and is also a premier conference on the convergence of High Performance and Embedded Computing.
Abstract: The neural network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer version, and natural language processing. Despite various algorithms focus on dif- ferent directions, generally, they mainly employs an iteration by iteration training and evaluating process. Each iteration aims to find a parameter set, which minimizes a loss function that defined by the learning model. When completing the training process, the global minimum is achieved with a set of the optimized parameters. At this stage, the deep learning application can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU, GPU and memory. In a multi-tenancy system, however, limited resources are shared by multiple clients and leads to severe resource contention. Therefore, a carefully designed resource management plan is required to improve the overall performance. In this project, we investigate a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tiered target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time, as much as 48.4%, to reaching the target.