One constant challenge in multicourse systems is to utilize fully the abundant resources, while assuring superior performance for individual tasks, particularly, in Non-uniform Memory Access (NUMA) systems where the locality of access is an important factor. To achieve this goal requires rethinking how to exploit parallel data access and I/O related optimizations. In the context of developing software for high-speed data transfer, we offer a novel design using asynchronous processing, and detail the advantages of resources-conscious task scheduling. In our design, multiple sets of threads are allocated to the different stages of the processing pipeline based on the capacity of resources, including storage I/O, and network communication operations. The threads in these stages are executed in an asynchronous mode, and they communicate efficiently via localized mechanisms in NUMA systems, e.g., task grouping, buffer memory, and locks. With this design, multiple effective optimizations are seamlessly integrated particularly for improving the performance and scalability of end-to-end data transfer. To validate the benefits of the design and optimizations therein, we conducted extensive experiments on the state-of-the-art multicourse systems. Our results highlighted the performance advantages of our software across different typical workloads, compared to the widely adopted data transfer tools, Graft and BBCP.