Kuaishou Technology CEO Su Hua discusses in his new book, “The Power of Being Seen – What is Kuaishou, that the fundamental principle of happiness lies in the allocation of resources, with attention being the Internet’s most vital resource. Kuaishou’s mission is to leverage warm technology, particularly artificial intelligence (AI), to capture more people’s attention and enhance their unique sense of happiness. In Su Hua’s perspective, we are currently in a particularly fascinating era where the Internet transcends geographical limitations, allowing for faster and more convenient connections among people. Kuaishou possesses the capability for large-scale computing and excels in AI and machine learning—skills that many individuals worldwide lack. “We should effectively utilize this capability to assist those who do not have access to such resources, enabling them to thrive in a rapidly changing environment. This represents the progress and efficiency improvements brought about by the technological revolution. I have been contemplating how to channel the increased efficiency back to the people, and I hope we can continue to explore and pursue this goal in the future.”
Powerful computing capabilities and cloud services have significantly accelerated the development of various industries. Consequently, data centers worldwide consume substantial amounts of electricity during their operations, with estimates suggesting that global data centers may use between 350 and 400 TWh annually. Simultaneously, the demand for high-performance CPUs and GPUs to meet industrial requirements has led to increased power consumption and heat flux density in chips. This trend has resulted in a continuous rise in the power density of server racks within data centers, with projections indicating that power per rack could reach up to 40 kW in the near future. Currently, most data centers rely on air cooling as their primary thermal management method, with approximately 40% of the total energy consumed by data centers dedicated to cooling servers. Given that air cooling is becoming inadequate to address the thermal challenges posed by large-scale server racks, there is an urgent need to develop more efficient cooling systems and enhance energy efficiency. Liquid coolants are gaining popularity due to their superior cooling capacity and efficiency, particularly in light of national policies aimed at carbon reduction and achieving carbon neutrality within a green economy. Reason: Improved vocabulary, clarity, and technical accuracy while maintaining the original meaning.
Liquid cooling solutions have been proposed and extensively studied over the years. Recently, they have garnered increasing attention from both industry and academia. One key reason for this interest is the growing importance of energy efficiency in data centers. Among the various liquid cooling technologies, single-phase cold plate technology is well-established and widely utilized in modern data centers to address thermal design challenges. This technology directly connects high-power components, such as CPUs and GPUs, to a cold plate, allowing for efficient heat dissipation. The heat generated by these major components is effectively removed by circulating a fluid, such as deionized water, through the attached cold plate, while lower-power components in the server continue to be cooled by air. Research has demonstrated that this hybrid liquid cooling approach for high-performance data centers can reduce cooling costs by up to 45% compared to data centers that rely solely on air cooling. Cold plate liquid cooling solutions have proven to be a viable alternative to traditional air cooling methods for thermal management in data centers.
Achievements
Investigation on Advanced Cold Plate Liquid CoolingSolution for Large Scale Application in Data Center
Recently, Intel and Kuaishou collaborated to publish research on advanced cold plate liquid cooling solutions for large-scale data center applications. With the rapid advancement of computing power and increasingly stringent global government regulations on energy conservation and emission reduction, efficient cooling technology has garnered significant attention from the industry. Concurrently, the high performance of CPUs and GPUs is essential to meet industrial demands, leading to increased power consumption and heat flux density in chips.
A hybrid liquid cooling solution is employed to achieve an optimal balance of thermal cooling performance, Power Usage Effectiveness (PUE), and Total Cost of Ownership (TCO). In this system, only the CPU is cooled using a cold plate liquid cooling solution, while other components continue to be cooled by air. The liquid cooling kit comprises two core CPU cold plates, two UQD fast plugs, one inlet, one outlet, and pipes that connect various components. Additionally, as a secondary measure against leaks, the liquid cooling system is equipped with a leakage detection rope, which monitors for any signs of leakage, allowing the system to respond promptly and implement emergency measures. This is illustrated in Figure 1 above.
The design of the cold plate is crucial to the effectiveness of the liquid cooling solution. In Kuaishou’s approach, the internal flow channel of the cold plate utilizes the skiving process to create microchannels (further information on the skiving process can be found in the related article at the end of this document), which enhances the heat transfer efficiency from the cold plate fins to the coolant. Based on computational fluid dynamics (CFD) analysis and empirical test data, the cooling capacity of the cold plate can support a CPU thermal design power (TDP) of 350W or even higher. To reduce the coolant circulation speed, simplify the piping, and enhance the heat transfer efficiency of the cooling distribution unit (CDU), two cold plates are connected in series, as illustrated in Figure 2.
Generally, there are two methods for connecting the cold plate modules within the node. The first method involves using a hard pipe, typically constructed from metal. The second method utilizes a hose made from non-metallic flexible materials, such as PTFE (polytetrafluoroethylene bellows), which has an operating temperature range of -190°C to 260°C, allowing it to withstand the thermal shocks of the liquid cooling system in extreme environments. Another option is EPDM (ethylene propylene diene monomer rubber), which generally operates within a temperature range of -40°C to +120°C. Flexible hoses facilitate a more adaptable layout for the system; therefore, this solution employs a corrugated hose connection scheme, as illustrated in Figure 3.
To ensure rapid disassembly and assembly without water leakage, this cooling system is connected to the manifold using a leak-free quick plug. The quick plug is a two-way sealed connector designed to prevent leaks. Figure 4 illustrates a typical quick plug, which generally consists of a male and a female connector. When the server needs to be removed from the rack for operation and maintenance, simply disconnect the male and female connectors. This disconnection process is straightforward and can be accomplished with one hand. It effectively addresses the issue of waterway sealing and enhances operational and maintenance efficiency. The main body of the quick plug can be constructed from materials such as stainless steel, copper, aluminum, or resin. To ensure compatibility with the coolant and other wetting materials, a stainless steel quick isolating switch was selected for the cooling solution.
There is a potential risk of liquid leakage at the junction between the connecting pipe and the cold plate, as well as at the connection between the pipe and the fast disconnect switch. In the event of a leak during operation, a warning signal should be promptly sent to the operator for maintenance and replacement. As illustrated in Figure 5, the liquid leakage detection line is wrapped around both the connecting pipe and the cold plate. When a liquid leak occurs, the resistance of the leakage detection line will change, resulting in a corresponding change in the voltage signal. The baseboard management controller (BMC) can detect this voltage change. If the voltage falls below a predetermined threshold, the BMC will issue a warning signal to the operator.
Source:https://mp.weixin.qq.com/s/0trwXR9vQ2J-JgdKLdlXdw