DeepSeek V3 paper details: How to bypass the CUDA monopoly!

DeepSeek’s two recently released models, DeepSeek-V3 and DeepSeek-R1, achieve performance comparable to similar models from OpenAI at a much lower cost.

According to foreign media reports, in just two months, they trained a MoE language model with 671 billion parameters on a cluster of 2,048 H800 GPUs, which is 10 times more efficient than the top AI.

This breakthrough was not achieved using CUDA, but through a large number of fine-grained optimizations and the use of NVIDIA’s assembly-like PTX (parallel thread execution) programming.

DeepSeek has been forced to take a different path from OpenAI and other companies that rely on brute force computing power under hardware constraints. It has used a series of technological innovations to reduce the power requirements of the model while achieving performance gains.

Some enthusiastic comments from netizens:

“In this world, if there are any groups of people who would be crazy enough to say things like ‘CUDA is too slow!

Genius geeks fine-tune PTX to maximize GPU performance

NVIDIA PTX (Parallel Thread Execution) is a middle-instruction-set architecture designed specifically for its GPUs, sitting between high-level GPU programming languages (such as CUDA C/C++) or other language front ends and low-level machine code (streaming assembly or SASS).

PTX is a low-level instruction set architecture that presents the GPU as a data parallel computing device, enabling fine-grained optimizations such as register allocation and thread/thread-bundle level tuning that are not possible with languages such as CUDA C/C++.

When PTX is converted to SASS, it is optimized for a specific generation of NVIDIA GPUs.

When training the V3 model, DeepSeek reconfigured the NVIDIA H800 GPU:

Of the 132 stream processor cores, 20 were allocated for inter-server communication, mainly for data compression and decompression, to break through the connection limit of the processor and improve transaction processing speed.

To maximize performance, DeepSeek also implemented advanced pipelining algorithms through additional fine-grained thread/thread bundle level adjustments.

These optimizations go far beyond the level of conventional CUDA development, but are extremely difficult to maintain. However, this level of optimization is precisely what fully demonstrates the outstanding technical capabilities of the DeepSeek team.

The V3 paper specifically mentions details about PTX

This is because, under the dual pressure of a global GPU shortage and US restrictions, companies such as DeepSeek had to seek innovative solutions.

Fortunately, they have made significant breakthroughs in this area.

One developer believes that “low-level GPU programming is the right direction. The more optimization, the lower the cost, or the performance budget that can be used for other progress without additional expenditure.”

This breakthrough has had a significant impact on the market, and some investors believe that the new model will reduce the demand for high-performance hardware, which may affect the sales performance of companies such as NVIDIA.

However, industry veterans, including former Intel CEO Pat Gelsinger, believe that AI applications can make full use of all available computing power.

Gelsinger sees this breakthrough by DeepSeek as a new way to embed AI capabilities in low-cost devices for the mass market.

PTX and CUDA

So does the arrival of DeepSeek mean that the development of cutting-edge LLM no longer requires large-scale GPU clusters?

Will the huge investments in computing resources by Google, OpenAI, Meta and xAI ultimately go to waste? The general consensus among AI developers is that this is not the case.

However, it is certain that there is still huge potential to be tapped in terms of data processing and algorithm optimization, and more innovative optimization methods will surely emerge in the future.

With the V3 model of DeepSeek open sourced, the details are disclosed in detail in its technical report.

The report documents the deep underlying optimizations performed by DeepSeek. In short, the degree of optimization can be summed up as “they have rebuilt the entire system from the ground up.”

As mentioned above, when training the V3 using the H800 GPU, DeepSeek customized the GPU’s core computing units (streaming multiprocessor cores, or SMs) to meet specific needs.

Of the total 132 SMs, they specifically allocated 20 to handle inter-server communication tasks rather than computing tasks.

This customization is done at the PTX (parallel thread execution) level, which is the low-level instruction set of the NVIDIA GPU.

PTX runs at a level close to assembly language and enables fine-grained optimizations such as register allocation and thread/thread-bundle level tuning. However, this fine control is complex and difficult to maintain.

This is why developers usually prefer to use high-level programming languages such as CUDA, which provide sufficient performance optimizations for most parallel programming tasks and eliminate the need for low-level optimizations.

However, when it comes to maximizing the efficiency of GPU resources and achieving specific optimization requirements, developers have to resort to PTX.

However, the technical barriers remain

In this regard, Internet user Ian Cutress said, “Deepseek’s use of PTX does not eliminate the technical barriers of CUDA.”

CUDA is a high-level language. It makes developing libraries and interfaces with NVIDIA GPUs easier and supports rapid iterative development.

CUDA can optimize performance by fine-tuning the underlying code (i.e. PTX), and the basic libraries are already complete. Most production-level software is currently built on CUDA.

PTX is more similar to a directly understandable assembly language for the GPU. It works at a low level and allows for micro-level optimization.

If you choose to program in PTX, it means that none of the aforementioned built-in CUDA libraries can be used. This is a very tedious task that requires in-depth expertise in hardware and runtime issues.

However, if developers fully understand what they are doing, they can indeed achieve better performance and optimization at runtime.

Currently, the mainstream of the NVIDIA ecosystem is still the use of CUDA.

Developers who want to get an extra 10-20% performance or power efficiency from their computational load, such as companies that deploy models in the cloud and sell token services, have indeed optimized from the CUDA level to the PTX level. They are willing to invest time because it is worth it in the long run.

It should be noted that PTX is usually optimized for a specific hardware model and is difficult to port between different hardware unless adaptation logic is specially written.

In addition, manually tuning the compute kernel requires a great deal of perseverance, courage, and a special ability to remain calm, because the program may have a memory access error every 5,000 cycles.

Of course, for those scenarios where PTX is really needed, and for those developers who are paid enough to deal with these issues, we express our full understanding and respect.

For all other developers, it is advisable to continue using CUDA or other advanced variants based on CUDA (or MLIR).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *