How to make mobile phones run large AI models 4 - 5 times faster

In the rapidly evolving world of artificial intelligence, the demand for running large AI models on edge devices like mobile phones, PCs, and even Raspberry Pi is growing. However, the challenge of efficiently deploying these models on devices with limited resources, such as CPUs, remains a significant hurdle. Traditionally, dedicated hardware accelerators like NPUs and GPUs have been the go-to solution for this task. But what if we could achieve similar, or even superior, performance using just a CPU? This is where T-MAC, a new tech from Microsoft Research Asia, comes in. T-MAC technology can boost the speed of big AI models on phones, making them run 4-5 times faster, all on just a CPU.

The Problem: Running Big AI Models on Phones

When we try to run AI on phones or small PCs, we run into two big issues: space and power. These models need lots of room and energy to work well. To help with this, we often use a trick called model “quantization.” This means we shrink the size of the model by lowering the bit count of its parts. While this helps save space, it can slow down the model due to how the math is done. Normally, you need to change these low-bit parts back to high-bit for the model to work, which is slow and not good for speed.

The Fix: T-MAC Technology

Instead of the old, slow method, T-MAC technology uses a “lookup table” (LUT) method to do the math. This means the model does not need to change the bits back to high-bit first. This saves time and power, making the model run faster and use less energy. With T-MAC, phones and small devices can run AI models at speeds that can beat even special hardware like NPUs.

How T-MAC Works: The Innovation Behind the Speed

At the core of T-MAC’s innovation is the use of a lookup table (LUT)-based computing paradigm, which replaces the traditional multiply-accumulate (MAC) approach. This paradigm shift allows T-MAC to perform low-bit calculations directly using lookup tables, thus eliminating the need for the inefficient dequantization operations required by other systems. This reduction in the number of multiplication and addition operations is key to the speed improvements seen with T-MAC.

For instance, when running large models on a Surface AI PC equipped with the latest Qualcomm Snapdragon X Elite chipset, T-MAC demonstrated impressive results: the 3B BitNet-b1.58 model could generate up to 48 tokens per second, the 2-bit 7B llama model could generate up to 30 tokens per second, and the 4-bit 7B llama model could generate up to 20 tokens per second. These figures not only highlight the efficiency of T-MAC but also show that it can outperform NPUs in certain scenarios. For example, when deploying the llama-2-7B-4bit model, the NPU could generate 10.4 tokens per second, while the CPU, using T-MAC, could reach 12.6 tokens per second with just two cores, and up to 22 tokens per second with additional cores.

Join GizChina on Telegram

Technical Details: How T-MAC Optimizes Performance

The efficiency of T-MAC lies in its ability to handle low-bit matrix multiplication calculations from a bit-centric perspective. Unlike traditional methods that require individual customization for different data types, T-MAC designs an optimal data structure for a single bit and then scales it up to higher bit levels through stacking. This approach simplifies the computation process and reduces the complexity associated with mixed-precision operations.

Moreover, T-MAC leverages highly efficient table lookup instructions (TBL/PSHUF) on the CPU, which significantly improves random memory access performance. The technology also optimizes the data flow and memory usage by storing the lookup tables in fast on-chip memory, rearranging weights for better cache hit rates, and designing an optimal matrix tiling method to maximize data reuse.

Performance Benchmarks: T-MAC vs. Traditional Methods

When we look at how T-MAC stacks up against the old way (like llama.cpp), the speed gains are clear. T-MAC can make the 4-to-1-bit math up to 11 times faster than llama.cpp, based on the device used. Also, T-MAC scales well as the bit count drops. This means it can keep getting faster, even as the model uses fewer bits, which the old way can’t do.

For a low-end device like the Raspberry Pi 5, T-MAC can make 11 tokens per second for the 3B BitNet-b1.58 model. This shows that T-MAC can work well on both high-end PCs and low-end devices, making it a flexible and powerful tool for AI.

Power Efficiency: Reducing Energy Consumption with T-MAC

In addition to its speed advantages, T-MAC also offers significant power efficiency benefits. The technology reduces the number of cores required to achieve the same generation rate by up to 1/4 to 1/6 compared to traditional methods, thereby lowering energy consumption. This efficiency is particularly important for mobile and edge devices, where battery life and power consumption are critical considerations.

Conclusion: The Future of AI on Edge Devices

T-MAC is a big step forward for AI on small devices. Using a smart lookup table method makes big AI models run faster and use less power. This opens up new ways to use AI on phones, small PCs, and other devices that don’t have the space or power for big GPUs or NPUs.

Microsoft Research Asia has made T-MAC open-source, so anyone can try it out and use it in their own AI work. As AI keeps growing, tools like T-MAC will help bring AI to more places, making it faster and easier to use on all kinds of devices. The future of AI on phones looks bright, with faster speeds and smarter use of power, all thanks to new tech like T-MAC.

Disclaimer: We may be compensated by some of the companies whose products we talk about, but our articles and reviews are always our honest opinions. For more details, you can check out our editorial guidelines and learn about how we use affiliate links.

Source/VIA :

MyDrivers

Official Galaxy S25 Wallpapers Now Available for Download

Samsung Reveals the Ultra-Thin Galaxy S25 Edge

Samsung Galaxy S25 series India pricing revealed

Goodbye Bixby: Gemini is the new personal assistant on the Galaxy S25

Scykei, the US Brand Designed for Z Generation, Will Make Its Debut at CES 2025

OnePlus Watch 3 Pro to Launch in 2025 alongside the Watch 3

Essential Tips Before Purchasing Your First Smart Ring

Apple Watch Series 10: Bigger Screen, Thinner Design, More Power

AGM PAD T2 Review: A Tablet for Every Outdoor Adventure and More

Honor MagicPad 2 Review: A Stunning Display with Unmatched VFM!

AGM Pad P2 Active Review: Robust Tablet in a Practical Case

Redmi Pad SE 8.7 Leaked Ahead of Launch

NOVOO 100W USB C Charger Review: Compact Power with GaN III Technology

Honor Magic 7 Lite: A “Budget Flagship” That Redefines Value

Honor Magic7 Pro Review: A Robust Flagship Packed with Innovation and AI

What Makes vivo X200 Pro the Ultimate Flagship?

How to make mobile phones run large AI models 4 – 5 times faster

The Problem: Running Big AI Models on Phones

The Fix: T-MAC Technology

How T-MAC Works: The Innovation Behind the Speed

Technical Details: How T-MAC Optimizes Performance

Performance Benchmarks: T-MAC vs. Traditional Methods

Power Efficiency: Reducing Energy Consumption with T-MAC

Conclusion: The Future of AI on Edge Devices

Previous Why I Still Recommend the iPhone 15 Pro Before the iPhone 16 is Released

Next Samsung's One UI 7: All the Rumored Changes Coming with Android 15

Efe Udin

Nvidia & AMD Team Up: Copilot Plus AI Enhances Gaming Laptops

Google reveals what happens when AI models run on smartphones

OpenAI Admits GPT-4’s ‘Lazy’ Behavior: What You Need to Know

ChatGPT vs Bing Chat vs Google Bard – The Differences Between Trending AI Chatbots

Snapdragon 8 Elite for Galaxy: The Fastest Mobile Chip with Satellite Connectivity

Official Galaxy S25 Wallpapers Now Available for Download

Samsung Reveals the Ultra-Thin Galaxy S25 Edge

Samsung Galaxy S25 series India pricing revealed

MENU

The Problem: Running Big AI Models on Phones

The Fix: T-MAC Technology

How T-MAC Works: The Innovation Behind the Speed

Technical Details: How T-MAC Optimizes Performance

Performance Benchmarks: T-MAC vs. Traditional Methods

Power Efficiency: Reducing Energy Consumption with T-MAC

Conclusion: The Future of AI on Edge Devices

Previous Why I Still Recommend the iPhone 15 Pro Before the iPhone 16 is Released

Next Samsung's One UI 7: All the Rumored Changes Coming with Android 15

Efe Udin

Related Posts

MENU