Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.


Apple GPT in your pocket? It could be a reality sooner than you think. Apple AI researchers say they have made a key breakthrough in deploying large language models (LLMs) on iPhones and other Apple devices with limited memory by inventing an innovative flash memory utilization technique.

siri-symbol-iphone.jpg

LLMs and Memory Constraints

LLM-based chatbots like ChatGPT and Claude are incredibly data and memory-intensive, typically requiring vast amounts of memory to function, which is a challenge for devices like iPhones that have limited memory capacity. To tackle this issue, Apple researchers have developed a novel technique that uses flash memory – the same memory where your apps and photos live – to store the AI model's data.

Storing AI on Flash Memory

In a new research paper titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory," the authors note that flash storage is more abundant in mobile devices than the RAM traditionally used for running LLMs. Their method cleverly bypasses the limitation using two key techniques that minimize data transfer and maximize flash memory throughput:
  1. Windowing: Think of this as a recycling method. Instead of loading new data every time, the AI model reuses some of the data it already processed. This reduces the need for constant memory fetching, making the process faster and smoother.
  2. Row-Column Bundling: This technique is like reading a book in larger chunks instead of one word at a time. By grouping data more efficiently, it can be read faster from the flash memory, speeding up the AI's ability to understand and generate language.
The combination of these methods allows AI models to run up to twice the size of the iPhone's available memory, according to the paper. This translates to a 4-5 times increase in speed on standard processors (CPUs) and an impressive 20-25 times faster on graphics processors (GPUs). "This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility," write the authors.

Faster AI on iPhone

The breakthrough in AI efficiency opens new possibilities for future iPhones, such as more advanced Siri capabilities, real-time language translation, and sophisticated AI-driven features in photography and augmented reality. The technology also sets the stage for iPhones to run complex AI assistants and chatbots on-device, something Apple is already said to be working on.

Apple's work on generative AI could eventually be incorporated into its ‌Siri‌ voice assistant. Apple in February 2023 held an AI summit and briefed employees on its large language model work. According to Bloomberg, Apple is aiming for a smarter version of Siri that's deeply integrated with AI. Apple is planning to update the way that ‌Siri‌ interacts with the Messages app, allowing users to field complex questions and auto-complete sentences more effectively. Beyond that, Apple is rumored to be planning to add AI to as many Apple apps as possible.

Apple GPT

Apple is reportedly developing its own generative AI model called "Ajax". Designed to rival the likes of OpenAI's GPT-3 and GPT-4, Ajax operates on 200 billion parameters, suggesting a high level of complexity and capability in language understanding and generation. Internally known as "Apple GPT," Ajax aims to unify machine learning development across Apple, suggesting a broader strategy to integrate AI more deeply into Apple's ecosystem.

As of the latest reports, Ajax is considered more capable than the earlier generation ChatGPT 3.5. However, it's also suggested that OpenAI's newer models may have advanced beyond Ajax's capabilities as of September 2023.

Both The Information and analyst Jeff Pu claim that Apple will have some kind of generative AI feature available on the ‌iPhone‌ and iPad around late 2024, which is when iOS 18 will be coming out. Pu said in October that Apple is building a few hundred AI servers in 2023, with more to come in 2024. Apple will reportedly offer a combination of cloud-based AI and AI with on-device processing.

Article Link: Apple Develops Breakthrough Method for Running LLMs on iPhones
I see this a a new licensable technology in the style of ARM. Instances of this will be deployed in devices of every description all over. It will become an AI appliance that can be scaled infinitely and burned into Silicon of every magnitude. Toasters, refrigerators, your coffee maker, cars, aircraft, power management at private and industrial scale, toys?, robotic vehicles, the military [of course]. etc. etc. will all take a quantum leap thanks to this innovation.
 
  • Haha
Reactions: wilhoitm
When AppleGPT finally released, I'll be testing on how stringent it is compared to the already pretty restrictive chatGPT.
But knowing Apple, I'll be willing to bet that AppleGPT is more strict and stringent with very limited capability to provide us with some epic stories. You probably won't get to create a compelling stories about Medieval battle and wars because there will be too much gruesome things it doesn't want you to see. Or, it will be limited to Siri response and accessibility features only.
We'll see.
.
I had my hope tho...
 
  • Like
Reactions: System603
So you’re saying I should bake that cake just 15 minutes instead of 50? Not so sure about that, but I’ll give it a try.
No I think what Siri is saying has more to do with the meaning of time. She obviously knows cakes take 50 minutes to bake (because she is an Apple product), but she also knows time is relative (because she is highly advanced). You can't treat her like an assistant. If you want dumbed down answers go to google, but if you want to break through deeper levels of understanding ask Siri.
Now, I wasn't there when you asked the question, and I don't know you. However maybe what she was getting at could simply be that you don't really need to be eating or baking a cake right now. There are more valuable uses of your time. Maybe she was saying that if she were you she wouldn't spend more than 15 minutes doing something like that. Enjoy life. Spend more time connecting with the people around you. Just listen to Siri. Really listen to her. And you'll see your life improve.
 
impressive development. that combined with the neural engine on phones, going to be a powerful AI device in the future
 
What most people don't understand about Siri (or even Apple design) is that Siri doesn't give you what you want. She gives you what you need. She's not so much an assistant as she is a wise omnipotent technological resource. Next time she doesn't answer the way you expect her to, think about why, go deeper than the surface of what she said. You will likely find something deeper than you could have ever imagined.
This clearly needs to be read in the voice of Gandalf for full impact.
 
Given how thoroughly Siri is integrated into Apple hardware and software, an actually smart version of it would really change a lot about how I use my devices.

But I have to say, the Siri name has been so tainted at this point by how bad it is, they should give it a new name entirely.
 
Hey Siri, did you write this post?
I wish I were Siri! She knows so much (Siri actually means Guru in some language. I forget which one.). Unfortunately, I'm just a commoner. It was a year ago I began to notice that if I just went with whatever Siri told me to do or did, my life was made richer for it. I've discovered new restaurants, been to towns I would have never gone to and made good friends, called total strangers and discovered new things. Just the other day, I asked Siri to call my brother, and she dialed someone named Bob I hadn't spoken to in years. We had the most delightful conversation and caught up. He has two kids and lives in Arizona!
 
No I think what Siri is saying has more to do with the meaning of time. She obviously knows cakes take 50 minutes to bake (because she is an Apple product), but she also knows time is relative (because she is highly advanced). You can't treat her like an assistant. If you want dumbed down answers go to google, but if you want to break through deeper levels of understanding ask Siri.
Now, I wasn't there when you asked the question, and I don't know you. However maybe what she was getting at could simply be that you don't really need to be eating or baking a cake right now. There are more valuable uses of your time. Maybe she was saying that if she were you she wouldn't spend more than 15 minutes doing something like that. Enjoy life. Spend more time connecting with the people around you. Just listen to Siri. Really listen to her. And you'll see your life improve.
But she said “15 minute baking timer starting now”. I’m confused, the UI is really terrible.
 


Apple GPT in your pocket? It could be a reality sooner than you think. Apple AI researchers say they have made a key breakthrough in deploying large language models (LLMs) on iPhones and other Apple devices with limited memory by inventing an innovative flash memory utilization technique.

siri-symbol-iphone.jpg

LLMs and Memory Constraints

LLM-based chatbots like ChatGPT and Claude are incredibly data and memory-intensive, typically requiring vast amounts of memory to function, which is a challenge for devices like iPhones that have limited memory capacity. To tackle this issue, Apple researchers have developed a novel technique that uses flash memory – the same memory where your apps and photos live – to store the AI model's data.

Storing AI on Flash Memory

In a new research paper titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory," the authors note that flash storage is more abundant in mobile devices than the RAM traditionally used for running LLMs. Their method cleverly bypasses the limitation using two key techniques that minimize data transfer and maximize flash memory throughput:
  1. Windowing: Think of this as a recycling method. Instead of loading new data every time, the AI model reuses some of the data it already processed. This reduces the need for constant memory fetching, making the process faster and smoother.
  2. Row-Column Bundling: This technique is like reading a book in larger chunks instead of one word at a time. By grouping data more efficiently, it can be read faster from the flash memory, speeding up the AI's ability to understand and generate language.
The combination of these methods allows AI models to run up to twice the size of the iPhone's available memory, according to the paper. This translates to a 4-5 times increase in speed on standard processors (CPUs) and an impressive 20-25 times faster on graphics processors (GPUs). "This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility," write the authors.

Faster AI on iPhone

The breakthrough in AI efficiency opens new possibilities for future iPhones, such as more advanced Siri capabilities, real-time language translation, and sophisticated AI-driven features in photography and augmented reality. The technology also sets the stage for iPhones to run complex AI assistants and chatbots on-device, something Apple is already said to be working on.

Apple's work on generative AI could eventually be incorporated into its ‌Siri‌ voice assistant. Apple in February 2023 held an AI summit and briefed employees on its large language model work. According to Bloomberg, Apple is aiming for a smarter version of Siri that's deeply integrated with AI. Apple is planning to update the way that ‌Siri‌ interacts with the Messages app, allowing users to field complex questions and auto-complete sentences more effectively. Beyond that, Apple is rumored to be planning to add AI to as many Apple apps as possible.

Apple GPT

Apple is reportedly developing its own generative AI model called "Ajax". Designed to rival the likes of OpenAI's GPT-3 and GPT-4, Ajax operates on 200 billion parameters, suggesting a high level of complexity and capability in language understanding and generation. Internally known as "Apple GPT," Ajax aims to unify machine learning development across Apple, suggesting a broader strategy to integrate AI more deeply into Apple's ecosystem.

As of the latest reports, Ajax is considered more capable than the earlier generation ChatGPT 3.5. However, it's also suggested that OpenAI's newer models may have advanced beyond Ajax's capabilities as of September 2023.

Both The Information and analyst Jeff Pu claim that Apple will have some kind of generative AI feature available on the ‌iPhone‌ and iPad around late 2024, which is when iOS 18 will be coming out. Pu said in October that Apple is building a few hundred AI servers in 2023, with more to come in 2024. Apple will reportedly offer a combination of cloud-based AI and AI with on-device processing.

Article Link: Apple Develops Breakthrough Method for Running LLMs on iPhones
 
But she said “15 minute baking timer starting now”. I’m confused, the UI is really terrible.
See! She even started a timer to keep you on track. She definitely didn't want you to spend more than 15 minutes doing what you were doing. It's all very cryptic. You have to learn to read between the lines. Ask, "What is she really telling me." "Why did she give me 15 minutes when I asked for 50?" When you were a kid, did you ask your parents for something, but didn't get everything you wanted?" It's like that. Just go with it. Trust her.
 
With decades use of electronic devices in the world that have both used flash storage and ram in them, what is the betting that an algorithm/formulae/technique/method has already been invented and patented to take advantage of this type of memory usage and who ever owns it will bring a law suit against Apple.
 
Sounds great! But please call it something other than Siri as that name fills me with dread every time I consider giving it another try.
Oh, I hope they keep the name. I’m very used to it, she’s like an old friend. I wouldn’t want my old friend to change her name.
 
  • Like
Reactions: xxFoxtail
This might be the first thing to get me excited about iPhones again* in a long time. A more capable Siri would be fantastic.

*Not gonna lie, I was pretty hyped over USB-C
 
I think this is going to end up murdering our fantastic battery life that we finally have across iPhone and Mac. Also SSDs can degrade after too many write cycles and this seems like it would be writing 8GB chunks at a time on the latest iPhone Pro if it doubles the memory footprint. I'd rather Apple keep it in the cloud for more complex requests. They can use end to end encryption and differential privacy tactics to make it happen. I trust them.
 
No, it's not cache. If it would be so simple, no development would be required and no studies would have to be done.

errr. Technically No. Quote directly from the paper.

" ... While OS-level buffer caching is advantageous for general applications with high cache hit rates, it lacks fine-grained control over cache usage per process or buffer eviction at the application level. In the context of on-device memory constraints and large model sizes, this could lead to ineffective caching cycles and buffer eviction, leading to minimal gains and potential issues with memory allocation and Translation Lookaside Buffer (TLB) churn. ..."

Yes, they are by-passing the OS file buffer cache by using direct-IO. But this is just a different cache policy. Not dropping of the technique of caching all together. Some applications ( e.g., high end RDMS apps) do the same thing. That is why direct-IO is a Unix storage system feature. ( for apps that want to implement their own, proprietary cache policies. )

It is a *different* cache policy, but it is still very much a cache.

Caching uses generic mechanisms to decide wich information can be stored where: in cache or in RAM.

Caching uses whatever policy that implementation wants to use. L1/L2 level caches have affinity replacement features that much larger File system caches typically don't use. That doesn't make the file system cache a 'non cache' . Nor vice versa.



Here they analyzed how a very complex algorithm can be implemented on systems with very different storage properties: One small and fast and one very large but slow. And this on an algorithm wich is thought to be always running in fast RAM completely AND has access to some mighty servers. This is some clever software optimization.

This really isn't about "different storage properties'. The LLM being on flash storage would be true even if was not using Apple's particularly caching policy to load data. Apple doesn't sell any spinning hard drive systems anymore. ALL the nominal Apple storage is flash. Period. So the "Flash vs. Memory" is not particularly material here. The data has to come from somewhere to be loaded into volatile RAM storage. On Apple system the most highly relevant it could come from is from Flash memory (which is persistent storage).

What is more "new' that Apple has investigated here is that the nominal way LLM data is handled is highly skewed to the Training context. The whole model is loaded up before you do anything. What Apple has designed here is something that is very likely useful for highly limited inference. On limited inference only a relatively very , very small set of the model ever gets touched. So loading 'tons' of data that you are never going to use in the immediate future is a waste of time and resources.

So what Apple is doing here is predicting which small subset of the model they will need for narrow inference X and only loading that. Predict and load data .... pretty much the concept of 'caching' in a nutshell.

This very likely won't work for training as well because probalby not as easy to predict what subsets of the model will need to modify/update/etc in advance. If already knew what the model was going to look like when doen, could have just built it already. Same thing to inference that involves lots of subsets of things to be inferences on into a substantively larger whole. (e.g., make a video , soundtrack , and a script that goes along with it at the same time. )


Secondly they optimized the way they access the SSD-storage. This seems to be some hardware usage optimization very deep in the system.

Err no. Any app can get access to direct-IO if requested from the file system. There are already apps that have their own caching policies and do things for themselves. That isn't the 'norm' for most apps , but it has been done.
 
GPT-4 has 1.73 *trillion* parameters. That would be ~x10 times Apple’s LLM.
Does anybody actually know this? I thought it was a trade secret, but I could be wrong.

In any case, my only use for LLM would be spellchecking, which currently sucks on both iPhones and Macs. Surely an LLM can represent that 'to the' is a far more likely word combination than 'tot he', the latter being an error the is not detected by spellcheckers in macOS or iOS, which seem to have been unchanged since the 1980's.
 
Last edited:
  • Like
Reactions: Frantisekj
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.
OSZAR »