Microsoft just released the Phi-4 multimodal model earlier today, and I spent the last two hours putting it through its paces. I think it's really, really good. The technical report shows great benchmark stats (top position with 6.14% word error rate on OpenASR benchmark, strong doc OCR and chart understanding).
In this post I’ll walk through some setup, and my experience testing it.
We will cover:
Text generation
Image understanding and captioning
OCR
Audio transcription
Function calling
Language translation from audio and text
IMO, one of the best small language models (SLMs) out there.
What Makes Phi-4 MultiModal Special
Right off the bat, there are two key things to understand about this model:
Fully multimodal on the input side: The same model can take in text, images, audio tokens, reason about them, and generate text as outputs.
Compact yet powerful: At just 5.6 billion parameters, this model can be optimized for deployment worldwide. As the official blog post mentions, it "delivers highly efficient low latency inference, all while optimizing for on-device execution and reduced computational overhead."
Note: this model is different from the Phi-4 14B parameter model which is significantly larger and was released in December 2024.
Overall, the model supports a wide range of capabilities: text generation, image description, captioning, audio transcription, text translation, function calling, OCR, and more. I've tested all of these capabilities and will share the results below.
Models supporting more modalities is part of the AI agent trends I discussed at the start of the year.
Architecture Overview
Phi-4 Multimodal employs a novel "mixture-of-LoRAs" architecture that allows it to handle text, images, and audio within a single compact model. Here's how it works:
Base Language Model: At its core is a 3.8B parameter Phi-4-Mini language model with 32 Transformer layers, which remains frozen during multimodal training.
Modality-Specific Adapters: Rather than fine-tuning the base model, Phi-4 uses specialized LoRA (Low-Rank Adaptation) modules for each modality:
A 370M parameter Vision LoRA adapter for image processing
A 460M parameter Audio LoRA adapter for speech/audio processing
Efficient Design: The model uses Group Query Attention (GQA) with 24 query heads and 8 key/value heads, which reduces KV cache consumption to one-third of standard size, enabling more efficient long-context processing.
Input Processing:
Images are processed through a SigLIP-400M vision encoder with a dynamic multi-crop strategy for handling diverse resolutions
Audio inputs use 80-dimensional log-Mel filterbank features processed through a specialized audio encoder with 3 convolutional layers and 24 conformer blocks.
This architecture allows Phi-4 Multimodal to maintain strong language capabilities while adding vision and audio understanding without the typical compromises seen in other multimodal models, all within a relatively small 5.6B total parameter footprint.
Setup and Dependencies
Before diving into the results, it's worth noting that Phi-4 requires a specific set of dependencies:
torch==2.6.0 flash_attn==2.7.4.post1 transformers==4.48.2 accelerate==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2
I have code that you can try out here.
Test Results
Using a virtual environment (Conda or venv) is recommended. I performed all testing on a local A6000 GPU (with no other optimizations), but the model should work well on most consumer GPUs. The full code used for this is here!
Text Generation
I started with simple text-only evaluation. When asked
Who is Victor Dibia, PhD and what is he known for?
He is the founder and CEO of Nanosys, a company that specializes in the development of nanotechnology-based solutions for various industries, including electronics, energy, and healthcare.
Dibia earned his PhD in materials science and engineering from the University of California, Berkeley, and has held various positions in academia and industry throughout his career. He is also a member of the National Academy of Inventors and has been recognized for his contributions to the field of nanotechnology. ..
None of this is correct! This is just a reminder that models at this size, due to how small they are, probably should not be used for factual queries or factual use cases.
The right approach here would be to provide context. So I went to my website, copied all the text (Ctrl A + Ctrl C), and added that as context in the first message. Then I asked the same question and got a much better response:
Victor Dibia, PhD, is an expert in Applied Machine Learning and Human-Computer Interaction (HCI). He is a Principal Research Software Engineer at the Human-AI eXperiences (HAX) team, Microsoft Research, where he focuses on Generative AI. He has published research at conferences such as ACL, EMNLP, AAAI, and CHI, and has received multiple best paper awards. He is also an IEEE Senior member, a Google Certified Professional (Data Engineer, Cloud Architect), and a Google Developer Expert in Machine Learning. He holds a PhD in Information Systems from City University of Hong Kong and has worked at Cloudera Fast Forward Labs, IBM Research, MIT Global Startup Labs, and as a founder/lead developer for a startup focused on West African markets. In his free time, he enjoys music, sports, and exploring art. He also writes a monthly newsletter on Applied AI and HCI.
This time, all the information is correct. The model was able to look through the messy, unorganized text from my website and come up with a sensible summary. This took about 11 seconds on my machine.
Note: This type of hallucination is not common to Phi-4 but to most language models, even much larger models.
Image Understanding and Captioning
Next, I tested the model's image understanding capabilities using illustrations from my blog posts, shown below
When asked
"What is shown in this image?",
Phi-4 accurately described:
The image depicts a man sitting on the floor with a large battery pack on his head, connected to a power outlet. There are various symbols and numbers floating around him, and a text that reads 'Your Surge Capacity is Depleted'.
The model processed this in about 4 seconds with remarkable accuracy, even capturing the text in the image.
For a second image shown below, the model again provided an accurate description:
The image depicts a man running on a treadmill while papers labeled 'NEW MODEL', 'UPDATE', 'SHO', 'ROATE', 'SPEM', 'VOMATE', and 'UPATE' are flying off the treadmill and onto the floor. The man appears to be in a hurry or stressed.
OCR (Optical Character Recognition)
For OCR testing, I used a screenshot of a smartphone specification page and got the model to extract data. Image is below (left is screenshot, right is markdown generated by the model).
When instructed to:
Extract ALL the text (OCR) from this image and render in a neat markdown format,"
Phi-4 successfully extracted nearly all the text, including detailed specifications for the HMD Skyline phone:
This task took about 59 seconds, which is longer than other tasks but still reasonable considering the amount of text extraction involved.
Audio Transcription and Multi modal Understanding
For audio testing, I used a clip from an MKBHD video (below) reviewing the HMD Skyline. First I extracted the MP3 for the video and asked the model to transcribe it:
The transcription was really good (a section is below). You can listen to the video and see how close it is.
PSA: This phone that you've probably never heard of has not just one but two features about it that I really like that should be on way more phones. So this thing has flown under a lot of people's radars. This is called the HMD Skyline, which might sound totally foreign, but HMD are the ones that own the Nokia brand. And so now if you squint, you can see it kind of resembles the old Nokia Lumia smartphones from a couple years ago. Flat edges, rounded corners. Shout out to those of you who remember the legendary Lumia 1520. But this 2024/2025 phone has, like I said, two things about it that are awesome that really could be adopted by basically every other smartphone company if they really wanted to. The thing is, they don't really want to. So the first one is pretty obvious: repairability. You see this screw down here at the bottom corner of this phone? Through this one screw, you can get into this phone and do actual repairs yourself. You can replace the battery or the charge port or the back cover glass or even the front display glass. So the front display glass replacement, for example, is $95, and they have a partnership with ifixit, so it's on ifixit's website, plus five extra bucks for the whole toolkit if you don't already have one. And you do actually need the toolkit to get the screw bit to open the phone. So definitely get it the first time you're doing a repair. And I have it right here. I am obviously a complete amateur, or at least I'm not a professional repair technician, but it took me literally a minute to get into the back of the phone and then go ahead and replace the front glass of this phone myself, which is great. Some of you are too young to remember the days when you would drop your phone and the back would fly off and the battery would end up across the room. So for those of you, yeah, yeah, phones used to just have removable backs. You could just pop off the back, take the battery out, replace it, and then just go on with your day. But modern smartphones have not had this for years. They are much more tightly integrated now. And I've gone back and forth on thoughts on kind of both sides of this ...
Transcription takes more time (couple minutes), and may not transcribe the entire audio (the model has a max_new_token param that can be modified to help here).
For a more complex test, I combined this audio processing with image analysis in a multimodal evaluation:
# Example: Combined Multimodal Understanding (Audio + Vision + Text)
system_prompt = "You are a helpful tech assistant who can analyze product reviews across multiple types of content."
content_list = [
{
'type': 'audio',
'content': (audio_data, sample_rate),
'role': 'user'
},
{
'type': 'image',
'content': phone_image,
'role': 'user'
},
{
'type': 'text',
'content': 'Compare what was mentioned in the audio review with the specifications shown in the image. Then answer these questions: 1) Does the reviewer mention all the key specs? 2) Are there any discrepancies between what the reviewer claims and the actual specs? 3) What feature does the reviewer seem most impressed by?',
'role': 'user'
}
]
The model successfully analyzed both inputs and provided thoughtful answers to the questions:
1) The reviewer mentions most of the key specs such as the display, camera, and battery, but does not mention the exact resolution or the specific processor.
2) There are no apparent discrepancies between what the reviewer claims and the actual specs. The reviewer seems to have a positive experience with the phone's features.
3) The reviewer seems most impressed by the phone's repairability and the Qi 2.0 wireless charging feature.
What's remarkable here is how the model can process information across different modalities, connecting what it hears in the audio with what it sees in the specifications image, then providing coherent analysis that draws from both sources. This demonstrates Phi-4' multimodal’s true multimodal reasoning capabilities, despite being a relatively compact 5.6B parameter model.
Function Calling
Finally, I tested function calling capabilities by defining a simple weather API tool. When asked
"What is the weather like in Paris today?",
the model correctly generated a function call:
<|tool_call|>[{
"name": "get_weather_updates",
"arguments": {
"city": "Paris"
}
}]<|/tool_call|>
This task was quite simple but demonstrates the model is indeed capable of function calling.
Conclusion - A Universal Low-Cost Tokenizer
Based on my testing, Phi-4 MultiModal demonstrates remarkable capabilities for a model of its size.
Phi-4 multimodal as a universal, low-cost tokenizer
Most generative AI models today work best by processing natural language text. However, a lot of data is locked behind other formats - PPT, video, images, audio, etc. In fact, the performance of most AI apps or agentic apps works best when they have access to the right data in text format.
In my opinion, a model like Phi4 could be highly optimized to become a "universal tokenizer" - imagine having an optimized endpoint where any data source can be passed in and text can be obtained from it. This universal tokenization approach could be especially valuable for:
Making unstructured data accessible for analysis
Enabling more sophisticated AI agents that can work with diverse data sources
Reducing the complexity of multimodal AI systems by standardizing inputs
Notes: This post mostly focuses on Phi4 multimodal, and while it works great, there are other cross modal models e.g., vision language models (LLAVA, QWENVL, Molmo etc), audio models (Qwen 2 Audio) that are also worth exploring. See the Phi4 multimodal technical report for more.
And of course, try out the model on your specific use case to verify performance. Chances are that you will need to finetune to get the most out of it.
Next steps: What I did not talk about so far is the Phi-4 mini model which as also released and is about 3.8B parameters. Jazzed to try it out and run some finte-tuning experiments.
Resources
Share this post