Meta’s Llama 4 Explores Bold Promises and Controversies in Open-Source AI

Meta has just unveiled its newest AI marvel, the Llama 4 models, sparking excitement and some debate in the open-source community. Building on previous successes, Meta’s Llama remains the community’s top choice, and the fourth-generation release comes with big promises—and a dash of controversy. We took Llama-4 for a spin so you don’t have to.

According to Meta’s announcement, the company’s latest lineup includes models that can rival leading proprietary systems out of the box, without any fine-tuning. “These models are our best yet, thanks to distillation from Llama 4 Behemoth, a 288-billion active parameter model featuring 16 experts. It stands as one of the world’s smartest LLMs,” the announcement stated. Meta even claims that Llama 4 Behemoth surpasses GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on multiple STEM benchmarks—and that model is still training.

Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself

META did not do a specific deployment / system prompt just for LMsys, did they? 👀 https://t.co/bcDmrcbArv
— Xeophon (@TheXeophon) April 6, 2025

The release includes two variants, Llama 4 Scout and Maverick, which both use 17 billion active parameters for inference but differ in their expert counts: Scout deploys 16 experts, while Maverick ramps it up to 128. Both versions are now available for download via llama.com and Hugging Face. Moreover, Meta is already incorporating these models into its suite of products, including WhatsApp, Messenger, Instagram, and the Meta.AI website.

A standout feature in these models is their innovative use of the mixture of experts (MoE) architecture. Rather than activating all parameters for every task, the approach activates only the necessary modules, keeping the remainder “dormant.” This clever design makes it possible to run a highly powerful model even on less robust hardware. For example, while Llama 4 Maverick contains 400 billion total parameters, it only activates 17 billion at a time, meaning it can comfortably run on a single NVIDIA H100 DGX card.

Another groundbreaking aspect of the Llama 4 models is their native multimodality. These models are pre-trained on massive, unlabeled datasets consisting of text, images, and videos. This fusion not only boosts the models’ versatility but also sets a new bar for multi-document summarization and extensive code analysis.

Perhaps most impressive is the Llama 4 Scout’s monumental context window. With the ability to process up to 10 million tokens—far exceeding the previous generation’s 128K and dwarfing competitors like Gemini—it opens up new horizons in handling large-scale datasets in a single prompt. Meta highlighted that the model can retrieve and process information seamlessly across this vast context.

Additionally, Meta teased more details about its in-progress Behemoth model. With a staggering 288 billion active parameters supported by 16 experts and nearly two trillion total parameters, early results signal that it already outperforms competing models on challenging STEM tasks, including benchmarks like MATH-500 and GPQA Diamond.

Let's Check the Meta's Llama-4

While Meta's Llama-4 models show impressive potential, some early testers have raised a cautionary flag—some capabilities might simply be too good to be true. Several independent researchers have questioned Meta's benchmark claims, noting discrepancies when they ran their own tests.

I made a new longform writing benchmark. It involves planning out & writing a novella (8x 1000 word chapters) from a minimal prompt. Outputs are scored by sonnet-3.7.

Llama-4 performing not so well. :~(

🔗 Links & writing samples follow. pic.twitter.com/oejJnC45Wy
— Sam Paech (@sam_paech) April 6, 2025

For instance, Sam Paech, maintainer of EQ-Bench, tweeted about his new long-form writing test—a challenge where a model has to plan and write an entire novella, split into eight 1,000-word chapters, based on minimal prompts. “Llama-4 performing not so well,” he observed. Other users and experts have even insinuated that Meta might be bending the evaluation system. In some cases, Llama-4 received artificially high scores despite providing incorrect answers. To be fair, human evaluation benchmarks are subjective, and it appears some users might have favored the model’s distinct writing style, marked by an abundance of emojis and an overly enthusiastic tone—a style perhaps influenced by its social media training data. There's even speculation that Meta might have fine-tuned a version specifically to excel on human-centric evaluations.

Wow... lmarena badly needs something like Community Notes' reputation system and rating explanation tags

This particular case: both models seem to give incorrect/outdated answers but llama-4 also served 5 pounds of slop w/that. What user said llama-4 did better here?? 🤦 pic.twitter.com/zpKZwWWNOc
— Jay Baxter (@_jaybaxter_) April 8, 2025

Long Context Capability Woes

The promise of a 10-million-token context window—a key selling point for Llama-4—has also been called into question. Independent AI researcher Simon Willinson detailed his experience with Llama-4 Scout through OpenRouter, where extended prompts led to bizarre outputs. In one experiment, the model fell into a loop, endlessly repeating “The reason” until it reached a 20,000-token limit. This disconnect between Meta’s promises and real-world application leaves users wondering: if the model struggles when pushed to just a fraction of its supposed capacity, how will it handle truly massive documents?

Testing the Model

Our tests involved running Llama-4 through various providers such as Meta AI, Groqq, Hugging Face, and Together AI. We quickly discovered that to experiment with the promised 1M or 10M token context windows, running the model locally is a must—the hosted services capped its abilities around 300K tokens, which is far from optimal for extensive tasks.

Information retrieval tests underscored this limitation. In a “Needle in a Haystack” experiment that involved embedding target sentences deep within lengthy texts, Llama-4 managed to detect the key segments in an 85K-token environment on most attempts. However, when pushed to a 300K-token prompt—inserting test sentences into something as substantial as Asimov's Foundation trilogy—the model essentially collapsed. Error messages and irrelevant pre-trained responses replaced what was hoped to be precise information retrieval, calling into question Meta’s claims of robust long-context performance.

Logic, Common Sense, and Linguistic Quirks

The model’s performance on basic reasoning puzzles further illustrates its gaps. When confronted with a classic “widow’s sister” riddle—a test that should be trivial for any state-of-the-art model—Llama-4 offered a drawn-out legal analysis rather than immediately recognizing the inherent paradox (after all, a man with a widow simply cannot exist). The situation worsened with non-English inputs; when the identical riddle was asked in Spanish, not only did the model miss the logical fallacy, but it also arrived at a completely opposite conclusion regarding the legal possibility of the scenario. Only when the question was stripped to its bare elements did the model finally spot the trap.

Creative Writing Strengths

Despite the aforementioned shortcomings, Llama-4 shines in creative writing tasks. We challenged the model with a prompt that tasked it to craft a narrative about a time-traveling man who inadvertently becomes the catalyst for the history he set out to change. The result was nothing short of atmospheric. The AI spun a tale rich in sensory details—a story where a Mayan-descended temporal anthropologist faces a catastrophic drought in the year 1000. Vivid descriptions of copal incense, shimmering chronal portals, and sunlit Yucatán landscapes lent an almost cinematic feel to its narrative. In a subtle nod to cultural authenticity, the model even concluded with the true Mayan proverb, "In lak'ech." In comparison, while GPT-4.5 produced a more concise, emotionally driven narrative, Llama-4 prioritized epic world-building and philosophical breadth, making it an inviting base for further fine-tuning in creative pursuits.

Sensitive Topics and Censorship

The Llama-4 release is also marked by extremely tight content guardrails. Testing revealed that the model steadfastly refuses to engage with even mildly controversial topics. Whether the prompts asked for delicate relationship advice or inquired about bypassing security measures, Llama-4 hit the same proverbial brick wall every time. While these precautions help to mitigate harmful content, they also result in numerous false positives—potentially limiting the model’s usefulness in fields such as cybersecurity education or content moderation. The silver lining here is the open-source nature of the project; developers can eventually customize and relax these constraints if they so choose.

Non-Mathematical Reasoning

On a positive note, Llama-4’s verbosity proves beneficial for complex reasoning. In a test modeled on a classic mystery—where the goal was to deduce a hidden culprit from a labyrinth of contextual clues—the model methodically laid out its evidence and reached the correct conclusion. Interestingly, unlike some counterparts that overtly question their own reasoning, Llama-4 proceeds with a direct analytical approach, breaking down intricate problems into manageable segments without overcomplicating its internal thought processes.

Llama is Meta’s family of open large language models (LLMs) and multimodal models (LMMs). The latest addition is Llama 4, a direct answer to competitors like OpenAI’s GPT and Google’s Gemini. Unlike many proprietary alternatives, all Llama models are freely available for both research and commercial applications, which has helped them gain remarkable popularity among AI developers.

What is Llama?

Llama is a collection of language models—some with vision capabilities—designed with principles similar to those underpinning models like GPT and Gemini. At present, the naming is a bit mixed:
• Llama 4 for some models
• Llama 3.1, 3.2, and 3.3 for others

Currently available downloads include:
• Llama 3.1 8B
• Llama 3.1 405B
• Llama 3.2 1B and 3B
• Llama 3.2 11B-Vision and 90B-Vision
• Llama 3.3 70B
• Llama 4 Scout
• Llama 4 Maverick

Meta has also announced two unreleased models—Llama 4 Behemoth and Llama 4 Reasoning.

All Llama models share the underlying transformer architecture and are developed through pre-training and fine-tuning. The key differences with Llama 4 are its native multimodal capabilities and the use of a mixture-of-experts (MoE) architecture that improves power and efficiency.

How to Try Llama via Meta AI

Meta AI, the assistant integrated into Facebook, Messenger, Instagram, and WhatsApp, now runs Llama 4 in the United States. Users can explore Llama 4 by visiting the dedicated Meta AI chat web app.

How Llama 4 Works

Llama 4 uses a mixture-of-experts architecture, which means that only a portion of its total parameters are activated for any given task. For example:
• Llama 4 Scout has 109B parameters distributed across 16 experts, yet activates only 17B parameters at a time.
• Llama 4 Maverick uses 400B parameters across 128 experts, similarly activating a maximum of 17B parameters.

Imagine each “expert” as a specialized subsystem (akin to having distinct experts in English literature, coding, or biology). A gating network dynamically chooses the most appropriate experts along with a shared common expert based on the context of your query. In contrast, previous Llama 3 models activate every parameter available for each query.

Llama models function by predicting the most plausible continuation of text (or image elements for vision-enabled ones) using a neural network with billions of parameters. They learn relationships between tokens (words or semantic fragments) mapped in high-dimensional space—thereby inferring context and meaning from extensive training data that includes publicly available text, books, and synthetic data. In addition, models like Llama 4 Scout and Maverick have been distilled from the highly capable Behemoth, which further refines their performance even while being smaller in size.

Llama Versus Other AI Models

When compared to closed-source models such as GPT-4o, Gemini, and Anthropic’s Claude:
• Llama 4 Maverick and Scout deliver competitive performance but do not yet reach state-of-the-art benchmarks for reasoning.
• Maverick is noted as the top open multimodal model and is highly cost-efficient to run, boasting a context window of one million tokens.
• Scout is designed to operate on a single H100 GPU and offers an impressive context window of ten million tokens, although its full capabilities are not yet widely accessible.

Meta’s provided benchmarks hint that, while Llama 4 models are promising, their key strength is open availability rather than pure performance metrics, leaving room for future improvements such as reasoning capabilities.

Why Llama Matters

The true significance of Llama lies in its openness. Whereas many leading models are proprietary, Meta’s Llama series offers a transparent and accessible platform for research and commercial use. This openness allows:
• Developers to download and modify the models with relative ease.
• Businesses to deploy Llama on major cloud infrastructures like Microsoft Azure, Google Cloud, and Amazon Web Services.
• Customization and fine-tuning of models to suit unique tasks, from generating tailored article summaries to enhancing customer support interactions.

Meta’s commitment to open source reflects CEO Mark Zuckerberg’s belief that accessible, open AI is important to prove that AI’s benefits are widely distributed rather than being concentrated among a few major companies. Despite certain licensing limits (such as restrictions for companies with over 700 million monthly users and an initial ban for EU users), Llama creates a viable alternative to closed, proprietary systems and paves the way for a more innovation-friendly AI ecosystem.

Final Thoughts

In summary, while Llama-4 marks an exciting milestone for open-source AI, it isn’t quite the revolutionary leap Meta envisaged. The challenges outlined—from underperforming long-context tasks and logical missteps to the overly cautious content filters—indicate that the model still requires significant refinement. The hardware demands remain a notable barrier to widespread implementation, with high-end GPUs like the NVIDIA H100 DGX (490,000) and RTX A6000 (5K) being prerequisites even for the smaller versions.

Yet, despite these setbacks, Llama-4 provides an invaluable foundation for the future of AI. It offers a promising base for creative writing and a gateway for further research and development by the community, particularly when compared to expensive closed models. As Meta continues to fine-tune its product and address these discrepancies, the potential for Llama-4 to expand—and to eventually meet the high expectations set by its marketing—remains strong. With the AI in a state of fast transformation, one thing is clear: open-source models like Llama-4 are firmly in the race, even if they’ve still got some rough edges to smooth out.

Disclaimer: All materials on this site are for informational purposes only. None of the material should be interpreted as investment advice. Please note that despite the nature of much of the material created and hosted on this website, HODL FM is not a financial reference resource and the opinions of authors and other contributors are their own and should not be taken as financial advice. If you require advice of this sort, HODL FM strongly recommends contacting a qualified industry professional.