Is Llama 2 open source? No – and perhaps we need a new definition of open…

Yesterday Meta released, with great fanfare, their new large language model Llama 2. Trained on over 2 trillion tokens with double the context length compared to Llama 1, this has impressive performance according to their own benchmarks. It’s also great to see they have worked on safety and environmental and other impacts. We can only applaud the work (and a good few tens of millions of dollars) that has been invested in this and it’s going to be exciting to see what search and other applications people build using this open model.

Yes, it’s ‘open’ – you can download it and use it today in your own applications, or rent it via services like HuggingFace, AWS or Microsoft Azure (Microsoft have also made announcements of the new model as part of a partnership with Meta). There’s some great documentation, examples and a research paper. As you might expect this has blown up on social media with thousands of people thanking Meta for bringing out this wonderful new resource. Except there’s one problem.

It’s been announced by leading figures at Meta and in their own press releases and website that Llama 2 is open source, which has been repeated widely by others, many of whom should know better. It’s not by any reasonable definition.

How do we know Llama 2 isn’t open source?

You make something open source by granting rights under an approved license. Meta’s license looks pretty open at first glance with phrases like “You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.” So far so good, you can do what you like – but watch out for the restrictions, carefully designed to protect Meta’s share of the market:

2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

Sorry Google or anyone else running a particularly large Internet service with lots of users (someone mentioned even Snapchat comes under this restriction) – you can’t use Meta’s model. Is this concerning for anyone else? Probably not, but it’s still anti-competitive.

“v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).”

So you can’t use Llama 2 to generate training data for either building or fine tuning any other model, which is potentially more of a problem as lots of people might want to do this.

This is Meta’s own license, created by them and not approved by the Open Source Initiative, who are generally accepted to be the (non-profit) authorities on what is open source and what is not. It has similarities to the approaches taken by MongoDB and Elastic to restrict the use of their software by cloud hosting companies like Amazon, as I’ve written about previously.

Why does it matter that Llama 2 isn’t open source?

Firstly, you can’t just call something open source if it isn’t, even if you are Meta or a highly respected researcher in the field like Yann LeCun. You don’t get to decide that your license is open source, the OSI do. Even Elastic didn’t call their new license open source, but used words like ‘open code’ (a far fuzzier definition).

Meta and many other giants have depended for years on the efforts of others who have freely granted the rights to using the software under open source licenses – with no ‘but you can’t do this’ or ‘except people like you’ clauses. That’s what open source is about at the core – freedom to innovate, standing on the shoulders of others – and you simply couldn’t build Facebook or Instagram without using open source software. For Meta to claim Llama 2 is open source is disengenuous and takes advantage of the open source movement, which has become a dominant force in software over the last few decades.

Some other LLMs have been released under truly open source licenses – it’s interesting to note that Falcon 2 changed their license almost overnight once people complained it wasn’t open, perhaps to drive adoption rather than for truly altruistic reasons. Open source is a powerful movement amongst the development community and in this fast-moving world of AI with new models appearing (it seems) every week, to get mindshare you may decide openness is a very good marketing strategy.

I suspect that Meta are trying to redefine open source under their own terms – and sadly, this time they may well get away with it. However there are plenty of people (including me) currently informing them of their error. What is particularly worrying is that this could open the floodgates for other large companies to create their own licenses however it suits them, trumpet them as open source and most people won’t even care, given the huge buzz of excitement around these new AI techniques. How about some new clauses saying it’s openly available for use, except in these countries? What about some extra text saying you can’t use the model for medical devices, or for this political party? You can download the code of course, so it’s ‘open’, but…

Can we even define how ‘open’ a LLM is?

The trouble with our new AI-powered world is that it’s hard to define ‘openness’. When you release a LLM under any kind of ‘open’ license you’re only providing the results of a long and complex process. You may not want to release what source data the LLM was trained on (which could contain copyrighted works, another hot issue, and which certainly contains various biases), the training code itself or full data on the the costs and harms of human input. Meta’s Llama 2 release includes only “model weights and starting code” according to their Github page. One could argue that a fully open LLM would include much more than this: it’s not just about code any more, but how you derived that huge pile of magic numbers that a LLM actually consists of.

OpenUK, the not for profit organisation representing the UK’s Open Technology sector (who I am proud to represent as an Open Source Ambassador) have broadly supported Meta’s release of Llama 2 in the spirit of open innovation, although they have also made it clear this is not open source. Their recent report on AI Openness shows clearly that as the world of AI races ahead, with new models appearing every week, licensing and legislation are at huge risk of lagging behind.

So I asked Llama 2…

OK, I admit it, I didn’t (I’m not yet been tempted to get a LLM to write part of my blog or presentation for me, it feels kind of lazy!).

Overall I feel we’re getting into risky territory here, where large companies bend or even break the open source model for their own commercial purposes and because we’re so excited about AI, we let them get away with it. We’re in an AI goldrush and the landgrab is intense: look at how VC money is thrown at startups only weeks old driven by FOMO – Meta and others are trying to make their LLMs (which sometimes only they and others at their scale can build, given the costs and data required) one of the standard shovels that everyone has to use.

There’s room for open and non-open solutions of course – as in the world of search engines, where there have always been different licensing models and it would be foolish to state that open is always best. You might be surprised to learn that OpenSource Connections has recommended commercial solutions before, where a client doesn’t have a search team, their search issues are easily solved by standard features, or is quite happy to follow a vendor roadmap and give up some control of their search. Similarly for LLMs – if using OpenAI is best for you and you’re happy to pay for it (and send them all your data) then that’s absolutely fine.

But when open really matters, let’s be clear about what it means and what it doesn’t. What constitutes true freedom, as embodied by the open source movement, is important and we should be very wary of anyone who tries to dilute this.

If you need help navigating the new world of AI-powered search, get in touch.

Image from Banner Vectors by Vecteezy