[Conference] From OpenAI to Open Source AI: From Open AI to Open Source AI

From OpenAI to Open Source AI by Raphaël Semeteys, Devoxx France 2025

Sommaire

1 From OpenAI to Open Source AI by Raphaël Semeteys, Devoxx France 2025
2 From simple statistical models to LLMs
3 GenAI at its Linux moment
4 Set a Template Open
5 Market Leader: OpenAI
6 Market Leader: Google
7 Other major players
8 Market Leader: Meta
9 Offspring of LLaMa
10 Collaborative Fundamental LLMs
11 Other LLMs Open Weights
12 Derived LLMs
13 Other aspects of the Linux moment
14 This is the end of our exploration
15 Initiatives to evaluate the openness of LLMs
16 Useful links
17 Episode production
18 License

Raphaël: Hello everyone, thank you for coming in such large numbers for Yet Another Talk on GenAI, so we want more. What I’m suggesting here is maybe to take a slightly different point of view: we’re not going to talk about intrinsic abilities, large language models, that kind of thing. And I put myself in the theme of, I don’t know what we call it, Cyber Maya, I don’t know what the theme of Vox is, I have a bit of Indiana Jones, we explore. explore.

And so, what I suggest is to go on a journey with me, on an expedition, to explore the LLM jungle. Because that’s really what it’s all about today. It’s dense, it’s everywhere, you hear noises, you don’t know what it means. There are big things, there are small ones. And so, I called it OpenAI to Open Source AI, because the angle of view that I want to offer, the map and the compass, is to look at what it means open, for a Large Language Model. Because there’s a lot of buzz, a lot of communication, a lot of open washing, a lot of open bashing, and so it’s to discover a little bit in this jungle what we have, all the variants we can have between commercial ownership and collaborative openness.

Let me introduce myself, my name is Raphaël Semeteys, I have been working at Worldline and Atos since 1999. I am based in Paris. I’m responsible for DevRel activities but I’m an architect, so I’m used to looking at things, how they are put together for what to do, etc. I’m an Open Source expert, and that’s important, because I’m going to use this expertise to explore precisely this jungle. And that’s my avatar. You may have the ref, Raphiki.

Oh yes, and I’m also promoting the podcast Projets Libres, because I’m helping a great friend. Now, I’m almost in the organizing team. And in French, if you want to know things about free software governance, licenses, project feedback, etc., go ahead, that’s great.

So there you have it. What I propose to you is to get ready, don’t be afraid. The jungle is scary because it’s dark, there are weird noises and everything. But take your explorer’s hat. I hope it won’t mess up my microphone. Adjust your belt, put on your boots. We’re going to enter the jungle of LLMs.

From simple statistical models to LLMs

That’s it, it’s starting. The light starts to get a little more subdued. Wow! It’s dark, you can see things in the background. What’s going on? Before going deeper into the jungle, we’re just going to turn around, we take a point of reference so that we can find the exit, know where we come from a little bit, so I’m not going to redo the history of the AI, so there’s Luc Julia who does it much better and basically it doesn’t exist. I’m coming back to model languages, basically I don’t go back that far in the 2010s there were already language models, we were already doing semantics, semantic research and we were modeling words in vector spaces with this notion of embeddings etc, but the qualities of these results were not great, It hadn’t broken through in terms of use. There was imitation, in fact.

All this changed in 2017-2018 with the publication of a research paper called Attention is all you need by Google Research, which introduced this attention mechanism, basically, without going into details and without offending data scientists and AI specialists. Basically, it allows you to paralyze things with attention that moves through the sentence, etc. And so, that unlocked the limitations we had on language models. All this, based on resources, etc. and we’ll see, of processing, it gave the Large Language Models. With a transformer-type architecture. And then, after that, for the past 5, 4, 5 years, it’s the total hype that we’ve been experiencing, with ChatGPT, a global use, the most adopted technology in record time by humanity. The use of multimodality, so several modalities, models too that will manage text, but also voice, image, video, etc. And then, reflections on “oh yes, but it’s powerful, actually, these models. Ah, we should start thinking about what we can or cannot do with it.” So, our notions of responsibility, of the use of these tools, since they are powerful. So, you saw, Luc Julia, it’s a tool. With a hammer, you can hit someone or drive a nail in. So, there are also regulatory frameworks that are being put in place, governments that are starting to say: “we may have to regulate all this”.

And then, when I started doing this research, from these angles of large language models, I said, so that was a year and a half ago, I had the first version of this talk, every time I give it it changes because obviously it moves so much. I was saying the future, well I was playing a little bit Madame Irma there, I was saying, wow it’s going to be the small language models, we’re going to go back to smaller, more specialized stuff that we’ll be able to broadcast. Arriving at mobile uses, there are also agents who will emerge, we will be able to start doing architecture. And then, in fact, today, finally, yesterday, finally, tomorrow, sorry, it’s today. All this is today, it is the experience we are having. There may be the AMLs I was talking about, that, maybe it will happen, it will explode soon. These are the Large Action Models. So, it’s more about language, it’s about behaviors, actually, the patterns. So, it’s to interact and do an embeeded AI in robots, or interact with the world, etc., etc.

GenAI at its Linux moment

Well, anyway, in any case, we can see that all this is moving a lot. Here, the subject of my talk is really about LLMs. And so I, as I said, I’m an Open Source expert. I discovered Open Source when I was in engineering school in the 90s, at the same time as the Internet. I saw these two things, I said: “This is a crazy thing, we’re in the process of connecting everything. In addition, we are starting to share things. Ah, it’s going to be crazy, it’s going to change the world.” Well, that’s pretty much what happened.

And so, I followed from the 90s, so it’s not the beginning of open source or free software, but I followed a little bit of the whole adoption, in fact. All this, how open source with the Internet, all of that, it’s concomitant, it’s changed IT a bit, it’s changed society, how it’s been adopted, in fact. And when I look at what’s happening with GenAI, I’m like, “Wow, this reminds me of what I’ve seen with open source, except it’s going much, much, much faster.” But indeed, whether it’s open source, the Internet or Gen AI, all of that started, obviously, in labs, in universities where we do research.
Raphaël Semeteys

And what do researchers do? They are used to working in their labs to publish research results. They’re coloques, like us, they don’t know it, he’s the one who found the thing and everything. And to publish their results and above all to reuse them.

And so, there is a kind of global collective research that is being done, where we reuse the work of others, we are inspired by it and we rebuild on top of it. And that’s how, collectively, we manage to build things. That’s how we made software at the very beginning, before we understood that software is valuable and that AT&T, IBM, etc. say that we are going to sell the software. That’s what created the Free Software Foundation. People who said: “no, we want to continue to share software, we don’t want to make it proprietary”. And so, that’s how we did mathematics. If we had put patents or done commercial mathematics, I don’t know if as a humanitarian we would have arrived on the Moon, for example, if everyone kept their equations in their corner, etc. So, there is this notion of sharing.

And then, when the use becomes mature, it becomes powerful, it leaves this sphere of the labs and it enters the individuals. And then, through individuals, in companies. So there’s the custom. So here, with GenAI, it’s ChatGPT. ChatGPT, did you see the trick? It’s unbelievable. And then there’s a crazy use and adoption in the context of ChatGPT and GenAI in general. And it fits into the world of business and companies. Now, we’re talking about billions right away. GenAI is billions, billions of dollars, euros, yen, etc.

And here, we are entering into what we have known with open source. There are companies that say, first of all, with this very centralized model, we’re going to discuss centralization, decentralization, it’s expensive to train, it’s expensive to run, you need a lot of machines, so you need money. And then, there is value in the model, so we will make money with it. And so now, we are starting to have postures that are changing. A bit like with software: “We could sell it, this thing, we could make money. Researchers, they’re nice, they keep going. But we’re going to build business models on top of that.” And if we follow what I’ve observed in Open Source, then I’m a bit radical when I say that, LLMs will become commodities. That is to say, it will become something standard. So obviously they will be specialized, there will be some that are open, that are paid, etc. But we’re going to start building on top of it.

We’re going to start innovating on top of it because the LLM is going to become a standard building block that we’re going to put in our architectures. Except that it goes much faster and so what I said to myself, I said to myself I recognize patterns, it reminds me of open source, the open source movement, etc. So the reflex I had was to say, I’m going to look at the licenses, I’m going to look at what it means open. Because, there you go, Meta, yes, we’re open source, that kind of thing. What does open mean for a model? And that’s a good angle of analysis of how we’re going to be able to use it in the future, in fact, these bricks, whereas for me, they’re going to become commodities very soon, if they haven’t already. So, it’s about having more clarity on licensing, more clarity on opening a model.

Set a Template Open

So, what did I do? I went to see the researchers at home. We do research. Worldline is mainly in payment, and in payment management, we’ve been doing AI for a long time, especially for fraud detection. not necessarily GenAI, but hey. And so, they’re looking at that. And so, I went to see them. I told them, “Well, I have a simple question. It’s, well, first of all, what is an LLM? Well, that’s complex. But what do I need to look at in the build or drive chain of an LLM? What do I need to look at to identify the level of openness of the thing?” And they told me, well, so, they’re with this pattern. So, that’s a year, a year and a half. Since then, there are other training methods that exist, we could make more complex diagrams, etc., but the important principles remain.

They told me the first thing is the model itself, so what is an LLM? It is a neural network model, it is a description of an architecture that will then be implemented and trained. And this training will generate parameters, the determination of parameters, so the famous weights of the model on these different layers of neural networks. And they told me “we, as AI researchers, having the code that implements the architecture of reason for rounds, we don’t care because frankly we know how to do it, it’s very well known as long as we know the architecture, that’s not what makes the value of the model, ok, what’s interesting is the weights, that is to say that it’s the trained model, once I have the weights, then there Yes, I can reproduce the architecture, reconfigure the model and basically I got the model and I run it at home.” So they told me “this is the first thing you have to look at, is it available and under what conditions”.

The second aspect to look at is the data since there are no models without data, there is no weight without models. In fact, the weights, in a way, are a bit like the data that have been engrammed in the model. So, we have several types of datasets that will be used to train the models. So we have the pre-training dataset, which will be used to create fundamental models. These are models who have a huge general knowledge. And that’s why to create them, what are we going to do? We’re going to try to recover as much data as possible, so all the Internet, all that it’s human, basically, by trying to filter, as Luc Julia said, everything that comes from the LLMs themselves, because otherwise, it’s mad cow, and we’re going to train the models with that. So now, he says, it’s good to have access to these datasets? Do we know what it is? Is there access to it? Are there licenses on it? Can we reuse them, modify them, reuse them to train other models or not? etc. So that’s going to be a guarantee of openness.

The second is the datasets that will be used in other phases of model training, especially when we want to make specialized fine-tuned models, a specialized fight then on a given field or at least on a way of operating data. For example, cat-type models with humans, these are models that are fine-tuned compared to fundamental models that have been trained to do a lot of things, is not necessarily specialized for cats. And then we take a third category of models as we put in, which I’m putting here in reinforcement by humans, so the RLHF, there is also data that comes from humans to correct and align the model to be even more in a particular use. OK? And then there are other techniques now, DPO, with preferences, etc. But basically, there are other types of data that are used to align the model.

And what they told me, they told me, when you’re in the latter case, there’s an intermediate thing that you don’t often see, which is the famous reward model. Because I said it’s reinforced by humans, but it’s going to help train an intermediate model that will then go into the training and final alignment of the model.

And that, they told me, that, often, you don’t see it, this thing. So, we, as researchers, it’s important that we can install it.

So, that, they said to me, yes, look at this, if there’s that. And then, well, there’s still some code, but I said, well, the code of the model itself, the architecture, the implementation, that’s not what we’re going to be most interested in. That’s Data Processing Code. That is to say, do we have all the elements to do all the scheduling, organize the data processing, organize the training, retrieve the results, the logs, etc.? If we have that, then it can save a lot of time because we can very easily reproduce the training of the model at home. And so, modify it and do our own fine-tuning, for example.

So, OK, me, starting with these 5 artifacts, 5 training components of an LLM, I said, well, OK, this is my card. Now I need a compass before I enter the jungle, I’m not crazy.

I have material, so I propose this graduation: 0 is completely closed, basically it’s the black box, we don’t have access to the artifact in question, so whether it’s the data, the code, the dataset, the weights, etc. And then 4, it’s completely open in the Free Software sense, that is to say I have access to it, I can use it, I can distribute it, I can modify it, I can redistribute it, and that without restriction. And then between the two, I put 1, it’s described in research papers but we don’t really have much more information than that, but as we’ve seen, it can be useful to researchers by saying yes no but this type of model I understood I can do it again etc. 2, I can have access to the components but you have to show your credentials so either you have to pay, or you have to have signed something, or be in a research project, that kind of thing. Typically for certain data. And then 3, it’s open, I can use it, I can modify it, I can redistribute it but there will be limitations to the use I can make of the artifact in question. And with that, what I’m proposing is that this time it’s good, we’re less afraid, at least for me, and I can go back into the jungle, go deep into the jungle and say “well come on, I’m going to start trying to map a bit like an ethnologist, I don’t know how to put it, a lmologue, and try to see what profile of critter I will find in there.”

Market Leader: OpenAI

Well, when you enter the jungle, the easiest to find is OpenAI: this one is not complicated, you don’t need to go very deep into the thing to identify it. And so, if I apply the grid a little bit, it’s going to give me things like that.

So first of all, OpenAI, its name: the non-profit company, even if at the moment, it is becoming more or less changing its status. The name is a multidisciplinary research project to advance knowledge in AI for the well-being of humanity, etc. etc. So which means that their GPT model family, so Generative Pre-training Transformer, Transformer, so the stuff introduced by Google before. So they’re still in collaborative search mode, etc. They implement that. The model is completely open. It can be used. There are no restrictions, etc. And dataset and code, well… It’s described in research papers, but we don’t have much more detail than that. We are in research. They don’t deal with the masses of data. What they want is to have a high-performance model and to show that we can move towards Artificial General Intelligence, which will never exist, as Luc Julia said.

But that’s it, so that’s their goal. Then comes the hype, the LLM hype . Ah, boom, boom, millions of users, etc., etc. Then Microsoft arrives: “Come on, I’ll give you millions, you’re going to train the model, we’re going to do something crazy”. Value arrives, and then people say “we close everything. We are not changing the name, but on the other hand, we are closing the model.” It’s complete, there’s nothing left, black box. So, we don’t know how it works, from versions 3, 4, O and later, everything is closed. There are things that are described on ChatGPT, so with a fine-tune of GPT in research papers, a little bit on O1, but somehow, I tend to say it’s almost marketing, it’s saying “yes we’ve done stuff, we’re innovating, we’re super strong, we’re doing chain of thought, That’s it, we’ve done things.” But we don’t really have the details. So, we can see it, radical change when we enter this era of business. Open, open, collaborative research, etc. Closed, we make money, we do business.

There’s another thing that’s important, and that’s why we have to read the licenses and terms of use, because in OpenAI’s terms of use, it says that we can’t use results from our models to train competing models. We’ll see if it’s important for the future. So, don’t train business models. Or you have to come and talk to us. Another one that’s easy to find… Ah, sorry! Yes, that’s something I added because it moves so fast that… I added that, because about a week or two ago, Sam Altman, the CEO of OpenAI, said, “Ah, it’s been a long time since we’ve done anything a little open, like in the days of GPT. We’re thinking about whether we can open something at OpenAI.” So now, we’re going to arrive, perhaps, at this notion of open weight, this notion of commodity , which I was starting to talk about there. Maybe we should share. So, is it under pressure from American or Chinese competitors? Who knows. What exactly is it going to be? We don’t know. But in any case, we can see that OpenAI is saying that we have to be careful and that maybe we play the card of openness a little.

Market Leader: Google

So there, as I said, the other easy to find is Google. I’ll check it out there. All right. With BERT. So in BERT, there is Transformer. That’s good. They are the ones who invented the type of architecture. So they implement it.

It’s downright rather open. The dataset, you don’t have access to everything if you don’t have a white leg. But we are really in this world of collaborative research, etc. The LLM hype , first with Palm, Palm 2, and now Gemini which is a generic name on which they put a lot of LLMs. First reflex: we also close, we stop. We’re entering the business arena, etc., so we continue to publish things in research papers, but that’s it.

And then, shortly after, they go back to a slightly more pragmatic posture where they say to themselves, we’re still going to do things that are a little open, we’re going to release Gemma, which is in fact in parallel, a family parallel to the Gemini, which allows the reuse of weights with limitations, we’ll see. The research on datasets, in particular, has been published, but we don’t really have the details, and there is a chance that there will be similar things with Gemini. And, on the other hand, they provide a lot of code, tools, docs, etc., on how to run Gemma locally, and especially fine-tuner, so how to make your own derived versions of models. So there, a little bit Google fashion, go ahead, adopt it, play with it, etc. OK? Why are there three on the opening of the weights?

It’s because basically, there’s this notion of responsible AI that I was talking about earlier, which says, “yes, but with Gemma, you can’t do evil.” So, what is evil? Good question. There are some tricks, it’s obvious: “You shouldn’t do illegal exercises of a profession, you shouldn’t promote violence, hatred, etc.” It’s pretty clear, it’s powerful, isn’t it? You shouldn’t make a fuss about it, well, this kind of thing. But that, this restriction, it contradicts the definition of open source in the strict sense. From the moment we put a restriction on use, in the initial definition of open source, it is not open, since we are in the process of restricting use. That’s just to note. First of all, open source, you have to be careful what you mean by that.
Raphaël Semeteys

There you go. So that’s on the Google part.

Other major players

After that, the others, the usual suspects, the ones who are easy to find in the jungle. Basically, it’s the big guys, it’s the web giants, it’s the ones who have benefited from all this networking and who have captured a little bit of the data, the uses, the data that they’ve monetized. They said to themselves, we have to make money with it, etc. etc., they are trying to catch up or they are catching up or they have caught up. With two main axes. One, they create their own models, so open or not.

So open, we have Alibaba with Qwen, we have things happening at IBM too, or proprietary. And then, if they are not already infrastructure providers, they will also join forces with infrastructure providers, because in this vision of models, we need more and more resources, we need to have the strongest model, and then in any case, it’s war, we have to show that we have the most parameters, the most modalities, etc. etc. So, OpenAI is partnering with Microsoft, even if now the bosses are changing a little, but there you go, they are partnering with banks and that’s capitalization in terms of billions. We have Anthropic with AWS, we have Grok and therefore xAI and Elon Musk with Oracle, etc. So, we see that things are starting to get organized and that the traditional players of the web giant are saying to themselves, we have to not miss the train, we have to position ourselves because there is value and we will have to capture it.

Market Leader: Meta

There is one that I haven’t talked about since we’re talking about GAFAM, and that’s Meta. He’s interesting because he did something that helped to structure, at least in this part of the actors, the jungle a little bit.

Meta, they started in the open too. In any case, they have had artificial intelligence research labs for a long time. We have Yann LeCun who has been there for a long time too. RoBERTa, in the pre-hype of LLMs, given the name you see where it comes from, it’s clear. And so there, they were really in the open model, completely open code. The dataset, they try to show even more visibility, then you have to unravel it a little bit, so there may be some limitations depending on the licenses, etc., but hey, we’re in something that’s open.

And then comes the hype, everyone says there’s value, etc., Meta does the same and releases its LLaMA family of models, and starts publishing things in the research context, but without giving too many details about the datasets, about the code. On the other hand, they do something, and it was one of the first of these players to do it: they say “yes, but we, our model, we allow it to use it, and to install it, and to fine-tune it”. And those were the first, and that’s why at the beginning, they said, yes, we’re open source, because we authorize, etc.

We’ve seen that open source doesn’t necessarily apply, because already, there are limitations that say, don’t do any harm, also already on LLaMA. And then as LLaMA, with the releases, etc., they start adding things back into the terms of use. So the first one is with version 2. The first restriction is that they say: “wait, if you make a service based on LLaMA in SaaS mode for example, and you have more than 700 million monthly users, then it’s more open, you have to come and share the cake with us”. So there is protection, it’s weird, it’s not good or not good, but in any case there you go. So we want to share the cake with you, that’s version 2. Version 3, they say: “We would still like to have paternity. LLaMA is starting to become known and used. So if you make a fine-tuned model, you have to specify that it was built with LLaMA, and well in the name of your model, you will have to put LLaMA 3 in front of it so that it is clear that it comes from us and that you are on a version that was based”. So it becomes a little bit more restrictive.

And then LLaMA 4, they take over the restrictions from before but there’s a little thing that says in the terms of use: if you’re based in Europe, you don’t have the right to use it. So that, yes, that’s it. Does it have anything to do with the AI Act and that they don’t feel very upright in their boots about how they trained LLaMA and on what type of data? Did they have the consent of Facebook users? There may be something at play in that, but for now, at least, it’s still said. So here, be careful, it’s open, but not in Europe. So, we have to be careful. You have to read the terms of use carefully because otherwise, you can have surprises.
Raphaël Semeteys

Offspring of LLaMa

But they did this thing anyway and it was the first of this type of players to say you can still load the model and reuse it, especially before LLaMA 4, which is even in Europe. But already in the United States, right away, there are researchers in universities who said to themselves: “great, we’re going to do this, we’re going to make fine-tuned versions”.

So Stanford, Alpaca, Vicuna, I don’t know where it is, Berkeley, I think, I don’t know anymore, another university. And they have more or less the same profile. That is to say, if we look at the level of openness with the compass, we try to scan the model from that angle, the model and the pre-training, the levels of opening, they inherit that from the fundamental model on which they are based. Since it comes from there, so it’s 3 to nothing. And what did they add? It’s code and it’s data to do this fine-tuning, so specialize the model, improve it, etc. The code is under the Apache 2 license, so that’s code anyway, open source licenses are really designed for code, so that’s clear. And at the data level, they used data that came from ShareGPT — a site where prompts and prompt results are shared– , they used results from OpenAI. Oh yes, but there was a limitation on OpenAI: you can’t train an AI competing with us with data, etc. So basically, it’s research work.

You can’t go and do something if you build on that and build something business and everything. So all of this didn’t go before judges and everything, but the original intention was not for you to do that. In any case, they don’t want to. So you can have small problems. It’s something that you have to keep in mind.

And then more recently, like last week, there are other players, this is a company in Silicon Valley, a startup, which said “well yes, I’m going to base myself on LLaMA 3.2 and I’m going to release an even more advanced version”. So with the chain of thought, it’s to do things where we have models who think about what they’re going to do before doing it, that kind of thing. So it’s called DeepCogito, it’s brand new, we don’t have all the details, we don’t have any details on how they did the fine-tuning, the code, etc. But we can see that if, directly, they will inherit the level of openness of LLaMA 3.2 — and by the way, I didn’t go to look, I didn’t go to ask them the question, but normally, they should put LLaMA 3 in front of –, if we say what was written in the restrictions. There you go.

But all this to show that this, this openness, right away, creates a dynamism. And there are people in research, but also in business, who are saying: we’re going to create models, we’re going to continue, etc.

Collaborative Fundamental LLMs

So from there, there are other players who have also been put in place. I mentioned just a few of them, and they said: “What we would like to do is create more open fundamental models, particularly at the level of datasets, to give a boost to dynamism and to rediscover this collaboration that comes from the world of research, which comes from the open source world. We’re going to do it in a community way.” So here, I have given you a few examples.

We see, it’s a lot of research or activist stuff. In France, for example, with Linagora and OpenLLM France, we want things to be as open as possible. So if I start on that side here, we have Eleuther AI with GPT-J, which is open. We can see that there are small limitations at the dataset level, because basically, it’s still researchers. And there’s a little sentence that says “Oh yes, at the dataset level, if you want to reuse them and know exactly, find out about the dataset“. So that means we didn’t really do the job. We’ll let you figure it out. That’s why I put three. But otherwise, it’s still quite open.

But it’s not just in the United States. For example, Falcon comes from the United Arab Emirates. And there, the same, super open dataset , they created their own dataset, they put a clear license on it, they authorize to use it, to use it, etc. On the model, we’ll see, they can put some limitations, already, maybe “not bad”, but maybe others, and some instructions on the code. We have BLOOM, this interesting project, because it’s a pan-European research project, where France was really very present, and it made a lot of labs collaborate, a lot of research labs, it turned on Jean Zay [CNRS supercomputer], that kind of thing.

And they, “don’t do evil”, they started to organize it with the notion of Open RAIL, therefore Open Responsible AI License. And so, there are several types of Open RAIL, several levels, etc. But it corresponds to saying: “you can’t do this, you can’t do that, etc”. From this point of view, it still contradicts the open source definition a little bit. On the other hand, the dataset is open, the code is completely available. So here, we are in something that is quite interesting.
Raphaël Semeteys

Open LLaMA, the same, so opening up to the model, to the dataset, and then some limitations at the code level. And then, Lucie, more recently, a project in France, where there, really, we aim with people who come from the world of open source, etc., and who try to aim for this openness as much as possible. So that’s interesting.

What should be noted is this notion of responsible use, vagueness on the datasets, where we can find ourselves in situations, we would say “no, no, no, it’s not us, we told you you had to look”, etc. But we have things that appear. So there, just to show open source licenses that can be changed. If I take Falcon, the UAE project, they say we’re based on the Apache License. And I’ve seen people who say “Ah, it’s Apache!” But they also say, “Ah, we changed it!” And by the way, you’ll really have to read what we’ve done. Because what we’re saying is that if you do Falcon as a Service, in fact, you have to come and give us money. So, basically, it’s based on Apache, but we’ve put a restriction on use and we’ve put in a kind of commercial protection clause, a bit like Meta, a bit like other players.

So it’s important to look at the details and the conditions of use of the models as well.

Other LLMs Open Weights

We also have other actors that I try to categorize in what is called Open Weight. So you see what it is now: I distribute the weights, I allow others to reuse them, even fine-tune models from there, but I’m not necessarily going to publish a lot of things about the data that went into getting those weights, OK? And I tend to favor the fact that we fine-tune by providing code, doc, etc. etc., for others to adopt the model(s) and specialize them, OK?

And so we have other companies in general that use this same positioning like Meta, for example, as we have seen, like Google with Gemma, and DeepSeek – so China. So that’s to show that all this is not limited to the United States. AI has never been limited to the United States, it’s the business that is very focused on the United States. With DeepSeek, so DeepSeek that has created a buzz, you have all heard of DeepSeek. So what do they do? They make a model that is open. So, there’s something not very clear, it’s a bit blurry on the DeepSeek part. So here, I’m talking about DeepSeek R1, so the DeepSeek that is intelligent, the one that knows how to do self-reflection, etc. So it’s based on a bootstrap version called R1.0. In fact, they built their model like that.

And the R1.0 is itself trained on a DeepSeek V3 base. And V3, there were restrictions on usage, but they don’t exist anymore in DeepSeek R1, which is completely MIT. So what do we see? It’s not very clear, but there’s one thing that’s clear in any case, and that’s that they’ve started to remove restrictions. So the reason can be geopolitical, it can be commercial, it can be full of reasons, but clearly, they said: “We’re no longer putting restrictions on the use of models.” In any case, we’re going for it. On the other hand, yes, the datasets, no details, etc. And we give out the code. They also did something else. They set up distillation. Distillation, basically, is that we use a large model, so DeepSeek R1, which has been super trained, which has its self-reflection capabilities, etc. And we’re going to do knowledge transfer. It will be used to educate a smaller model that exists to align it even more.

And they did it with open models, such as Qwen (so Alibaba) or LLaMA, in version 3.1, 3.3. So here, what you have to see in these cases is that the licenses of these distilled models, they keep the licenses of the original models. So that’s why we shouldn’t mix. Sometimes, you can do a bitmask in the templates and everything. So a distilled LLaMA 3 model, or if there is an LLaMA 4, be careful, in Europe, for example, it could pose problems. That kind of thing.

And then, we have Mistral, cocorico, based in France, let’s say. It’s a bit the same profile. That is, the models are open. We don’t have a lot of information on datasets. On the other hand, we have a lot of components and aids to go and make fine-tuned models. What’s interesting about Mistral is also the innovation at the business level. That is to say, we see things that we have known well in open source: community version, enterprise version. So there you have it, Mistral will also provide, under a closed commercial license, optimized stuff, adapted to the company, which is easily integrated, etc.

So we see that there, we find things that we have known in the world of open source, with business models that are looking for themselves, that are being created, etc. They have even innovated with the notion of sustainable openness on CodeStral. So CodeStral is a specialized model for code generation, where they created the Mistral Non-Public Production License. So basically, it’s said: you can do what you want personally, and everything, but if you use it in production, or make APIs or SaaS with it… Well actually no, it’s not open. Be careful there too. That’s it, it depends on what you’re looking at. But what’s interesting with Mistral is that they have all these different postures. They are looking a little bit at how to find their economic model too.
Raphaël Semeteys

Derived LLMs

Well, I’m speeding up a little bit. From the moment we have fundamental models that are open, more or less open, what happens? There are others who take them and make fine-tuned versions on them.

So that’s what happened with Dolly. So Dolly, it’s a model made by Databricks. They are based on EleutherAI’s GPT-J. So they inherit the opening levels of GPT-J on the pre-training model. And then, they created their own fine-tuning model. So, they made the crowd something internal, something internal in the company to create this model. They completely allow it to be reused. We don’t have any information on a possible reward model, and the code is available.

BLOOMChat, we deduce what it is based on: on BLOOM, plus the cat orientation. What is interesting, we see that they would have used Dolly’s dataset to do the fine-tuning. So now, we are starting to have this logic of collaboration, of collective innovation that is being put in place.

TODO: reread from here

And Laion too, which are datasets that come from communities, in Germany for example. No public information either on a reward model. And then there, it’s reusing Open RAIL, anyway, which is in the BLOOM ecosystem for code.

Zephyr is interesting because it’s an initiative of Hugging Face, which you know well, who said “oh well, we’re going to fine-tuner Mistral”. So here, they fine-tuned Mistral, but they’re the same, they used OpenAI results to do the fine-tuning. So well, it’s more of a research project, a POC, I don’t know how to put it, with examples, anyway.

And then the last two, they’re interesting, because that’s the communities that said: well, so we’re trying to make the most open models from the beginning. We want to have something open. So LLaMA 360, well it’s all in the name. So that’s the name of the organization. After that, there are several versions: Under, etc., in the models. So there, the weights and models, they are completely open. They really list the datasets they use and they pay attention to the licenses. So RedPajama, which are datasets known and used by a lot of other projects. They reuse datasets from Falcon, from earlier, StarCoder, etc. We have information about the fine-tuning datasets they used, but it’s a bit complicated to really know where we stand. That’s why I put three, because it’s a little messy. But still, it’s their idea. No information on a potential reward model. The code is available.

And then Olmo is for the Allen AI Institute in the United States, which is interesting because they really made the effort first to create their own dataset, both pre-training and fine-tuning. They put very clear licenses on it, but of the Responsible License type, that’s why they have three. And they did something right: it’s that the reward model, it’s clear, it’s licensed by MIT, UltraFeedback, and it can be reused, etc. So now, we have something that is starting to become more and more open. So that’s interesting.

Afterwards, we can still have this notion of responsible AI again. So here, for example, on Olmo, it’s again “don’t do evil”. But we see that here, “don’t do evil” is another definition. Here, it’s “no military use” because war is not good. It should be noted that it was generated by a machine. If you do something that comes from that, you can’t do things that are related to biometrics, or start making predictions in things that have to do with the law, etc.

So that’s why “evil is not good”, that’s for sure. But is the evil in China and the evil in the United States the same as in Europe? That’s why we have to look into the details of what it means to “do no evil”. Because it could very well be “don’t do medical stuff”. If you are a company, a startup, which is in the medical field, and which wants to innovate on this, you get up to speed before you start building a whole solution on this kind of components.

And then, to follow, I’m going to start accelerating the latter: Open R1 and OpenSeek, that’s just to show the ability of open source to mutate and adapt, of the collaborative and open movement, as long as there are things that are open. So Open R1 is Hugging Face saying: “DeepSeek published everything, how they did it, all their tricks, their tricks, etc., to make Open R1 and compete with OpenAI, we’re going to do the same thing again, but completely open.”

And then there’s OpenSeek, the same, it’s the same, but it’s in Beijing, in the Beijing AI Research Institute, who say: “We’re going to do OpenSeek, and it’s the same idea. We’re going to try to open things up as much as possible.” So, once we’ve started to open things up, there’s this community aspect that comes into place. And that’s what I call the Linux moment, where there’s really things going on. Because openness and transparency promote collective innovation. At some point, someone will take over and reuse a dataset from there, etc., and build their new model.
Raphaël Semeteys

I’ve talked a lot about Linux Moments. I analyzed it through this notion of license so that we could understand a little bit the positioning of the different players. And we can see that this positioning is changing. We saw it there. Now, there’s OpenAI saying, “Ah, maybe we’re going to do some open weight.” Because if not, maybe we’re going to get overtaken. That’s interesting.

Other aspects of the Linux moment

There are other aspects, there are other elements that, for me, are part of what I call the Linux Moment, that is, the moment when Linux and open source were starting to change IT. Here, I put three.

There is one, and that is the collaborative tools and ecosystems part. That is to say, as soon as we have openness and reuse, inevitably, there is collaboration that is put in place and this brings out ecosystems, which are used to working together. And these ecosystems, they will either be based on tools, or they will create these tools to collaborate.

That’s something we’ve seen with the rise of the Internet and then open source in general. So, I’ve quoted a few of them here. arXiv is not something that was created by open source or AI at all. But on the other hand, it has become a platform really on which is the arXiv reference of your research paper, where can I get the description of what you did.

Jupyter, I was just another example. This notion, which came a little bit from the world of data scientists, etc. But this way of presenting and explaining and demonstrating data science recipes, and in particular GenAI, with notebooks, is something that has been completely adopted, and now has become part of the practice. And by the way, now, we can run something other than Python in notebooks, etc.

Hugging Face is the GitHub of AI. So that’s really… There, all the communities are present. Where is your model card? On Hugging Face. That’s where you need to be.

What’s also interesting is that in the innovations that are made, as we’ve seen, there are innovations that are not only made at the technical level, but at the level of business models, as we have seen in open source. I have a community enterprise model, I have a somewhat open core model, together. There are things that are happening there. But then, there are all these ecosystems that will generate, that will innovate, really.

Then, there are things that are happening in terms of the optimization and democratization of the models themselves. That’s why I talk about community & company.

First, at the hardware level, already with the chips, because there, the hype exploded in 2020. The time to build new chips at the industrial level is 4 years. Now, we’re entering the period where everyone is going to release AI chips. We’re going to have AI chips in all PCs, we’re going to have AI chips in phones. It’s going to start to become widespread, because GPUs, it just so happens to be matrix computing, neural networks. So it was a good thing, it’s the same thing as doing graphics cards. But originally, it wasn’t intended for that. So now, we’re going to create more CPUs or what I call XPUs, which are even more optimized, which will cost less, and that we’ll find everywhere. So that’s sure to democratize completely.

There are also things that are done at the software level, on the models themselves. How, already, the small language models, as we have seen, for example, with distillation, that is to say, OK, we have large models that will be used to train smaller, more specialized models, so that we can run in telephones, or have uses that are more frugal, let’s say.

We have quantization. What does quantization do? It allows us to reduce the quality of the models a little. Basically, it’s vectors in vector spaces. And so, rather than putting floats, we’re going to put less precision. And so that allows you to have a smaller model, which may be of lower quality, but which will be able to run on classic CPUs, or in mobiles, etc.

And then, there is also the whole aspect of decentralization, which I believe a lot, since that’s how we made the Internet, open source, etc. It is re-decentralizing the training and inference of models. So how do you do that and not have this single vision of “you have to have the biggest model with the biggest machine rooms and two nuclear power plants next to it to run them”. Do we have other ways of doing things too? And that’s in addition, it’s not in the place of, as we’ve seen with, for example, open source.
Raphaël Semeteys

And then, last aspect, is the tools, frameworks, communities that are being created around the models, this time, since they are in the process of being democratized. And there, for me, that’s also the fact that we’re entering into commoditization. That’s how we use these bricks in IT. As an architect, when I go to design solutions, how do I integrate the models into my architectures.

And so, what do we see? We do it, when it’s in this somewhat collaborative mode, it’s done in the Unix way. So, do it once, but do it well. So, I’m not going to try to do the thing that does everything, I’m going to try to do something that does it really well. Because I know that someone else is going to do something great on the side, and I’m going to be able to team, and in the end, it’s going to be an OS, and then it’s going to be called Linux, and then it’s going to dominate the world, for example.

But when you do that, inevitably, if I count on others to integrate me, I’ll bring out standards. Because interoperability is needed. So, it generates. All this will generate interoperability and standards. And so, it organizes this whole ecosystem a little bit.

So there, I mentioned some in a field that is a little related, but it’s agents, because it’s fashionable. So, I placed a little bit of an agent. I went to the beginning, to the end, that’s it, with MCP, everyone talks about MCP. Or A2A, how to get agents to communicate with each other. We can see that there are standards, protocols that are being put in place. And the good news is that when we started making interoperable protocols, it gave rise to the Internet. So there’s hope.

And then, I mentioned LangChain, it’s one among many. He is known. This is interesting for me because it shows how LangChain, first of all, is in Python, because Python is data science, it comes from that world. And then there’s a JavaScript version, and then there’s LangChain4j. I’m sure there were plenty of meetings. I know people who are like domain specialists in the room.

And so, we see that it’s becoming more democratic and that it’s coming out of the world of data science, and it’s entering the world of computing, at least that I know, and the systems that I’m used to designing.

This is the end of our exploration

So, in conclusion, this is the end of our exploration. I hope you’re a little less afraid. You have keys, a compass, etc. What did we see?

We’ve seen that between a positioning where we have completely closed, black box, paid APIs, and something that people are trying to create, which would be free AI like As in Freedom and not As in Free Beer, there are a lot of different positions and it’s very, very moving. There are plenty of actors who arrive, who create themselves, who leave, who reassociate, etc. And we have a model that is emerging that we now call Open Weight. We don’t say Open Source too much anymore, it doesn’t look too trendy now. We have to say Open Weight. But you understand what it means.

What we need to remember, what have we also learned by observing these wild creatures a little?

It’s that be careful with the fundamental models on which it’s based, because we’re going to inherit certain restrictions or openings, etc.

Be careful with the datasets we use, because these datasets can also come with constraints, etc.

And then beware of the competition clauses that will be put in place by players, because it is also business and strategic positioning and the restrictions related to responsible AI

That’s it, and if I go back to my curve from the beginning, where I said we’re in the Linux moment and then I recognize an adoption curve that I experienced with open source etc, well indeed we’ve gone from open search to today a hyper-competitive market where there are still announcements, there’s a lot of buzz, it’s sometimes a little difficult to see clearly in this because there are A lot of foliage in the jungle It’s really a puma that I saw, I don’t know.

In short, we are moving towards the competitive ecosystem that really generates commoditization because we will cooperate on these commodity bricks and we will continue to innovate on top of them.

Openness does promote reuse and collaboration. And this collaboration is what leads to commoditization. In fact, I shouldn’t talk about Linux moments anymore. Now, we have moved on to the movement. It’s a groundswell that is coming and that will transform, a little bit like open source IT, it doesn’t mean that it will replace.
Raphaël Semeteys

Initiatives to evaluate the openness of LLMs

And I finished afterwards, it’s over. AI and open source, it’s completely compatible. Since then, there have been other organizations, there are other initiatives to study the opening up of models. I’ve put some of them here.

OSI (Open Source Initiative), what they have done is very important. There are controversies about data sets and everything, but beyond that, someone had to get the hang of it and someone we trust, so the OSI, with an open process to do it, who define what Open Source AI is because it’s used in the AI Act, for example. So, the legislator, he says, there’s an exception if you’re an Open Source AI. Oh yes, but what is it? So there, the OSI, it did the job. So afterwards, there are discussions that will take place.

The Linux Foundation also participated in its framework model to evaluate. There are things that exist in the world of research, but it’s a hundred or so batteries of criteria, etc. It’s heavy, it’s a bit complicated. There are also things that are happening at the level of the French government.

So there you go, I hope you see a little more clearly, that you were interested in it and then when we take a step back, we realize that this jungle, there are still things that are happening, it is getting organized and in the end it’s not that much chaos.

then I don’t know, Raphiki, I saw that he stole some kind of old Mayan or Astèque tablet, I don’t really know what it is if you want to follow me on social networks or to go around and find the slides of the thing I still think it’s a QR code so you can scan it.

There you go, thank you for your attention, I can remove that, by the way, we’re not in the jungle anymore, I don’t know if we have time for questions, otherwise it will be afterwards, thank you in any case.

Useful links

Video of the Devoxx France 2025 conference and the associated presentation material
Video of the same conference presented in October and in English at Devoxx Belgium 2025 and the associated presentation material (it is a little more up-to-date than the French version that preceded it, with the mention of OpenAI OSS-GPT released in the meantime).

Episode production

Live recording at Devoxx France, Paris in April 2025
Conference: Raphaël Semeteys
Technical resources: Devoxx France
Transcript: Walid Nouh

License

This podcast is released under the CC BY-SA 4.0 or later license.

[Conference] From OpenAI to Opensource AI: between commercial ownership and collaborative openness