Opinions – Open Source Initiative

Overcoming barriers to Open Source procurement in the European Union

Jordan Maris — Wed, 12 Mar 2025 13:33:29 +0000

Every year, public authorities in the European Union spend around 14% of annual GDP on services, works and goods. As European governments continue to digitalize, a growing share of this spending is on digital solutions. Sadly, few of these solutions are built on Open Source software, but that might be about to change: the EU has just started a review of European public procurement rules, and have asked for input on potential changes to the rules. We responded advocating for changes to dismantle these barriers and ensure Open Source solutions are fairly considered.

Why public procurement matters for Open Source

Open Source underpins almost 90% of the software used around the world today, but very little of the money used for public procurement reaches Open Source developers. Meanwhile, the Open Source community faces the challenge of achieving sustainability whilst having to comply with a flurry of new laws and regulations, such as the EU’s Cyber Resilience Act.

If the European Union dismantles the barriers to procurement of Open Source software, it could create numerous opportunities for Open Source projects in Europe and around the world, giving them the financial backing and support needed to achieve sustainability, and accelerating their development.

Why Open Source matters to the EU

The European Union and its member states also face new challenges: the current geopolitical context has forced a change in approach, and the consideration of the continent’s strategic autonomy when it comes to public procurement of software and digital services.

In our letter to the European Commission, we highlight how the nature of Open Source allows governments and public authorities to achieve autonomy without isolationism, trust the software they use, host it where they want, and avoid vendor lock-in.

We also explained how Open Source solutions foster a more competitive market by preventing vendor lock-in and encouraging suppliers to consistently provide the best solutions at the best prices. The ability to freely modify Open Source software also makes it more flexible and adaptable to specific needs, benefiting both public authorities and the wider community.

But above all, we highlighted the societal benefit of procuring Open Source solutions: public authorities spend once, but the benefit of that spending is multiplied, because the money spent on improving the software benefits everyone, from citizens to businesses and even other public authorities. Done right, it can also create economic benefits and high-quality jobs within the EU, contributing to Europe’s digital independence and technological expertise.

Recognize the barriers to Open Source software procurement

Despite the benefits, there are barriers to the procurement of Open Source solutions. For example, in many countries, governments provide templates for public authorities to prepare their tender requests, but these templates are often designed with proprietary software in mind, meaning they contain conditions and questions that aren’t relevant to Open Source suppliers, or even prevent Open Source suppliers from participating.

There are also issues with what public authorities ask for: often they are already locked into proprietary ecosystems, or only know these ecosystems, and so the conditions of their tender mention specific proprietary solutions or standards, excluding Open Source by default.

One other issue is that many of the advantages of Open Source software (control and trust, interoperability, societal benefit, long-term cost and versatility) are not yet considered in the procurement process, whose focus is on immediate cost.

Finally, one other significant challenge is that public authorities don’t necessarily know which suppliers of Open Source software are actively contributing to the Open Source projects they supply. Helping public authorities evaluate if a company is actually contributing upstream is important, because this unlocks many of the benefits of Open Source for public authorities, while making Open Source projects more sustainable.

How to break down the barriers to Open Source software procurement

Experience shows that simply mandating Open Source doesn’t break down these barriers. In our letter we suggest going further, to address the environmental factors that hinder Open Source procurement, such as the prevalence of proprietary standards and software requirements. By prohibiting the use of patent-encumbered or proprietary standards in defining project needs and focusing on standards that meet Open Standards Requirements, we can make it possible for Open Source projects to compete for tenders.

We also suggest that public authorities should be prohibited from requiring specific proprietary software solutions in their tenders, ensuring that the procurement process is open and fair. Additionally we propose that interoperability, reusability, vendor lock-in, and digital sovereignty should be considered systemic criteria for procurement. Public authorities should also consider exit strategies and the associated costs, evaluating how challenging it would be to migrate data or switch suppliers. This includes considering the total cost of ownership, encompassing not just the initial procurement cost but also the lifetime expenses, including support, upgrades, and potential migration or exit costs.

Another core proposal we make is to mandate interoperability through open Application Programming Interfaces (APIs). We believe that this can not only help to prevent lock-in, but can also give Open Source solutions a chance to compete with proprietary alternatives and give public authorities a way out of vendor lock-ins.

Finally, we recognize and underline the need to guarantee that the chosen supplier actually contributes upstream, and suggest evaluating a supplier’s contribution to the Open Source project as part of the procurement criteria to further encourage a collaborative and sustainable Open Source ecosystem. By providing public authorities with guidelines on how to assess these contributions, we can foster an environment that is not only cost-effective but also innovative and adaptable to future technological advancements.

Moving forward to defend Open Source in the public sector

The OSI will continue to follow the process, and work to educate lawmakers about the barriers to procurement of Open Source software solutions. We’re committed to ensuring Open Source projects and communities have everything they need to continue to grow sustainably.

Like what we do? Donate to support our work!

Open Data and Open Source AI: Charting a course to get more of both

Stefano Maffulli — Mon, 18 Nov 2024 08:00:00 +0000

While working to define Open Source AI, we realized that data governance is an unresolved issue. The Open Source Initiative organized a workshop to discuss data sharing and governance for AI training. The critical question posed to attendees was “How can we best govern and share data to power Open Source AI?” The main objective of this workshop was to establish specific approaches and strategies for both Open Source AI developers and other stakeholders.

The Workshop: Building bridges across “Open” streams

Held on October 10-11, 2024, and hosted by Linagora’s Villa Good Tech, the OSI workshop brought together 20 experts from diverse fields and regions. Funded by the Alfred P. Sloan Foundation, the event focused on actionable steps to align open data practices with the goals of Open Source AI.

Participants, listed below, comprised academics, civil society leaders, technologists, and representatives from organizations like Mozilla Foundation, Creative Commons, EleutherAI Institute and others.

Ignatius Ezeani University of Lancaster / Nigeria
Masayuki Hatta Debian, Open Source Group Japan / Japan
Aviya Skowron EleutherAI Institute / Poland
Stefano Zacchiroli Software Heritage / Italy
Ricardo Torres Digital Public Goods Alliance / Mexico
Kristina Podnar Data and Trust Alliance / Croatia + USA
Joana Varon Coding Rights / Brazil
Renata Avila Open Knowledge Foundation / Guatemala
Alek Tarkowski Open Future / Poland
Maximilian Gantz Mozilla Foundation / Germany
Stefaan Verhulst GovLab / USA+ Belgium
Paul Keller Open Future / Germany
Thom Vaughan Common Crawl / UK
Julie Hunter Linagora / USA
Deshni Govender GIZ FAIR Forward – AI for All / South Africa
Ramya Chandrasekhar CNRS / India
Anna Tumadóttir Creative Commons / Iceland
Stefano Maffulli Open Source Initiative / Italy

Over two days, the group worked to frame a cohesive approach to data governance. Alek Tarkowski and Paul Keller of the Open Future Foundation are working with OSI to complete the white paper summarizing the group’s work. In the meantime, here is a quick “tease”—just a few of the many topics that the group discussed:

The streams of “open” merge, creating waves

AI is where Open Source software, open data, open knowledge, and open science meet in a new way. Since OpenAI released ChatGPT, what once were largely parallel tracks with occasional junctures are now a turbulent merger of streams, creating ripples in all of these disciplines and forcing us to reassess our principles: How do we merge these streams without eroding the principles of transparency and access that define openness?

We discovered in the process of defining Open Source AI that the basic freedoms we’ve put in the Open Source Definition and its foundation, the Free Software Definition, are still good and relevant. Open Source software has had decades to mature into a structured ecosystem with clear rules, tools, and legal frameworks. Same with Open Knowledge and Open Science: While rooted in age-old traditions, open knowledge and science have seen modern rejuvenation through platforms like Wikipedia and the Open Knowledge Foundation. Open data, however, feels less solid: often serving as a one-way pipeline from public institutions to private profiteers, is now dragged into a whole new territory.

How are these principles of “open” interacting with each other, how are we going to merge Open Data with Open Source with Open Science and Open Knowledge in Open Source AI?

The broken social contract of data

Data fuels AI. The sheer scale of data required to train models like ChatGPT reveals not just a technological challenge but also a societal dilemma. Much of this data comes from us—the blogs we write, the code we share, the information we give freely to platforms.

OpenAI, for example, “slurps” all the data it can find, and much of it is what we willingly give: the blogs we write; the code we share; the pictures, emails and address books we keep in “the cloud”; and all the other information we give freely to platforms.

We, the people, make the “data,” but what are we getting in exchange? OpenAI owns and controls the machine built with our data, and it grants us access via API, until it changes its mind. We are essentially being stripmined for a proprietary system that grants access at a price—until the owner decides otherwise.

We need a different future, one where data empowers communities, not just corporations. That starts with revisiting the principles of openness that underpin the open source, open science, and open knowledge movements. The question is: How do we take back control?

Charting a path forward

We want the machine for ourselves. We want machines that the people can own and control. We need to find a way to swing the pendulum back to our meaning of Open. And it’s all about the “data.”

The OSI’s work on the Open Source AI Definition provides a starting point. An Open Source AI machine is one that the people can meaningfully fork without having to ask for permission. For AI to truly be open, developers need access to the same tools and data as the original creators. That means transparent training processes, open filtering code, and, critically, open datasets.

Group photo of the participants to the workshop on data governance, Paris, Oct 2024.

Next steps

The white paper, expected in December, will synthesize the workshop’s discussions and propose concrete strategies for data governance in Open Source AI. Its goal is to lay the groundwork for an ecosystem where innovation thrives without sacrificing openness or equity.

As the lines between “open” streams continue to blur, the choices we make now will define the future of AI. Will it be a tool controlled by a few, or a shared resource for all?

The answer lies in how we navigate the waves of data and openness. Let’s get it right.

UPDATE: Learn more about the white paper here.

Explaining the concept of Data information

Stefano Maffulli — Fri, 14 Jun 2024 13:53:28 +0000

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments.

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate.

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition.

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion.

The intent of the code and data requirements is to ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union.
DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
Training techniques that preserve privacy like federated learning don’t create datasets.

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

If a dataset is only legal in Japan, is that AI Open Source only in Japan?
If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
- If so, what happens to the applications that use such AI?
If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as:

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements.

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers.

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.

Contributions of Open Source to AI: a panel discussion at CPDP-ai conference

Stefano Maffulli — Tue, 04 Jun 2024 09:00:00 +0000

I participated as a panelist at the CPDP-ai 2024 conference in Brussels last week where we discussed the significant contributions of Open Source to AI and highlighted the specific properties that differentiate Open Source AI from proprietary solutions. Representing the Open Source Initiative (OSI), the globally recognized non-profit that defines the term Open Source, I emphasized the longstanding principle of granting users full agency and control over technology, which has been proven to deliver extensive social benefits.

Below is a glimpse at the questions and answers posed to me and my fellow panelists:

Question: Stefano, please explain what the contribution to AI from Open Source is, and if there are specific properties of Open Source AI that make a difference for the users and for the people who are confronted with its results.

Response: The Definition of Open Source Software has existed for over 25 years; That doesn’t apply to AI. The Open Source Definition for software provides a stable north star for all participants in the digital ecosystem, from small and large companies to citizens and governments.

The basic principle of the Open Source Definition is to grant to the users of any technology full agency and control over the technology itself. This means that users of Open Source technologies have self-sovereignty of the technical solutions.

The Open Source Definition has demonstrated that massive social benefits accrue when you remove the barriers to learning, using, sharing and improving software systems. There is ample evidence that giving users agency, control and self-sovereignty of their technical choices produces a viable ecosystem based on permissionless innovation. Multiple studies by the EU Commission and Harvard researchers have assigned significant economic value to Open Source Software, all based on that single, clear, understood and approved Definition from 26 years ago.

For AI, and especially the most recent machine learning solutions, it’s less clear how society can maintain self-sovereignty of the technology and how to achieve permissionless innovation. Despite the fact that many people talk about Open Source AI, including the AI Act, there is no shared understanding of what that means, yet!

The Open Source Initiative is concluding a global, multi-stakeholder co-design process to find an unequivocal definition of Open Source AI, and we’re heading towards the conclusion of this process with a vastly increased knowledge of the AI machine learning space. The current draft of the Open Source AI Definition recognizes that in order to study, use, share and modify AI, one needs to refer to an AI system, not a single individual component. The global process has identified the components required for society to maintain control of the technology and these are:

Detailed information about the dataset used to train the system and the code so that a skilled person can train a system with similar capabilities
All the libraries and tools used to run training and inference
The model architecture and the parameters, like weights and biases

Having unrestricted access to all these elements is what makes an AI an Open Source AI.

We’re in the final stretch of the process, starting to gather support for the current draft of the definition.

The most controversial part of the discussion is the role of data in the training. To answer your question about the power of big foreign tech companies, putting aside the hardware requirements, the data is where the fight is. There seem to be two views of the world on data when it comes to AI: One thinks that text and data mining is basically strip mining humanity and all accumulation of data without consent of the rights holders must be made illegal. Another view of the world is that text and data mining for the purpose of training Open Source AI is probably the only antidote to the superpowers of large corporations. These camps haven’t found a common position yet. Japan seems to have made up its mind already, legalizing unrestricted text and data mining. We’ll see where the lawsuits in the US will go, if they ever get to a decision in court or, as I suspect, they will be settled out of court.

In any case, data, competence and to some extent hardware, are the levers to control the development of AI.

Open Source has been leveling the playing field of technologies. We know from past experience with Open Source software that giving people unrestricted access to the means of digital production enables tremendous economic value. This worked in Europe as well as in China. We think that Open Source AI can have the same effect of generating value while leaving control of the technology in the hands of society.

Question: Big tech companies are important for the development of AI. Apart from the purely technological impacts, there is also economic importance. The European Commission has been very concerned about the Digital Single Market recently, and has initiated legislation such as DSA and DMA to improve competition and market access. Will these instruments be sufficient in view of AI roll-out, thinking also of the recently adopted AI Act? Or will additional attention need to be paid?

Response: Open is the best antidote to the concentration of power. That said, I see these legislations as the sticks, very necessary. I’d love us to think also about carrots. We don’t want to repeat the mistakes of the past with the early years of the internet. Open Source software was equally available in the US and Europe but despite that, the few European champions of Open Source haven’t grown big enough to have a global impact. And some of the biggest EU companies aren’t exactly friendly with Open Source either.

Chinese companies have taken a different approach. But in Europe we have talents, and we have an attractive quality of life so we can get even more talents. Finding money is never an issue. We need to remove the disincentives to grow our companies bigger, widen the access to the internal EU market and support their international expansion, too.

For example, we need to review European Regulation 1025, on standardization to accommodate for Open Source. 1025 Regulation was written at a time when Open Source was considered a “business model” and information and communication technology standards were about voltages in a wire. Today, Open Source is between 80% and 90% of all software and “digital elements” comprise some part of every modern product. Even hardware solutions are dominated by “digital elements.” As such, the approach taken by 1025 is out of date and most likely needs a root-and-branch rethink to properly apply to the world today and the world we anticipate tomorrow.

We need to make sure that the standardization rules required by the Cyber Resilience Act are written together with Open Source champions so the rules don’t favor exclusively the cartel of European patent holders who try to seek rent instead of innovating. Europe has all the means to be at the center of AI innovation; It embodies the right values of diversity and collaboration.

Closing remarks: We think that Open Source is the best antidote to fight market concentration in AI. Data is where the concentration of power is happening now and it’s in the hands of massive corporations: not only Google, Meta, Amazon, Reddit but also Sony, Warner, Netflix, Getty Images, Adobe … All these companies have already gained access to massive amounts of data, legally. These companies basically own our data, legally: Our pictures, the graph of our circles of friends, all the books and movies…

There is a risk that if we don’t write policies that allow text and data mining in exchange of a real Open Source AI (one that society can fully control) then we risk leaving the most powerful AI systems in the hands of the oligopoly who can afford trading money for access to data.

Why datasets built on public domain might not be enough for AI

Stefano Maffulli — Tue, 07 May 2024 10:00:00 +0000

There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear that this approach may lead to bad policies, reinforcing the power of copyright holders; not the small creators but large corporations.

A dataset built on public domain sources

In March 2024 Common Corpus was released as an open access dataset for training large language models (LLMs). Announcing the release, the lead developer Pierre-Carl Langlais says “Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.” The dataset contains 500 billion words in multiple European languages and different cultural heritages. It is a project coordinated by the French startup Pleias and supported by organizations committed to open science such as Occiglot, Eleuther AI and Nomic AI as well as being partly funded by the French government. The stated intention of Common Corpus is to democratize access to large quality datasets. It has many other positive characteristics, highlighted also by Open Future’s summary of a talk given by Langlais.

The commons needs more data

The debates sparked by the Deep Dive: AI process on the role of training data highlighted that AI practitioners encounter many obstacles assembling datasets. At the same time, we discovered that tech giants have an incredible advantage over researchers and startups. They’ve been slurping data for decades, have the financial means to go to court and can enter into bilateral agreements to license data. These strategies are inaccessible to small competitors and academics. Accepting that the only path to creating open large datasets suitable to train Open Source AI systems is to use sources in the public domain, risks cementing the dominant positions of existing large corporations.

The open landscape already faces issues with big tech and their ability to influence legislation. The big corporations have lobbied to extend the duration of copyright, introduced the DMCA, are opposing the right to repair, and have the resources to continue lobbying and sue any new entrant who they deem to get too close. There are plenty of examples showing an unequal advantage in protecting what they think is theirs. The non-profit Fairly Trained certifies companies “willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain,” respecting copyright law: who’s going to benefit from this approach?

Unsuitable for public policies

Initiatives like Common Corpus and The Stack (used to train Starcoder2) are important achievements as they allow researchers to develop new AI systems while mitigating the risk of being sued. They also push the technical boundaries of what can be achieved with smaller datasets that don’t require a nuclear power plant to train new models. But I think they mask the underlying issue: AI needs data and limiting open datasets to only public domain sources will never give them a chance to match the size of the proprietary ones. The lobby for copyright maximalists is always looking for ways to expand scope and extend terms for copyright laws, and when they succeed it is a one-way ratchet. It would be a tragedy for society if legislators listened to their sophistry and made new laws doing this based on the apparent consensus that creators need protection from AI.
The role of data for training machine learning systems is a divisive topic and a complex one. Having datasets like Common Corpus is a very useful way for the science of AI to progress with better sources. For policies, we’d be better off pushing for something like the proposal advanced by Open Future and Creative Commons in their paper Towards a Books Data Commons for AI Training.

CRA standards request draft published

Simon Phipps — Thu, 02 May 2024 12:19:03 +0000

The European Commission recently published a public draft of the standards request associated with the Cyber Resilience Act (CRA). Anyone who wants to comment on it has until May 16, after which comments will be considered and a final request to the European Standards Organizations (ESOs) will be issued. This process is all governed by regulation 2012/1025, which will be discussed in a future post.

The publication of this draft is important for every entity that will have duties under the CRA, namely “manufacturers” and “software stewards.” Conformance with the harmonized standards that emerge from this process will allow manufacturers to CE-mark their software on the presumption it complies with the requirements of the CRA, without taking further steps.

For those who depend on incorporating or creating Open Source software, there is an encouraging new development found here. For the first time in a European standards request, there is an express requirement to respect the needs of Open Source developers and users. Recital 10 tells each standards organization the following:

“where relevant, particular account should be given to the needs of the free and open source software community”

That is made concrete in Article 2 which specifies:

“The work programme shall also include the actions to be undertaken to ensure effective participation of relevant stakeholders, such as small and medium enterprises and civil society organizations, including specifically the open source community where relevant”

Article 3 requires proof that effective participation has been facilitated. The community is going to have to step up to help the ESOs satisfy these requirements—or corporations claiming to speak for the community will do it instead.

OSI applauds the Commission’s steps to include the Open Source community and will be pleased to work with the European standards organizations towards that initial goal of effective representation and consultation. Additionally, the OSI will:

Work with our Affiliates to identify additional suitable participants with relevant skills and experience, and make connections between them and the ESOs.
Assist the Commission in validating responses to Article 3.

Our goal is to ensure that the development and use of Open Source software is at best facilitated and at worst not obstructed by any aspect of the standards development process, the resulting harmonized standards, and the access and IPR terms of those standards.

A comparative view of AI definitions as we move toward standardization

Mia Lykou Lund — Fri, 09 Feb 2024 10:54:00 +0000

Discussions of Artificial Intelligence (AI) regulation will be heating up in 2024 with a provisional agreement for the EU AI Act having been reached in December 2023. The evolution of the EU AI Act is progressing toward a technology-neutral definition for AI to be applied to future AI systems. In the coming months, multiple states will agree on precise legal definitions, which reflect moral considerations of the role that AI will and will not be allowed to play in Europe for the very first time. And formally defining AI is an ongoing debate.

Precise definitions within a rapidly expanding field are perhaps not the first things that come to mind when asked about pressing issues concerning AI. However, as its influence grows, arriving at one seems essential when considering how to regulate it. Agreeing on what AI is–and what it is not–on a transnational level, is proving to be increasingly important. Online spaces rarely respect sovereignty, and the role of AI in public life is expected to increase rapidly.

Different countries and organizations have different definitions, though the AI Act is expected to provide some standardization, not only within the EU but also outside of it due to its influence. Other than providing a framework for businesses to operate within in the future, it further shows the anticipation of what, how and where AI will act and what it will develop towards. Let’s consider how different organizations and states currently are defining AI systems.

OECD

So far, the AI ACT’s definition of AI systems is expected to follow the OECD’s current definition. This currently seems to be the most influential definition and it reads as follows:

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

Notably, the OECD’s definition has undergone changes from its first draft to the current one above. The removal of “human-based inputs” and the addition of “decisions” when referring to outputs reflects a potential for vastly limiting human-centred decisions and actions. While acknowledging that different systems vary in their autonomy, this change opens up the potential for full autonomy. This can be controversial, to say the least, and can be expected to feed into the growing concerns of AI alignment. As we await the EU AI Act, if they indeed adopt the same or even a similar definition, it will be interesting to see their definition of personhood, considering the removal of “human-based” under inputs.

ISO

The International Organization for Standardization has defined AI systems as follows:

AI:

set of methods or automated entities that together build, optimize and apply a model (3.1.26) so that the system can, for a given set of predefined tasks (3.1.37), compute predictions (3.2.12), recommendations, or decisions

Note 1 to entry: AI systems are designed to operate with varying levels of automation (3.1.7).

Note 2 to entry: Predictions (3.2.12) can refer to various kinds of data analysis or production (including translating text, creating synthetic images or diagnosing a previous power failure). It does not imply anteriority.

study of theories, mechanisms, developments and applications related to artificial intelligence (3.1.2)

AI System:

engineered system featuring AI (3.1.2)

Note 1 to entry: AI systems can be designed to generate outputs such as predictions (3.2.12), recommendations and classifications for a given set of human-defined objectives.

Note 2 to entry: AI systems can be designed to operate with varying levels of automation.

Here, there is a consideration of what kind of system is considered, notably an engineered one. This is interesting as previous definitions have been somewhat ambiguous about what technologies, in fact, will fall under such legislation. There is also a focus on the cooperation of different entities, not specified of human or otherwise. Notably, they do not mention the origin and what kind of input is being processed, though through “varying levels of automation” it can be inferred that it covers the balance between human or non-human inputs, thus offering varying levels of autonomy.

South Korea

South Korea also adopted their definition of AI system in their 2023 AI Act, and it reads as follows:

Article 2 (Definitions) As used in this Act, the following terms have the following meanings.

1. “Artificial intelligence” refers to the electronic implementation of human intellectual abilities such as learning, reasoning, perception, judgment, and language comprehension.

2. “Artificial intelligence technology” means hardware technology required to implement artificial intelligence, software technology that systematically supports it, or technology for utilizing it.

While not mentioning AI systems, they attribute human attributes, like perception, to an electronic entity. While not mentioning “decisions,” attributing human characteristics perhaps makes that point redundant, as it can be interpreted as an actor, acting on a similar level as humans. Further, they are expansive on what technology is considered AI, as even a cable providing power can, under their current definition, be classified as a piece of AI technology.

US Executive Order

In the last part of 2023, The Biden administration issued an executive order whereby they defined an AI system:

“a machine-based system that can, for a given set of human-defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments. Artificial intelligence systems use machine- and human-based inputs to perceive real and virtual environments; abstract such perceptions into models through analysis in an automated manner; and use model inference to formulate options for information or action.”

Here, The Biden Administration merges human and machine-based inputs, highlighting the cooperation between the two actors. And while not legally binding, it shows intent. It shows more caution and perhaps skepticism regarding AI acting autonomously, as compared to any other of the major actors. Interestingly, the distinction between virtual and “real” (assuming this means physical, though the wording of it remains problematic) environments shows a similar skepticism to the scope and spheres that the Biden Administration is interested in AI occupying. This limits the controversial issue of potential autonomy present in previous definitions, though it limits communication between systems independently of human inputs, which can prove problematic in practice.

Answers we are excited to see

As we enter into an important legislative year for AI, we are looking forward to getting answers to the following questions regarding the legal definitions of AI systems:

What definition of personhood will accompany the AI systems definition in the AI Act? And what does this mean for the intellectual protection of something entirely made by an AI, considering that it allows for large amounts of autonomy? That is, if it indeed follows the same definition as the OECD.
What kind of technology will be considered to be AI? Will it range from Excel spreadsheets to LLMs? Are we considering “machine-based systems,” an “engineered system” or something else?
Will legislation be strong enough, or perhaps broad enough, to encompass the massive changes AI is currently undergoing? And what predictions can we infer that the EU is making on behalf of the future advancements of AI?

A historic view of the practice to delay releasing Open Source software: OSI’s report

Stefano Maffulli — Wed, 10 Jan 2024 15:00:00 +0000

The Open Source Initiative published today a new report that looks at the history of the business practice to delay releasing their code under freedom-respecting licenses. Since the early days of the Open Source movement, companies have experimented with finding a balance between granting their users the basic freedoms guaranteed by Open Source licenses while also capitalizing on their investments in software development. One common approach, albeit with many different flavors, is what this report calls “Delayed Open Source Publication” (DOSP) — “the practice of distributing or publicly deploying software under a proprietary license at first, then subsequently and in a planned fashion publishing that software’s source code under an Open Source license.”

The new report titled “Delayed Open Source Publication: A Survey of Historical and Current Practices” was authored by the team of Open Tech Strategies (Seth Schoen, James Vasile and Karl Fogel) based on crowdsourced interviews. Their research was made possible through a donation by Sentry and the financial contributions of OSI individual members.

Like the authors, I found that the historical survey revealed numerous surprises, and what I found even more intriguing are the new questions raised (see Section 7) that beg for more dedicated research.

I encourage you to give it a read and share it with others. We encourage feedback from the community: I hold office hours for OSI members and you can discuss this on Mastodon or LinkedIn.

Download the report.

Open Source AI: Establishing a common ground

Stefano Maffulli — Tue, 28 Nov 2023 13:00:00 +0000

The current draft v. 0.0.3 of the Open Source AI Definition borrows wordings from the GNU Manifesto’s golden rule stating:

If I like a program, I must be able to share it with others who like it.
The GNU Manifesto

The GNU Manifesto refers to “program” (not “AI system”), without the need to define it. When it was published in 1985, the definition of a program was pretty clear. Today’s scene around artificial intelligence is not as clear and there are multiple definitions for AI systems floating around.

The process of finding a shared definition of Open Source AI is only in its infancy. I’m fully aware that for many of us here this is trivial and this phase is almost boring.

But the four workshops revealed that a significant number of people in the rooms did not know the 4 Freedoms nor had any idea that OSI has a formal Open Source Definition. And this happened also at two Open Source-focused events!

Which definition of AI system to adopt

I don’t think the Open Source community should write its own definition of an AI system as there are too many dangers with doing that. Most importantly, adopting a vocabulary foreign to the AI world increases the risks of not being understood or accepted. It’s a lot more effective and will be more palatable to use a widely adopted definition.

The OECD definition of AI system

The Organisation for Economic Co-operation and Development (OECD) published one in 2019 and updated it in November 2023. OECD’s definition has been adopted by the United Nations, NIST and the AI Act may use it too.

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment
Recommendation of the Council on Artificial Intelligence Adopted on: 22/05/2019; Amended on: 08/11/2023

I discovered a 2022 document of the OECD with a slightly amended definition from the one of 2019.The 2022 OECD Framework for the Classification of AI systems removes the words “or decisions” from their previous definition, saying in the note 5:

Experts Working Group decided [“or decisions”] should be excluded here to clarify that an AI system does not make an actual decision, which is the remit of human creators and outside the scope of the AI system
2022 OECD Framework for the Classification of AI systems

The updated definition used by the Experts WG is:

An AI system is a machine-based system that is capable of influencing the environment by producing recommendations, predictions or other outcomes for a given set of objectives. It uses machine and/or human-based inputs/data to:

perceive environments;

abstract these perceptions into models; and

use the models to formulate options for outcomes.

AI systems are designed to operate with varying levels of autonomy (OECD, 2019f[2]).”
2022 OECD Framework for the Classification of AI systems

Surprisingly, the version amended in November 2023 by the OECD still uses the words “or decisions”.

The definition of AI system for US National Institute of Standards (NIST)

NIST AI Risk Management Framework slightly modified the OECD definition that includes the word “outputs”:

The AI RMF refers to an AI system as an engineered or machine-based system that can, for a given set of objectives, generate outputs such as predictions, recommendations, or decisions influencing real or virtual environments. AI systems are designed to operate with varying levels of autonomy (Adapted from: OECD Recommendation on AI:2019; ISO/IEC 22989:2022)
AI Risk Management Framework

The definition of AI system in Europe

To complete the picture, I also looked at the EU. In a document from 2019, in the early days of the legislative process, the expert group on AI suggested: https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence:

Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal. AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions.

As a scientific discipline, AI includes several approaches and techniques, such as machine learning (of which deep learning and reinforcement learning are specific examples), machine reasoning (which includes planning, scheduling, knowledge representation and reasoning, search, and optimization), and robotics (which includes control, perception, sensors and actuators, as well as the integration of all other techniques into cyber-physical systems).
High-Level expert group on AI: Ethics guidelines for trustworthy AI

It’s worth noting that this definition is not used in the AI Act. The text of the EU Council suggests this one be used:

artificial intelligence system’ (AI system) means a system that

receives machine and/or human-based data and inputs,

infers how to achieve a given set of human-defined objectives using learning, reasoning or modelling implemented with the techniques and approaches listed in Annex I, and

generates outputs in the form of content (generative AI systems), predictions, recommendations or decisions, which influence the environments it interacts with;

which seems to be quite similar to the OECD text.

Why we need to adopt a definition of AI system

There is agreement that the Open Source AI Definition needs to cover all AI implementations and not be specific to machine learning, deep learning, computer vision or other branches. That requires using a generic term. For software, the word “program” covers everything, from assembly, interpreted to compiled languages. “AI system” is the equivalent in the context of artificial intelligence.

“Program” is to software as “AI system” is to artificial intelligence.

In the document What is Free Software, the GNU project describes four fundamental freedoms that the “program” must carry to its users. Draft v. 0.0.3 similarly describes four freedoms that the AI system needs to deliver to its users.

In v. 0.0.3 draft there was debate on the wording of the freedom 3 — freedom to modify. For software, that’s the freedom to modify the program to better serve user’s needs, fix bugs, etc. Draft v. 0.0.3 says:

Modify the system to change its recommendations, predictions or decisions to adapt to your needs.
Draft v.0.0.3

The intention to specify what the object of the change is to establish the principle that anyone should have the right to modify the behavior of the AI system as a whole. The words “recommendations, predictions or decisions” come from the definition of AI system: what does the “system” do and what would I want to modify?

That’s why it’s important to say what it is we expect to have the right to modify. Tying that to an agreed-upon definition of what an AI system does is a way to make sure that all readers are on the same page.

We can change the wordings for that bullet point but I think the verb “modify” should refer to the whole system, not individual components.

We’re trying to adopt a definition of an AI system that is widely understood and accepted, even though it’s not strictly correct scientifically. The Open Source AI Definition should align with other policy documents because many communities (legal, policy makers and even academia) will have to align too.

The newest definition of AI system from the OECD is the best candidate, without the words “or decisions.”

Next steps

I met with the Digital Public Goods Alliance in Addis Ababa on November 14. I expected to encounter a different assortment of competences than the ones I’ve met so far, and that was true. How far we are from consensus on basic principles is something I’m contemplating before releasing draft v.0.0.4 and move on to the next phase of public conversations. For 2024 we’re planning a regular cadence of meetings (online and in- person) and a release roadmap leading to a v. 1.0 before the end of the year. More to come.

To trust AI, it must be open and transparent. Period.

Cristin Zegers — Thu, 14 Sep 2023 15:00:00 +0000

[SPONSOR OPINION]

By Heather Meeker, OSS Capital

Machine learning has been around for a long time. But in late 2022, recent advancements in deep learning and large language models started to change the game and come into the public eye. And people started thinking, “We love Open Source software, so, let’s have Open Source AI, too.”

But what is Open Source AI? And the answer is: we don’t know yet.

Machine learning models are not software. Software is written by humans, like me. Machine learning models are trained; they learn on their own automatically, based on the input data provided by humans. When programmers want to fix a computer program, they know what they need: the source code. But if you want to fix a model, you need a lot more: software to train it, data to train it, a plan for training it, and so forth. It is much more complex. And reproducing it exactly ranges from difficult to nearly impossible.

The Open Source Definition, which was made for software, is now in its third decade, and has been a stunning success. There are standard Open Source licenses that everyone uses. Access to source code is a living, working concept that people use every day. But when we try to apply Open Source concepts to AI, we need to first go back to principles.

For something to be “Open Source” it needs to have one overarching quality: transparency. What if an AI is screening you for a job, or for a medical treatment, or deciding a prison sentence? You want to know how it works. But deep learning models right now are a black box. If you look at the output of a model, it’s impossible to tell how or why the model came up with that output. All you can do is look at the inputs to see if its training was correct. And that’s not nearly as straightforward as looking at source code.

AI has the potential to greatly benefit our world. Now is the first time in history we’ve had the information and technology to tackle our biggest problems, like climate change, poverty and war. Some people are saying AI will destroy the world, but I think it contributes to the hope of saving the world.

But first, we need to trust it. And to trust it, it needs to be open and transparent.

As a consumer you should demand that the AI you use is open. As a developer, you should know what rights you have to study and improve AI. As a voter, you should have the right to demand that AI used by the government is open and transparent.

Without transparency, AI is doomed. AI is potentially so powerful and capable that people are already frightened of it. Without transparency, AI risks going the way of crypto–a technology with great potential that gets shut down by distrust. I hope that we will figure out how to guarantee transparency before that happens, because the problems AI can help us solve are urgent, and I believe we can solve them if we work together.

—-

OSI has gathered a group of leaders who will be presenting ideas around the topic of AI and Open Source in our upcoming Deep Dive: Defining Open Source AI Webinar Series. Registration is free and allows you to attend and ask questions at any or all of the sessions taking place between September 26 and October 12, 2023. REGISTER HERE today!