Mia Lykou Lund – Open Source Initiative

Open Source AI Definition – Weekly update September 23

Mia Lykou Lund — Mon, 23 Sep 2024 13:31:52 +0000

Draft v.0.0.9 of the Open Source AI Definition is available for comments

@nemobis points out that the term “skilled person” in the Open Source AI Definition needs clarification, especially when considering different legal systems. The term could lead to misinterpretations and suggests adjusting the wording to focus on access to data. Additionally, the term “substantially equivalent system” also requires a more precise definition.
@shujisado adds that in Japan, the term “skilled person” is linked to patent law, which could complicate its interpretation. He proposes using a simpler term, like “person skilled in technology,” to avoid unnecessary debate.
@stefano asks for suggestions for a better alternative to “skilled person,” such as “practitioner” or “AI practitioner.”
@kjetilk jokingly suggests lowering the bar to “any random person with a computer,” emphasizing the importance of accessibility in open source, allowing anyone to engage regardless of formal training.
@samj highlights that byte-for-byte reproducibility is unrealistic, as randomness and hardware variability make exact replication unachievable, similar to how different binaries perform equivalently despite differing checksums.
@samj notes the existence of models like StarCoder2 and OLMo as examples of Open Source AI, refuting the claim that no models meet the standard. He stresses the need for the definition to encourage the development of new models rather than settling for an inadequate status quo.

Case-in-Point: Zuckerberg’s blog on Open Source

@kjetilk reflects on Mark Zuckerberg’s blog post about Llama 3.1, where Zuckerberg claims that “Open Source AI Is the Path Forward.” He points out that while it’s easy to agree with Zuckerberg’s sentiment, Llama 3.1 isn’t truly open source and wouldn’t meet the criteria for compliance under the OSAID. This raises important questions about how to engage with Meta: should the open-source community push them away, or guide them toward creating OSAID-compliant models? Furthermore, @kjetilk wonders how this affects perceptions of open source, especially in light of EU legislation and the broader governance issues around open source.
@shujisado responds by noting that the Open Source Initiative (OSI) has already made it clear that Llama 2 (and by extension Llama 3.1) does not meet the Open Source definition, despite Zuckerberg’s claims. He suggests that Zuckerberg might be using a different definition of “open source,” particularly given the unclear legal landscape around AI training data and copyright. In his view, the creation of the Open Source AI Definition (OSAID) is the community’s formal response to Meta’s claims.

Open Source AI Definition Town Hall – September 20, 2024

The seventeenth edition of our town hall meetings was held on the 20th of September. If you missed it, the recording and slides can be found here.

Open Source AI Definition – Weekly update september 16

Mia Lykou Lund — Mon, 16 Sep 2024 23:38:08 +0000

Week 37 summary

Endorse the Open Source AI Definition

OSI invites individuals and organizations to endorse the Open Source AI Definition (OSAID). Endorsers will have their name and affiliation listed in the press release for Release Candidate 1 (RC1), which is expected to be finalized by the end of September. Those endorsing version 0.0.9 will be contacted again to confirm their support if there are any changes leading up to RC1.

Recommended Resources: US Copyright Office Guidance on TDM

@mjbommar encourages reviewing the U.S. Copyright Office’s guidance on text and data mining (TDM) exceptions, which provides clear explanations and limitations, especially focusing on non-commercial, scholarly, and teaching uses. He emphasizes that the TDM guidance operates within narrow parameters that are often misunderstood or overlooked.

Proposal to handle Data Openness in the Open Source AI definition [RFC]

@quaid proposes adding nuance to the Open Source AI (OSAI) Definition by introducing two designations: OSAI D+ (with open data) and OSAI D- (without open data, due to legitimate reasons beyond the creator’s control). He suggests using a dataset certificate of origin (dataset DCO) for self-verification to ensure compliance.
@kjetilk agrees that verification is key but questions whether data information alone is sufficient for verification. He highlights that verifying rights to the data may not always be possible.
@stefano appreciates the quadrant system’s clarity and confirms @quaid’s proposal for OSAI D- to be reserved for those with legitimate reasons for not sharing data.
@thesteve0 expresses skepticism about broadening the “Open Source” label. He argues that without access to both data and code, AI models cannot truly be Open Source and suggests labeling such models as “open weights” instead.
@shujisado notes the importance of data access in AI, pointing out that OSAID requires detailed information about how data is sourced, including provenance and selection criteria. He also discusses potential legal and ethical reasons for not sharing datasets.
@Shamar raises concerns about “openwashing” in AI, where developers might distribute a model with a different dataset, undermining trust. He argues that distinguishing between OSAI D+ and D- risks legal complications for derivative works, suggesting that models without open data should not be considered truly open.
@zack supports the idea of a tiered system (D+ and D-) as an improvement over the current situation, as it incentivizes progress from D- to D+. He is skeptical about verifiability but sees potential in the branding aspect of the proposal.

Welcome diverse approaches to training data within a unified Open Source AI Definition

@stefano asks @arandal about suggested edits, which include renaming data as “source data,” allowing open-source AI developers to require downstream modifications with open data, and permitting downstream developers to use open data to fine-tune models trained on non-public data. He further asks if arandal compares training data to model weights as source code is to binary code.
@shujisado agrees with @stefano and points out that while many interpret OSD-compliant licenses to include CC4 and CC0, OSI has not officially evaluated Creative Commons licenses for compliance. He highlights concerns about CC0’s patent defense, which could be crucial for datasets.
@mjbommar echoes the concerns about patent defense, noting it as a critical issue in both software and data licensing.
@Shamar supports the first two suggestions but argues that models trained on non-public data cannot meet an “Open Source AI” definition, as they limit the freedom to study and modify, which are core principles of Open Source.

On the current definition of Open Source AI and the state of the data commons

@nick shares an article by Nathan Lambert, reviewed by key figures in the Open Source AI space, discussing the challenges of training data and the current Open Source AI definition. @Percy Liang (on X) view is highlighted, where he suggests that releasing an entire dataset is neither sufficient nor necessary for Open Source AI. He emphasizes the need for detailed code of the data processing pipeline for transparency, beyond just releasing the dataset.
@shujisado discusses the legal nuances of using U.S. government documents in AI training, emphasizing that while they may be used in the U.S., legal complications arise in other jurisdictions.
@Shamar stresses that Open Source AI should provide all the necessary data and processing information to recreate a system, otherwise, calling it Open Source is “open washing.”

[RFC] Separating concerns between Source Data and Processing Information

@Shamar proposes a clearer distinction between “source data” and “processing information” in the Open Source AI definition to ensure transparency and reproducibility. He suggests source data should be publicly available under the same terms that allowed its original use, while the process used to train the system should be shared under an Open Source license. His formulation aims to prevent loopholes that could lead to open-washing and emphasizes the importance of granting all four freedoms (study, modify, distribute, and use) to qualify as Open Source AI.
@nick disagrees, arguing that @Shamar proposal misunderstands the difference between the rights to use data for training and the rights to distribute it. He also challenges the claim that exact replication of AI systems can be guaranteed, even with access to the same data.

Open Source AI Definition Town Hall – September 13, 2024

The sixteenth edition of our town hall meetings was held on the 13th of September. If you missed it, the recording and slides can be found here.

Open Source AI Definition – Weekly update September 9

Mia Lykou Lund — Mon, 09 Sep 2024 17:02:46 +0000

Week 36 summary

Draft v.0.0.9 of the Open Source AI Definition is available for comments

-@Shamar agrees with @thesteve0 and emphasizes that AI systems consist of two parts: a virtual machine (architecture) and the weights (the executable software). He argues that while weights are important, they are not sufficient to study or fully understand an AI model. For a system to be truly Open Source, it must provide all the data used to recreate an exact copy of the model, including random values used during the process. Without this, the system should not be labeled Open Source, even if the weights are available under an open-source license. Shamar suggests calling such systems “freeware” instead and ensuring the Open Source AI Definition aligns with the Open Source Definition.
@jberkus questions whether creating an exact copy of an AI system is truly possible, even with access to all the training data, or if slight differences would always exist.
@shujisado explains that under Japan’s copyright law, AI training on publicly available copyrighted works is permissible, but sharing the datasets created during training requires explicit permission from copyright holders. He notes that while AI training within legal limits may be allowed in many jurisdictions, making all training data freely available is unlikely. He adds that the current Open Source AI Definition strikes a reasonable balance given global intellectual property rights but suggests that more specific language might help clarify this further.

Share your thoughts about draft v0.0.9

@marianataglio suggests including hardware specifications, training time, and carbon footprint in the Open Source AI Definition to improve transparency. She believes this would enhance reproducibility, accessibility, and collaboration, while helping practitioners estimate computational costs and optimize models for more efficient training.

Open Source AI Definition Town Hall – September 6, 2004

The fifthteenth edition of our town hall meetings was held on the 6th of September. If you missed it, the recording and slides can be found here.

Welcome diverse approaches to training data within a unified Open Source AI Definition

@Alek_Tarkowski agrees with @arandal on the importance of situating Open Source AI within broader open movements like open data. He suggests cooperation with organizations like Creative Commons should go beyond licensing standards to include data governance, which remains an undeveloped area.
@Alek_Tarkowski finds the idea of requiring source data to follow Open Source licenses conceptually interesting, likening it to “upstream copyleft,” but notes traditional copyleft frameworks may not suit AI development.
@arandal clarifies that the proposal is an evolution of software freedom principles, not a direct extension of traditional copyleft, similar to how AGPL addressed gaps left by earlier licenses. They further mention that discussions on these approaches are ongoing across various organizations, though formal publications are limited.

Explaining the concept of Data information

@Senficon highlights a concern from the open science community that, while EU copyright law allows reproductions of protected content for research, it restricts making the research corpus available to third parties. This limits research reproducibility and open access, as it aims to protect rights holders’ revenue.
@kjetilk agrees with the observation but questions the assumption that making content publicly available would significantly harm rights holders’ revenue. He believes such policies should be based on solid evidence from extensive research.

Open Source AI Definition – Weekly update September 2nd

Mia Lykou Lund — Mon, 02 Sep 2024 14:17:42 +0000

Share your thoughts about draft v0.0.9

@mkai added concerns about how OSI will address AI-generated content from both open and closed source models, given current legal rulings that such content cannot be copyrighted. He also suggests clarifying the difference between licenses for AI model parameters and the model itself within the Open Source AI Definition.
@shujisado added that while media coverage of the OSAID v0.0.9 release is encouraging, he is not supportive of the idea of an enforcement mechanism to flag false open source AI. He believes this approach differs from OSI’s traditional stance and suggests it may be a misunderstanding.
@jplorre added that while LINAGORA supports the proposed definition, they propose clarifying the term “equivalent system” to mean systems that produce the same outputs given identical inputs. They also suggest removing the specific reference to “tokenizers” in the definition, as it may not apply to all AI systems.
- @shujisado agreed with the need for clarification on “equivalent system” but noted that identical outputs cannot always be guaranteed in general LLMs. He suggests that this clarification might be better suited for the checklist rather than the OSAID itself

Draft v.0.0.9 of the Open Source AI Definition is available for comments

@adafruit reconnects with @webmink and proposes updates to the Open Source AI Definition, including adding requirements for prompt transparency and data access during AI training. These updates aim to enhance the ability to audit, replicate, and modify AI models by providing detailed logs, documentation, and public access to prompts used during the training phase.
- @webmink appreciates the proposal but points out that it seems specific to a single approach, suggesting that it may need broader applicability.
@thesteve0 criticizes the current definition, arguing that it does not grant true freedom to modify AI models because the weights, which are essential for using the model, cannot be reproduced without access to both the original data and code. He suggests that models sharing only their weights, especially when built on proprietary data, should be labeled as “open weights” rather than “open source.” He also expresses concern about the misuse of the “open source” label by some AI models, citing specific examples where the term is being abused.

Open-washing and unspoken assumptions of OSS

@pranesh added that it might be helpful to explicitly state that the governance of open-source AI is out of scope for OSAID, but also notes that neither the OSD nor the free software definition explicitly mention governance, so it may not be necessary.
@kjetilk added that while governance issues have traditionally been unspoken, this unspoken nature is a key problem that needs addressing. He suggests that OSI should explicitly declare governance out of scope to allow others to take on this responsibility.
@mjbommar added support for making an official statement that OSI does not intend to control governance, noting concerns that some might fear OSI is moving towards a walled governance approach. He references past regrets about not controlling the “open source” trademark as a means to combat open-washing.
@nick added assurance that OSI has no intention of creating a walled governance garden, reaffirming the organization’s long-standing position against such control.
@shujisado added that there seems to be a consensus within the OSAID process that governance is out of scope, and notes that related statements have already been moved to the FAQ section in recent versions.

Explaining the concept of Data information

@pranesh mentions that, from a legal perspective, the percentage of infringement matters, citing the “de minimis” doctrine and defenses like “fair use” that consider the amount and purpose of infringement. He emphasizes that copyright laws in different jurisdictions vary, and not all recognize the same defenses as in the US.

@mjbommar argues that the scale and nature of AI outputs make the “de minimis” defense irrelevant, especially when AI models generate significant amounts of copyrighted content. He stresses that the economic impact of AI-generated content is a key factor in determining whether it qualifies as transformative or infringes copyright.
@shujisado highlights that in Japan, using copyrighted works for AI training is generally treated as an exception under copyright law, a stance that is also being adopted by neighboring East Asian countries. He suggests that approaches like the EU Directive are unlikely to become mainstream in Asia.
@mjbommar acknowledges the global focus on US/EU laws but points out that many commonly used models are developed by Western organizations. He questions how Japan’s updated copyright laws align with international treaties like WCT/DMCA, expressing concern that they may allow practices that conflict with these agreements.
- @shujisado responds by stating that Japan’s copyright laws, including Article 30-4, were carefully crafted to comply with international standards, such as the Berne Convention and the WIPO Copyright Treaty, ensuring that they meet the required legal frameworks.

Welcome diverse approaches to training data within a unified Open Source AI Definition

@arandal emphasizes the importance of the Open Source Definition (OSD) as a unifying framework that accommodates diverse approaches within the open-source community. She argues that AI models, being a combination of source code and training data, should have their diversity in handling data explicitly recognized in the Open Source AI Definition. She proposes specific text changes to the draft to clarify that while some developers may be comfortable with proprietary data, others may not, and both approaches should be supported to ensure the long-term success of open-source AI.
@mjbommar appreciates the spirit of Arandal’s proposal but adds that the OSI currently lacks specific licenses for data, which is why it is crucial for the OSI to collaborate with Creative Commons. Creative Commons maintains the ecosystem of “data licenses” that would be necessary under the proposed revisions to the Open Source AI Definition.
@arandal agrees with the need for collaboration with organizations like Creative Commons, noting that this coordination is already reflected in checklist v. 0.0.9. She suggests that such collaboration is necessary even without the proposed revisions to ensure the definition accurately addresses data licensing in AI.
@nick acknowledges the importance of working with organizations like Creative Commons and mentions that OSI is in ongoing communication with several relevant organizations, including MLCommons, the Open Future Foundation, and the Data and Trust Alliance. He highlights the recent publication of the Data Provenance Standards by the Data and Trust Alliance as an example of the kind of collaborative work that is being pursued.
@mjbommar reiterates the need for explicit coordination with Creative Commons, arguing that the OSI cannot realistically finalize the Open Source AI Definition without such collaboration. He also suggests that the OSI should explore AI preference signaling and work with Creative Commons and SPDX/LF to establish shared standards, which should be part of the OSAID standard’s roadmap.

Join this week’s town hall to hear the latest developments, give your comments and ask questions.

Open Source AI Definition – Weekly update August 26

Mia Lykou Lund — Mon, 26 Aug 2024 19:33:35 +0000

Week 34 summary

Share your thoughts about draft v0.0.9

As we move toward the release of the first-ever Open Source AI Definition in October at All Things Open, the publication of the 0.0.9 draft brings us one step closer to realizing this goal.

OSAID 0.0.9 draft definition is live!

Changelog includes:
- New Feature: Clarified Open Source Models and Weights
  - Added a new paragraph under “What is Open Source AI” to define “system” as including both models and weights.
  - Clarified that all components of a larger system must meet the standard.
  - Updated paragraph after the “share” bullet to emphasize this point.
- New Section: Open Source Models and Open Source Weights
  - Added descriptions of components for both models and weights in machine learning systems.
  - Edited subsequent paragraphs to eliminate redundancy.
- Training Data: Defined as a Benefit, Not a Requirement
  - Defined open, public, and unshareable non-public training data.
  - Explained the role of training data in studying AI systems and understanding biases.
  - Emphasized extra requirements for data to advance openness, especially in private-first areas like healthcare.
- Separation of Checklist
  - The Checklist is now a separate document from the main Definition.
  - Fully aligned Checklist content with the Model Openness Framework (MOF).
- Terminology Changes
  - Replaced “Model” with “Weights” under “Preferred form to make modifications” for consistency.
- Explicit Reference to Recipients of the Four Freedoms
  - Added specific references to developers, deployers, and end users of AI systems.
- Credits and References
  - Incorporated credit to the Free Software Definition.
  - Added references to conditions of availability of components, referencing the Open Source Definition.

Initial reactions on the forum:
- @shujisado praises the updates in version 0.0.9, particularly the decision to separate the checklist from the main document, which clarifies the intent behind OSAID. He also supports the separation of “code” and “weights,” noting that in Japan, “code” clearly falls under copyright, making this distinction logical. He acknowledges revisions in the checklist that consider the importance of complete datasets, even though he disagrees with making datasets mandatory.

Comments on the draft on HackMD
- @Joshua Gay adds that instead of narrowing the focus to machine-learning systems, the emphasis should be on “parameters” as a whole since weights are just one type of parameter. He suggests a rewrite that highlights making model parameters, such as weights and other settings, available under OSI-approved terms, with examples across various AI models.
  - He further suggests using broader language that covers more AI systems instead of narrower terminology. Specifically, he proposes replacing “Open Source models and Open Source weights” with “Open Source models and Open Source parameters,” and using “AI systems” instead of “machine learning systems.” Additionally, he recommends redefining an AI model to include architecture, parameters like weights and decision boundaries, and inference code, while referring to AI parameters as configuration settings that produce outputs from inputs.
- Under “Open Source models and Open Source weights”, @shujisado adds that the last paragraph titled “Open Source models and Open Source weights” actually explains “AI model” and “AI weights,” leading to a mismatch between the title and content, and notes that these terms are not used elsewhere in the definition.
- Under “Preferred form to make modifications to machine-learning systems”, @shujisado suggests some grammatical corrections.

Next steps
- The OSI has recently presented at the following events:
  - Hong Kong for AI_dev, August 21-23
  - Beijing for Open Source Congress, August 25-27.
- Iterate Drafts: Continue refining drafts with feedback from the worldwide roadshow, considering new dissenting opinions.
- Review Licenses: Decide on the best approach for reviewing new licenses for datasets, documentation, and model parameters.
- Enhance FAQ: Continue improving the FAQ to address emerging questions.
- Post-Stable Release Plan: Establish a process for reviewing and updating future versions of the Open Source AI Definition.

Get involved:
- Join the forum and share your opinion.
- Leave a comment on the draft v.0.0.9 with precise feedback.
- Follow the weekly recaps and subscribe to our monthly newsletter.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions, and share your thoughts. The next is on September 6.
- Join the workshops and scheduled conferences

Explaining the concept of Data information

@Kjetilk points out the legal distinction between using copyrighted works for AI training (reproduction) and incorporating them into publishable datasets, questioning the fairness of allowing exploitative models without compensation while potentially banning those that benefit society.
@Shujisadoclarifies that compensation for copyrighted works used in AI training is possible for both open source and closed models, distinguishing it from “royalty,” and notes that Japan’s copyright law exempts such uses for machine learning.
- @Kjetilk reiterates the relevance of “royalty” for compensation in closed, non-published models, suggesting it makes sense under copyright law if required, but if not, it could benefit science and the arts.

Open Source AI Definition Town Hall

The slides and recording from the town hall meeting held on August 23, 2024 are available here.
The next town hall meeting will be held on September 6th. Sign up for the event here.

Open Source AI Definition – Weekly update July 15

Mia Lykou Lund — Mon, 15 Jul 2024 19:26:16 +0000

It has been quiet over the 4th of July weekend on the forums and OSI has been speaking at different events:

@stefano spoke in a panel at the UN event OSPOs for Good. Access the recording here.
@mer is speaking at Open Source Community Africa
OSI was present at the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. Read about the takeaways here.

Why and how to certify Open Source AI

@jberkus expresses concern about the extensive resources required to certify AI systems, estimating that it would take weeks of work per system. This scale makes it impractical for a volunteer committee like License Review.
@shujisado reflects on past controversies over license conformity, noting that Open Source AI has the potential for a greater economic impact than early Open Source” He acknowledges the need for a more robust certification process given this increased significance. He suggests that cooperation from the machine learning community or consortia might be necessary to address technical issues and monitor the certification process neutrally. He offers to help spread the word about OSAID within the Japanese ML/LLM development community.

@jberkus clarifies that the OSI would need full-time paid staff to handle the certifications, as the work cannot be managed by volunteers alone.

Open Source AI Definition – Weekly update July 1

Mia Lykou Lund — Mon, 01 Jul 2024 15:48:07 +0000

An open call to test OpenVLA

Last week @quaid suggested conducting a controlled experiment to determine if data information alone is sufficient to recreate an AI model with fidelity to the original. He shared insights from the OpenVLA project, noting its possible compliance with the requirements of draft v0.0.8 and suggesting a test suite to compare models created with full datasets versus data information.
- To this, @Stefano noted that there also are some master students at CMU who are conducting similar experiments to “kick the tires” of the draft definition.
- @quaid proposed more precise criteria for evaluating model similarity, such as “functionally similar” or “practically similar” and further suggested detailing the values sought from open data datasets to improve the experiment’s framework.

Interesting research paper: “Rethinking open source generative AI: open-washing and the EU AI Act”

@hook has shared a research paper they found interesting and relevant tilted Rethinking open source generative AI: open-washing and the EU AI Act.
- This paper has been shared before by its author @mark and discussed in the context of whether the OSAID should contain a partially open license, arguing that in doing so, open washing would be limited, stating that “ I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.” This would highlight the “degrees of openness”.
- They too present their findings in a visualization of the degrees of openness of different systems.
  - This is a point we have discussed before and note that the OSAID will not be a partially open license but a binary one. See week 22 summary for the context of this discussion.

Open Source AI Definition Town Hall – June 28, 2024

We held our 12th town hall meeting last week. You can access the recording and slides here if you missed it. The town hall presented some ideas for the next draft of the Definition, making it clear that there is no agreement yet on the data information concept and that part is still subject to change.
A new town hall meeting is scheduled for Friday, July 12.

Open Source AI Definition – Weekly update June 24

Mia Lykou Lund — Mon, 24 Jun 2024 19:36:07 +0000

Explaining the concept of Data information

Following @stefano’s publication regarding why the OSI considers training data to be “optional” under the checklist in Open Source AI Definition, the debate has continued. Here are the main points:

Preferred Form of Modification

@hartmans states finding an agreement on the meaning of “preferred form of modification” depends on the user’s objectives. The disagreement may stem from different priorities in ranking the freedoms associated with open source AI, though they emphasize prioritizing model weights for practical modifications. He suggested that data information could be more beneficial than raw data for understanding models and urged flexibility in AI definitions.
@shujisado highlighted that training data for machine learning models is a preferred form of modification but questioned if it is the most preferred. He further emphasized the need for a flexible definition for preferred forms of modification in AI.
@quaid supported the idea of conducting controlled experiments to determine if data information alone is sufficient to recreate AI models accurately. Suggested practical steps for testing the effectiveness of data information and encouraged community participation in such experiments.
- @stefano added that some students at CMU will run this kind of experiment (if full training dataset is needed or if data information is enough to recreate a model that can be tested for fidelity to the original) to test the definition.
@jberkus raised concerns about the practical assessment of data information and its ability to facilitate the recreation of AI systems. He questioned how to evaluate data information without recreating the AI system.
Practical Applications and Community Insights
- @hartmans proposed practical scenarios where data information could suffice for modifying AI models and suggested that the community’s flexibility in defining the preferred form of modification has been valuable for Debian.
- @quaid shared insights from his research on the OpenVLA project, noting its compliance with OSAID requirements. He further proposed conducting controlled experiments to verify if data information is enough to recreate models with fidelity.
General observations

@shujisado emphasized the need for flexible definitions in AI, drawing from open-source community experiences. Agreed on the complexity of training data issues and supported the flexible approach of OSI in defining the preferred form of modification.
@quaid suggested practical approaches for evaluating data information and its adequacy for recreating AI models and proposed further experiments and community involvement to refine the understanding and application of data information in open-source AI.

Are we evaluating Licenses or Systems?

@jberkus asked whether OSAID will apply to licenses or systems, noting that current drafts focus on systems. He questioned if a certification program for reviewing systems as open source or proprietary is the intended direction.
@shujisado confirmed that discussions are moving towards certifying AI systems and pointed at an existing thread. He emphasized the need for evaluating individual components of AI systems and expressed concern about OSI’s capacity to establish a certification mechanism, highlighting that it would significantly expand OSI’s role.

Open Source AI Definition – Weekly update June 17

Mia Lykou Lund — Mon, 17 Jun 2024 16:52:03 +0000

Explaining the concept of Data information

After much debate regarding training data, @stefano published a summary of the positions expressed and some clarifications about the terminology included in draft v.0.0.8. You can read the rationale about it and share your thoughts on the forum.
Initial thoughts:
- @Senficon (Felix Reda) adds that while the discussion has highlighted the case for data information, it’s crucial to understand the implications of copyright law on AI, particularly concerning access to training data. Open Source software relies on a legal element (copyright licenses) and an access element (availability of source code). However, this framework does not seamlessly apply to AI, as different copyright regimes allow text and data mining (TDM) for AI training but not the redistribution of datasets. This discrepancy means that requiring the publication of training datasets would make Open Source AI models illegal, despite TDM exceptions that facilitate AI development. Also, public domain status is not consistent internationally, complicating the creation of legally publishable datasets. Consequently, a definition of Open Source AI that imposes releasing datasets would impede collaborative improvements and limit practical significance. Emphasizing data innovation can help maintain Open Source principles without legal pitfalls.

Concerns and feedback on anchoring on the Model Openness Framework

@amcasari expresses concern about the usability and neutrality of the “Model Openness Framework” (MOF) for identifying AI systems, suggesting it doesn’t align well with current industry practices and isn’t ready for practical application without further feedback and iteration.
@shujisado points out that the MOF’s classification of components doesn’t depend on the specific IP laws applied, but rather on a general legal framework, and highlights that Japan’s IP law system differs from the US and EU, yet finds discussions based on the OSD consistent.
@stefano emphasizes the importance of having well-thought-out, timeless principles in the Open Source AI Definition document, while viewing the Checklist as a more frequently updated working document. He also supports the call to see practical examples of the framework in use and proposes separating the Checklist from the main document to reduce confusion.

Initial Report on Definition Validation

Reviews of eleven different AI systems have been published. We do these review to check existing systems compatibility with our current definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
- @mer has set up a review sheet for the Viking model upon request from @merlijn-sebrechts.
- @anatta8538 asks if MLOps is considered within the topic of the Model Openness Framework and whether CLIP, an LMM, would be consistent with the OSAID.
- @nick clarifies that the evaluation focuses on components as described in the Model Openness Framework, which includes development and deployment aspects but does not cover MLOps as a whole.

Why and how to certify Open Source AI

@Alek_Tarkowski agrees that certification of open-source AI will be crucial under the AI Act and highlights the importance of defining what constitutes an Open Source license. He points out the confusion surrounding terms like “free and open source license” and suggests that the issue of responsible AI licensing as a form of Open Source licensing needs resolution. Notes that some restrictive licenses are gaining traction and may need consideration for exemption from regulation, thus urging for a consensus.

Open Source AI Definition Town Hall – June 14, 2024

Slides and the recording of our previous townhall meeting can be found here.

Open Source AI Definition – Weekly update June 10

Mia Lykou Lund — Tue, 11 Jun 2024 21:40:15 +0000

Open Source AI needs to require data to be viable

With many different discussions happening at once, here are the main points:
- On the issue of training data
  - @mark is concerned with openness of AI not being meaningful if there is not a focus on the training data.” Model weights are the most inscrutable component of current generative AI, and providers that release only [the weights] should not get a free ‘openness’ pass.”
  - @stefano agrees with all of that but questions the criteria used to assign green marks in Mark’s paper, pointing out inconsistencies. They use the example of Pythia-Chat-Base-7, which relies on a dataset from OpenDataHub with potential issues like non-versioned data and stale links, failing to meet stringent requirements required by @juliaferraioli. Similar concerns are raised for other models like OLMo 7B Instruct, which lack specific data versioning details. Maffulli also highlights the case of Pythia-7B, which once may have been compliant but it’s now problematic due to the unavailability of its foundational dataset, the Pile, illustrating the complexities in maintaining an “open source” status over time, if the stringent proposal suggested by @juliaferraioli and the AWS team is adopted.
  - @shujisado adds that while he sympathizes with @juliaferraioli‘s request for datasets, @stefano‘s arguments in support of the concept of “Data information” are aligned with the OSI principles and are reasonable.
  - @spotaws stresses that “data information” alone is insufficient if the data itself is too vague.
  - @juliaferraioli adds that while replicating AI systems like OLMo or Pythia may seem impractical due to costs and statistical nature, the capability is crucial for broader adoption and consistency. She finds the current definition to be unclear and subjective.
  - @zack recommends to review StarCoder2, recognizing that it would be in the same category of BLOOM: a system with lots of transparency and a dataset made available but released with a restrictive license.
  - @Ezequiel_Lanza joined the conversation in support of the concept of Data information, claiming, with technical arguments that “sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory.”
  - Partially open / restrictive licenses
    - Continuing @marks points regarding restrictive licenses (like the ethical licenses), @stefano has added a link to an article highlighting some reasons why OSI is staying away from these licenses.
    - @pchestek further adds that a partially open license would create even more opportunities for open washing, as “open source AI” could have many meanings.
    - @mark clarified that rather than proposing a variety of meanings, they are seeking to highlight the dimensions of openness in their paper, exploring the broader landscape.
    - @stefano adds that in the 26 years of OSI, it has contended with numerous organizations claiming varying degrees of openness as “open source. This issue is now mirrored in AI, as companies seek the market value of being labeled Open Source. Open Source is binary: either users have full rights or they don’t, and any system that falls short is not Open Source AI, regardless of how “almost” open it is.
  - Field of use/restriction
    - @juliaferraioli believes that OSAID should include prohibitions against field-of-use restrictions.
    - @shujisado adds that OSAID specifies four freedoms as requirements for being considered open source and that this should be understood as the same since “freedom” is the same as “non-restricted”. The 10 clauses of the OSD have been replaced by the checklist in draft v0.0.8.
    - @juliaferraioli adds that individual components may be covered by their individual licenses, but the overall system may be subject to additional terms, which is why we need this to be explicit.

Initial Report on Definition Validation

@Mer has added how far we are regarding our system analysis compared to our current draft definition. Some points that remain incomplete have been highlighted.
Mistral (Mixtral 8x7B) is considered not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.

Can a derivative of non-open-source AI be considered Open Source AI?

@tarek_ziade shares his experience fine-tuning a “small” model (200M parameters) for a Firefox feature to describe images, using a base model for image encoding and text decoding. Despite not having 100% traceability of upstream data, Tarek argues that intentional fine-tuning and transparency make the new fine-tuned model open source. Any issues arising from downstream data can be addressed by the project maintainers, maintaining the model’s open source status.

Town hall recording out

We held our 10th town hall meeting a week and a half ago. You can access the recording here if you missed it.
A new town hall meeting is scheduled for this Friday, June 14.