About Reglab
Reglab is a think tank specializing in research and consultancy that assists companies, business associations and policymakers with data-driven planning and impact analysis. Our focus is on making responsible and strategic decisions, unraveling the regulatory challenges of the media and technology sector.
Our goal is to promote evidence-based research that increases the
responsibility and establish meaningful milestones and goals for the ecosystem.
Find out more at www.reglab.com.br
About the Policy Briefs Series
The Series Policy Briefs encompasses studies that evaluate existing or proposed public policies, using qualitative and quantitative data to inform and guide strategic decisions. The objective is to bring complex issues into an accessible way, highlighting the main points of analysis, impacts and possible recommendations.
Hours
Executive Director: Pedro Henrique Ramos
Research Coordinator: Marina Earrote
Authors: Pedro Henrique Ramos, Julia de Albuquerque Barreto, Marina Earrote
Researchers:Stephanie Mathias de Souza
Final Layout: Eliza Natsuko Shiroma
Suggested Quote: RAMOS, P. H.; BARRETO, J.; GARROTE, M. Compensation for Copyright in AI: Limits and Implementation Challenges. Policy Briefs Reglab, n. 3. São Paulo: Reglab, 2025.
Executive Summary
The debate on copyright and Generative Artificial Intelligence (IAG) is one of the biggest issues of the moment in regulation, with different perspectives and opinions. Reglab decided to tackle this issue from a undervalued perspective in the regulatory debate: the view of professionals in the STEM areas (exact sciences, technology, engineering and mathematics).
To do this, we interviewed computer scientists, software engineers, machine learning experts and university professors, with the aim of understanding how the training of IAG models involves the use of copyrighted content, and what the technical challenges are in enabling remuneration proposals associated with this use.
This is unprecedented research in Brazil. To get an idea of the importance of this data, we examined the 24 hearings of the CTIA – Internal Temporary Commission on Artificial Intelligence in Brazil, which analyzed bill no. 2,338/23, and we observed the low participation of these professionals in the debate on IAG and copyright.
[IMAGE 1 — replace with the corresponding image from the PDF]
Profile of participants and contributions of CTIA – Internal Temporary Committee on AI
Word cloud generated using Wordclouds software. com, based on information automatically generated by Atlas.ti, using the “Concepts” tool.
Among the main findings of the research, we highlight:
- data selection is a complex and critical process for AI models, and not only the quantity, but also the quality and diversity of data they have a decisive influence on the performance of the models;
- Although there are technical approaches that allow tracking the flow and origin of data used in training, interviewees indicated that There are still no scalable and reliable solutions to measure the specific contribution of each work in large-scale models. This does not appear to be a market choice, but a structural limitation of current technology, especially in machine learning;
- this happens because Machine learning-based models do not store data as a queryable reference bank, but rather as vector patterns, generalized from statistical probabilities, “breaking” data and converting the information obtained into numbers. So, Determining the exact impact of each work on the final model is, in practice, impossible.
These technical challenges compromise traditional compensation solutions
by copyright, which essentially depend on quantifying the use of works to establish payments. Without accurate measurement, licensing agreements can end up favoring large rights holders with more legal resources and harming independent creators, who would not be able to measure the use of their works.
Asked about the effects of severe limitations on data availability due to the potential application of licensing and copyright rules in Brazil, experts highlighted that, due to the global nature of the internet, IAG training could easily be carried out in other countries, weakening the national AI ecosystem and rendering local rules ineffective, with negative impacts on the country’s regulatory credibility and competitiveness.
Other impacts highlighted were:
- Reduction in model quality:Models trained with less data may have lower accuracy and less generalization capacity;
- Increased development costs: The need to negotiate individual licensing for each data would make the process more expensive, making it unviable for startups;
- Market concentration: Companies with exclusive access to large datasets
they would have a competitive advantage, harming open innovation;
- Economic Effects: If “use” is not the main criterion adopted, other measurement methods can generate market distortions, reinforcing structural inequalities in the sector;
- “Escape from AI Centers”: If strict rules are implemented in Brazil, the tendency would be for AGI development centers to leave Brazil for other jurisdictions.
The main contribution of this study is to demonstrate that the technical and realistic understanding of the functioning of IAG models is vital for good regulation, and that it is urgent to expand the participation of STEM professionals in the legislative process.
Introduction
Imagine a person looking for information about Brazil’s economy in the 1990s. Instead of accessing a report or book, they ask a chatbot of artificial intelligence. In seconds, the system responds with clear, well-structured and precise text. Although no content is reproduced verbatim, the model was trained with large volumes of texts available on the internet, such as newspaper articles from the time, academic works and Wikipedia entries.
Copyright are rules that protect creators of intellectual works, such as music, texts and images, allowing them to control the use of their creations and to receive remuneration when someone else uses these works. Law 9,610/1998 regulates these rights in Brazil, including its exceptions.
Could it be that the way thischatbot used the texts conflict with the rights of the authors of the texts that were used during the training? This is a difficult question, and it is part of a debate
quite broad: the complex relationship
between IAG and copyright.
However, these discussions often occur without an analysis of the technical aspects of IAG – and it is with this in mind that we propose this study. Our idea is to translate technical and operational aspects of IAG to offer evidence that can
be used in the regulatory debate on IAG and copyright. We do not seek to prioritize political, economic or legal dimensions – after all, it is essential that the debate is guided by multiple perspectives. Still, we believe that the technical dimension – although not the only relevant one – is fundamental for discussions to move forward based on concrete evidence and viable solutions.
What is Generative Artificial Intelligence
and why does it matter?
In this work, we consider Generative Artificial Intelligence (GIA) technologies as systems that employ statistical and machine learning techniques to generate new texts, images or other types of content (Daase et al, 2024). Unlike analytical models that interpret, classify and make decisions based on data (Amorim, 2025), IAG systems are capable of
generate new data, such as texts and images, using patterns extracted from extensive training databases – so-called datasets.
Datasets (or training data) are organized sets of data – such as texts, images or videos – used to train IAG systems, and which help the machine to “learn” patterns and improve its responses.
Figure 1. Differences between Analytical and Generative AI models
Analytical AI
input data (entered by user)
output data (analysis, recommendations, predictions)
analysis, prediction, descriptive system, etc.
Enerative AIobs. AGI systems can also combine AI Analytics features
+data from language training models (LLM)
commands (user instructions)
Source: own elaboration, based on Ramos (2023).
IAG-based systems have seen rapid adoption in the last three years, and their economic impact is significant: a McKinsey study (Chui et al, 2023) projects that the use of these technologies could generate up to 5% additional growth for the global economy over the next five years, directly affecting sectors such as agribusiness, insurance, consumer goods and the pharmaceutical industry.
Figure 2. Future economic impact of generative AI on organizations worldwide in 2023, by economic sector
[IMAGE 2 — replace with the corresponding image from the PDF]
Source: Chui et al (2023).
The IAG economy is not made up of a single actor, but of a structured ecosystem, with different companies performing complementary and chain functions. We can summarize this in three layers1:
- infrastructure, made up of hardware manufacturers responsible for chips and data centers high performance, essential for processing large volumes of information;
- models, made up of companies that develop and license foundational models, such as large language models (in English, Large Language Models, or LLMs), based on neural networks with billions of parameters trained with vast amounts of data and aimed at text production; and
1 Other studies suggest a different division of layers, into four or six (Simmons, A., 2023; Epical, 2024). For didactic purposes, and explicitly based on Benkler’s (2006) model, we chose to simplify it into just three.
- applications, layer in which companies develop and offer software systems that, based on models and infrastructure, offer solutions and services to end users — being the chatbots one of the best-known examples.
When we talk about training data, we are especially concerned with the models. But it is important to clarify: the way these datasets are processed by the models is quite different from what happens in applications based on storing or reproducing content – such as a streaming of songs or series, for example. (We will see this later, in the presentation of the study results).
Data Mining, IAG and Copyright
There are at least two debates about copyright and IAG that deserve to be differentiated: one about the protection of works createdby AI systems (When AI creates a song, who is the author?), and the other about copyrighted works that are present in the models’ training data. Here we will talk exclusively about this second debate – which started well before the IAG (Fill-Flynn et al, 2022).
This is because the practice of data mining – a process that involves statistical methods to identify patterns and correlations between data – began to become popular in the 1990s (Coenen, 2004). It was at the same time that the technique of crawling, or data scan, in which a system scours websites, pages or databases to analyze content, index it and then incorporate it into practices such as data mining.
Crawling, data mining and machine learning are techniques that serve as a basis not only for IAG, but for applications as diverse as internet search engines, price comparison tools, services
indexing of scientific articles and platforms that monitor data
open government.
RECAPITULATION:
Crawling: automatic collection of data, such as texts and images, to form banks that will serve as a basis for analysis.
Data mining: process of analyzing large volumes of data to identify patterns and correlations, before or independently of use in training IAG systems.
Data training: stage in which an IAG system learns from data, adjusting parameters to better recognize patterns and generate responses.
Machine Learning: process by which a system continually improves its responses by identifying patterns in data during training.
The importance of these processes is so great that, in recent years, several countries have started to include exceptions for copyright activities in their copyright laws. data mining, especially for research or innovation in the public and private sector:
In Europe, countries such as the United Kingdom and Germany have broad exceptions for training and data mining, and recently the European Union, through the AI Act, it also incorporated specific rules on this topic (Rosati, 2024);
Japan has stood out for its stance of encouraging the use of data for research and development in the public and private sectors (Ueno, 2025);
In the United States, the judicial doctrine of fair use has been interpreted as a valid exception for data mining and training; however, recent legal discussions have generated legal uncertainty regarding the interpretation of this concept within the scope of the IAG, with criticism including from important civil society organizations (Noble, 2025).
In China, there is a scenario of judicial uncertainty similar to that in the USA, with the legislation being a little clearer in favor of the use exception compared to the US legislation (Karaganis, 2024).
InSouth America, the legal scenario is very different. Copyright laws in the region have not created specific exceptions regarding training and data mining, an aspect that creates legal uncertainty for investments in data centers in the region, in addition to imposing barriers to the development of local technologies (Schirru et al, 2024).
Reliance on models trained in other jurisdictions may limit the ability of Latin American countries to develop technologies aligned with their cultural, linguistic and social realities, and applications in areas such as public health, justice, education or local culture may be especially affected.
Figure 3. Copyright Exceptions for Research, Training and Data Mining
[IMAGE 3 — replace with the corresponding image from the PDF]
Source: Fill-Flynn et al, 2022.
The Moment of Debate in Brazil: PL 2,338/23
The legislative debate on IAG in the country gained new momentum with the approval, by the Federal Senate, of Bill No. PL 2,338/23. The project originated from a draft prepared by a commission of jurists and presented by the president of the Senate, Rodrigo Pacheco, in 2023, and incorporates provisions from seven other legislative proposals, including PL 21/2020 — already approved by the Chamber of Deputies in 2021, but which was being processed in the Senate.2.
Inspired by AI Act of the European Union and in normative references in the area of personal data protection, the project proposes a risk-based obligations regime combined with a set of guarantees for people affected by AI systems. Among the rights guaranteed are access to prior information about interaction with automated systems, the right to privacy and protection of personal data and the right to
non-discrimination. For systems classified as high risk, the text also provides for additional safeguards, such as the right to explanation, contestation and human review of automated decisions.
Regarding copyright, PL 2,338/23 adopted a more restrictive approach than the proposal from the European Union and other countries. In summary, articles 62 to 65 of the project:
- Create exceptions for AI mining and training only for scientific and educational institutions, museums, archives and libraries, as long as they are non-commercial and have legal access;
- They establish that AI developers must comply with transparency obligations, such as public disclosure of the databases used in training; and
- They create remuneration mechanisms, allowing collective or direct negotiation with copyright holders, considering size and economic impact.
The proposal generated immediate reactions. On the one hand, cultural sectors and representatives of creators highlighted the unprecedented nature of the measure and its commitment to guaranteeing protection
Copyright remuneration mechanisms ensure payment for the use of creative works. Some examples include:Music: holders receive for public or digital performances. In Brazil, ECAD collects and distributes to authors, performers and publishers. Audiovisual: screenwriters and directors are paid for showing works.
copyright in the digital age. On the other hand, concerns arose regarding the technical feasibility of the established requirements and the potential impact of the proposal on Brazil’s competitiveness in the global scenario of development and innovation in AI.
“In any type of economic activity there is an input that is fundamental, and whoever coordinates that activity has to pay for it. In the case of artificial intelligence, the main input is creativity, it is what each person was able to create, and that will be mined by the company that will develop the artificial intelligence program, which will also have to pay for it due to the creativity that people insert into their musical, literary production, whatever it may be”
Senator Humberto Costa (PT-PE)3.
PL 2338 does not follow international trends that seek to achieve a balance between copyright protection and the development of AI. Countries like Singapore and Japan widely allow the training of AI models and systems. The European Union, in turn, adopted more flexible rules that allow computational analysis of publicly available works
to enable AI training, recognizing the importance of fostering innovation in this field, while ensuring that rights holders can indicate, through technical means, that they do not allow training in their works.
ABAG – Brazilian Agribusiness Association4.
Figure 4. Comparative table of legislation relating to training AI models
HOW DOES PL 2338/23 COMPARE WITH THE LEGISLATION OF OTHER COUNTRIES?
| Question |
USA |
EU |
China |
Japan |
Brazil (PL 2.338/23) |
Can models be trained from publicly available copyrighted works?
Yes, is “transformative” (doctrine of fair use)
Yes, unless the holder has informed his opt-out
Yes, there is a legal exception for data training,
although not expressed
Yes, there is a legal exception for data training
No, training is not an express legal exception if there is a commercial/profit purpose
Can copyright holders prevent their works from being used in model training?
No, unless they prove in court that it is not a
fair use
Yes, Technical opt-out (via metadata) or licensing is possible
No, the right to opt-out is not provided for by law
No, the right to opt-out is not provided for by law
Yes – being the rule opt-in, companies need to negotiate use before training
Can copyright holders demand compensation from companies that use their works in model training?
[IMAGE 4 — replace with the corresponding image from the PDF]
Uncertain – depends on a court decision
Partial –only if the opt-out is violated
[IMAGE 5 — replace with the corresponding image from the PDF]
Uncertain – there is current judicial discussion on the topic
Partial – the law provides the right if there is misuse or plagiarism in the output
Partial – the right to compensation exists, but without clarity of
criteria for its calculation
Source: own elaboration.
-go-to-chamber. Accessed on: 12 May. 2025.
Accessed on: 12 May. 2025.
The Methodological Proposal of this Research
This context reinforces the relevance of this research: at a time when Brazil is discussing its legal framework on AI, public policy makers need to understand in depth the technical dimensions involved. Issues such as the possibility of tracking and quantifying the use of authorial content, attributing their individual contribution to the result of a model, estimating the costs of a possible compensation system and evaluating who would actually benefit from these measures
they need to be answered based on evidence, before legislative solutions are defined.
This research aims to understand how the training of IAG models involves the use of copyrighted content, and what the technical challenges are related to the feasibility of remuneration proposals associated with this use.
In this study, we combined two methodological approaches. The first is the evidence translation, still little explored in digital governance in Brazil, which aims to produce robust and accessible evidence for public decisions (Ingold, 2025).
Whenever we use pink frames, graphs or examples highlighted in the layout, we do so consciously. We know that we run the risk of technical inaccuracies, but we understand that, within the logic of translating
complex evidence in applied knowledge, making the content clearer and more accessible is a necessary methodological choice — and a position that we take with transparency.
The second is the qualitative approach. Instead of literature reviews and desk research
traditional, we drive semi-structured interviews to capture perceptions and experiences of a group often missing from the regulatory debate: STEM professionals (exact sciences, technology, engineering and mathematics).
Inspired by reception studies, we seek to understand how these professionals interpret technical challenges regarding the relationship between AI and copyright. Over the course of a month, we conducted eight interviews withexperts on the research topic, focusing on senior-level professionals with experience and academic training in the STEM field. The interviews followed pre-defined scripts and confidentiality protocols, with their transcriptions and memorials evaluated using the Atlas.ti software using the thematic analysis technique.
Figure 5. Descriptive table of people interviewed
- woman, doctor and data scientist from a large Brazilian company in the software sector
- man, doctor, data scientist and university professor in the field of technology
- man, data scientist in a large Brazilian company in the financial sector
- man, doctor, data scientist and university professor in the field of technology and
administration
- man, software engineer and executive at Brazilian startup
Cman, master, electrical engineer, AI solutions architect in big tech
- *woman, artificial intelligence professional in big tech
- *man, consultant machine learning in Brazilian startup
* preliminary interviews
Source:own elaboration.
The complete methodology, with details on the procedures adopted, is at the end of the study.
Main Results
Quantity, quality and diversity of training data impact model performance
Interviewees explained that IAG models are highly dependent on the quality, diversity and quantity of data, and that there is no necessary hierarchy between these factors – this will depend on the objective of a given model.
- A data quality is central: content with errors, biases or incomplete information compromises the model’s inferences and can reproduce distortions or omissions;
- Likewise, the data diversity — across languages, cultures, styles and contexts — is essential to ensure inclusive and generalizable responses; and
- A amount of data is also relevant, especially due to the mathematical model adopted – “in neural models, the more data, the better”5 , said one person interviewed.
However, none of these isolated factors are a guarantee of good performance. Models trained with large volumes of homogeneous data, for example, can reproduce biases and present applicability limitations. As one person interviewed said:
“You have to be careful, because quantity is not the number of photos. There’s no point in sending a trillion images if they’re all similar.”
Some people interviewed mentioned that smaller and more specific models can be even more effective for certain applications, in addition to being more economical. This statement aligns with recent academic experiments, which seek to create sets of datasetssmaller, but whose data diversity and quality compensate for their limitation in quantity (Eao et al, 2020; Leffer, 2025).
Interestingly, some interviewees also highlighted that some of the most popular models already sold out public data existing on the internet, obtained through crawling. This means that what will differentiate them now will be (i) the technical performance of the models, such as greater processing capacity, innovation in the calculation format, customization, among other factors, or (ii) the incorporation, into its bases, of datasets that cannot be captured by
crawling, but whose quality can be a differentiator in the model, which explains why several companies are seeking to acquire licenses to use historical newspaper collections, generally private and not publicly available on the internet (Barcott, 2025).
- In order to preserve the anonymity and confidentiality of research participants, specific changes were made to the quotes presented in this study. In certain circumstances, specific linguistic adaptations were made to ensure the original intention of the interviewees in the textual transcription. The preservation of the discursive record was maintained whenever possible, respecting the established methodological principles.
These technical findings bring direct lessons to the debate on copyright and IAG, warning that remuneration models based solely on the volume of works used may not capture the real impact of each contribution on the performance of a system. A more balanced approach would need to consider not only the quantity, but also the quality and contextual relevance of the works in training – a huge challenge from a technical point of view, as we will see later.
Not all data can be collected by crawling. This is because many are protected by technical barriers (such as paywalls), require a login and password to access, or have legal restrictions, such as sensitive personal data. Additionally, there is content in formats that are not automatically accessible, such as offline files or private collections. This limits the scope of crawling and requires other forms of access or authorization, which may involve financial agreements between companies.
The technical unfeasibility of measuring the contribution of works in IAG
The people interviewed explained that large-scale models do not work by directly indexing the data (like in a library), but operate through statistical patterns extracted from the data. Each work is fragmented into words, which
they are transformed into billions of vector representations with no direct links to the source files – and which are not even stored.
Therefore, the attempt to identify how much an individual work contributed to a specific result is technically unfeasible.
This is because, during training, a model analyzes large volumes of data at each dataset to adjust mathematical representations (specific to each model), without necessarily copying or storing the data. In other words: while a music app processes information to reproduce, IAG systems process information to generalize.
It is because of this generalization that an IAG system trained on millions of images can produce a new visual composition that reproduces common characteristics of works from the 16th century, without replicating any specific work, just incorporating recurring elements from different references.
During training, AI analyzes millions of data points and turns each of them into a mathematical representation – which is also called vector. These vectors represent characteristics of what was learned. Let’s take an example and see how a word can be transformed into a vector of hundreds (sometimes thousands) of numbers:
dog → [0,2, -1,3, 7,8, -0,4, …]
These numbers have no isolated meaning. What What matters is how they relate to other vectors. The word “hot” may not be related to dogs, but the system can attribute some correlation – which will be identified by the repetition of some of the numbers in the vector:
hot→ [65,1, -1,12, 32,8, -0,4, …]
Over time, the model learns more and more correlations, that is, how different vectors can connect to each other. It’s an intensive process – we’re talking trillions of vectors! –, and which requires a lot of data processing capacity and extremely complex calculations, and which are known as neural networks, due to its similarity with the functioning of the nervous system of living organisms.
Now, let’s imagine that you type, in a chatbot, the following command:
“Complete the following sentence: At lunch, I asked for a dog”
The first thing the IAG model will do is transform this command into vectors, transforming words into numerical sequences. Then the model will look for correlations: which of these numbers relate to other numbers that the model already knows. It is a probability calculation: The model does not choose words randomly, but rather most likely next word. It’s as if the model asked itself: “Based on this command, what is the most likely word to come now?”. Let’s simplify and look again at our example vectors:
Lunch→ [0,2, -1,3, 7,8, –0,4, …]
dog → [65,1, -1,12, 32,8, -0,4, …]
hot→ [0,7, -8,3, 7,1, –0,4, …]
The model appears to have found a correlation! If this is the most statistically likely, the model will then respond as follows:
At lunch, I ordered a hot dog.
The most interesting thing is that, even though the model has learned from thousands of sentences and content, This result, as simple as it may seem, is a new combination of words, generated from statistical patterns.
This characteristic distinguishes AI models from other applications, such as streaming services – which work more like digital libraries. In these cases, consumption can be linked to a content unit, making it possible to attribute the use to the result. In IAG techniques, there is no metadata structure or tracking system that allows reconstructing cause and effect relationships between input data (input) and output (output).
One of the people interviewed explained this issue:
“I can say that author
Let’s imagine that the model learns to correlate the word “rain” with “sadness”, based on two song lyrics (Music A and Music B) and three books (Book A, Book B and Book C). This correlation is so strong that it will generate a specific vector.
When the model creates the sentence “rain is sadness”, it will be possible to audit and identify that the vector (23;0.4;18.4;80) was used – but it will not be possible to discover which of the 5 data sources contributed to this result, since there was no storage or indexing, only one machine learning.
In other words, the attempt to isolate the influence of a single or restricted set of information becomes complex. The idea that it would be possible to calculate the “weight” of an individual work in the performance of a model contradicts statistical functioning.
of machine learning systems, which learn through fuzzy patterns and probabilistic recurrences. Another interviewee explained this issue based on the way models understand different languages – and transform concepts into vectors:
“In a model, when you write a question in Portuguese or English, the first model you will do is take this sentence from the question and take a mathematical representation that is already language agnostic. This is a very beautiful thing. Imagine you take a dog and it will take this to a mathematical vector that means dog in
any language. In all languages you will arrive at the same thing.”
The discussion about quality also touches on the idea of cultural value vs. statistical value: A work may have enormous cultural value (e.g. an excerpt from a literary classic), but for the AI model, its specific statistical contribution may be irrelevant – which makes the development of a copyright remuneration system complex. This is the conclusion of a recent experiment carried out by De La Rosa et al (2024), demonstrating that works of fiction are not as decisive in the performance of models.
Let’s imagine that a model was trained with two different datasets: the first has books by author We know that the model used information from both sets to generate the answer. But, as this data was transformed into numbers and statistical patterns during training, there is no way to say which set weighed more in the construction of the final text. Even though both helped, we were unable to measure which was more important. And then the question arises: how can we pay precisely those who contributed the most, if it is not possible to identify the weight of each part?
Finally, some interviewees highlighted that this issue does not seem to be a market choice, but rather a limitation of the state of the art technology.
In this sense, our research showed a recent interest in conferences and academic experiments at universities, such as the working papersde Wang et al (2024) and Zhang et al (2025) use game theory techniques to try to estimate these weights, but recognize a series of methodological limitations, such as computational complexity, fragmentation of data across different sources and precise identification of which works are or are not protected by copyright.
Data reduction can cause a reduction in
quality of IAG models
It is unanimous among the people interviewed that restrictions on the use of data — due to regulations, costs or legal risks — directly impact the quality of the models. The smaller the database available, the more limited the universe that the model can represent, resulting in products that tend to be poorer in nuance, precision and applicability.
Some people interviewed commented on the replacement with synthetic data – artificially created to overcome diversity limitations in the database –, but were categorical in stating that:
“The accuracy is not as good as if you actually used real data.”
One question that seemed relevant to us in the interviews is that reducing data in Portuguese can cause a significant deterioration in the quality of models for issues of local cultural representation: the Portuguese language represents just under 4% of open content on the internet, while English represents almost half (Statista, 2025) – in other words, there is a real risk that the models will become less relevant to the local audience. As one person interviewed said:
“[If the model] is trained only with data from other countries, because Brazil closes, these models probably won’t be able to work with typical Brazilian problems or with some things that are specific to Brazil. So, if you ask who the champion of the Brazilian championship is, he won’t know, because he can’t use that information, unless it has been published in some other external source that he isn’t using.”
Individual data licensing can make the
development of Brazilian models unfeasible
Asked about the impacts of a regulatory system that required mandatory licensing of works for training IAG models, the people interviewed agreed that the impacts would be serious, especially for Brazilian companies. As one interviewee put it:
“The biggest losers would be Brazilian companies, because for us to have an operation in Brazil and carry out this training in Brazil, we wouldn’t be able to do it, it would be unfeasible.”
The concern that arose in the interviews was both in relation to the operationalization and the cost of these licenses. One of the interviewees stated that the diversity of creators and sources on the internet is so great that it would be practically impossible to individually license all content:
“The problem is if I make a law like this, a general law, that I want everyone to compensate, you know? I even doubt how I’m going to compensate everyone. I’m finding a photo there on the internet, another here, another here, which is publicly available and how am I going to
compensate each of these people? Like, I think that’s what the problem would be, it would be to make a law that in practice is unworkable.”
The financial impact was also pointed out as a relevant limitation for the emergence of new Brazilian companies, which would make development “priceless for startups”. This scenario compromises the dynamism of the innovation ecosystem, and also creates barriers to the country’s competitiveness on the international stage, as other countries (as we saw in the Introduction) are precisely looking for ways to make the use of training data more flexible. “It would be a problem that could leave a country behind”, as one of the interviewees said.
Market concentration: exclusive access to datasets can only benefit large companies
In a context of strict regulation, Companies with large proprietary bases or exclusive access to data could occupy even more dominant positions in the AGI market. According to the people interviewed, this can create distortions on both sides – between content holders and IAG model developers.
As Barcott (2025) shows, large IAG model companies are already developing exclusive agreements with companies that have large proprietary databases.
The issue, brought up in several interviews, is that The diversity of content on the internet, combined with the difficulty of technical attribution, would make it practically impossible to provide individual financial compensation to small creators – although, statistically, even more relevant than those from large databases.
“We are not just talking about niches like media companies, niches like book publishers. I feel like it’s really starting to be a story where almost every website, on the internet now, has the right to charge copyrights if they assume, to be compensated for copyrights, if they assume that the model was trained with that.”
On the development side of IAG technologies, this concentration is already visible internationally, with few companies dominating the development of models, and which would be better able to pay high costs related to licensing.
– in addition to the costs of the training itself, which are already very high, as highlighted
one person interviewed:
“You can provide training, but you also have to invest money, you have to put PHDs working on it, in short, there are a lot of indirect costs”.
“Escape from centers”: a strict rule in Brazil would be easily circumvented – with relevant economic and social effects
Imposing overly restrictive rules on the use of data for AI training can have the side effect of what we call of “flight from centers” — the displacement of innovation and investment hubs to countries with more flexible regulations.
People interviewed commented that IAG companies could technically move their activities to jurisdictions where copyright remuneration obligations do not exist. Considering the global and open nature of the internet, this would be very simple to execute – and equally simple to circumvent. As one person interviewed said:
“It’s like you’re going to ban it here, but you’re not going to ban it in the rest of the world (…) so you decide, don’t you want to have the technology here and all your neighbors have it?”
This situation would go against policies encouraging local data centers, and would give a competitive advantage to larger companies, with greater globally distributed cloud infrastructure and who can choose to train models on servers located where the law is more favorable.
“Absolutely. Without a doubt. Think about the following: firstly, where does the training take place? You have different scales (…) [companies today] are training on clouds that exist in several countries. And they have data centers in several countries. So, if the question is, today the technology for training is already in several countries, certainly, there is no doubt that it is and will be even more so.”
Furthermore, if a law is only local, its effectiveness over a foreign entity that makes a model available via the internet is limited. Unless there are total blocks of websites and applications (a serious measure), the foreign company could offer its service to Brazilians anyway – which could affect the credibility of regulation in the country.
“And what I can also do is the following: if I have this care in the United States and I don’t have it in Brazil. Because of this restriction, I can send my model to the United States, train the part that I don’t have knowledge of there, bring him back and continue training in Brazil.”
Analysis and Comments
This section analyzes the research results, relating them to academic literature and expert opinions, through the lens of the author of this work.
After carrying out the interviews and re-examining the text on copyright approved in PL 2338/23, it seemed to us that there was a huge gap between the project proposal and its technical feasibility. What could have motivated this?
It is a difficult question to deduce empirically. However, our exploratory hypothesis is
thatthe legislative debate on AI and copyright in Brazil was conducted without an in-depth understanding of the technology. There are some factors that reinforce this argument.
Firstly, the topic of copyright was not among the most debated topics during the work of the Senate Committee. The analysis of shorthand notes from the 24 sessions of the
Internal Temporary Commission on Artificial Intelligence in Brazil of the Federal Senate (CTIA) showed that the debates mainly focused on topics such as protection
of personal data, risk classification, definition of systems and impacts on innovation. Although copyright is a relevant dimension of AI regulation, its discussion has been significantly less compared to other topics.
Figure 6. Word cloud from CTIA sessions, using the Atlas.ti software (“Concepts” tool).
[IMAGE 6 — replace with the corresponding image from the PDF]
Source: word cloud generated using the Wordclouds.com software, based on information automatically generated by Atlas.ti, using the “Concepts”6 tool.
- From the automated mapping, a thematic grouping of concepts was carried out, bringing together similar expressions (e.g. legislation and regulation; privacy and personal data). The size of each word reflects the weighted calculation of the frequency of concepts after grouping.
Furthermore, we observed the low presence of STEM professionals in the debate – even smaller when we analyze how many of them discussed, in their speeches, the issue of copyright from their technical perspective:
Figure 7. Profile chart of CTIA participants and contributions
[IMAGE 7 — replace with the corresponding image from the PDF]
Source: own elaboration.
In other words, it seems to us that the absence of technical experts at CTIA may have led to the proposition of measures that do not reflect the reality of how AI models work on the issue of copyright– and it seems to us that this disconnect between regulation and technology is not an isolated problem in Brazil, but a global challenge. However, to avoid the creation of laws that are inapplicable or that harm the country’s competitiveness, it is essential to institutionalize qualified technical consultation mechanisms based on concrete scientific and economic evidence, guaranteeing feasible guidelines.
Technical challenges in attributing the use of works in IAG systems compromise the economic logic of copyright, which are based on quantifying how protected content is reproduced, distributed or transformed to allocate its remuneration (Watt, 2009). However, this justification – rewarding creators in proportion to the use of their work – falls apart when we observe that IAG systems are unable to reliably track the use of these works.
This disruption has the potential to distort incentives: licensing structures that are not usage-based could exacerbate market concentration, and large
Rights holders with legal recourse can negotiate licensing deals en masse, while independent creators without bargaining power to prove IAG use can be disadvantaged, marginalizing smaller voices and reducing creative diversity (Martens, 2024).
In the current state of machine learning technology, copyright risks becoming an instrument that excessively harms AI innovation and also inadequately protects human creators.
Addressing these challenges in future studies may require a reconceptualization of both the concept of copyright and the impact of IAG on creative industries through a broader historical perspective. Previous technological disruptions, such as the transition from physical to digital distribution, initially generated controversy but ultimately led to industry transformation rather than decline, fostering new business models and revenue streams that took advantage of new technologies to create additional value (Masnick and Beadon, 2024).
Conclusion
The advancement of IAG raises legitimate questions about how to ensure an ecosystem that balances innovation and social well-being, and regulation emerges as a tool to promote this balance. However, the findings of this study suggest that, for this regulation to be effective, its guidelines must be technically viable.
This study shows that, although it is possible to identify the databases used in model training, it is still there are no scalable and reliable solutions
to measure the specific contribution of each work in large-scale models. Currently, this appears to be a limitation structural technology,
especially in machine learning.
Therefore, regulatory proposals that do not take this into account may generate arbitrary measurements. Furthermore, the interviews highlighted that restrictionsExcessive data usage may create entry barriers for startups, independent researchers
and public institutions, favoring market concentration. The possibility of companies relocating their training processes to countries with more permissive regulations should also be considered, which affects the regulatory credibility and the country’s competitiveness.
The findings of this research should not be used in isolation against the regulation, or against the appreciation of the rights of male and female creators. On the contrary, their inferences point to the need for regulation based evidence, which considers the reality of the sector and the technical limits of existing technology.
As a final message, the urgency to also expand the participation of technical experts in the AI regulatory process stands out. The current legislative process demands an effective dialogue with the technical community – not to superimpose its vision on others, but to ensure that public policies reflect the complexity of the systems to be regulated.
Suggestions for Future Studies
This study analyzed the feasibility of copyright compensation systems in AI, but some questions remain open. Below, we list research axes that can deepen the debate and support public policies.
- IAG Economic Impact: There is little evidence on whether IAG generates losses or new opportunities for breeders. Research can analyze how different sectors are impacted, evaluate changes in income distribution and test monetization alternatives.
- Creators’ Insights on Using Their Data in AI Training: regulation often ignores breeders’ perceptions. Qualitative research with a media reception perspective can explore how creators evaluate the use of their data, their level of acceptance or rejection and their view on the regulatory debate.
- Dynamics of Interest in the Regulatory Debate: Studies can map who the main participants in the legislative process are, how they influence policies, what their agendas are of interest and whether there is a balance in the representation of different sectors.
- Compensation Models: Is There a Viable Path?: Given the lack of traceability, research can evaluate the impacts of estimated compensation models, opt-out,or exceptions to copyright, based on econometric methodologies or assessments of social costs and benefits.
- The Effect of Data Constraints on Innovation and Competitiveness: Studies can measure how restrictions impact the quality of AI models, whether they favor large players and how they encourage the migration of companies to more flexible jurisdictions.
References
AMORIM, P. Analytical AI: A Better Way to Identify the Right AI Projects. Available at: https://sloanreview. mit.edu/article/analytical-ai-a-better-way-to-identify-the-right-ai-projects/. Accessed on: 10 May. 2025.
AUDENHOVE, L. V.; DONDERS, K. Talking to People III:
Expert Interviews and Elite Interviews. In: VAN DEN
Bulck, H.; PUPPIS, M.; DONDERS, K.; VAN AUDENHOVE,
L. (Eds.). The Palgrave Handbook of Methods for Media
Policy Research. Palgrave Macmillan, 2019.
BARCOTT, B. How the Emerging Market for AI Training Data is Eroding Big Tech’s “Fair Use” Copyright Defense, 2025. Available at: <https://www.techpolicy. press/how-the-emerging-market-for-ai-training-data-is-eroding-big-techs-fair-use-copyright-defense>. Accessed on: 12 May. 2025.
CHUI, M. et al. Economic potential of generative AI.McKinsey, 2023. Available at: https://www.mckinsey. com/capabilities/mckinsey-digital/our-insights/
the-economic-potential-of-generative-ai-the-next-
productivity-frontier. Accessed on: 11 May. 2025.
COENEN, F. Data Mining: Past, Present and Future.
The Knowledge Engineering Review, vol. 00, p. 0–1, 2004.
DE LA ROSA, J., et al. The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective. arXiv preprint, 2024. Available at:
<https://arxiv.org/html/2412.09460v1>. Accessed on: 11
mai. 2025.
DAASE, C. et al. On the Current State of Generative Artificial Intelligence: A Conceptual Model of Potentials and Challenges. 26th International Conference on Enterprise Information Systems, 2024.
FIIL-FLYNN, Sean M. et al. Legal reform to enhance global text and data mining research. Science, v. 378,
p. 951-953, 2022.
GAO, L. et al.The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint, 2020, available at: < https://arxiv.org/abs/2101.00027 >. Accessed on: 10 May. 2025.
GUEST, G.; BUNCE, A.; JOHNSON, L. How Many Interviews Are Enough? An Experiment with Data Saturation and Variability. Field Methods, 18(1), 59-82, 2006.
HERZOG, C; HANDKE, C.; HITTERS, E. Analyzing Talk
and Text II: Thematic Analysis. In: VAN DEN BULCK, H.; PUPPIS, M.; DONDERS, K.; VAN AUDENHOVE, L. (Eds.).
The Palgrave Handbook of Methods for Media Policy
Research. Palgrave Macmillan, 2019.
INGOLD, Jo; MONAGHAN, Mark. Evidence translation:
an exploration of policy makers’ use of evidence.
Policy C Politics, v. 44, no. 2, p. 171-190, 2016.
KARAGANIS, J. Emerging Copyright Governance Frameworks Across the US, China, and Europe. AI, Media C Democracy, 2024. Available at: < https:// www.aim4dem.nl/is-ai-training-infringement/>. Accessed on: 12 May. 2025.
LEFFER, L. When It Comes to AI Models, Bigger Isn’t Always Better, 2025. Available at: <https://www. scientificamerican.com/article/when-it-comes-to-ai-models-bigger-isnt-always-better/>. Accessed on: 12 May. 2025.
MARTENS, Bertin. Economic arguments in favor of reducing copyright protection for generative AI inputs and outputs. Working Paper, Bruegel, 2024.
MASNICK, M.; BEADON, L. The Sky Is Rising: A detailed look at the state of the entertainment industries, 2024 Edition. Copy Institute C CCIA Research Center, 2024.
NOBLE, T. AI and Copyright: Expanding Copyright
Hurts Everyone—Here’s What to Do Instead. Electronic Frontier Foundation, 2025.
RAMOS, P. H. (coord). Digital Governance in Focus: strategies for using generative AI in companies. Etech – Law and Technology Study Group.
Research Report. São Paulo: Ibmec SP, 2023.
ROSATI, E.Infringing AI: Liability for AI-enerated Outputs under International, EU, and UK Copyright Law. European Journal of Risk Regulation, p. 1–25, 31 Oct. 2024.
SALDAÑA, Johnny. The Coding Manual for Qualitative
Researchers. 4th ed. Thousand Oaks: SAGE Publications, 2021
STATISTA SEARCH DEPARTMENT. Languages most frequently used for web content, 2025. Available at: < https://www.statista.com/statistics/262946/most- common-languages-on-the-internet/>. Accessed on: 12 May. 2025.
SCHIRRU, L. et al. Text and Data Mining Exceptions in Latin America. IIC – International Review of Intellectual Property and Competition Law, 19 Sep. 2024.
SOBEL, B. L. W. Artificial intelligence’s fair use crisis.
Columbia Journal of Law’s the Arts, v. 41, p. 45-96, 2017.
UENO, H. Japan’s New Approach to Collaborative International RsD. Issues in Science and Technology. Vol. XLI, Winter, 2025.
WANG, J. T. et al. An Economic Solution to Copyright
Challenges of Generative AI. arXiv preprint, 2024. Available at: <https://arxiv.org/abs/2404.13964>. Accessed on: 12 May. 2025.
Zhang, L. et al. Fairshare Data Pricing for Large Language Models. arXiv preprint, 2025. Available at: <https://arxiv. org/html/2502.00198v1>. Accessed on: 12 May. 2025.
Reglab Methodology Annex
FORMAT: POLICY BRIEF
| Title |
Compensation for Copyright in AI: Limits and Implementation Challenges |
| Research Question |
How does the training of IAG models involve the use of copyrighted content, and what are the technical challenges to the viability of compensation proposals associated with this use? |
| Methodology Summary |
This research adopts a qualitative approach, combining primary data collection through in-depth interviews with experts (expert interviews) and secondary data analysis (documents, literature and practical cases). The methodological choice is based on the exploratory nature of the topic: as it is an emerging subject, with few concrete experiences of remuneration for AI data, it is valuable to capture the perceptions, opinions and knowledge of experts. |
| Data Collection |
Data collection used the expert interview methodology (Audenhove and Donders, 2019), with semi-structured qualitative interviews, of an exploratory nature. The choice for this method is justified by the technical nature of the topic and the lack of systematized data on the problem investigated, making the accumulated knowledge of experts working in the area essential.<br>The sample was defined based on criteria of diversity and representativeness, including: minimum participation of women; presence of representatives from academia or research centers; professionals from Brazilian companies; and experts from large technology companies. The selection combined convenience sampling and snowballing technique.<br>16 people were contacted, of which eight agreed to participate in the research; the others declined due to unavailability. The interviews were carried out between March 12th and 31st, 2025, in an online format (via Teams), with an average duration of between 45 and 60 minutes. Each session was attended by at least two Reglab researchers. The question guide used is attached.<br>Among the interviews, two were conducted on a preliminary basis, with the aim of testing the structure of the script and validating initial hypotheses. These interviews were not included in the coding process, but they contributed substantially to the final collection design. The<br>six interviews analyzed were considered sufficient for theoretical saturation purposes,<br>given that, in qualitative approaches with semi-structured and in-depth interviews, thematic recurrence and analytical density tend to be consolidated with a reduced number of participants (Euest et al, 2006).<br>All interviews were recorded with the participants’ authorization, transcribed in full and accompanied by memos from the interviewers. The material was stored and coded in the Atlas.ti software. The names and institutions of the interviewees were anonymized. |
| Data Analysis |
The data were analyzed using thematic analysis, according to Herzog et al. (2019), with two cycles of inductive coding. The first cycle consisted of open conceptual coding, and the second used pattern coding to group and refine the analytical categories (Saldaña, 2021). The process was carried out using the Atlas.ti software.<br>The choice for thematic analysis is justified by its suitability for exploratory studies that seek to structure and interpret technical information, allowing the identification of conceptual patterns in highly complex contexts. The team adopted a reflective stance throughout the analytical process, recording interpretative memos and systematically discussing potential analytical biases.<br>Themes were defined based on recurrence, conceptual density and relevance to the research objectives. The final categories included, among others: “technical impossibility of attribution”, “remuneration models”, “market concentration”, “traceability limits” and “regulatory impact”. To support critical analysis and triangulation of evidence, Atlas.ti’s visualization, mapping and correlation capabilities were used.<br>The analysis was conducted between April 2 and 15, 2025. |
| Bias Reduction Procedures |
Consolidated theoretical-methodological references: the data collection and analysis techniques adopted in this study followed practices recognized in academic literature. The methodological approach was discussed internally before and after carrying out the preliminary interviews, allowing the incorporation of criticisms and suggestions into the final research design, before the analysis process began.
Open categorization: data coding followed an inductive logic, without categories
pre-defined, allowing codes and themes to emerge directly from the empirical material. This methodological choice aimed to minimize interpretative biases arising from previous conceptual impositions.
Triangulation of methods: empirical findings were contrasted with documentary analysis of secondary sources, with the aim of comparing, validating and reinforcing the consistency of interpretations constructed from the interviews. These references were expressly cited throughout the text.
Double validation in critical steps: coding was conducted and reviewed by two researchers in a cross-sectional manner. The final definition of the themes was carried out in a collective discussion between the three authors, ensuring multiple perspectives and control of individual biases in the interpretation of the data.
Recording and methodological transparency: all stages of the analytical process were documented, including successive versions of the files and coding decisions. This practice allows traceability of the methodological path, in accordance with Reglab guidelines for transparency and replicability. |
Consolidated theoretical-methodological references: the data collection and analysis techniques adopted in this study followed practices recognized in academic literature. The methodological approach was discussed internally before and after carrying out the preliminary interviews, allowing the incorporation of criticisms and suggestions into the final research design, before the analysis process began.
Open categorization: data coding followed an inductive logic, without categories
pre-defined, allowing codes and themes to emerge directly from the empirical material. This methodological choice aimed to minimize interpretative biases arising from previous conceptual impositions.
Triangulation of methods: empirical findings were contrasted with documentary analysis of secondary sources, with the aim of comparing, validating and reinforcing the consistency of interpretations constructed from the interviews. These references were expressly cited throughout the text.
Double validation in critical steps: coding was conducted and reviewed by two researchers in a cross-sectional manner. The final definition of the themes was carried out in a collective discussion between the three authors, ensuring multiple perspectives and control of individual biases in the interpretation of the data.
Recording and methodological transparency: all stages of the analytical process were documented, including successive versions of the files and coding decisions. This practice allows traceability of the methodological path, in accordance with Reglab guidelines for transparency and replicability. |
Consolidated theoretical-methodological references: the data collection and analysis techniques adopted in this study followed practices recognized in academic literature. The methodological approach was discussed internally before and after carrying out the preliminary interviews, allowing the incorporation of criticisms and suggestions into the final research design, before the analysis process began.
Open categorization: data coding followed an inductive logic, without categories
pre-defined, allowing codes and themes to emerge directly from the empirical material. This methodological choice aimed to minimize interpretative biases arising from previous conceptual impositions.
Triangulation of methods: empirical findings were contrasted with documentary analysis of secondary sources, with the aim of comparing, validating and reinforcing the consistency of interpretations constructed from the interviews. These references were expressly cited throughout the text.
Double validation in critical steps: coding was conducted and reviewed by two researchers in a cross-sectional manner. The final definition of the themes was carried out in a collective discussion between the three authors, ensuring multiple perspectives and control of individual biases in the interpretation of the data.
Recording and methodological transparency: all stages of the analytical process were documented, including successive versions of the files and coding decisions. This practice allows traceability of the methodological path, in accordance with Reglab guidelines for transparency and replicability. |
| Other Methodological Limitations |
Qualitative and non-generalizable scope: the reduced number of interviews prioritized analytical depth, but does not allow statistical inferences.<br>Convenience sampling and contact networks: the selection may have reflected biases in availability and professional circles, despite diversity criteria.<br>Technological and regulatory evolution: the findings reflect the state of the art up to the time of the<br>research and may be impacted by future changes in the sector.<br>Dependence on External Tools: still Although it is one of the most consolidated analytical tools in the academic sector, the analysis depended significantly on the use of Atlas.ti software. |
Qualitative and non-generalizable scope: the reduced number of interviews prioritized analytical depth, but does not allow statistical inferences.<br>Convenience sampling and contact networks: the selection may have reflected biases in availability and professional circles, despite diversity criteria.<br>Technological and regulatory evolution: the findings reflect the state of the art up to the time of the<br>research and may be impacted by future changes in the sector.<br>Dependence on External Tools: still Although it is one of the most consolidated analytical tools in the academic sector, the analysis depended significantly on the use of Atlas.ti software. |
Qualitative and non-generalizable scope: the reduced number of interviews prioritized analytical depth, but does not allow statistical inferences.<br>Convenience sampling and contact networks: the selection may have reflected biases in availability and professional circles, despite diversity criteria.<br>Technological and regulatory evolution: the findings reflect the state of the art up to the time of the<br>research and may be impacted by future changes in the sector.<br>Dependence on External Tools: still Although it is one of the most consolidated analytical tools in the academic sector, the analysis depended significantly on the use of Atlas.ti software. |
| Use of Software |
|
SOFTWARE<br><br><br>MS Office Suite |
USE IN RESEARCH<br><br><br>text editing, spreadsheets and graphs, interviews (Teams) |
| Use of Software |
|
Adobe C Suite |
layout and finalization of graphics and illustrations. |
| Use of Software |
|
Atlas.ti |
organization, coding and analysis of qualitative data |
| Use of Software |
|
Cockatoo |
Audio transcription of text interviews. |
| Use of Software |
|
ChatGPT 4th |
brainstorm, information systematization, grammar review<br>(spelling, grammar search for<br>synonyms), language adaptation, adaptation to the Reglab Writing Manual. |
| Use of Software |
|
Notion AI |
text editing and proofreading (spelling and grammar, search for synonyms, language adaptation, translations); organization of research, structuring of schedule. |
| Use of Software |
|
Wordclouds |
creating word clouds |
| Use of Software |
|
Lex.page |
advanced text review (brevity, clichés, readability, passive voice, statements without evidence, repetitions). |
| Use of Software |
|
|
|
Ethical Guidelines
Research Funding. This publication is part of a series of publications sponsored by the companies Google, meta and B/Luz, in which RegLab maintains editorial control of the publications. Unlike commissioned research, RegLab determined the scope, objectives and methodology of this study with complete autonomy. The authors maintain full professional independence and responsibility for the content and conclusions of this work.
Processing of Personal Data. The research involved the processing of personal data exclusively in the collection and analysis stages, in a limited manner and proportional to the objectives of the study, in accordance with Law No. 13,709/2018 (LGPD).
Legal basis: all participants formally authorized their participation by signing a consent form, being aware of the research objectives and use of data;
Purpose and suitability: the data were used exclusively for the purposes of this research, compatible with the consent obtained, and were not used for other purposes;
Minimization and anonymization: personally identifiable information that was not relevant to the research objectives was anonymized in the transcripts and excluded from the active database;
Secrecy and confidentiality: When presenting the results, the data were kept confidential, and citations were adjusted, when necessary, to guarantee the confidentiality of the sources. Only a limited number of researchers directly involved in the project had access to personal data and original documents;
Registration and information security: the files were stored with password access control and in accordance with Reglab’s internal information security policies;
Retention and disposal: the data will be stored for up to 12 months, exclusively for the purposes of methodological auditing and possible replication, and will subsequently be deleted;
Responsible Use of Public Data: Although some of the data analyzed is public, its use was made in a responsible and ethical manner, with the exclusive objective of independent research.
Methodological Transparency: The research methodology was detailed to ensure transparency and replicability, contributing to scientific integrity and allowing independent validation of results.
Non-discrimination and Respect for Diversity: The research was conducted in a way that respects diversity and avoids any form of discrimination.
ANNEX II – SEMI-STRUCTURED INTERVIEW SCHEDULE
- When writing the research product, what would you like to be called/called? What title can we use?
- To begin with, can you tell us a little about your professional experience and your work in the area of AI/Machine Learning?
- How is data selected and used for training AI models?
- What types of data are most critical to model performance? Are there specific characteristics that make a dataset more valuable?
If there was a limitation on data availability due to data
- licensing, what would be the technical and practical impacts on the development and
model performance?
- Thinking about the possibility of data being restricted/unavailable due to copyright restrictions, what would be the impacts for AI models?
CIn your experience, is it possible to track and document what specific data was used in training an AI model?
- How is this done in practice?
- Considering the way AI models are trained, how would it be possible to identify and remunerate each work used in the process? What opportunities and challenges are there in this activity?
>In the need for protected content, there are techniques to ensure that data models
AI don’t memorize them and reproduce them? Can you explain this issue a little?
- If a mandatory remuneration system for copyrights were implemented, what would be the impacts on companies and startups that develop AI?
If restrictions on the use of data for AI were implemented in Brazil, training
- of the models could simply be transferred to other countries? Does this already happen in other contexts?
If there were an ideal system to balance the use of data for AI and the protection of
- copyright, how would it work in your vision? Are there viable technical solutions to this problem?
- Is there something we didn’t ask but that you think is important about data use in
AI training?