training Archives - AI News https://www.artificialintelligence-news.com/news/tag/training/ Artificial Intelligence News Fri, 25 Apr 2025 14:07:46 +0000 en-GB hourly 1 https://wordpress.org/?v=6.8.1 https://www.artificialintelligence-news.com/wp-content/uploads/2020/09/cropped-ai-icon-32x32.png training Archives - AI News https://www.artificialintelligence-news.com/news/tag/training/ 32 32 Study claims OpenAI trains AI models on copyrighted data https://www.artificialintelligence-news.com/news/study-claims-openai-trains-ai-models-copyrighted-data/ https://www.artificialintelligence-news.com/news/study-claims-openai-trains-ai-models-copyrighted-data/#respond Wed, 02 Apr 2025 09:04:28 +0000 https://www.artificialintelligence-news.com/?p=105119 A new study from the AI Disclosures Project has raised questions about the data OpenAI uses to train its large language models (LLMs). The research indicates the GPT-4o model from OpenAI demonstrates a “strong recognition” of paywalled and copyrighted data from O’Reilly Media books. The AI Disclosures Project, led by technologist Tim O’Reilly and economist […]

The post Study claims OpenAI trains AI models on copyrighted data appeared first on AI News.

]]>
A new study from the AI Disclosures Project has raised questions about the data OpenAI uses to train its large language models (LLMs). The research indicates the GPT-4o model from OpenAI demonstrates a “strong recognition” of paywalled and copyrighted data from O’Reilly Media books.

The AI Disclosures Project, led by technologist Tim O’Reilly and economist Ilan Strauss, aims to address the potentially harmful societal impacts of AI’s commercialisation by advocating for improved corporate and technological transparency. The project’s working paper highlights the lack of disclosure in AI, drawing parallels with financial disclosure standards and their role in fostering robust securities markets.

The study used a legally-obtained dataset of 34 copyrighted O’Reilly Media books to investigate whether LLMs from OpenAI were trained on copyrighted data without consent. The researchers applied the DE-COP membership inference attack method to determine if the models could differentiate between human-authored O’Reilly texts and paraphrased LLM versions.

Key findings from the report include:

  • GPT-4o shows “strong recognition” of paywalled O’Reilly book content, with an AUROC score of 82%. In contrast, OpenAI’s earlier model, GPT-3.5 Turbo, does not show the same level of recognition (AUROC score just above 50%)
  • GPT-4o exhibits stronger recognition of non-public O’Reilly book content compared to publicly accessible samples (82% vs 64% AUROC scores respectively)
  • GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples than non-public ones (64% vs 54% AUROC scores)
  • GPT-4o Mini, a smaller model, showed no knowledge of public or non-public O’Reilly Media content when tested (AUROC approximately 50%)

The researchers suggest that access violations may have occurred via the LibGen database, as all of the O’Reilly books tested were found there. They also acknowledge that newer LLMs have an improved ability to distinguish between human-authored and machine-generated language, which does not reduce the method’s ability to classify data.

The study highlights the potential for “temporal bias” in the results, due to language changes over time. To account for this, the researchers tested two models (GPT-4o and GPT-4o Mini) trained on data from the same period.

The report notes that while the evidence is specific to OpenAI and O’Reilly Media books, it likely reflects a systemic issue around the use of copyrighted data. It argues that uncompensated training data usage could lead to a decline in the internet’s content quality and diversity, as revenue streams for professional content creation diminish.

The AI Disclosures Project emphasises the need for stronger accountability in AI companies’ model pre-training processes. They suggest that liability provisions that incentivise improved corporate transparency in disclosing data provenance may be an important step towards facilitating commercial markets for training data licensing and remuneration.

The EU AI Act’s disclosure requirements could help trigger a positive disclosure-standards cycle if properly specified and enforced. Ensuring that IP holders know when their work has been used in model training is seen as a crucial step towards establishing AI markets for content creator data.

Despite evidence that AI companies may be obtaining data illegally for model training, a market is emerging in which AI model developers pay for content through licensing deals. Companies like Defined.ai facilitate the purchasing of training data, obtaining consent from data providers and stripping out personally identifiable information.

The report concludes by stating that using 34 proprietary O’Reilly Media books, the study provides empirical evidence that OpenAI likely trained GPT-4o on non-public, copyrighted data.

(Image by Sergei Tokmakov)

See also: Anthropic provides insights into the ‘AI biology’ of Claude

AI & Big Data Expo banner, a show where attendees will hear more about issues such as OpenAI allegedly using copyrighted data to train its new models.

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Study claims OpenAI trains AI models on copyrighted data appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/study-claims-openai-trains-ai-models-copyrighted-data/feed/ 0
Microsoft and OpenAI probe alleged data theft by DeepSeek https://www.artificialintelligence-news.com/news/microsoft-and-openai-probe-alleged-data-theft-deepseek/ https://www.artificialintelligence-news.com/news/microsoft-and-openai-probe-alleged-data-theft-deepseek/#respond Wed, 29 Jan 2025 15:28:41 +0000 https://www.artificialintelligence-news.com/?p=17009 Microsoft and OpenAI are investigating a potential breach of the AI firm’s system by a group allegedly linked to Chinese AI startup DeepSeek. According to Bloomberg, the investigation stems from suspicious data extraction activity detected in late 2024 via OpenAI’s application programming interface (API), sparking broader concerns over international AI competition. Microsoft, OpenAI’s largest financial […]

The post Microsoft and OpenAI probe alleged data theft by DeepSeek appeared first on AI News.

]]>
Microsoft and OpenAI are investigating a potential breach of the AI firm’s system by a group allegedly linked to Chinese AI startup DeepSeek.

According to Bloomberg, the investigation stems from suspicious data extraction activity detected in late 2024 via OpenAI’s application programming interface (API), sparking broader concerns over international AI competition.

Microsoft, OpenAI’s largest financial backer, first identified the large-scale data extraction and informed the ChatGPT maker of the incident. Sources believe the activity may have violated OpenAI’s terms of service, or that the group may have exploited loopholes to bypass restrictions limiting how much data they could collect.

DeepSeek has quickly risen to prominence in the competitive AI landscape, particularly with the release of its latest model, R-1, on 20 January.

Billed as a rival to OpenAI’s ChatGPT in performance but developed at a significantly lower cost, R-1 has shaken up the tech industry. Its release triggered a sharp decline in tech and AI stocks that wiped billions from US markets in a single week.

David Sacks, the White House’s newly appointed “crypto and AI czar,” alleged that DeepSeek may have employed questionable methods to achieve its AI’s capabilities. In an interview with Fox News, Sacks noted evidence suggesting that DeepSeek had used “distillation” to train its AI models using outputs from OpenAI’s systems.

“There’s substantial evidence that what DeepSeek did here is they distilled knowledge out of OpenAI’s models, and I don’t think OpenAI is very happy about this,” Sacks told the network.  

Model distillation involves training one AI system using data generated by another, potentially allowing a competitor to develop similar functionality. This method, when applied without proper authorisation, has stirred ethical and intellectual property debates as the global race for AI supremacy heats up.  

OpenAI declined to comment specifically on the accusations against DeepSeek but acknowledged the broader risk posed by model distillation, particularly by Chinese companies.  

“We know PRC-based companies — and others — are constantly trying to distill the models of leading US AI companies,” a spokesperson for OpenAI told Bloomberg.  

Geopolitical and security concerns  

Growing tensions around AI innovation now extend into national security. CNBC reported that the US Navy has banned its personnel from using DeepSeek’s products, citing fears that the Chinese government could exploit the platform to access sensitive information.

In an email dated 24 January, the Navy warned its staff against using DeepSeek AI “in any capacity” due to “potential security and ethical concerns associated with the model’s origin and usage.”

Critics have highlighted DeepSeek’s privacy policy, which permits the collection of data such as IP addresses, device information, and even keystroke patterns—a scope of data collection considered excessive by some experts.

Earlier this week, DeepSeek stated it was facing “large-scale malicious attacks” against its systems. A banner on its website informed users of a temporary sign-up restriction.

The growing competition between the US and China in particular in the AI sector has underscored wider concerns regarding technological ownership, ethical governance, and national security.  

Experts warn that as AI systems advance and become increasingly integral to global economic and strategic planning, disputes over data usage and intellectual property are only likely to intensify. Accusations such as those against DeepSeek amplify alarm over China’s rapid development in the field and its potential quest to bypass US-led safeguards through reverse engineering and other means.  

While OpenAI and Microsoft continue their investigation into the alleged misuse of OpenAI’s platform, businesses and governments alike are paying close attention. The case could set a precedent for how AI developers police model usage and enforce terms of service.

For now, the response from both US and Chinese stakeholders highlights how AI innovation has become not just a race for technological dominance, but a fraught geopolitical contest that is shaping 21st-century power dynamics.

(Image by Mohamed Hassan)

See also: Qwen 2.5-Max outperforms DeepSeek V3 in some benchmarks

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Microsoft and OpenAI probe alleged data theft by DeepSeek appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/microsoft-and-openai-probe-alleged-data-theft-deepseek/feed/ 0
Ai2 OLMo 2: Raising the bar for open language models https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/ https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/#respond Wed, 27 Nov 2024 18:43:42 +0000 https://www.artificialintelligence-news.com/?p=16566 Ai2 is releasing OLMo 2, a family of open-source language models that advances the democratisation of AI and narrows the gap between open and proprietary solutions. The new models, available in 7B and 13B parameter versions, are trained on up to 5 trillion tokens and demonstrate performance levels that match or exceed comparable fully open […]

The post Ai2 OLMo 2: Raising the bar for open language models appeared first on AI News.

]]>
Ai2 is releasing OLMo 2, a family of open-source language models that advances the democratisation of AI and narrows the gap between open and proprietary solutions.

The new models, available in 7B and 13B parameter versions, are trained on up to 5 trillion tokens and demonstrate performance levels that match or exceed comparable fully open models whilst remaining competitive with open-weight models such as Llama 3.1 on English academic benchmarks.

“Since the release of the first OLMo in February 2024, we’ve seen rapid growth in the open language model ecosystem, and a narrowing of the performance gap between open and proprietary models,” explained Ai2.

The development team achieved these improvements through several innovations, including enhanced training stability measures, staged training approaches, and state-of-the-art post-training methodologies derived from their Tülu 3 framework. Notable technical improvements include the switch from nonparametric layer norm to RMSNorm and the implementation of rotary positional embedding.

OLMo 2 model training breakthrough

The training process employed a sophisticated two-stage approach. The initial stage utilised the OLMo-Mix-1124 dataset of approximately 3.9 trillion tokens, sourced from DCLM, Dolma, Starcoder, and Proof Pile II. The second stage incorporated a carefully curated mixture of high-quality web data and domain-specific content through the Dolmino-Mix-1124 dataset.

Particularly noteworthy is the OLMo 2-Instruct-13B variant, which is the most capable model in the series. The model demonstrates superior performance compared to Qwen 2.5 14B instruct, Tülu 3 8B, and Llama 3.1 8B instruct models across various benchmarks.

Benchmarks comparing the OLMo 2 open large language model to other models such as Mistral, Qwn, Llama, Gemma, and more.
(Credit: Ai2)

Commiting to open science

Reinforcing its commitment to open science, Ai2 has released comprehensive documentation including weights, data, code, recipes, intermediate checkpoints, and instruction-tuned models. This transparency allows for full inspection and reproduction of results by the wider AI community.

The release also introduces an evaluation framework called OLMES (Open Language Modeling Evaluation System), comprising 20 benchmarks designed to assess core capabilities such as knowledge recall, commonsense reasoning, and mathematical reasoning.

OLMo 2 raises the bar in open-source AI development, potentially accelerating the pace of innovation in the field whilst maintaining transparency and accessibility.

(Photo by Rick Barrett)

See also: OpenAI enhances AI safety with new red teaming methods

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Ai2 OLMo 2: Raising the bar for open language models appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/feed/ 0
Industry leaders back open-source AI definition https://www.artificialintelligence-news.com/news/industry-leaders-back-open-source-ai-definition/ https://www.artificialintelligence-news.com/news/industry-leaders-back-open-source-ai-definition/#respond Tue, 29 Oct 2024 14:36:15 +0000 https://www.artificialintelligence-news.com/?p=16411 The Open Source Initiative (OSI) has unveiled a definition framework to evaluate whether AI systems can be classified as open-source. The announcement of the first Open Source AI Definition (OSAID) was made at All Things Open and marks the culmination of a comprehensive global effort spanning multiple years of research, international workshops, and a year-long […]

The post Industry leaders back open-source AI definition appeared first on AI News.

]]>
The Open Source Initiative (OSI) has unveiled a definition framework to evaluate whether AI systems can be classified as open-source.

The announcement of the first Open Source AI Definition (OSAID) was made at All Things Open and marks the culmination of a comprehensive global effort spanning multiple years of research, international workshops, and a year-long community design process.

The OSI – widely recognised as the definitive authority on open-source definitions by individuals, organisations, and government bodies worldwide – developed the framework through extensive collaboration with industry stakeholders. This framework defines what open-source AI means, insisting that the same open-source requirements apply whether to a fully functional AI system, a model, weights and parameters, or other structural elements.

An open-source AI system must be made available under terms that grant four essential freedoms:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system for any purpose, including to change its output.
  • Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is having access to the preferred form to make modifications to the system, which includes detailed data information, complete source code, and model parameters.

“The co-design process that led to version 1.0 of the Open Source AI Definition was well-developed, thorough, inclusive, and fair,” said Carlo Piana, OSI board chair. “The board is confident that the process has resulted in a definition that meets the standards of open-source as defined in the open-source definition and the four essential freedoms.”

One of the framework’s most significant requirements is the mandate for open-source models to provide sufficient information about their training data, ensuring that “a skilled person can recreate a substantially equivalent system using the same or similar data,” according to Ayah Bdeir, who leads AI strategy at Mozilla.

Bdeir acknowledged that whilst this approach might not be perfect, it represents a practical compromise between ideological purity and real-world implementation. She suggested that demanding an unrealistically high standard could prove counterproductive to the initiative’s goals.

The Digital Public Goods Alliance (DPGA) has expressed support for the OSI’s leadership in defining open-source AI. Liv Marte Nordhaug, CEO of the DPGA secretariat, confirmed that her organisation will incorporate this foundational work into updates to their Digital Public Goods Standard for AI applications.

EleutherAI Institute, known for its non-profit work in AI development, has also endorsed the definition.

“The Open Source AI Definition is a necessary step towards promoting the benefits of open-source principles in the field of AI,” stated Stella Biderman, Executive Director of the EleutherAI Institute. “We believe that this definition supports the needs of independent machine learning researchers and promotes greater transparency among the largest AI developers.”

The definition highlights the importance of including data information and code when sharing open-source models and weights. These requirements ensure transparency and the ability to modify the AI system.

OSI Executive Director Stefano Maffulli acknowledged the challenges faced during the development process, noting that despite occasional heated exchanges and differing opinions, the final result aligned with the project’s initial objectives.

“This is a starting point for a continued effort to engage with the communities to improve the definition over time,” he stated.

The OSAID does not require a specific legal mechanism for assuring that model parameters are freely available to all, though it may involve licences or legal instruments. This aspect is expected to become clearer over time as the legal system addresses these open-source AI systems.

See also: President Biden issues first National Security Memorandum on AI

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Industry leaders back open-source AI definition appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/industry-leaders-back-open-source-ai-definition/feed/ 0
MIT breakthrough could transform robot training https://www.artificialintelligence-news.com/news/mit-breakthrough-could-transform-robot-training/ https://www.artificialintelligence-news.com/news/mit-breakthrough-could-transform-robot-training/#respond Mon, 28 Oct 2024 16:43:57 +0000 https://www.artificialintelligence-news.com/?p=16403 MIT researchers have developed a robot training method that reduces time and cost while improving adaptability to new tasks and environments. The approach – called Heterogeneous Pretrained Transformers (HPT) – combines vast amounts of diverse data from multiple sources into a unified system, effectively creating a shared language that generative AI models can process. This […]

The post MIT breakthrough could transform robot training appeared first on AI News.

]]>
MIT researchers have developed a robot training method that reduces time and cost while improving adaptability to new tasks and environments.

The approach – called Heterogeneous Pretrained Transformers (HPT) – combines vast amounts of diverse data from multiple sources into a unified system, effectively creating a shared language that generative AI models can process. This method marks a significant departure from traditional robot training, where engineers typically collect specific data for individual robots and tasks in controlled environments.

Lead researcher Lirui Wang – an electrical engineering and computer science graduate student at MIT – believes that while many cite insufficient training data as a key challenge in robotics, a bigger issue lies in the vast array of different domains, modalities, and robot hardware. Their work demonstrates how to effectively combine and utilise all these diverse elements.

The research team developed an architecture that unifies various data types, including camera images, language instructions, and depth maps. HPT utilises a transformer model, similar to those powering advanced language models, to process visual and proprioceptive inputs.

In practical tests, the system demonstrated remarkable results—outperforming traditional training methods by more than 20 per cent in both simulated and real-world scenarios. This improvement held true even when robots encountered tasks significantly different from their training data.

The researchers assembled an impressive dataset for pretraining, comprising 52 datasets with over 200,000 robot trajectories across four categories. This approach allows robots to learn from a wealth of experiences, including human demonstrations and simulations.

One of the system’s key innovations lies in its handling of proprioception (the robot’s awareness of its position and movement.) The team designed the architecture to place equal importance on proprioception and vision, enabling more sophisticated dexterous motions.

Looking ahead, the team aims to enhance HPT’s capabilities to process unlabelled data, similar to advanced language models. Their ultimate vision involves creating a universal robot brain that could be downloaded and used for any robot without additional training.

While acknowledging they are in the early stages, the team remains optimistic that scaling could lead to breakthrough developments in robotic policies, similar to the advances seen in large language models.

You can find a copy of the researchers’ paper here (PDF)

(Photo by Possessed Photography)

See also: Jailbreaking AI robots: Researchers sound alarm over security flaws

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post MIT breakthrough could transform robot training appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/mit-breakthrough-could-transform-robot-training/feed/ 0
xAI breaks records with ‘Colossus’ AI training system https://www.artificialintelligence-news.com/news/xai-breaks-records-colossus-ai-training-system/ https://www.artificialintelligence-news.com/news/xai-breaks-records-colossus-ai-training-system/#respond Tue, 03 Sep 2024 13:29:42 +0000 https://www.artificialintelligence-news.com/?p=15954 Elon Musk’s xAI has unveiled its record-breaking AI training system, dubbed ‘Colossus’. Musk revealed that the xAI team had successfully brought the Colossus 100k H100 training cluster online after a 122-day process. Not content with its existing capabilities, Musk stated, “over the next couple of months, it will double in size, bringing it to 200k […]

The post xAI breaks records with ‘Colossus’ AI training system appeared first on AI News.

]]>
Elon Musk’s xAI has unveiled its record-breaking AI training system, dubbed ‘Colossus’.

Musk revealed that the xAI team had successfully brought the Colossus 100k H100 training cluster online after a 122-day process. Not content with its existing capabilities, Musk stated, “over the next couple of months, it will double in size, bringing it to 200k (50k H200s).”

The scale of Colossus is unprecedented, surpassing every other cluster to date. For context, Google uses 90,000 GPUs while OpenAI utilises 80,000 GPUs—both of which have been surpassed by xAI’s creation, even prior to Colossus’ doubling in size over the coming months.

Developed in partnership with Nvidia, Colossus leverages some of the most advanced GPU technology on the market. The system initially employs Nvidia’s H100 chips, with plans to incorporate the newer H200 model in its expansion. This vast array of processing power positions Colossus as the most formidable AI training system currently available.

The H200, while recently superseded by Nvidia’s Blackwell chip unveiled in March 2024, remains a highly sought-after component in the AI industry. It boasts impressive specifications, including 141 GB of HBM3E memory and 4.8 TB/sec of bandwidth. However, the Blackwell chip raises the bar even further, with top-end capacity 36.2% higher than the H200 and a 66.7% increase in total bandwidth.

Nvidia’s response to the Colossus unveiling was one of enthusiasm and support. The company congratulated Musk and the xAI team on their achievement, highlighting that Colossus will not only be the most powerful system of its kind but will also deliver “exceptional gains” in energy efficiency.

Colossus’ processing power could potentially accelerate breakthroughs in various AI applications, from natural language processing to complex problem-solving algorithms. However, the unveiling of Colossus also reignites discussions about the concentration of AI power among a handful of tech giants and well-funded startups.

As companies like xAI push the boundaries of what’s possible in AI training, concerns about the accessibility of such advanced technologies to smaller organisations and researchers may come to the forefront.

As the AI arms race continues to heat up, all eyes will be on xAI and its competitors to see how they leverage these increasingly powerful systems. With Colossus, Musk and his team have thrown down the gauntlet and issued a challenge to rivals to match or exceed their efforts.

See also: Amazon partners with Anthropic to enhance Alexa

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post xAI breaks records with ‘Colossus’ AI training system appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/xai-breaks-records-colossus-ai-training-system/feed/ 0
Reddit is reportedly selling data for AI training https://www.artificialintelligence-news.com/news/reddit-is-reportedly-selling-data-for-ai-training/ https://www.artificialintelligence-news.com/news/reddit-is-reportedly-selling-data-for-ai-training/#respond Mon, 19 Feb 2024 11:11:40 +0000 https://www.artificialintelligence-news.com/?p=14419 Reddit has negotiated a content licensing deal to allow its data to be used for training AI models, according to a Bloomberg report. Just ahead of a potential $5 billion initial public offering (IPO) debut in March, Reddit has reportedly signed a $60 million deal with an undisclosed major AI company. This move could be […]

The post Reddit is reportedly selling data for AI training appeared first on AI News.

]]>
Reddit has negotiated a content licensing deal to allow its data to be used for training AI models, according to a Bloomberg report.

Just ahead of a potential $5 billion initial public offering (IPO) debut in March, Reddit has reportedly signed a $60 million deal with an undisclosed major AI company. This move could be seen as a last-minute effort to showcase potential revenue streams in the rapidly growing AI industry to prospective investors.

Although Reddit has yet to confirm the deal, the decision could have significant implications. If true, it would mean that Reddit’s vast trove of user-generated content – including posts from popular subreddits, comments from both prominent and obscure users, and discussions on a wide range of topics – could be used to train and enhance existing large language models (LLMs) or provide the foundation for the development of new generative AI systems.

However, this decision by Reddit may not sit well with its user base, as the company has faced increasing opposition from its community regarding its recent business decisions.

Last year, when Reddit announced plans to start charging for access to its application programming interfaces (APIs), thousands of Reddit forums temporarily shut down in protest. Days later, a group of Reddit hackers threatened to release previously stolen site data unless the company reversed the API plan or paid a ransom of $4.5 million.

Reddit has recently made other controversial decisions, such as removing years of private chat logs and messages from users’ accounts. The platform also implemented new automatic moderation features and removed the option for users to turn off personalised advertising, fuelling additional discontent among its users.

This latest reported deal to sell Reddit’s data for AI training could generate even more backlash from users, as the debate over the ethics of using public data, art, and other human-created content to train AI systems continues to intensify across various industries and platforms.

(Photo by Brett Jordan on Unsplash)

See also: Amazon trains 980M parameter LLM with ’emergent abilities’

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Reddit is reportedly selling data for AI training appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/reddit-is-reportedly-selling-data-for-ai-training/feed/ 0
OpenAI: Copyrighted data ‘impossible’ to avoid for AI training https://www.artificialintelligence-news.com/news/openai-copyrighted-data-impossible-avoid-for-ai-training/ https://www.artificialintelligence-news.com/news/openai-copyrighted-data-impossible-avoid-for-ai-training/#respond Tue, 09 Jan 2024 15:45:05 +0000 https://www.artificialintelligence-news.com/?p=14167 OpenAI made waves this week with its bold assertion to a UK parliamentary committee that it would be “impossible” to develop today’s leading AI systems without using vast amounts of copyrighted data. The company argued that advanced AI tools like ChatGPT require such broad training that adhering to copyright law would be utterly unworkable. In […]

The post OpenAI: Copyrighted data ‘impossible’ to avoid for AI training appeared first on AI News.

]]>
OpenAI made waves this week with its bold assertion to a UK parliamentary committee that it would be “impossible” to develop today’s leading AI systems without using vast amounts of copyrighted data.

The company argued that advanced AI tools like ChatGPT require such broad training that adhering to copyright law would be utterly unworkable.

In written testimony, OpenAI stated that between expansive copyright laws and the ubiquity of protected online content, “virtually every sort of human expression” would be off-limits for training data. From news articles to forum comments to digital images, little online content can be utilised freely and legally.

According to OpenAI, attempts to create capable AI while avoiding copyright infringement would fail: “Limiting training data to public domain books and drawings created more than a century ago … would not provide AI systems that meet the needs of today’s citizens.”

While defending its practices as compliant, OpenAI conceded that partnerships and compensation schemes with publishers may be warranted to “support and empower creators.” But the company gave no indication that it intends to dramatically restrict its harvesting of online data, including paywalled journalism and literature.

This stance has opened OpenAI up to multiple lawsuits, including from media outlets like The New York Times alleging copyright breaches.

Nonetheless, OpenAI appears unwilling to fundamentally alter its data collection and training processes—given the “impossible” constraints self-imposed copyright limits would bring. The company instead hopes to rely on broad interpretations of fair use allowances to legally leverage vast swathes of copyrighted data.

As advanced AI continues to demonstrate uncanny abilities emulating human expression, legal experts expect vigorous courtroom battles around infringement by systems intrinsically designed to absorb enormous volumes of protected text, media, and other creative output. 

For now, OpenAI is betting against copyright maximalists in favour of near-boundless copying to drive ongoing AI development.

(Photo by Levart_Photographer on Unsplash)

See also: OpenAI’s GPT Store to launch next week after delays

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post OpenAI: Copyrighted data ‘impossible’ to avoid for AI training appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/openai-copyrighted-data-impossible-avoid-for-ai-training/feed/ 0
Nightshade ‘poisons’ AI models to fight copyright theft https://www.artificialintelligence-news.com/news/nightshade-poisons-ai-models-fight-copyright-theft/ https://www.artificialintelligence-news.com/news/nightshade-poisons-ai-models-fight-copyright-theft/#respond Tue, 24 Oct 2023 14:49:13 +0000 https://www.artificialintelligence-news.com/?p=13779 University of Chicago researchers have unveiled Nightshade, a tool designed to disrupt AI models attempting to learn from artistic imagery. The tool – still in its developmental phase – allows artists to protect their work by subtly altering pixels in images, rendering them imperceptibly different to the human eye but confusing to AI models. Many […]

The post Nightshade ‘poisons’ AI models to fight copyright theft appeared first on AI News.

]]>
University of Chicago researchers have unveiled Nightshade, a tool designed to disrupt AI models attempting to learn from artistic imagery.

The tool – still in its developmental phase – allows artists to protect their work by subtly altering pixels in images, rendering them imperceptibly different to the human eye but confusing to AI models.

Many artists and creators have expressed concern over the use of their work in training commercial AI products without their consent.

AI models rely on vast amounts of multimedia data – including written material and images, often scraped from the web – to function effectively. Nightshade offers a potential solution by sabotaging this data.

When integrated into digital artwork, Nightshade misleads AI models, causing them to misidentify objects and scenes.

For instance, Nightshade transformed images of dogs into data that appeared to AI models as cats. After exposure to a mere 100 poison samples, the AI reliably generated a cat when asked for a dog—demonstrating the tool’s effectiveness.

This technique not only confuses AI models but also challenges the fundamental way in which generative AI operates. By exploiting the clustering of similar words and ideas in AI models, Nightshade can manipulate responses to specific prompts and further undermine the accuracy of AI-generated content.

Developed by computer science professor Ben Zhao and his team, Nightshade is an extension of their prior product, Glaze, which cloaks digital artwork and distorts pixels to baffle AI models regarding artistic style.

While the potential for misuse of Nightshade is acknowledged, the researchers’ primary objective is to shift the balance of power from AI companies back to artists and discourage intellectual property violations.

The introduction of Nightshade presents a major challenge to AI developers. Detecting and removing images with poisoned pixels is a complex task, given the imperceptible nature of the alterations.

If integrated into existing AI training datasets, these images necessitate removal and potential retraining of AI models, posing a substantial hurdle for companies relying on stolen or unauthorised data.

As the researchers await peer review of their work, Nightshade is a beacon of hope for artists seeking to protect their creative endeavours.

(Photo by Josie Weiss on Unsplash)

See also: UMG files landmark lawsuit against AI developer Anthropic

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Nightshade ‘poisons’ AI models to fight copyright theft appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/nightshade-poisons-ai-models-fight-copyright-theft/feed/ 0
OpenAI is not currently training GPT-5 https://www.artificialintelligence-news.com/news/openai-is-not-currently-training-gpt-5/ https://www.artificialintelligence-news.com/news/openai-is-not-currently-training-gpt-5/#respond Mon, 17 Apr 2023 10:36:35 +0000 https://www.artificialintelligence-news.com/?p=12963 Experts calling for a pause on AI development will be glad to hear that OpenAI isn’t currently training GPT-5. OpenAI CEO Sam Altman spoke remotely at an MIT event and was quizzed about AI by computer scientist and podcaster Lex Fridman. Altman confirmed that OpenAI is not currently developing a fifth version of its Generative […]

The post OpenAI is not currently training GPT-5 appeared first on AI News.

]]>
Experts calling for a pause on AI development will be glad to hear that OpenAI isn’t currently training GPT-5.

OpenAI CEO Sam Altman spoke remotely at an MIT event and was quizzed about AI by computer scientist and podcaster Lex Fridman.

Altman confirmed that OpenAI is not currently developing a fifth version of its Generative Pre-trained Transformer model and is instead focusing on enhancing the capabilities of GPT-4, the latest version.

Altman was asked about the open letter that urged developers to pause training AI models larger than GPT-4 for six months. While he supported the idea of ensuring AI models are safe and aligned with human values, he believed that the letter lacked technical nuance regarding where to pause.

“An earlier version of the letter claims we are training GPT-5 right now. We are not, and won’t for some time. So in that sense, it was sort of silly,” said Altman.

“We are doing things on top of GPT-4 that I think have all sorts of safety issues that we need to address.”

GPT-4 is a significant improvement over its predecessor, GPT-3, which was released in 2020. 

GPT-3 has 175 billion parameters, making it one of the largest language models in existence. OpenAI has not confirmed GPT-4’s exact number of parameters but it’s estimated to be in the region of one trillion.

OpenAI said in a blog post that GPT-4 is “more creative and collaborative than ever before” and “can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.”

In a simulated law bar exam, GPT-3.5 scored around the bottom 10 percent. GPT-4, however, passed the exam among the top 10 percent.

OpenAI is one of the leading AI research labs in the world, and its GPT models have been used for a wide range of applications, including language translation, chatbots, and content creation. However, the development of such large language models has raised concerns about their safety and ethical implications.

Altman’s comments suggest that OpenAI is aware of the concerns surrounding its GPT models and is taking steps to address them.

While GPT-5 may not be on the horizon, the continued development of GPT-4 and the creation of other models on top of it will undoubtedly raise further questions about the safety and ethical implications of such AI models.

(Photo by Victor Freitas on Unsplash)

Related: ​​Italy will lift ChatGPT ban if OpenAI fixes privacy issues

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The event is co-located with Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post OpenAI is not currently training GPT-5 appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/openai-is-not-currently-training-gpt-5/feed/ 0
Adobe may train its algorithms with your work unless you opt-out https://www.artificialintelligence-news.com/news/adobe-train-algorithms-your-work-opt-out/ https://www.artificialintelligence-news.com/news/adobe-train-algorithms-your-work-opt-out/#respond Mon, 09 Jan 2023 17:09:53 +0000 https://www.artificialintelligence-news.com/?p=12589 Unless you specifically opt-out, Adobe may assume that it’s ok to use your work to train its algorithms. An eagle-eyed developer at the Krita Foundation noticed that Adobe had automatically opted them into a “content analysis” initiative. The program allows Adobe to “analyze your content using techniques such as machine learning (e.g. for pattern recognition) […]

The post Adobe may train its algorithms with your work unless you opt-out appeared first on AI News.

]]>
Unless you specifically opt-out, Adobe may assume that it’s ok to use your work to train its algorithms.

An eagle-eyed developer at the Krita Foundation noticed that Adobe had automatically opted them into a “content analysis” initiative. The program allows Adobe to “analyze your content using techniques such as machine learning (e.g. for pattern recognition) to develop and improve our products and services.”

The rule was implemented in August 2022 but managed to go unnoticed.

Artists, understandably, have been protesting over AI-generated art as a potential threat to their livelihoods:

While some artists believe AI is a tool for their work rather than a threat, there’s near-unanimous consensus that the method in which generative AI models are often trained is unfair.

Some artists have found their work has been scraped to train generative AI models without their consent or at least being paid royalties. This has raised questions over whether end-users could also unwittingly violate copyright and face legal consequences.

By changing its policy to allow AI models to be trained on the works of its users, Adobe doesn’t have to rely on scraping data from the web. Adobe, it’s worth noting, is set to 

While Adobe claims that it doesn’t use data on customers’ Creative Cloud accounts to train its experimental generative AI features, the wording provides some legal flexibility.

In the company’s documentation, Adobe quite clearly says “we first aggregate your content with other content and then use the aggregated content to train our algorithms and thus improve our products and services.”

Such data collection should never be opted into by default, it arguably falls foul of regulations like GDPR. If you’re an Adobe user and want to opt-out, you can do so here.

(Photo by Emily Bernal on Unsplash)

Related: Adobe to begin selling AI-generated stock images

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Adobe may train its algorithms with your work unless you opt-out appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/adobe-train-algorithms-your-work-opt-out/feed/ 0
Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data https://www.artificialintelligence-news.com/news/devang-sachdev-snorkel-ai-on-easing-the-laborious-process-of-labelling-data/ https://www.artificialintelligence-news.com/news/devang-sachdev-snorkel-ai-on-easing-the-laborious-process-of-labelling-data/#respond Fri, 30 Sep 2022 07:52:51 +0000 https://www.artificialintelligence-news.com/?p=12318 Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious. Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset […]

The post Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data appeared first on AI News.

]]>
Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.

Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.

AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.

AI News: How is Snorkel helping to ease the laborious process of labelling data?

Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.

Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale. 

Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model. 

Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:

  1. Generalise beyond the output of the label model.
  2. Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add. 

This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.

AN: Are there dangers to implementing too much automation in the labelling process?

DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.

When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.

In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.

Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.

However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows. 

Here’s what this entails: 

  • Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale. 
  • Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was. 
  • Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.

AN: How easy is it to identify and update labels based on real-world changes?

DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.

In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective. 

In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model. 

To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.

As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions. 

AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?

DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.

To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations. 

Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.

With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.

AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?

DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).

Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.

On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.

AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?

DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.

In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.

Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.

This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.

The Snorkel AI team will be sharing their invaluable insights at this year’s AI & Big Data Expo North America. Find out more here and swing by Snorkel’s booth at stand #52.

The post Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/devang-sachdev-snorkel-ai-on-easing-the-laborious-process-of-labelling-data/feed/ 0