▾ G11 Media Network: | ChannelCity | ImpresaCity | SecurityOpenLab | Italian Channel Awards | Italian Project Awards | Italian Security Awards | ...
InnovationOpenLab

Cerebras Systems and Barcelona Supercomputing Center Train Industry-Leading Multilingual Spanish Catalan English LLM

Cerebras Systems, the pioneer in accelerating generative AI, today announced that the Barcelona Supercomputing Center (BSC) has completed training FLOR-6.3B, the state-of-the-art English Spanish Calat...

Business Wire

Condor Galaxy AI Supercomputer Powers FLOR 6.3B -- a Catalan, Spanish and English Open-Source Model Using Novel Training Techniques

SUNNYVALE, Calif.: Cerebras Systems, the pioneer in accelerating generative AI, today announced that the Barcelona Supercomputing Center (BSC) has completed training FLOR-6.3B, the state-of-the-art English Spanish Calatan large language model. FLOR-6.3B was trained in just 2.5 days on Condor Galaxy (CG-1), the massive AI supercomputer built from 64 Cerebras CS-2s by Cerebras and G42. FLOR-6.3B continues Cerebras’ leading work on multilingual models, a trend that started with the introduction of Jais, the leading Arabic English model.

As Catalan has a fraction of the data that is typically needed to train a model, innovative AI training techniques were created. Catalan and Spanish are low and mid-resourced languages relative to English. As explained in a recent post, BSC sought to create a model that was stronger for having three languages together, as each language is commonly spoken in Spain. In partnership with Cerebras, the BSC team explored a technique that used a fully-trained LLM and adjusted the embedding layer to achieve the same result as if it were trained using a large data set.

“Even though Spanish is one of the most commonly spoken languages in the world, there is a shortage of data available on the Internet for training – and we’ve found this to be a common problem for many languages beyond English,” said Andrew Feldman, CEO and co-founder of Cerebras. “In collaboration with our partners, we have been committed to developing new methodologies for creating models where training data is underrepresented. We are proud to work with BSC on FLOR 6.3B, which is multilingual at its core and performs significantly better than competing Spanish LLMs thanks to our novel training techniques.”

FLOR is a new family of open-source models, ranging in size from 760M to 6.3B parameters, that are based on publicly released checkpoints of BLOOM. These checkpoints have been previously pre-trained on 341B tokens of multilingual data, including 46 natural languages and 13 coding languages.

Bloom-7.1B was taken as the initial checkpoint of the continuous pretraining due to its multilingual nature. To better adapt to Catalan and Spanish, a new tokenizer was trained and used in the continuous pretraining process. The new tokenizer has a reduced vocabulary set of 50,257 subwords, in which 66% were overlapping with the Bloom vocabulary set and the rest are subwords that are more prevalent in Catalan and Spanish. The reduction of the vocabulary size also resulted in FLOR-6.3B having fewer parameters than the Bloom-7.1B model which directly reduces the cost of doing inference by more than 10%.

The FLOR family of models were trained using subsets of the Condor Galaxy 1 AI Supercomputer. The smaller models were trained using single Cerebras CS-2 systems, while FLOR-6.3B was trained using 16 CS-2s. Cerebras completed the entire training of FLOR-6.3B on 140 billion tokens in 2.5 days. FLOR-6.3B is open source and available for use in both research and commercial applications.

Condor Galaxy is one of the largest AI supercomputers in the world. Build by Cerebras and its strategic partner G42, Condor Galaxy 1 is comprised of 64 CS-2 systems, creating a 4 Exaflop AI supercomputer, with standard support for up to 600 billion parameter models. Condor Galaxy 1 is simple to program and entirely avoids the complexity of distributed computing. This enables customers to train large, ground-breaking models quickly, greatly reducing the time from idea to trained model.

The FLOR family of models continues Cerebras’ leadership in multilingual models. In 2023, Cerebras and Core42 co-developed Jais 13B and Jais30B, the best bilingual Arabic models in the world, now available on Azure Cloud. Condor Galaxy has also been used to train BTLM-3B-8K, which is the number one leading 3B model on HuggingFace, offering 7B parameter performance in a light 3B parameter model for inference. Med42, developed with M42 and Core42, is a leading clinical LLM, trained on Condor Galaxy 1 in a weekend and surpassing MedPaLM on performance and accuracy.

For more information on Condor Galaxy AI supercomputer, please visit https://www.cerebras.net/condor-galaxy-1.

About Cerebras Systems

Cerebras Systems is a team of pioneering computer architects, computer scientists, deep learning researchers, and engineers of all types. We have come together to accelerate generative AI by building a new class of computer system. Our flagship product, the CS-2 system, is powered by the world’s largest and fastest AI processor, our Wafer-Scale Engine. It makes training large models simple and easy by avoiding the complexity of distributed computing. Cerebras CS-2s are clustered together to make the largest AI supercomputers in the world, which are used by leading corporations for proprietary models, and to train open-source models with millions of downloads. Cerebras solutions are available through the Cerebras Cloud and on premise. For further information, visit https://www.cerebras.net.

Fonte: Business Wire

If you liked this article and want to stay up to date with news from InnovationOpenLab.com subscribe to ours Free newsletter.

Related news

Last News

25 Italian Startups Will Be Present at Expand North Star 2024

Scheduled for October, the world's largest startup event will bring together more than 2,000 exhibitors in Dubai, UAE

Partitalia: Italian IoT innovation in San Francisco

The Italian IoT company is in the US for the second phase of CALL4INNOVIT

Sparkle works on environmentally sustainable content distribution

The Italian company partners with MainStreaming for high-performance, energy-efficient video streaming

Libraesva: being specialized is ok again in cybersecurity

Software vendors developing vertical solutions against specific attack vectors are 'cool' again. And when it comes to email security, all companies now…

Most read

New Zealand Buy Now Pay Later Business and Investment Opportunities Databook…

The "New Zealand Buy Now Pay Later Business and Investment Opportunities Databook - Q2 2024 Update" report has been added to ResearchAndMarkets.com's…

Kenya Buy Now Pay Later Business Report 2024: BNPL Payments to Grow by…

The "Kenya Buy Now Pay Later Business and Investment Opportunities Databook - 75+ KPIs on BNPL Market Size, End-Use Sectors, Market Share, Product Analysis,…

Median Technologies to host two webcasts on September 5, 2024

Regulatory News: Median Technologies (FR0011049824, ALMDT, PEA/SME eligible, “Median” or “The Company”) will host two live webcasts on September 5, 2024.…

Mark Lawyer Joins RWS’s Executive Team as President of Regulated Industries…

RWS, a unique, world-leading provider of technology-enabled language, content and intellectual property solutions, announces that Mark Lawyer has joined…

Newsletter signup

Join our mailing list to get weekly updates delivered to your inbox.

Sign me up!