One of the most innovative developments in the field of NLP (Natural Language Processing) was the release of BERT (Bidirectional Encoder Representations from Transformers) by Google for English and Chinese as well as a single multilingual model for all other languages.
The BERT models allow anyone to train their own advanced question answering system. However, this multilingual model does not perform well for small languages such as the Nordic languages because of underrepresentation in the training data.
BotXO, a Danish SaaS company specializing in conversational AI and chatbot technology, decided to develop their own BERT models for the overlooked Nordic languages.
It all started from building a Danish BERT model to help Danish organizations use the latest Conversational AI technology, secure their role in the global economy and eventually foster their participation in the AI race. Norwegian and Swedish models followed right after, while the Finnish language model is currently in progress.
Danish BERT Model
BotXO developed their Danish BERT model upon Google’s BERT model. The aim was to help out Danish companies, educational institutions, NGOs or public organizations with their projects within NLP. Overall, the model was to benefit the Danish AI and Machine Learning fields. For that reason, the company decided to release their Danish BERT model as open-source code. The goal was to facilitate the democratizing of AI for less widely spoken languages, by making it publicly available for everyone that speaks Danish.
One of the biggest challenges of working with AI language models is the massive amount of text needed to make an extensive model. As General-Purpose Language models demand vast corpora and training underrepresented Nordic language was difficult due to the scarcity of available training data. Yet, the Danish BERT model has read 1.6 billion words, corresponding to more than 30.000 books. The model could have learned even more, but it wasn’t easy to find much more publicly accessible Danish text.
So, how is Danish BERT model different from Google’s model? Google multilingual BERT model is trained in more than a hundred different languages. Consequently, Danish text constitutes only 1% of the total amount of data. The model includes a vocabulary of 120.000 words, having a room for about 1.200 Danish words. To compare with, BotXO’s model has a vocabulary of 32.000 Danish words!
Importantly enough, researchers from the University of Copenhagen and Alexandra Institute have concluded that BotXO’s #AI outperforms the AI of Google and Zalando, as well as the ones built by the Stanford University and Stony Brook University. Rasmus Hvingelby, Machine Learning Specialist at Alexandra Institute, explains that after examining various language models on which they could develop language technologies, they concluded BotXO’s model was the one that gave the best performance.
How Can We Use the Danish BERT Model?
The Danish BERT model may be applied for different NLP tasks such as sentiment analysis and entity extraction. For instance, it can analyze different prejudices in a text, define the purpose of a text, context and point out relevant words. This can be beneficial to various industries such as e-commerce, finance, the tech industry, as well as the public sector.
How Does the Model Learn from the Text?
Firstly, it starts by reading a sentence, for example, “I like Portugal, especially Lisbon.” Then, it locks up some of the words from itself: “I like […], especially Lisbon.” Next, it tries to guess the hidden word. If it is correct, it remembers the meaning of the text. However, if it is wrong, it makes adjustments to make better guest the next time. Thus, what does the model learns from this example? It determines that Lisbon is located in Portugal.
Later on, the model could read the following sentence in the text, “That’s why I often spend my holidays in some nice place in Portugal”. When presented with a random sentence such as “Schools are closed on Saturday”, the model would figure out it could not logically follow the first one. However, it knowns that “I like Portugal, especially Lisbon“ could fit.
Norwegian and Swedish BERT Models
The release of Norwegian and Swedish BERT models followed the Danish model. The Norwegian language has approximately 4.6 million speakers. For that reason, it is overlooked in NLP research. Thus, BotXO decided to create the first-ever BERT model trained on Norwegian data to help data scientists in Norway build their state-of-art NLP solutions.
The model was trained on a computer chip called a TPU (Tensor Processing Unit). This chip works well at “Tensor” operations, which are needed to train Deep Neural Network. You can access TPUs only by renting it from Google. As you can guess, it is quite expensive. It is vital to make the algorithms run as fast as possible to decrease the cost of the project. Further, to train the model, BotXO used text fetched gathered by the non-profit organization, Common Crawl, which collects huge amounts of data from the internet.
Finally, the Swedish BERT model has been trained on 25 GB of raw text data. It is ten times more than the data of the pre-existing biggest Swedish BERT. The Swedish language is more widely spoken than Danish and Norwegian. Ten million people speak it, almost as many as the populations of Denmark and Norway combined. Similarly to the Danish BERT model, Norwegian and Swedish models are also available on open-source. Scientists from the respective countries can further develop them to improve their products and/ or build new solutions.
Article written by Patrycja Hala Sacan, Marketing Specialist at BotXO