DeepPavlov Ruberta conversion to PyTorch

Source: Deep Learning on Medium

DeepPavlov Ruberta conversion to PyTorch

https://www.behance.net/gallery/12125277/From-Russia-with-love

Russian is a very complex language that is hard to put in a neural network. But it is possible and DeepPavlov has made an incredible job to make it happen by providing the pretrained weights for TensorFlow and Keras. An enormous work by a group from MIPT.

Though every day I understand the beauty of TensorFlow more and more I more inclined to use PyTorch to do the initial research. But the problem is that DeepPavlov provides weights for TensorFlow and Keras not for PyTorch.

It seems to be not a problem at all by using HuggingFace library. There is a nice PyThon file that does the job inside HuggingFace.

Steps to convert Ruberta TensorFlow, Keras weights to PyTorch

  1. Find a file convert_bert_original_tf_checkpoint_to_pytorch.py .
  2. Read the documentation on how to call this function with parameters. Pay attention that checkpoint ending with index should be provided.
  3. Run the library and you will receive a file that PyTorch accepts as a weight.
  4. Copy all files to your working directory.
  5. Execute:
import torch
from transformers import *
bertcf = BertConfig(vocab_size_or_config_json_file=119547)
model = BertModel(bertcf)
for index, par_bert_pytorch in enumerate(model.parameters()):
par_rubert = torch.load('rubert.pt')[keys[index]]
if par_rubert.shape == par_bert_pytorch.shape:
par_bert_pytorch.data= par_rubert
print('Executed substitution of paramaters.')
print(index)
else:
print('Skipped')

And TensorFlow weights will be loaded to your PyTorch model.

Use a preprocessor from DeepPavlov to encode Russian text for your model:

from deeppavlov.models.preprocessors import bert_preprocessor
tokenizer = bert_preprocessor.BertPreprocessor('vocab.txt')

Everything is working. Congratulations!

The same steps might be repeated if you have the pretrained weights for your language.