Some open-source NLP tools in TensorFlow

Source: Deep Learning on Medium


Go to the profile of Steve Zheng

Recently, I have spent some time to sort out my previous open-source NLP projects and add documents for them, I’d like to share some of them with you.

Here are some of the TensorFlow tools I built from scratch for NLP experiment & development, you can visit my GitHub overview page for more details.

Machine Reading Comprehension (MRC)

  • GitHub: https://github.com/stevezheng23/reading_comprehension_tf
  • Overview: I have re-implemented several MRC models (e.g. QANet, BiDAF, etc.) from scratch, and run some experiments on SQuAD task. More details can be found in GitHub project page
  • Description: Machine reading comprehension (MRC), a task which asks machine to read a given context then answer questions based on its understanding, is considered one of the key problems in artificial intelligence and has significant interest from both academic and industry. Over the past few years, great progress has been made in this field, thanks to various end-to-end trained neural models and high quality datasets with large amount of examples proposed.

Sequence Labeling

  • GitHub: https://github.com/stevezheng23/sequence_labeling_tf
  • Overview: I have re-implemented several sequence labeling models (e.g. Bi-LSTM + Char-CNN + Softmax, Bi-LSTM + Char-CNN + CRF, etc.) from scratch, and run experiments on different tasks, including NER (e.g. CoNLL2003, OntoNotes5) and POS tagging (e.g. Treebank3). More details can be found in GitHub project page.
  • Description: Sequence labeling is a task that assigns categorial label to each element in an input sequence. Many problems can be formalized as sequence labeling task, including speech recognition, video analysis and various problems in NLP (e.g. POS tagging, NER, Chunking, etc.). Traditionally sequence labeling requires large amount of hand-engineered features and domain-specific knowledge, but recently neural approaches have achieved state-of-the-art performance on several sequence labeling benchmarks.

Sequence-to-Sequence (Seq2Seq)

  • GitHub: https://github.com/stevezheng23/seq2seq_tf
  • Overview: I have re-implemented vanilla & attention-based Seq2Seq models from scratch, and run NMT experiments on IWSLT’15 English-Vietnamese task. More details can be found in GitHub page.
  • Description: Sequence-to-Sequence (Seq2Seq) is a general end-to-end framework which maps sequences in source domain to sequences in target domain. Seq2Seq model first reads the source sequence using an encoder to build vector-based ‘understanding’ representations, then passes them through a decoder to generate a target sequence, so it’s also referred to as the encoder-decoder architecture. Many NLP tasks have benefited from Seq2Seq framework, including machine translation, text summarization and question answering.

Language Model

  • GitHub: https://github.com/stevezheng23/language_model_tf
  • Overview: I have re-implemented Bi-directional Language Model (biLM) models from scratch and run LM experiment on Wikipedia corpus. More details can be found in GitHub page.
  • Description: Language modeling is a task that assigns probabilities to sequences of words or various linguistic units (e.g. char, subword, sentence, etc.). Language modeling is one of the most important problem in modern natural language processing (NLP) and it’s used in many NLP applications (e.g. speech recognition, machine translation, text summarization, spell correction, auto-completion, etc.). In the past few years, neural approaches have achieved better results than traditional statistical approaches on many language model benchmarks.