GluonNLP — Deep Learning Toolkit for Natural Language Processing

Introducing GluonNLP, an open-source toolkit that tackles the reproducibility crisis in NLP research with stable APIs, reusable components, and centralized resources.

The Reproduction Problem in NLP

Let me tell you a story of Alexander, a second-year PhD student. Alexander read the paper “Attention is All You Need” and was fascinated by the Transformer model it introduced. He wanted to reproduce the results and build upon them for his own research. He found Google’s official implementation in the Tensor2Tensor package built on TensorFlow. But when he dug into the code, he discovered the hyperparameters differed significantly from what was published in the paper.

Plan vs. Reality

His advisor told him to “just tweak the parameters a bit, run and see, it should be fine.” After days of experimentation that consumed all available GPU resources, he posted the issue on GitHub — but received no timely resolution, only a response three months later. Others on GitHub faced identical issues with no solutions.

GitHub issue: can not reproduce the result

Half a month passed. Finally, the project maintainer appeared and replied that he would look into it.

Maintainer response

Three months passed. Alexander is still asking helplessly: “Is there any progress?”

Follow-up plea

This is not an isolated incident. Unlike computer vision, NLP involves numerous preprocessing complexities. As someone once put it: “From the loading of the training dataset to the output of the BLEU score on the testing set, there will be a thousand opportunities to do something wrong.” The challenges include:

  • String encoding/decoding and Unicode format
  • Parsing and tokenization
  • Text data from various languages with different grammatical rules
  • Left-to-right or right-to-left reading order
  • Word embedding
  • Input padding
  • Gradient clipping
  • Variable-length input data and states

The Four Problems GluonNLP Solves

Problem 1: Reproducibility

Natural language processing papers are difficult to reproduce. Implementations on GitHub vary widely in quality, and many are abandoned.

Solution: Frequent updates with training scripts, hyperparameters, and runtime logs.

Problem 2: API Stability

Reproduction code is sensitive to API changes and thus has limited shelf-life. Code breaks when deep learning frameworks evolve.

Solution: Continuous Integration testing of all training and evaluation scripts, catching regressions early.

Problem 3: Code Reusability

Developers copy-paste code across projects due to poor interface design and tight deadlines, creating unmaintainable solutions.

Solution: Easy-to-use, extensible interfaces developed from diverse real-world research applications.

I personally accumulated five different Beam Search implementations, each slightly modified, because the interface of the module was rushed due to project deadlines and thus poorly designed. Different use cases prompted copying and tweaking rather than creating a unified, extensible interface.

Problem 4: Resource Fragmentation

NLP resources are scattered. To complete one project you may have to depend on multiple packages.

Solution: Centralized aggregation enabling one-click downloads of pre-trained embeddings, models, and datasets.

Examples of fragmentation:

  • Google’s Word2vec requires gensim as a dependency
  • Salesforce’s AWD language model uses PyTorch and lacks pre-trained models
  • Facebook’s FastText operates as an independent package

”Enough talking, show me the code!”

Loading GloVe embeddings and a pre-trained AWD LSTM language model, then comparing cosine similarity between the words “baby” and “infant”:

import mxnet as mx
import gluonnlp as nlp

# Load GloVe word embeddings
glove = nlp.embedding.create('glove', source='glove.6B.50d')

# Compute 'baby' and 'infant' word embeddings
baby_glove, infant_glove = glove['baby'], glove['infant']

# Load pre-trained AWD LSTM language model and get the embedding
lm_model, lm_vocab = nlp.model.get_model(name='awd_lstm_lm_1150',
                                         dataset_name='wikitext-2',
                                         pretrained=True)
baby_idx, infant_idx = lm_vocab['baby', 'infant']
lm_embedding = lm_model.embedding[0]

# Get the word embeddings of 'baby' and 'infant'
baby_lm, infant_lm = lm_embedding(mx.nd.array([baby_idx, infant_idx]))

# cosine similarity
def cos_similarity(vec1, vec2):
    return mx.nd.dot(vec1, vec2) / (vec1.norm() * vec2.norm())

print(cos_similarity(baby_glove, infant_glove))  # 0.74056691
print(cos_similarity(baby_lm, infant_lm))        # 0.3729561

GloVe embeddings (0.74) better capture the semantic similarity between “baby” and “infant” than the language model embeddings (0.37), which is expected — static word embeddings are trained specifically to encode semantic relationships, while language model embeddings optimize for next-word prediction.

GluonNLP v0.3.2

Features at the time of writing:

  • Over 300 pre-trained word embeddings (GloVe, FastText, Word2vec)
  • 5 language models (AWD, Cache, LSTM)
  • Neural Machine Translation training (Google NMT, Transformer)
  • Word2vec and FastText embedding training with subword interpolation
  • Flexible data pipeline tools and public datasets
  • NLP examples including sentiment analysis
pip install gluonnlp

Team

The GluonNLP team: