Using gobbli for interactive NLP

Original article can be found here (source): Deep Learning on Medium

explore

The explore app requires a dataset. The dataset can be one of a few formats (note it must fit in memory):

  • A built-in gobbli dataset (ex. NewsgroupsDataset or IMDBDataset)
  • A text file with one document per line
  • A .csv file with a “text” column and optional “label” column
  • A .tsv file with a “text” column and optional “label” column

Some functionality won’t appear for datasets without labels. If you don’t have your own dataset handy, the following invocation will work out of the box (note it will take several minutes to build the first time, as gobbli has to download and unpack the dataset):

gobbli explore NewsgroupsDataset

If everything is installed correctly, you should see the explore app open in your browser.

The explore app pointed at the built-in IMDB dataset.

Here are some general things to know about the gobbli interactive apps:

  • Parameters and user input are kept in the sidebar. The main section is reserved for displaying data and output.
  • Since the entire app re-runs with every input widget change, the apps default to taking a small sample of data so you can tweak parameters without locking up your browser on long-running tasks. You can increase the sample size when you have everything set the way you want.
  • All the normal gobbli output goes to the terminal window running Streamlit. Check the terminal to see status of long-running tasks that involve use of a model (embedding generation, prediction, etc).
  • We attempt to cache long-running task results as much as possible, but re-running costly tasks is required in many cases when parameters change.

Upon opening the app, you’ll be able to read through example documents from the dataset and check the distributions of labels and document lengths. The more involved tasks of topic modeling and embedding generation require some additional inputs.

Topic modeling

The explore app provides an interface to gensim’s LDA model, which allows you to train a topic model that learns latent topics from a bag-of-words representation of your documents. The approach doesn’t incorporate contextual information like a modern neural network, but it can reveal recurring themes in your dataset. To train a topic model, check the “Enable Topic Model” box in the sidebar and click “Train Topic Model”.

Results from a topic model.

The explore app displays the coherence score and top 20 words for each learned topic. It also displays the correlation between topics, which helps determine how well-fit the model is, and the correlation between topics and labels, which may help interpret some of the topics.

Plotting embeddings

Embeddings represent the hidden state of a neural network. They generally aim to quantify the semantics of a document, meaning documents with similar meanings are close together in the embedding space, so plotting them can provide a useful “map” of your dataset. gobbli makes this easy. To generate and plot embeddings, check the “Enable Embeddings” check box and click the “Generate Embeddings” button.

Results from plotting embeddings.

After some time, you’ll see the embeddings with their dimensionality reduced via UMAP. You can hover over individual points to see the text and label for that document. Points are colored by label.

Untrained embeddings can preview how well a model differentiates between the classes in your dataset. The more separated your classes are in the embeddings plot, the more likely the model will be able to discern the difference between them. Using the “Model Class” dropdown and “Model Parameters” JSON input, you can quickly evaluate different model types and parameter combinations on your dataset.

If you have a trained gobbli model, you can also visualize its embeddings (if it supports embeddings). You’ll need the path returned by calling “.data_dir()” on the model if you trained a model directly:

If you trained the model using a (non-distributed) experiment, you’ll need the path two directories up from the checkpoint:

Pass this path to the explore app to use a trained model:

gobbli explore --model-data-dir <MODEL_DATA_DIR> <DATASET>

You should then see the available checkpoints for the model in the “Embedding” section:

Generating embeddings using a trained gobbli model.

You can also apply clustering algorithms (HDBSCAN or K-means) to the embeddings before or after dimensionality reduction and plot the clusters, if you’re interested in seeing how well a clustering algorithm groups your documents in a high-dimensional or low-dimensional space. Check the “Cluster Embeddings” box, set parameters, and click “Generate Embeddings” again to see clusters plotted.