Practical Text Generation with Tensorflow Serving

In this entry, I am going to talk about deep learning models exposure and serving via Tensorflow, while showcasing my setup for a flexible and practical text generation solution.

With text generation I intend the automated task of generating new semantically valid pieces of text of variable length, given an optional seed string. The idea is to be able to avail of different models for different use-cases (Q&A, chatbot-utilities, simplification, next word(s) suggestion) also based on different type of content (e.g. narrative, scientific, code), sources or authors.

Here a first preview of the setup in action for sentence suggestion.

Text generation example using seed text (the selected part in the video) for different training sources (i.e. song lyrics, Bible, Darwin)

This recording shows on-demand generation of text using three different models, each trained on the corresponding source of text (i.e. song lyrics, The King James Version of the Bible and Darwin’s On the Origin of Species), based on different seed strings. As expected each model reflects the tone and content of the original source — as all other potential models would do — and shows its potential for specific uses and scenarios, still mainly depending on the writer needs and expectations. The preview allows also to get an intuitive feeling of models performances, hinting to which model would need improvements to meet the defined requirements.

The following paragraphs will exactly depict the architectural approach I followed to obtain the provisioning and usage of the text-generation capabilities behind this first, and following, examples.

Architecture Preview

Even if the showcase focus is on the text generation task, much of the content related to model management can be abstracted. For example, we can identify three separate steps that have been proved to be valid for a variety of different use-cases and situations:

  • Training: what many in the field are mostly used to; investigate the best approach, define architecture/algorithm, process data, train and test models.
  • Serving: expose the trained models for consumption
  • Consuming: use the exposed models to obtain predictions from raw data

The difference between the last two steps is the more subtle one, and strongly depends on the working setup and what one identifies as a model. For starter think about how much of the data preprocessing you did for the training is actually not embedded in your actual training using Keras (or any other library for that matter). All this work simply moves to the next architectural block, but needs to be deferred to a consuming middleware, such that we can expose just the base required capabilities via a clean interface.

Here a schematic overview of our architectural steps and interactions.


I am not going into the technical details of text-generation training, but if interested you can find most of the training code for this article, plus additional pointers and resources, in this Jupyter notebook.

The nice thing about the serving architecture is that training can be highly uncoupled from the other components. This allows for fast, simple and transparent delivery of improved versions via the provision of better performing models. Better training data, more controlled training procedure, implementation of a more sophisticated algorithm or testing of a new architecture, are all options that could generate better models which in turn could seemingly supersede the ones currently served.

Regarding the aspects which breaks what would otherwise be a perfect uncoupling, consider for example the following most pressing dependencies:

  • model signature (inputs/outputs shape and type), which has to be known and followed by the client. There is currently no way to discover this directly from the serving server, so agreements with the training step needs to be guaranteed “manually” in order to avoid errors. Once a signature is defined you can serve models for new versions or test new architectures without additional burdens, as long as it is consistent across all such models.
  • pre and post-processing as operated during training need to be “replicated” in the consuming layer.
  • external/additional model data (e.g. word-indexes) as for the pre/post processing needs to be made available to the consuming layer while guaranteeing correspondence to what used during training.


Models are the outputs of our training process. We can categorize them first by their function (e.g. classification, text-generation, Q&A), and then by a version.

For our text-generation case, we might consider models to serve different functions based on the text content of which they are trained on, even if the basic task is practically the same. In most scenarios the training process/code will be actually the same, what will change is instead the training data used. Versioning will be then specific for that model function and content, and could be determined in the most simple case by basic snapshots along a training process of multiple epochs, or otherwise by the adoption or testing of new algorithms and architectures.

The idea is then to have a central repository of a multitude of models, where newly trained versions can be added as needed.

As a practical example consider this snapshot of my folder for basic text-generation with our three models, each having possible-multiple serving-ready versions, a more granular selection of training checkpoints (or snapshots) and word-to-index data.

partial tree of the models folder


With models ready one needs a way to serve them: make them available for efficient and effective use.

“TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.” It is a great and immediate approach for the ones already familiar with Tensorflow and not in the mood for writing their own serving architecture.
It includes automated model management based on the associated directory of models and exposes them via GRPC. Cherry on the top, it can be easily Dockerized.

model config example

You can have a replica of all your models and related versions on the machine running the Tensorflow server, or you can already filter based on needs to have a lighter serving container. You can then specify which model to run directly or via a model configuration file which needs to be passed when starting the serving server. The file should specify a list of model configuration for all the models that we plan to expose.

Tf will then take care of serving each of the listed models, and automatically manage the versioning. Insertion of a new version will be automatically picked up by Tensorflow while the injection of an entirely new model will require a restart of the serving service.
All this can be done manually, but for production setup one might prefer to develop a “synchronization” util that should take care of synchronizing model data inside the Tensorflow serving container from whatever external storage is hosting the actual results of the training step.


Consider our current use-case: we want to get generated text, of a specific type or from a specific source, and optionally conditioned on a seed text. Unfortunately, the pure Tensorflow serving endpoint is far from being so immediate. We not only need to transform the text (in and out), but we also have to implement a generation procedure that we would like to be completely transparent in the final interface. All this needs to be deferred to, and implemented by, a consumer middleware.

This will be proper of many other data-science scenarios, where the exported model is really just a partial step in the pipeline from raw data to usable predictions. A consumer middleware will be needed to fill this pre and post-processing gaps, as depicted in the initial architectural schema. For our text-generation case, this middleware and related code is again defined in my Github repository. It includes a basic class responsible for text pre and post-processing, a procedure for text generation (which builds upon multiple model calls and secondary requirements) and a proxy to handle different models.

The proxy can be used directly assuming the needed dependencies are resolved, otherwise I suggest to ease the task even further, and expose all as a really basic REST API: with Python and Flask, a couple of lines are really all is needed. Additionally, the modularity of all our components makes it easy to externalize and scale the solution via technologies like Docker and AWS.

Showcase Time!

How to actually make use of the setup described until now is a pure matter of imagination and needs (this article nicely explores various form of machine-assisted-writing). The best aspect is exactly that we now have a flexible architecture, reusable for all kind of scenarios without much additional operational burden.

The first practical example I baked — and which now I actively use —is a basic text editor plugin. In my case I rely on Sublime 3 and Notepad++. Writing the plugin for the former was pretty trivial once I had the text-generation API up and running; the following is actually all the code needed.

What is going on here is that I write something myself, select the text I want to use as seed and then call the generation on the preferred model. Here another demonstration. Notice again that I am here relying on three different models, trained respectively on song lyrics, The King James Version of the Bible and Darwin’s On the Origin of Species.

text gen on same seed for different models. Again notice the weaker Bible one.

This “plugin for text apps” allows me to come up or take inspiration from content (being it complete sentences or single words) which otherwise would rarely emerge from my purely spontaneous writing. While the more creative context of narrative and poetry is the one that seems to make better use of this tool, I often find good advice also for more formal and “strict” kind of content, putting my trust in the source dataset used for the training, while being confident that “more likely implies correctness”.

Another usage I am working on is a web-browser plugin for chat and instant-messaging automated reply, again by simply leveraging the exposed REST API. We’ll see how many of my friends can tell apart between me and an RNN.

words suggestion based on the song-lyrics dataset
words suggestion based on Darwin’s On the Origin of Species


The text generation capabilities of deep-learning-based techniques have already been proven, and for a variety of different use-cases and scenarios. In this entry, I showed how a basic architectural solution based on Tensorflow can guarantee high flexibility and effectiveness in creating a practical little text-generation toolbox for your inner writer.

I am also confident that we will soon see a more structured, granular and democratized access to pre-trained models. A sort of self-service repository — on the likes of other popular services— such that one can easily plug-and-play new models, and embed them directly in whatever context of choice, without having to train them practically from scratch.
In the context of this article, for example, it would be nice to have access to models trained by others for different tasks, text content or authors, as well as been able to share the results of my own training on a common platform or “model-hub”.

But now go; make good use of these powerful machines for your productive and creative purposes; it will not be long before they get too smart for the task.

Practical Text Generation with Tensorflow Serving was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

View at