FeatureUnion: a Time-Saver When Building a Machine Learning Model

Source: Deep Learning on Medium

2. Two vectorizers

Now, what if we want to test another vectorizer such as TFI/DF? We can’t simply just put the second vectorize like this:

#This will not workpipe = Pipeline([
(‘cvec’, CountVectorizer()),
('tfic', TfidfVectorizer()),
(‘lr’ , LogisticRegression())
])

Why won’t this work? Remember what our pipeline is doing. In this case, it’s taking the output of our first method, CountVectorizer, and feeding it to TfidfVectorizer as an input.

Instead, we will use FeatureUnion to accomplish this task.

FeatureUnion:

FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors. source

Let’s see how to use it:

from sklearn.pipeline import Pipeline,FeatureUnionpipe = Pipeline([
(‘feats’, FeatureUnion([
(‘tfic’, TfidfVectorizer()),
(‘cvec’, CountVectorizer()),
])),

(‘lr’ ,LogisticRegression())
])

Tuning our hyperparameters will also require a slight change:

pipe_parms = [{
‘feats__tfic__max_features’ : [100,500],
‘feats__tfic__ngram_range’ : [(1,1),(1,2)],
‘feats__cvec__max_features’ : [100,500],
‘feats__cvec__ngram_range’ : [(1,1),(1,2)],
}]

Let’s see what FeatureUnion has done:

Using FeatureUnion will actually fit our data for each vectorizers independently. The transformation part will be performed on our data using parallel processes before the data is fed to the classifier.