How dimensionality is reduced by adding features

Source: Deep Learning on Medium

It’s all about the question you are asking

** Disclaimer: this story is based on years of experience looking at influenza virus evolution, with some alterations to the details and numbers to make it easier to explain the concepts.**

Two patients with flu-like symptoms (fever, sore muscles, etc) walk into an emergency room (ER). One, we’ll call her Jessica, complains that she feels nauseous, the other, we’ll call her Marian, does not feel nauseous. A flu virus has about 15,000 nucleotides and one person can host thousands to millions of viruses throughout their infection. A person’s genome has about 6 billion nucleotides. On top of each of gene sits epigenetic molecules that add more complexity to the difference between these two flu-infected individuals.

Can we determine the cause of the difference in these two patients’ symptoms?

With just two patients, we can’t, but if we had 100 patients and 30% of them presented with nausea, like Jessica, while the other 70% did not have that symptom, like Marian, I think we can. Why? Because nausea is a symptom that only sometimes associates with a flu infection. For example, it is known to more often associate with a flu infection when the virus comes from an animal reservoir, like bird or swine flu. Nausea was more often reported during the swine flu pandemic of 2009 than it was reported by flu infected patients in the year before. But there were only four pandemics in the last century and a pandemic strain is very different from the strains that were circulating in humans the year before. Seasonal strains are less different and also sometimes have detectible differences in symptoms. In my opinion, data that captures these instances are worth the gold it costs to collect them.

On a global level, with hundreds of thousands of people infected each year, recording changes in patient symptoms that arise one year, are gone the next, and only resurface every 5–10 years give us a better chance of isolating the cause of the symptom. The cause is something that changed on the same timeframe. The 5–10 year timeframe narrows down the possible culprits to other things that change on that timeframe. Things that change on that timeframe can broadly be categorized as (1) the patients’ adaptive immune sensors or (2) the virus. By adding a time feature — the date that each flu-infected person who walked into an ER and reported whether or not they felt nauseous —a researcher can focus on the differences between sequences before and after the change in reported symptoms.

How many differences are there in the nucleotides between the viruses inside Jessica vs those inside Marian? Let’s pretend there are 1,000 differences. This ‘low’ number of differences is because Jessica and Marian were infected in the same year. When you compare the differences between the viruses that infected the 30 nauseous and 70 non-nauseous patients, all during one flu season, you might find that 20 of the mutations were detected in a large portion of the 30 nauseous patients and absent from most of the non-nauseous patients.

Now, let’s say you are a postdoc funded by an NIH grant, so you only have time and money to perform a few experiments to identify a cause of nausea. It will help to add more features. For example, we can add a column for each mutation that indicates if that mutation is in a protein or a known regulatory RNA. We could add the name of the protein and go even further to add a picture of the protein with the mutation indicated.

If 5 of your 20 common mutations end up at different locations in a functional site of one protein, say a known binding site, then it makes sense to study how the binding of that protein to its ligand affects nausea. And, if you wanted to develop a drug that reduced nausea, you could target the interaction of that protein with its ligand or any other interactions along that pathway.

In this story, adding a time feature allowed researchers to focus on things that change over time and find a point in time where the differences (total feature number) was smallish. Adding additional features, including protein name and structure allowed the researcher to focus more precisely on what was important to pursue experimentally.

What is the equivalent of a time feature in cancer? Cell lineage information. Focusing on specific timepoints or intermediate cell states can significantly reduce the number of features for which secondary metadata should be generated in order to identify features worth further investigation.