The AI behind Search Engines

Original article was published on Artificial Intelligence on Medium

Understanding the AI behind Search Engines

There used to be a time when a group of friends at dinner could ask a question like “is a hot dog a sandwich?” and it would turn into a basic shouting match with lots of gesturing and hypothetical examples. But now, we have access to a LOT of human knowledge in the palm of our hands… so our friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none of them have a connected bun like hot dogs (disappointed).

Search engines are a huge part of modern life. They help us access information, find directions to places, shop, and participate in sandwich arguments. But how does Google find answers to questions?

How are Siri and Alexa so smart but also easily stumped? How did IBM’s Watson beat the best Jeopardy players in the world? Well, search engines are just AI systems that are getting better and better at helping us find what we’re looking for.

Search Engines are just Librarians

When we talk about search engines, we typically think about the AI systems online, like Google, Bing, Duck Duck Go, and Ask Jeeves. But the basic ideas behind non-AI search engines have existed for centuries.

Gather data

For example, when you needed an answer to a question and couldn’t search online, you could go to the library! Libraries gather data in the form of books and newspapers that are stacked neatly on the shelves.

Organization systems

Librarians have organization systems to help you find what you’re looking for. Knowing that magazines are on shelves by the water fountain, while kids’ books are on the second floor is a kind of organization. Plus, fiction books are sorted by the author’s last name, while nonfiction has the Dewey Decimal System, and so on.

Finding results

Once you (or the librarian) have the resources you need, you’ll be able to find results to your question! Now, rather than looking through books, web search engines look through all the data on the World Wide Web, aka “the Web”. And instead of asking a human librarian where to find information, we ask an AI instead.

1. Web crawler

As with most AI systems, the first step is to gather lots of data. To gather data on the Web, we can use a computer program called a Web crawler, which systematically finds and downloads Web pages.

This is a HUGE task and happens before the search engine AI can take any questions. It starts on some Web page that we pick, called a seed, and downloads that page and finds all its links. Then, the crawler downloads each of the linked Web pages and finds their links, and so on… until we’ve crawled the whole Web.

2. Indexing

After we have collected all the data, the AI’s next step is to organize it by building an index, which is a kind of lookup system. The kind that’s used for organizing Web pages is called an inverted index, which is like the index in the back of a textbook. For each word, it lists all of the Web pages that contain that word.

Usually, the Web pages are represented by I. D. numbers so we don’t have a long, messy list of URLs. When Siri says “I found this for you,” the AI is just returning a list of Web pages that contain the same terms as the question.

3. Ranking Pages

Most search engines include one more step. There are millions of pages online that contain the same terms. So it’s important for search engines to rank Web pages so that the top result is more likely to be relevant than the tenth result or the hundredth. Of course, Google and Bing don’t hire “supervisors” to grade each possible question and answer to help their AI systems learn from training data. That would take forever, and they wouldn’t be able to keep up with all the new content that gets created every day.

Really, regular users like us do this training for free all the time. Every time we use a search engine, our behavior tells the AI whether or not the results answered our question. For example, if we type in “who is Genghis Khan” into a search engine, and click on a Web page about Star Trek II: The Wrath of Khan, we might be disappointed to find Genghis. Khan isn’t ANYWHERE in that movie. So we’ll bounce back to the search results, and try again until we find a page that answers our question. A bounce indicates a bad result.

But if we click on a Wikipedia article about Genghis Khan and stay for a while reading, that’s a click-through, which probably means that we found what we were looking for… so that indicates a good result. Human behavior like bounces and click-throughs give AI systems the training data they need to learn how to rank search results and better answer our questions. Data from the Web and data from how we use the Web helps make better and better search engines.

Voice Search Engines

Now, sometimes we ask our smart devices questions and we want actual answers… not links to Web pages. When I say “OK Google, what’s the weather like in Indianapolis?” I don’t want to scroll through results.

For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases. A knowledgebase encodes information about the universe as relationships between objects.

One of the main problems with knowledge bases is that it’s really hard to write down all of the facts in the universe, especially common sense things that humans take for granted but computers need to be told.

Never-Ending Language Learner

Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University. In 2010, they created a huge knowledge base called the Never-Ending Language Learner or NELL, which was able to extract hundreds of thousands of facts from random Web pages.

NELL starts with some facts provided by a human, for example, the genre of music that Mozart plays is classical.

Then, NELL gets to work and reads through each Web page one-by-one for words mentioned in those facts.

Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols, but this text matches the same pattern as one of the facts provided by a human, specifically, the “plays” relationship. So NELL learns a new object: Piano. And a new fact: Mozart plays Piano.

By searching over the entire Web, NELL can learn lots of facts based on just the three original ones that humans gave it! Some facts might appear hundreds or thousands of times online.

But NELL might also find facts that are mentioned SOMEWHERE online and extract them as potentially true. Like, for example, Darth Vader plays Kloo Horn. We just don’t know! Just like how we look for multiple sources when writing a paper, NELL uses repetition and multiple sources to build confidence that the facts it’s finding are actually true.

To consider other relationships, NELL uses the highly confident facts it learned and searches through the Web again. Only this time, NELL is looking for new relationships. Maybe it finds the text “Darth Vader cuts off Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand.

Over and over again, NELL will use known relationships to find new objects, and known objects to find new relationships — creating a huge knowledge base.

AI systems can use huge knowledge bases, like this one extracted by NELL, to answer our questions directly.

How Voice Assistants Use Knowledge Bases

Instead of using the words from our questions to search through an inverted index, an AI like Siri can reformulate our questions into incomplete facts and then look for matches in a knowledge base.

“Who wrote The Bluest Eye?” His AI could then reformulate that question into an incomplete fact, replacing “who” with a question mark. If Siri extracted that information earlier, it can find matches in his knowledge base and return the most confident result: Toni Morrison wrote The Bluest Eye!

Search Engine Limitations

Using all these strategies, search engines have become really good at answering common questions. But questions like “How many trees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?” still stump most AI systems, because not enough people ask them and AI hasn’t learned how to answer them well yet.

It’s also important to watch out for search engine answers to questions like “Who invented the time machine?” because AI systems have a tough time with nuance and incomplete data.

And a big, sort of hidden, problem is that search engine AI systems, are influenced by any biases in data online. For example, if I ask Google for images of “nurses,” it will mostly show pictures of female nurses.