Skip to content

Entity Extractor from Google SERPS

Dorai Thodla edited this page Nov 22, 2020 · 1 revision

Goal:

Find the alignment of search terms to the top keywords, phrases in result documents

Input:

  1. A search string (for example: "What is AI")
  2. Key word/phrase count (n)

Output:

  1. List the top n unigrams, bigrams, trigrams in the search result pages by frequency (aka SERPs)

Process:

  1. Get the search term/expression
  2. Perform a Google query
  3. Get the results
  4. For each result page, extract text
  5. Split text into sentences
  6. For each sentence, extract terms (unigrams/bigrams/tri-grams)
  7. Count the term frequency for each document (search result page)
  8. Store it in a dictionary or database
  9. Create a report of the top n terms in the descending order of frequency
Clone this wiki locally