Sat Dec 22 14:03:45 PST 2018






30
Gerard Salton
The phrase vector space model, which search algorithms still heavily rely upon today,
goes back to the 1970s. Gerard Salton was a well-known expert in the field of
information retrieval who pioneered many of today?s modern methods. If you are
interested in learning more about early information retrieval systems, you may want
to read A Theory of Indexing, which is a short book by Salton that describes many of
the common terms and concepts in the information retrieval field.
Mike Grehan?s book, Search Engine Marketing: The Essential Best Practices Guide, also
discusses some of the technical bits to information retrieval in more detail than this
book. My book was created to be a current how-to guide, while his is geared more
toward giving information about how information retrieval works.
Parts of a Search Engine
While there are different ways to organize web content, every crawling search
engine has the same basic parts:
? a crawler
? an index (or catalog

? a search interface
Crawler (or Spider

The crawler does just what its name implies. It scours the web following links,
updating pages, and adding new pages when it comes across them. Each search
engine has periods of deep crawling and periods of shallow crawling. There is also
a scheduler mechanism to prevent a spider from overloading servers and to tell the
spider what documents to crawl next and how frequently to crawl them.
Rapidly changing or highly important documents are more likely to get crawled
frequently. The frequency of crawl should typically have little effect on search
relevancy; it simply helps the search engines keep fresh content in their index. The
home page of CNN.com might get crawled once every ten minutes. A popular,
rapidly growing forum might get crawled a few dozen times each day. A static site
with little link popularity and rarely changing content might only get crawled once
or twice a month.
The best benefit of having a frequently crawled page is that you can get your new
sites, pages, or projects crawled quickly by linking to them from a powerful or
frequently changing page.

No comments:

Post a Comment

Featured Post

Tue May 18 12:07:08 CDT 2021