What is a Search Engine?
A software program that retrieves information from a computer system and gives some of it back when you ask nicely. What you get back depends on:
- How the search engine software works
- What has been put into the search engine
- Certain characteristics of your documents (many of which you can control)
How Does a Search Engine Work?
Collection of information:
- The software 'crawls' with a 'spider' or 'bot':
- Collects information by following links and/or directories and finding files
- You can control what is collected from your site, depends on:
- What it knows is available – some have to be told!
- Site maps or Custom XML documents – you can build your own sitemap to suit Google
- Links to files from files
NOTE: you can exclude search engines – very important for sensitive information
Building an index, which is a reference to:
- Some of the words in a file and where in the file they are
- Some words are left out eg. the, a, an
- Important parts of a page
- You control what is in the important parts of a page:
- Page title – most important
- Link text – use concrete, informative words
- Headings, eg. H1, H2
- Body of the document, with important content higher up the page – you control the order of information
- Less important: Metadata – although larger search engines prefer other page parts although they may use metadata in the absence of any other usable text
NOTE: There may be a limit set on the amount indexed:
- 101kb for HTML and 120kb for PDF in Google
Ranking or weighing with algorithms:
- A set of rules to decide which bit of a page and what sort of words are most important
- These differ from one engine to another
- Generally, most use the location/frequency method, the most important words are:
- Near the top of the page in 'important' page parts
- Mentioned first and more than once – but not TOO often
- You control these
Processing queries and display of results:
- Queries and Results:
- Search engines break down the query, and return results based on rank
Page-ranking engines, eg. UTAS search engine:
- Results based on presence of search words in 'important' parts of a page
- You control the important page parts in your site
-
Site-ranking engines, eg. Google:
- Results based on presence of 'important' words in 'important' sites with lots of inbound links:
- Or Money:
- You can always pay for better placement, eg. Google, Yahoo
Some Ways to Get Excluded From Most Search Engines, just some of the ways:
Google punishes for most of these, particularly:
- Anchor SPAM (using the same few words in lots of inbound links)
- Too many trivial changes to content
- Certain parameters in dynamic URLs eg. &id=
Some sites may not be excluded but filtered by 'safe' options eg. SafeSearch in Google filters adult content
Write Searchable HTML:
- Write a unique page title relevant to the page contents – MOST IMPORTANT
- Add some metadata, particularly description, keywords
- Use structural elements properly eg.:
- Headings, H1, H2
- Use plain text links, not URLs
- Put keyword-rich text near the top of the page
- Use the inverted pyramid writing style, starting with the conclusion
- Add keyword rich alt text to images (if about the content and accessible)
- Prune your pages to remove irrelevant content – moves important content closer to the top of the page!
Write Searchable Documents:
- Fill out the document 'properties' – this text will be displayed in search results
- Use structural elements properly in Word, PDF eg.:
- Headings
- Table of Contents
- Split large documents into sections:
- Get more content indexed
- Documents will load faster
- Provide alt text for images:
- Adds more 'indexable' text to your page
- If done well is good for accessibility to people with disabilities
How to Test the Searchability of your Documents:
Using the UTAS search engine:
- Search using 'common language' terms (what do your clients call what you do?)
- See where your document is ranked in results
- Examine higher ranking documents, looking for your search term in the important page parts
- Change your documents and wait for the documents to be reindexed
- You can also make a site-specific search with the UTAS search engine
No user support from Google:
- Wait 4-8 weeks for the Google collection to be refreshed