Posts tagged seo

What Search.gov Indexes From Your Website

Content

When we think about indexing pages for search, we usually think about indexing the primary content of the page. But if the page isn’t structured to tell the search engine where that content is to be found, it will collect the <body> tag, and then filter out the <nav> and <footer> elements, if present. If <main>, <nav>, or <footer> are not present, we collect the full contents of the <body> tag. Learn more on our post about aiming search engines at the content you really want to be searchable, using the </main> element.

Metadata

You can read more detail on each of the following elements here.

Standard metadata elements

  • title
  • meta description
  • meta keywords
  • locale or language (from the opening <html> tag)
  • url
  • lastmod (collected from XML sitemaps)
  • og:description
  • og:title
  • article:published_time
  • article:modified_time

File formats

In addition to HTML pages with their various file extensions, Search.gov indexes the following file types:

  • PDFs
  • Word docs
  • Excel docs
  • TXT
  • Images can be indexed either using our Flickr integration, or by sending us an MRSS feed. Note that images are not indexed during web page indexing, so you’ll need to use one of these two methods.

Coming soon:

  • Powerpoint

Please note that at this time we cannot index javascript content, similar to most search engines (External link). At this time we recommend your team adds well crafted, unique description text for each of your pages, or perhaps auto-generate description tag text from the first few lines of the article text. However the text is added, it should include the keywords you want the page to respond to in search, framed in plain language. This will give us, and other search engines, something to work with when we’re matching and ranking results. See our discussion of description metadata for more information.

How Search.gov Ranks Your Search Results

Google and Bing hold their ranking algorithms closely as trade secrets, as a guard against people trying to game the system to ensure their own content comes out on top, regardless of whether that’s appropriate to the search. Search Engine Optimization (SEO) consulting has grown up as an industry to try to help websites get the best possible placement in search results. You may be interested in our webinars on technical SEO and best practices that will help you get your website into better shape for search, and we’re also available to advise federal web teams on particular search issues. Generally speaking, though, SEO is a lot like reading tea leaves.

We at Search.gov share our ranking factors because we want you to game our system. This helps ensure that the best, most appropriate content rises to the top of search results to help the American public find what they need.

This page will be updated as new ranking factors are added.

Guaranteed 1st Place Spot

For any pages you want always to appear in the top of search results, regardless of what the ranking algorithm might decide, use a Best Bet. Like an ad in the commercial engines, Best Bets allow you to pin recommended pages to the top of results. Text Best Bets are for single pages, and Graphics Best Bets allow you to boost a set of related items. Our Match Keywords Only feature allows you to put a tight focus on the terms you want a Best Bet to respond to. Read more here.

Ranking Factors

Each of the following ranking factors is calculated separately, and then multiplied together to create the final ranking score of a given item for a given search.

File Type

We prefer HTML documents over other file types. Non-HTML results are demoted significantly, to prevent, for instance, PDF files from crowding out their respective landing pages.

Freshness

We prefer documents that are fresh. Anything published or updated in the past 30 days is considered fresh. After that, we use a Gaussian decay function to demote documents, so that the older a document is, the more it is demoted. When documents are 5 years old or older, we consider them to be equally old and do not demote further. We use either the article:modified_time on an individual page, or that page’s <lastmod> date from the sitemap, whichever is more recent. If there is only an article:published_time for a given page, we use that date.

Documents with no date metadata at all are considered fresh and are not demoted. Read more about date metadata we collect and why it’s important to add metadata to your files.

Page Popularity

We prefer documents that users interact with more. Currently we leverage our own search analytics to track the number of times a URL is clicked on from the results page. The more clicks, the more that URL is promoted, or boosted. We use a logarithmic function to determine how much to boost the relevance score for each URL.

Note: Sites using the search results API to present our results on their own websites will not be able to take advantage of our click data ranking.

Core Ranking Algorithm

Our system is built on Elasticsearch, which itself is built on Apache Lucene. For the first several generations, Elasticsearch used Lucene’s default ranking, the Practical Scoring Function. This Function starts with a basic Boolean match for single terms and adds in TF/IDF and a vector space model. Here are some high level definitions for these technical terms:

  • Boolean matches are the AND / OR / NOT matches you’ve probably heard about.
    • This AND that
    • This OR that
    • This NOT that
    • This AND (that OR foo) NOT bar
    • Note that while the relevance ranking takes these into account, we do not currently use these operators if entered by a searcher. Support for user-entered Boolean operators is coming in 2019.
  • TF/IDF means term frequency / inverse document frequency. It counts the number of times a term appears in a document, and compares it to how many documents have that word. It aims to identify documents where the query terms appear frequently, and documents with more rare terms across the whole set of documents will get a higher score. Documents with a lot of common terms appearing in many documents will get a lower score.
    • They also have tempered the TF/IDF score with a method called BM25, which attempts to balance the TF/IDF scores of documents that are very different in length. If there are ten documents containing rare terms, the longest doc with the most instances of the terms would get a much higher score than a short doc with only a few instances of the terms. This makes intuitive sense, but when considered as a full pdf of a report vs the summary of the report, the full report isn’t that much more relevant to the query than the summary is. BM25’s length ‘normalizatin’ addresses that issue.
  • The vector space model allows the search engine to weight the individual terms in the query, so a common term in the query would receive a lower match score than a rare term in the query.
  • Read detailed technical documentation here (External link)

The latest versions of Elasticsearch takes into account the context of terms within the document, whether they are in structured data fields or in unstructured fields, like body text.

  • Structured data fields, like dates, are treated with a Boolean match method - does the field value match, or not?
  • Unstructured data fields, like webpage body content, are considered for how well a document matches a query.
  • Read highly technical documentation here (External link)

Metadata and tags you should include in your website

Search.gov, like other search engines, relies on structured data to help inform how we index your content and how it is presented in search results. You should also read up on the metadata and structured data used by Google (External link) and Bing (External link).

Including the following tags and metadata in each of your pages will improve the quality of your content’s indexing, as well as results ranking. We also encourage you to read about more HTML5 semantic markup (External link) you can include in your websites.

This page will be updated over time as we add more tag-based indexing functions and ranking factors to our service.

<title>
Detail: Unique title of the page. If you want to include the agency or section name, place that after the actual page title.
Used in: Query matching, term frequency scoring

<meta name=”description” content=”foo” />
Used in: Your well crafted, plain language summary of the page content. This will often be used by search engines in place of a page snippet. Be sure to include the keywords you want the page to rank well for. Best to limit to 160 characters, so it will not be truncated. Read more here (External link).
Used in: Query matching, term frequency scoring

<meta name=”keywords” content=”foo bar baz ” />
Detail: While not often used by commercial search engines due to keyword stuffing (External link), Search.gov indexes your keywords, if you have added them.
Used in: Query matching, term frequency scoring

<meta property="og:title” content=”Title goes here” />
Detail: Usually duplicative of <title>, we use the og:title property as the result title if it appears to be more substantive than the <title> tag. Note, Open Graph elements are used to display previews of your content in FaceBook and some other social media platforms.
Used in: Query matching, term frequency scoring

<meta property="og:description” content=”Description goes here” />
Detail: Often duplicative of the meta description, we index this field as well, in case it has different content. This field is a good opportunity to include more keywords than you could write into the meta description. Note, Open Graph elements are used to display previews of your content in FaceBook and some other social media platforms.
Used in: Query matching, term frequency scoring

<meta property="article:published_time" content="YYYY-MM-DD" />
Detail: Exact time is optional; read more here (External link).
Used in: Page freshness scoring.

<meta property="article:modified_time" content="YYYY-MM-DD" />
Detail: Exact time is optional; read more here (External link).
Used in: Page freshness scoring.

<meta name="robots" content="..., ..." />
Detail: Use the meta robots tag to block the search engine from indexing a particular page.
Used in: Used during indexing, does not affect relevance ranking.

<main>
Detail: Allows the search engine to target the actual content of the page and avoid headers, sidebars and other page content not useful to search. Read more about the <main> element here
Used in: Query matching, term frequency scoring

<lastmod>
Detail: This field is included in XML sitemaps to signal to search engines when a page was last modified. Search.gov collects this metadata in case there is no article:modified_time data included in the page itself.
Used in: Indexing processing, page freshness scoring.


Everything You Need to Know About Indexing with Search.gov

How does all this work?

Domain Level SEO Supports

Page Level SEO Supports

How to get search engines to index the right content for better discoverability

Website structure and content can have a significant impact on the ability of search engines to provide a good search experience. As a result, the Search Engine Optimization industry evolved to provide better understanding of these impacts and close critical gaps. Some elements on your website will actively hinder the search experience, and this post will show you how to target valuable content and exclude distractions.

We’ve written a post about robots.txt files, talking about high level inclusion and exclusion of content from search engines. There are other key tools you will want to employ on your website to further target the content on individual pages:


The <main> element

Targeting particular content on a page

A <main> element allows you to target content you want indexed by search engines. If a <main> element is present, the system will only collect the content inside the element. Be sure that the content you want indexed is inside of this element. If the element is closed too early, important content will not be indexed. Unless the system finds a <main> element demarcating where the primary content of the page is to be found, repetitive content such as headers, footers, and sidebars will be picked up by search engines as part of a page’s content.

The element is implemented as a stand-alone tag:

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>This is your page title</h1>
<p>This is the main text of your page
</main>
Redundant footer code
Various scripts, etc.
</body>

The element can also take the form of a <div> with the role of main, though this approach is now outdated:

<body>
Redundant header code and navigation elements, sidebars, etc.
<div role=”main”>
<h1>This is your page title</h1>
<p>This is the main text of your page
</div>
Redundant footer code
Various scripts, etc.
</body>

As mentioned above, if no <main> element is present, the entire page will be scraped. This is best reserved for non-HTML file types, though, including PDFs, DOCs, and PPTs.


Declare the ‘real’ URL for a page

There are two good reasons to declare the URL for a given page: CMS sites can easily become crawler traps, and list views can generate urls that are unhelpful as search results.

A crawler trap occurs when the engine falls into a loop of visiting, opening, and “discovering” pages that seem new, but are modifications on existing URLs. These URLs may have appended parameters such as tags, referring pages, Google Tag Manager tokens, page numbers, etc. Crawler traps tend to occur when your site can generate an infinite number of URLs. The crawler is ultimately unable to determine what constitutes the entirety of a site. <link rel="canonical" href="https://www.example.gov/topic1" />

By using a canonical link, shown above, you tell the crawler this is the real URL for the page despite parameters present in the URL when the page is opened. In the example above, even if a crawler opened the page with a URL like https://example.gov/topic1?sortby=desc, only https://www.example.gov/topic1 will be captured by the search engine.

Another important use-case for canonical links is the dynamic list. If the example above is a dynamic list of pages about Topic 1, it’s likely there will be pagination at the bottom of the page. This pagination dynamically separates items into distinct pages and generates urls like: https://example.gov/topic1?page=3. As new items are added to or removed from the list, there’s no guarantee that existing items will remain on a particular page. This behavior may frustrate users when a particular page no longer contains the item they want.

Use a canonical link to limit the search engine to indexing only the first page of the list, which the user can then sort or move through as they choose. The individual items on the list are indexed separately and included in search results.


Robots meta tags

There are individual pages on your websites that do not make good search results. This could be archived event pages, list views such as Recent Blog Posts, etc. Blocking individual pages on the robots.txt file will be difficult if you don’t have easy access to edit the file Even if edits are easy, it could quickly lead to an unmanageably long robots.txt.

It’s also important to note that search engines will pay attention to Disallow directives in robots.txt when crawling, but may not when accessing your URLs from other sources, like links from other sites or your sitemap. Search.gov will rely on robots meta tags when working off your sitemap to know what content you want searchable, and what you don’t want searchable.

To achieve best results for blocking indexing of particular pages, you’ll want to employ meta robots tags in the <head> of the pages you want to exclude from the search index.

This example says not to index the page, but allows following the links on the page:

<meta name="robots" content="noindex" />

This example says to index the page, but not follow any of the links on the page:

<meta name="robots" content="nofollow" />

This example tells bots not to index the page, and not to follow any of the links on the page:

<meta name="robots" content="noindex, nofollow" />

You can also add an X-Robots-Tag to you HTTP header response to control indexing for a given page. This requires deeper access to servers than our customers usually have themselves, so if you are interested in learning more, you can do so here  (External link).

If you have content that should be indexed when it’s fresh, but needs to be removed from the index once it’s outdated, you’ll want to take a few actions:

  • Once the page’s window of relevance is over, add a <meta name="robots" content="noindex" /> tag to the <head> of the page.
  • Make sure the modified_time on the page is updated.
  • Leave the item in the sitemap, so that search engines will see the page was updated, revisit it, and see that the item should be removed from the index.


Sample code structure

Dynamic list 1: Topic landing page

The following code sample is for a dynamically generated list of pages on your site, where you want the landing page for the list to appear in search results.

<head>
<title>Unique title of the page</title>
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/topic1" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
</main>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Dynamic list 2: Posts tagged XYZ

The following code sample is for a dynamically generated list of pages on your site, where you do not want the list to appear in search results. In the case of pages tagged with a particular term, the pages themselves would be good search results, but the list of them would be just another click between the user and the content.

Note: the description tags are still present in case someone links to this page in another system and that system wants to display a summary with the link.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/posts-tagged-xyz" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<h1>Unique title of the page</h1>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Event from last month

In the following example, an event page was published in June, and then updated the day after the event occurred. This update adds the meta robots tag, which declares the page should not be indexed, and links from the page should not be followed in future crawls. Again, the meta descriptions are retained in case of linking from other systems.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex, nofollow" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-06-04” />
<meta property=”article:modified_time” content=”2018-08-13” />
<link rel="canonical" href="https://www.example.gov/events/august-12-title-of-event" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
Specifics about the event.
</main>
Redundant footer code
Various scripts, etc.
</body>

Resources

Government-managed Domains outside the .Gov and .Mil Top Level Domains

Overview

As the U.S. government’s official web portal, USA.gov searches across all federal, state, local, tribal, and territorial government websites. Most government websites end in .gov or .mil, but many end in .com, .org, .edu, or other top-level domains.

In support of USA.gov and M-17-06 - Policies for Federal Agency Public Websites and Digital Services (External link), Search.gov maintains a list of all government domains that don’t end in .gov or .mil.

How to Update the List

Federal agencies are required (External link) to submit to Search.gov all non-.gov websites for inclusion in the list. This includes subdomains of a second-level domain managed by a third party, and federally controlled subfolders of a domain managed by a third party.

State or local agencies can browse the full list by state or search for their domains, and are welcome to submit any updates or additions to the Search team.

What’s Included in The List?

What’s Not Included in This List?

  • .gov URLs - these are managed by the .gov Registry (External link)
  • .mil URLs - these are managed by DOD (External link)
  • Subdomains or folders that are already covered by a higher-level domain
  • State institutions of higher education or their board of regents
  • K-12 school districts
  • Local fire, library, police, sheriff, etc. departments with separate websites
  • Local chambers of commerce or visitor bureaus
  • Nonprofit municipal leagues or councils of government officials
  • Nonprofit historical societies
  • Transit authorities

Checklist for a Successful Website Redesign

We often receive questions when an agency conducts a major website upgrade, changes content management systems, or both. We created this checklist that will help ensure your redesign is successful.

Reindexing

  1. Keep the file name and directory structure the same, if possible. If it isn’t possible, use 301 redirects to send visitors to the appropriate new pages. For more on 301 redirects, read tips from Bing (External link) and Google (External link). Notify other websites that link to you of the changes.
  2. Register for the commercial search engines’ webmaster tools.
  3. Update the XML sitemap on your website and notify the search engines via webmaster tools.
  4. Notify the search engines of content that has been removed via the webmaster tools. Specifically, in Bing, use the Site Move tool (External link).

Search Setup

  1. Within the Search Admin Center:
  2. If you are transitioning your website to HTTPS, we have an additional Preparing Your Site for HTTPS checklist.

If you’ve undergone a redesign, followed these steps, and your site search results are not what you’d expect, send us an email.


Additional Resource:

seo

Search Engine Optimization for Government Websites

On June 10, 2014, the Metrics Community of Practice of the Federal Web Managers Council and DigitalGov University hosted an event to honor the memory of Joe Pagano, a former co-chair of the Web Metrics Sub-Council.

This third lecture honoring Joe focused on search engine optimization (SEO).

While commercial search engines do a remarkable job of helping the public find our government information, as web professionals, it’s also our job to help the public make sense of what they find.

Ammie Farraj Feijoo, our program manager, presented on SEO for government websites and specifically talked about:

  • What SEO is and why it is important;
  • SEO building blocks for writing content;
  • Conducting keyword research; and
  • Eliminating ROT (redundant, outdated, and trivial content).

Download the slide deck [PDF] and visit the resources below to learn more.

Webmaster Tools

A Few (of Many) SEO Resources


Page last reviewed or updated: