Posts tagged indexing

Checklist for a Successful Website Redesign

We often receive questions when an agency conducts a major website upgrade, changes content management systems, or both. We created this checklist to help ensure your redesign is successful. The stages are:

Ready…

1. Let the Search.gov team know you are launching a new site

2. Develop a reindexing plan

Set…

3. Prepare xml sitemaps and SEO elements

4. Add a Search Page Alert

5. Prepare color scheme updates and new logo to add to Admin Center

6. Prepare updates to your other search features

Go!

7. Flip the new website live, let us know

8. Implement the changes to the search site

9. Results begin to show

Victory lap

10. Alert Google and Bing that your website has been refreshed

Flow chart showing the steps involved in getting the search index ready to go on Search.gov, for a website that’s being relaunched. Website relaunch flow chart detailed description
Open large version

Ready…

1. Let the Search.gov team know you are launching a new site

Who: You, the agency web team

What: Send us an email, give us a call, either way, please let us know that you’re working on a redesign of your website. If we know ahead of time, we can help you get your new search experience prepped and in good shape on the day of the relaunch. When you reach out to us, include the planned launch date.

It’s important to plan ahead, because if there are any changes to your site structure, your search results will break, which will lead to frustration for the public as they try to use your new site. This is true for our service, and out on Google and Bing. To avoid an avalanche of 404 not found errors from your search results, wherever possible, use 301 redirects to send visitors from the old pages to the appropriate new pages. For more on 301 redirects, read tips from Bing (External link) and Google (External link). Notify other websites that link to you of the changes.

2. Develop a reindexing plan

Who: Search.gov team, in consultation with you, the agency web team

What: We will ask you about your search needs for the new site, including what domains you need to include, how you can generate an xml sitemap, and what SEO supports you’re putting in place in your new templates, like metadata and other structured elements. Read more about our indexing process here.

As part of this discussion, we’ll make some recommendations and likely agree on some action items for your team to consider implementing prior to your launch. If there are any major SEO warning signs in your setup, we’ll let you know.

We’ll also ask you about the timeline for launch, so that we can reserve a time to coordinate Step 8 with you.

Set…

3. Prepare xml sitemaps and SEO elements

Who: You, the agency web team

What: Action items that usually come out of the planning discussions include

  • Ensure that each domain and subdomain you want to be searchable launches with an xml sitemap.
  • Add metadata blocks to the <head> of your page templates, and Semantic Markup to the <body>.
    • Sometimes these pieces are in place, but need to be modified or moved.
  • Talk with other web teams to ask them to do the above items on their sites, so you can leverage them when your site searches their site’s content.

4. Add a Search Page Alert in the Admin Center

Who: You, the agency web team

What: Use our Search Page Alert feature to display a “pardon our dust” type message on your results page. For example:

  • We are launching a new example.gov. If your search does not return the content you expected, please check back soon for updated results.
  • Set the status of the alert to Inactive and wait for the relaunch.

5. Prepare color scheme updates and new logo to add to Admin Center

Who: You, the agency web team

What: Gather your new logo and color palette, if needed. Many sites find it helpful to mock up their redesigned results page in a non-production search site - you can either clone your existing site or just use the Add Site button to create a totally new one.

Don’t implement these changes on your production site ahead of the actual relaunch (that comes in Step 8, below).

6. Prepare updates to your other search features

Who: You, the agency web team

What: When your URL structure changes, this will affect several of our search features. You’ll want to get your updates ready to go, but don’t implement them ahead of the relaunch, or people will end up in the wrong places:

  • Domains: make sure your Domains list includes the domains and subdomains that you want included as the default content to search.
  • Collections: make sure your Collections are searching for the right content in the new location.
  • Best Bets: make sure your Best Bet URLs are correct for the content’s new location.
  • Routed Queries: update the target of your Routed Queries, so searchers will end up on the correct page.
  • RSS feeds should be removed, and re-added from their new locations.

Go!

7. Flip the new website live, and let us know

Who: You, the agency web team

What: When your new website is publicly available, reach out to us by email or phone. This will be our signal to begin our part of Step 8.

8. Implement the changes to the search site

At this point, the work splits into two parallel tracks, with your team and ours working on related items at the same time.

Who: You, the agency web team

What: Add the updates you prepared in Steps 4, 5, and 6 to the Admin Center for your production search site:

  • Set your Search Page Alert to Active
  • Update your colors and logo
  • Update your Domains, Collections, Best Bets, Routed Queries, and RSS Feeds as necessary

Who: The Search.gov team

What: We complete several backend tasks

  • Switch your production search site to use the new index, which will begin empty for your domain(s).
  • Tell our indexer to begin working on your domain(s)
    • The time it takes to get your content indexed depends on the number of items you have, and whether you have a crawl delay declared in your /robots.txt file. Generally speaking, a few hundred items should be done in an hour or two, a few thousand items should be done in several hours, etc.

9. Results begin to show

What: Our indexer will first read your sitemap, collect the urls, and then work through them in the order they were collected. We will work at the crawl delay set in your /robots.txt file, or 1 request per second, whichever is slower. This delay is the time after we’ve rendered a page, before requesting the next page to render.

Victory lap

10. Alert Google and Bing that your website has been refreshed.

Who: You, the agency web team

What: Register for the commercial search engines’ webmaster tools, if you haven’t already done so.

If you’ve undergone a redesign, followed these steps, and your site search results are not what you’d expect, send us an email.

How Search.gov Ranks Your Search Results

Google and Bing hold their ranking algorithms closely as trade secrets, as a guard against people trying to game the system to ensure their own content comes out on top, regardless of whether that’s appropriate to the search. Search Engine Optimization (SEO) consulting has grown up as an industry to try to help websites get the best possible placement in search results. You may be interested in our webinars on technical SEO and best practices that will help you get your website into better shape for search, and we’re also available to advise federal web teams on particular search issues. Generally speaking, though, SEO is a lot like reading tea leaves.

We at Search.gov share our ranking factors because we want you to game our system. This helps ensure that the best, most appropriate content rises to the top of search results to help the American public find what they need.

This page will be updated as new ranking factors are added.

Guaranteed 1st Place Spot

For any pages you want always to appear in the top of search results, regardless of what the ranking algorithm might decide, use a Best Bet. Like an ad in the commercial engines, Best Bets allow you to pin recommended pages to the top of results. Text Best Bets are for single pages, and Graphics Best Bets allow you to boost a set of related items. Our Match Keywords Only feature allows you to put a tight focus on the terms you want a Best Bet to respond to. Read more here.

Ranking Factors

Each of the following ranking factors is calculated separately, and then multiplied together to create the final ranking score of a given item for a given search.

File Type

We prefer HTML documents over other file types. Non-HTML results are demoted significantly, to prevent, for instance, PDF files from crowding out their respective landing pages.

Freshness

We prefer documents that are fresh. Anything published or updated in the past 30 days is considered fresh. After that, we use a Gaussian decay function to demote documents, so that the older a document is, the more it is demoted. When documents are 5 years old or older, we consider them to be equally old and do not demote further. We use either the article:modified_time on an individual page, or that page’s <lastmod> date from the sitemap, whichever is more recent. If there is only an article:published_time for a given page, we use that date.

Documents with no date metadata at all are considered fresh and are not demoted. Read more about date metadata we collect and why it’s important to add metadata to your files.

Page Popularity

We prefer documents that users interact with more. Currently we leverage our own search analytics to track the number of times a URL is clicked on from the results page. The more clicks, the more that URL is promoted, or boosted. We use a logarithmic function to determine how much to boost the relevance score for each URL. For sites new to our service, please expect this ranking factor to take 30 days to fully warm up after your search goes live.

Note: Sites using the search results API to present our results on their own websites will not be able to take advantage of our click data ranking.

Core Ranking Algorithm

Our system is built on Elasticsearch, which itself is built on Apache Lucene. For the first several generations, Elasticsearch used Lucene’s default ranking, the Practical Scoring Function. This Function starts with a basic Boolean match for single terms and adds in TF/IDF and a vector space model. Here are some high level definitions for these technical terms:

  • Boolean matches are the AND / OR / NOT matches you’ve probably heard about.
    • This AND that
    • This OR that
    • This NOT that
    • This AND (that OR foo) NOT bar
    • Note that while the relevance ranking takes these into account, we do not currently use these operators if entered by a searcher. Support for user-entered Boolean operators is coming in 2019.
  • TF/IDF means term frequency / inverse document frequency. It counts the number of times a term appears in a document, and compares it to how many documents have that word. It aims to identify documents where the query terms appear frequently, and documents with more rare terms across the whole set of documents will get a higher score. Documents with a lot of common terms appearing in many documents will get a lower score.
    • They also have tempered the TF/IDF score with a method called BM25, which attempts to balance the TF/IDF scores of documents that are very different in length. If there are ten documents containing rare terms, the longest doc with the most instances of the terms would get a much higher score than a short doc with only a few instances of the terms. This makes intuitive sense, but when considered as a full pdf of a report vs the summary of the report, the full report isn’t that much more relevant to the query than the summary is. BM25’s length ‘normalizatin’ addresses that issue.
  • The vector space model allows the search engine to weight the individual terms in the query, so a common term in the query would receive a lower match score than a rare term in the query.
  • Read detailed technical documentation here (External link)

The latest versions of Elasticsearch takes into account the context of terms within the document, whether they are in structured data fields or in unstructured fields, like body text.

  • Structured data fields, like dates, are treated with a Boolean match method - does the field value match, or not?
  • Unstructured data fields, like webpage body content, are considered for how well a document matches a query.
  • Read highly technical documentation here (External link)

What Search.gov Indexes From Your Website

Content

When we think about indexing pages for search, we usually think about indexing the primary content of the page. But if the page isn’t structured to tell the search engine where that content is to be found, it will collect the <body> tag, and then filter out the <nav> and <footer> elements, if present. If <main>, <nav>, or <footer> are not present, we collect the full contents of the <body> tag. Learn more on our post about aiming search engines at the content you really want to be searchable, using the </main> element.

Metadata

You can read more detail on each of the following elements here.

Standard metadata elements

  • title
  • meta description
  • meta keywords
  • locale or language (from the opening <html> tag)
  • url
  • lastmod (collected from XML sitemaps)
  • og:description
  • og:title
  • article:published_time
  • article:modified_time

File formats

In addition to HTML pages with their various file extensions, Search.gov indexes the following file types:

  • PDFs
  • Word docs
  • Excel docs
  • TXT
  • Images can be indexed either using our Flickr integration, or by sending us an MRSS feed. Note that images are not indexed during web page indexing, so you’ll need to use one of these two methods.

Coming soon:

  • Powerpoint

Please note that at this time we cannot index javascript content, similar to most search engines (External link). At this time we recommend your team adds well crafted, unique description text for each of your pages, or perhaps auto-generate description tag text from the first few lines of the article text. However the text is added, it should include the keywords you want the page to respond to in search, framed in plain language. This will give us, and other search engines, something to work with when we’re matching and ranking results. See our discussion of description metadata for more information.

Metadata and tags you should include in your website

Search.gov, like other search engines, relies on structured data to help inform how we index your content and how it is presented in search results. You should also read up on the metadata and structured data used by Google (External link) and Bing (External link).

Including the following tags and metadata in each of your pages will improve the quality of your content’s indexing, as well as results ranking. We also encourage you to read about more HTML5 semantic markup (External link) you can include in your websites.

This page will be updated over time as we add more tag-based indexing functions and ranking factors to our service.

<title>
Detail: Unique title of the page. If you want to include the agency or section name, place that after the actual page title.
Used in: Query matching, term frequency scoring

<meta name=”description” content=”foo” />
Used in: Your well crafted, plain language summary of the page content. This will often be used by search engines in place of a page snippet. Be sure to include the keywords you want the page to rank well for. Best to limit to 160 characters, so it will not be truncated. Read more here (External link).
Used in: Query matching, term frequency scoring

<meta name=”keywords” content=”foo bar baz ” />
Detail: While not often used by commercial search engines due to keyword stuffing (External link), Search.gov indexes your keywords, if you have added them.
Used in: Query matching, term frequency scoring

<meta property="og:title” content=”Title goes here” />
Detail: Usually duplicative of <title>, we use the og:title property as the result title if it appears to be more substantive than the <title> tag. Note, Open Graph elements are used to display previews of your content in FaceBook and some other social media platforms.
Used in: Query matching, term frequency scoring

<meta property="og:description” content=”Description goes here” />
Detail: Often duplicative of the meta description, we index this field as well, in case it has different content. This field is a good opportunity to include more keywords than you could write into the meta description. Note, Open Graph elements are used to display previews of your content in FaceBook and some other social media platforms.
Used in: Query matching, term frequency scoring

<meta property="article:published_time" content="YYYY-MM-DD" />
Detail: Exact time is optional; read more here (External link).
Used in: Page freshness scoring.

<meta property="article:modified_time" content="YYYY-MM-DD" />
Detail: Exact time is optional; read more here (External link).
Used in: Page freshness scoring.

<meta name="robots" content="..., ..." />
Detail: Use the meta robots tag to block the search engine from indexing a particular page.
Used in: Used during indexing, does not affect relevance ranking.

<main>
Detail: Allows the search engine to target the actual content of the page and avoid headers, sidebars and other page content not useful to search. Read more about the <main> element here
Used in: Query matching, term frequency scoring

<lastmod>
Detail: This field is included in XML sitemaps to signal to search engines when a page was last modified. Search.gov collects this metadata in case there is no article:modified_time data included in the page itself.
Used in: Indexing processing, page freshness scoring.


Search Site Launch Guide

At Search.gov we aim to provide a self-service, plug and play search solution. This guide will walk you through everything you need to do, and let you know when to reach out to us. The basic steps are:

  1. Add a site
  2. Add Domains
  3. We will select the search index your site will use
  4. Add additional search features
  5. Turn on the search features
  6. Configure the branding of your results page
  7. Connect your website’s search box to your search site

Flow chart showing the steps involved in launching a search site on Search.gov Site launch flow chart detailed description
Open large version

1. Add Site

Who: You, the agency web team

What: After you’ve successfully opened an account with Search.gov, you’ll need to create a search site. A search site is where you configure the search experience for your website. Find the Add Site link at the top of the Admin Center, and enter some basic details about your site. Please note that our service is for publicly accessible, federal government content. More detailed information can be found on our Add Site help page.

Once you’ve created your site, note the actions available on the left-hand navigation of your Admin Center.

The Dashboard is where you can view a Site Overview, manage users, update your site’s homepage, or site display name.

Analytics are provided for the past 13 months, reporting your top queries, clicks, and referrers (the pages people were on when they ran their searches), and monthly rollup data.

Content management is where you define what your search experience will include, both the default search scope, additional content sources, and alternative search views.

Display management is where you can configure the branding of your search results page.

Preview your search results page to see what your search experience will be like, before you go live.

And finally, the Activate section provides pre-formatted code snippets to help you go live. Don’t be afraid of entering this area, nothing will actually be activated.


2. Add Domains

Who: You, the agency web team

What: In the content management section, the domains list defines the default search scope for your site. You can include one domain or several, or you can focus on particular subdomains of one domain. Read more here.


3. Web Index Selection

Who: Search.gov team, in consultation with you, the agency web team

What: By default a new search site will be connected to the Bing web index to receive web results. Websites with very low levels of search traffic can continue to use the Bing web index after they launched our service. However, sites that will see greater than 150,000 queries per year will need to be indexed directly by our service before going live. We monitor new sites established in our system, and will reach out if we think your site will need to be indexed by us, or if we need more information to make a determination.

Regardless of the index used to support your search, we can only serve publicly accessible content. You will not be able to use our service for secure content, including intranets, and we can never index or serve personally identifiable information (PII) or other confidential data.

(Jump to Step 4. Add Features if you don’t need the details of the indexing process at this time.)

If we will be indexing your content ourselves, we will follow these steps:



Indexing with Search.gov



A. Define Domains and Subdomains

Who: You, the agency web team, in consultation with the Search.gov team

What: The Admin Center Domains list controls what we pull out of our index for a search on your site. But we also need to know what to put in to the index to begin with. We’ll work with you to confirm the domains and subdomains you want discoverable through search. For example, after discussing with you, we may plan to index all of your subdomains, or just a selection of the major sections:

www.example.gov
data.example.gov
archive.example.gov
www.subagencydomainexample.gov 



B. Sitemap for Each Subdomain

Who: You, the agency web team, in consultation with the Search.gov team

What: The easiest way for us to discover what URLs exist on your domain is via an XML sitemap. Each domain identified above will need a separate sitemap. Please read our detailed discussion of XML sitemaps, and let us know if you have any questions. We understand it can be difficult for some legacy systems to generate a sitemaps, so if this is the case, please reach out.

We do not crawl websites by default due to the high resource demand of crawling every page on every website all the time. One of the goals of our service is to contain the costs of search government-wide, and a crawling-first model would increase costs significantly.

If you publish your site on Federalist, read these additional instructions.

C. Index Subdomains

Who: The Search.gov team

What: Once sitemaps are posted to your website, our system will index your content. Alert us when the sitemaps are posted, and we’ll add your domains to our list of domains that we monitor. Then, indexing will begin.

By default, we make 1 request per second to a domain. If a Crawl-delay is declared in your /robots.txt file, we will honor that delay while fetching your content for indexing. The length of time required to index a site is (number of items) x (crawl delay) / 3600 = hours to index.

If you use a firewall service, it’s possible our indexer will be blocked. We can provide our IP addresses for you to whitelist in your firewall.

Please note, we can only index domains that are publicly accessible. This means that if you have a password-protected staging environment, we will not be able to index it for you as part of your testing process. Please reach out and we can discuss options if you need to test our service pre-production.

D. Test Index

Who: Search.gov Team

What: For search sites switching from Bing: After your content is indexed, we’ll start up a parallel search site using your current site configuration and the new index, and run a number of test queries to ensure the index is performing satisfactorily. Our test will cover your live site’s most popular queries.

E. Review Index

Who: You, the agency web team

What: For sites switching from Bing: After we’re satisfied with the index, we’ll send you a link to the test search site, so you can review and provide feedback.

For brand new sites: You will be able to test the index using your regular search site(s).

F. Ready to Launch

Who: You, the agency web team, in collaboration with Search.gov

What: For brand new sites: Your index is ready to go, you can proceed with the rest of the site launch steps and go live without any further action from our team.

For sites switching from Bing: When you give us the green light to switch to the new index, there is no action needed on your part other than the approval. We will change a setting in our back end, which will point your existing search site’s web results module to our index, and the change is effective immediately. All other elements of your search site remain the same: search features, branding, etc.



4. Add Search Features

Who: You, the agency web team

What: We offer several additional search features you can configure to enhance your search experience.

  • Collections allow you to set up alternative search scopes from the Domains you declare for the main search. Often Collections point at particular subfolders or subdomains of the primary domain for the site. Sometimes they point at a different domain entirely. If you are indexed by Searhc.gov and you want a Collection to search another domain, check with us to see if we have that content already indexed.
  • Best Bets work like ads in Google, and allow you to pin certain results to the top of your search results. Use Text Best Bets to boost individual items, and Graphics Best Bets to boost a set of related items, such as a form, its instructions page, and other related material.
  • Routed queries allow you to bypass the results page entirely for a given query, where you know exactly the page you want a person to get to after running that query. This is helpful for always getting people to the landing page for a process, rather than their clicking to a mid-process page from a search results page.
  • RSS feeds can be indexed and searched either as separate tabs on the search results, or as an inline module promoting your latest content alongside your web results.
  • YouTube videos can also be searched
  • Twitter
  • Flickr
  • Jobs are one of the most frequently searched topics on agency websites. Use our jobs module to show your agency’s postings from USAJOBS in your own website’s search results.
  • Federal Register rules and notices can be added to your search results in a separate module.


5. Toggle Search Features On

Who: You, the agency web team

What: In order to display any of the search features you just added above, you’ll need to toggle ON the display for each one, using the Display Overview page. If you want to show Jobs or Federal Register results and you don’t see those options on the Display Overview page, let us know and we can connect your search site to those features.


6. Configure Results Page

Who: You, the agency web team

What: To make the results page complement your website’s look and feel, upload your logo, set the font style, and customize the page colors to ensure a more seamless experience for your searchers as they move from your site to ours, and back again. You can also add header and footer links to support navigation back to your website. See more details here.

Masking the domain for your results page is another way you can provide continuity to your searchers as they move back and forth between your site and our system.


7. Connect Your Search Box to Search.gov

Who: You, the agency web team, in collaboration with your deploy team, if different

What: Once you’re ready to go live with your search site, take a look at the Go-Live Checklist to make sure you’ve covered all your bases. Then you will need to modify the form code for the search box on your website. We provide simple pre-formatted code in the Admin Center, or you can include these same parameters in another style of search box. Read more and see required parameters here.

If you publish your site on Federalist, read these alternative instructions.

You’re now live with Search.gov!

XML Sitemaps

An XML sitemap  (External link) is an XML formatted file containing a list of URLs on a website. An XML sitemap provides information that allows a search engine to index your website more intelligently, and to keep its search index up to date.

Sitemaps tell search engines what URLs are on a website, and, if URLs are added as they are published, they tell the engines what new content needs to be picked up. They may also provide additional metadata about each URL, such as the last modified date, which signals to the engine to update the index record for that page.

Search.gov uses sitemaps to tell us what URLs should be in our index and when a URL has been updated. Sitemaps are used in a similar way by Google  (External link), Bing, and and other search engines. Having an xml sitemap will improve your Google SEO (search engine optimization).

Example: https://search.gov/sitemap.xml

What content should be on XML sitemap?

Some sitemaps are comprehensive, but for very large sites you may need to publish several sitemaps. Each sitemap should be no more than 50MB or 50,000 URLs, whichever comes first. You do not need to add URLs of content you want to remain unsearchable.

Note that an HTML formatted file listing the pages of a site is more akin to an index page, and is not the same as an XML sitemap. HTML files are human friendly, but not machine friendly, and Search engines need an xml formatted file in order to leverage the information for indexing work.

More than one web platform? Use multiple sitemaps.

It’s common for agencies to use more than one platform to publish their websites. For instance, a CMS was launched, but some content is still on the legacy site’s platform. In this case, use available plugins for the CMS’s in your environment to auto-generate sitemaps for that content. Manually generate a sitemap for any static content. You can publish a sitemap index file  (External link) that lists the locations of all your specific sitemaps, or you can list all your sitemaps on your robots.txt file.

How do search engines find my sitemap(s)?

Sitemaps (or the sitemap index  (External link)) should be listed in your site’s robots.txt file, i.e.:
Sitemap: https://www.agency.gov/sitemap_1.xml
Sitemap: https://www.agency.gov/sitemap_2.xml

List the appropriate sitemap(s) for the domain or subdomain. www.exampleagency.gov/robots.txt would list sitemaps for content in the www subdomain, while forms.exampleagency.gov/robots.txt would list sitemaps for the forms subdomain.

Read more about robots.txt files, and take a look at ours: https://search.gov/robots.txt

What should my XML sitemap look like?

Please refer to the official sitemaps protocol  (External link) for full information on how a sitemap should be structured.

When publishing your sitemap, be sure it begins with an <xml> declaration, and that the URLs are enclosed in opening and closing tags. To take a simplified example:

<?xml version="1.0" encoding="UTF-8"?>
<urlset>
<url>
<loc>https://exampleagency.gov/blog/file1.html</loc>
<lastmod>2018-03-19T00:00:00+00:00</lastmod>
</url>
<url>
<loc>https://exampleagency.gov/policy/new-policy.html</loc>
<lastmod>2018-03-27T00:00:00+00:00</lastmod>
</url>
</urlset>

If you use multiple sitemaps, then you’ll need to use a sitemap index  (External link), along these lines:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex>
<sitemap>https://exampleagency.gov/sitemap.xml?page=1</sitemap>
<sitemap>https://exampleagency.gov/sitemap.xml?page=2</sitemap>
</sitemapindex>

Importantly, be sure that any special characters in your URLs are escaped  (External link) so the search engines will know how to read them.

What metadata does Search.gov require for each XML sitemap URL?

The sitemap protocol defines required and optional XML tags  (External link) for each URL. We recommend including the <lastmod> value (the date of last modification of the file) whenever possible, to indicate when a file has been updated and needs to be re-indexed.

We do not have plans to support the <priority> tag, which is no longer used  (External link) by search engines like Google. We may support the <changefreq> tag in the future, but the <lastmod> tag is more accurate and supported by more search engines.

How can I create an XML sitemap?

Most content management systems provide tools to generate a sitemap and keep it updated. Below are some tools that we recommend:

Drupal

XML Sitemap Module  (External link)

Wordpress

Yoast SEO Plugin  (External link)

Google Sitemap Plugin  (External link)

Wagtail

Sitemap Generator  (External link)

Github Pages (Jekyll)

Jekyll Sitemap gem  (External link)

Online generators

(Note: free online generators often have a limit to the number of URLs they will include, and do not always generate the most accurate sitemaps. Use them only as a last resort.)

Free Sitemap Generator  (External link)

Web Sitemap  (External link)

Sitemap checklist

1. One or more sitemaps have been created

2. The URLs in the sitemap have been reviewed (clean URLs, only includes URLs that should be searchable)

3. Each sitemap’s XML format has been validated  (External link)

4. Each sitemap (or a sitemap index) is listed in the site’s robots.txt file

Additional Resources:

Official Documentation from Sitemaps.org  (External link)

Google’s guide to building a sitemap  (External link)

Sitemap validator  (External link)

More questions?

If you have questions that aren’t answered here, email us. We’ll also keep updating this page over time.

How a Page on a Sitemap Becomes a Search Result

We often get questions about how sitemaps control the search results for a given site. The answer is, they don’t! This page will describe to you the relationship between sitemaps, search indexes, and the search experiences you create through the Admin Center.

A frame for the relationships described below

Imagine a big lake. There are any number of tributaries feeding into the lake. There are fishing boats out on the lake, each loaded up with the gear they need and a guide to the kinds of fish they’re trying to catch.

The Big Search.gov Index: the Lake

Like a lake with its fish, the common search index has all the content from all the sites we index, ready to be brought up by any number of different search site configurations.

The main difference in the search site setup process is the source of the web results. Like Google and Bing, when we index your content, we collect every site’s web pages into a big, common index. All search sites using our index reference this same common data pool.

Sitemaps: the Tributaries

XML Sitemaps are like tributaries feeding into a lake. They do not feed into sitemap-specific indexes connected to particular search sites.

Sitemaps list the content available on websites in a machine-friendly format, so that search engines will know what to collect from the site. The content indexed from your website goes into the big index mentioned above, along with the content from all other websites. You can, in theory, pull content from any website we have indexed into your search experience. This supports portal search experiences.

Search Site Setup: the Fishing Boats

Like a fishing boat on the water, you’ve decided what fish you’re going after, you know what corners of the lake to go to, and you’ve collected the gear you need to get the fish.

Search.gov used to rely on the Bing web index for our main search results. Customers would log in to the Admin Center and use the Domains list to include the content they wanted to pull from Bing. Now that we’re building our index in house, all this remains the same. You log in to the Admin Center and configure what you want your search to return on the results page.

Tying it all together

We use sitemaps to inform what we index into our system. You use the Admin Center to determine what results will come out of the index when people search on your website. Tributaries feed into a lake, and fishers can go out to any part of the lake to get the particular kinds of fish that they want.

Following a particular page through this cycle looks like this:

  1. A page is posted to a website
  2. Its URL is added to the sitemap
  3. Search.gov’s indexer reads the sitemap and picks up the URL
  4. Search.gov’s indexer visits the page and scrapes the content
  5. The content is added to the index. Meanwhile, the search site had already been configured to include this content within the index.
  6. A member of the public searches on the website
  7. The query matches the page’s content
  8. The page is returned as a search result
  9. The searcher clicks on the URL on the results page
  10. The searcher is brought to the page on the website

    Diagram showing a large circle, representing the Search.gov website. To the left of the circle is an array of small blocks, each representing an individual sitemap. Arrows point from the sitemaps to the large circle. To the right of the circle is a set of pentagons representing search sites. To the right of these is a vertical bar representing the Public. Arrows flow from the circle, through the pentagon and end at the bar, representing the flow of search results from the central Search.gov index through the search sites to the members of the public who are searching.

Robots.txt Files

A /robots.txt file is a text file that instructs automated web bots on how to crawl and/or index a website. Web teams use them to provide information about what site directories should or should not be crawled, how quickly content should be accessed, and which bots are welcome on the site.

What should my robots.txt file look like?

Please refer to the robots.txt protocol  (External link) for detailed information on how and where to create your robots.txt. Key points to keep in mind:

  • The file must be located at the root of the domain, and each subdomain needs its own file.
  • The robots.txt protocol is case sensitive.
  • It’s easy to accidentally block crawling of everything
    • Disallow: / means disallow everything
    • Disallow: means disallow nothing, thus allowing everything
    • Allow: / means allow everything
    • Allow: means allow nothing, thus disallowing everything
  • The instructions in robots.txt are guidance for bots, not binding requirements.

How can I optimize my robots.txt for Search.gov?

Crawl delay

A robots.txt file may specify a “crawl delay” directive for one or more user agents, which tells a bot how quickly it can request pages from a website. For example, a crawl delay of 10 specifies that a crawler should not request a new page more than every 10 seconds. We recommend a crawl-delay of 2 seconds for our usasearch user agent, and setting a higher crawl delay for all other bots. The lower the crawl delay, the faster Search.gov will be able to index your site. In the robots.txt file, it would look like this:

User-agent: usasearch  
Crawl-delay: 2

User-agent: *
Crawl-delay: 10

XML Sitemaps

Your robots.txt file should also list one or more of your XML sitemaps. For example:

Sitemap: https://www.exampleagency.gov/sitemap.xml
Sitemap: https://www.exampleagency.gov/independent-subsection-sitemap.xml
  • Only list sitemaps for the domain matching where the robots.txt file is. A different subdomain’s sitemap should be listed on that subdomain’s robots.txt.

Allow only the content that you want searchable

We recommend disallowing any directories or files that should not be searchable. For example:

Disallow: /archive/
Disallow: /news-1997/
Disallow: /reports/duplicative-page.html
  • Note that if you disallow a directory after it’s been indexed by a search engine, this may not trigger a removal of that content from the index. You’ll need to go into the search engine’s webmaster tools to request removal.
  • Also note that search engines may index individual pages within a disallowed folder if the search engine learns about the URL from a non-crawl method, like a link from another site or your sitemap. To ensure a given page is not searchable, set a robots meta tag on that page.

Customize settings for different bots

You can set different permissions for different bots. For example, if you want us to index your archived content but don’t want Google or Bing to index it, you can specify that:

User-agent: usasearch  
Crawl-delay: 2
Allow: /archive/

User-agent: *
Crawl-delay: 10
Disallow: /archive/

Robots.txt checklist

1. A robots.txt file has been created in the site’s root directory (https://exampleagency.gov/robots.txt)

2. The robots.txt file disallows any directories and files that automated bots should not crawl

3. The robots.txt file lists one or more XML sitemaps

4. The robots.txt file format has been validated  (External link)

Additional Resources

Yoast SEO’s Ultimate Guide to Robots.txt  (External link)

Google’s “Learn about robots.txt files”  (External link)


Page last reviewed or updated:

Everything You Need to Know About Indexing with Search.gov

How does all this work?

Domain Level SEO Supports

Page Level SEO Supports

How to get search engines to index the right content for better discoverability

Website structure and content can have a significant impact on the ability of search engines to provide a good search experience. As a result, the Search Engine Optimization industry evolved to provide better understanding of these impacts and close critical gaps. Some elements on your website will actively hinder the search experience, and this post will show you how to target valuable content and exclude distractions.

We’ve written a post about robots.txt files, talking about high level inclusion and exclusion of content from search engines. There are other key tools you will want to employ on your website to further target the content on individual pages:


The <main> element

Targeting particular content on a page

A <main> element allows you to target content you want indexed by search engines. If a <main> element is present, the system will only collect the content inside the element. Be sure that the content you want indexed is inside of this element. If the element is closed too early, important content will not be indexed. Unless the system finds a <main> element demarcating where the primary content of the page is to be found, repetitive content such as headers, footers, and sidebars will be picked up by search engines as part of a page’s content.

The element is implemented as a stand-alone tag:

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>This is your page title</h1>
<p>This is the main text of your page
</main>
Redundant footer code
Various scripts, etc.
</body>

The element can also take the form of a <div> with the role of main, though this approach is now outdated:

<body>
Redundant header code and navigation elements, sidebars, etc.
<div role=”main”>
<h1>This is your page title</h1>
<p>This is the main text of your page
</div>
Redundant footer code
Various scripts, etc.
</body>

As mentioned above, if no <main> element is present, the entire page will be scraped. This is best reserved for non-HTML file types, though, including PDFs, DOCs, and PPTs.


Declare the ‘real’ URL for a page

There are two good reasons to declare the URL for a given page: CMS sites can easily become crawler traps, and list views can generate urls that are unhelpful as search results.

A crawler trap occurs when the engine falls into a loop of visiting, opening, and “discovering” pages that seem new, but are modifications on existing URLs. These URLs may have appended parameters such as tags, referring pages, Google Tag Manager tokens, page numbers, etc. Crawler traps tend to occur when your site can generate an infinite number of URLs. The crawler is ultimately unable to determine what constitutes the entirety of a site. <link rel="canonical" href="https://www.example.gov/topic1" />

By using a canonical link, shown above, you tell the crawler this is the real URL for the page despite parameters present in the URL when the page is opened. In the example above, even if a crawler opened the page with a URL like https://example.gov/topic1?sortby=desc, only https://www.example.gov/topic1 will be captured by the search engine.

Another important use-case for canonical links is the dynamic list. If the example above is a dynamic list of pages about Topic 1, it’s likely there will be pagination at the bottom of the page. This pagination dynamically separates items into distinct pages and generates urls like: https://example.gov/topic1?page=3. As new items are added to or removed from the list, there’s no guarantee that existing items will remain on a particular page. This behavior may frustrate users when a particular page no longer contains the item they want.

Use a canonical link to limit the search engine to indexing only the first page of the list, which the user can then sort or move through as they choose. The individual items on the list are indexed separately and included in search results.


Robots meta tags

There are individual pages on your websites that do not make good search results. This could be archived event pages, list views such as Recent Blog Posts, etc. Blocking individual pages on the robots.txt file will be difficult if you don’t have easy access to edit the file Even if edits are easy, it could quickly lead to an unmanageably long robots.txt.

It’s also important to note that search engines will pay attention to Disallow directives in robots.txt when crawling, but may not when accessing your URLs from other sources, like links from other sites or your sitemap. Search.gov will rely on robots meta tags when working off your sitemap to know what content you want searchable, and what you don’t want searchable.

To achieve best results for blocking indexing of particular pages, you’ll want to employ meta robots tags in the <head> of the pages you want to exclude from the search index.

This example says not to index the page, but allows following the links on the page:

<meta name="robots" content="noindex" />

This example says to index the page, but not follow any of the links on the page:

<meta name="robots" content="nofollow" />

This example tells bots not to index the page, and not to follow any of the links on the page:

<meta name="robots" content="noindex, nofollow" />

You can also add an X-Robots-Tag to you HTTP header response to control indexing for a given page. This requires deeper access to servers than our customers usually have themselves, so if you are interested in learning more, you can do so here  (External link).

If you have content that should be indexed when it’s fresh, but needs to be removed from the index once it’s outdated, you’ll want to take a few actions:

  • Once the page’s window of relevance is over, add a <meta name="robots" content="noindex" /> tag to the <head> of the page.
  • Make sure the modified_time on the page is updated.
  • Leave the item in the sitemap, so that search engines will see the page was updated, revisit it, and see that the item should be removed from the index.


Sample code structure

Dynamic list 1: Topic landing page

The following code sample is for a dynamically generated list of pages on your site, where you want the landing page for the list to appear in search results.

<head>
<title>Unique title of the page</title>
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/topic1" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
</main>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Dynamic list 2: Posts tagged XYZ

The following code sample is for a dynamically generated list of pages on your site, where you do not want the list to appear in search results. In the case of pages tagged with a particular term, the pages themselves would be good search results, but the list of them would be just another click between the user and the content.

Note: the description tags are still present in case someone links to this page in another system and that system wants to display a summary with the link.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/posts-tagged-xyz" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<h1>Unique title of the page</h1>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Event from last month

In the following example, an event page was published in June, and then updated the day after the event occurred. This update adds the meta robots tag, which declares the page should not be indexed, and links from the page should not be followed in future crawls. Again, the meta descriptions are retained in case of linking from other systems.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex, nofollow" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-06-04” />
<meta property=”article:modified_time” content=”2018-08-13” />
<link rel="canonical" href="https://www.example.gov/events/august-12-title-of-event" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
Specifics about the event.
</main>
Redundant footer code
Various scripts, etc.
</body>

Resources