Posts tagged indexing

Robots.txt Files

A /robots.txt file is a text file that instructs automated web bots on how to crawl and/or index a website. Web teams use them to provide information about what site directories should or should not be crawled, how quickly content should be accessed, and which bots are welcome on the site.

What should my robots.txt file look like?

Please refer to the robots.txt protocol  (External link) for detailed information on how and where to create your robots.txt. Key points to keep in mind:

  • The file must be located at the root of the domain, and each subdomain needs its own file.
  • The robots.txt protocol is case sensitive.
  • It’s easy to accidentally block crawling of everything
    • Disallow: / means disallow everything
    • Disallow: means disallow nothing, thus allowing everything
    • Allow: / means allow everything
    • Allow: means allow nothing, thus disallowing everything
  • The instructions in robots.txt are guidance for bots, not binding requirements.

How can I optimize my robots.txt for Search.gov?

Crawl delay

A robots.txt file may specify a “crawl delay” directive for one or more user agents, which tells a bot how quickly it can request pages from a website. For example, a crawl delay of 10 specifies that a crawler should not request a new page more than every 10 seconds. We recommend a crawl-delay of 2 seconds for our usasearch user agent, and setting a higher crawl delay for all other bots. The lower the crawl delay, the faster Search.gov will be able to index your site. In the robots.txt file, it would look like this:

User-agent: usasearch  
Crawl-delay: 2

User-agent: *
Crawl-delay: 10

XML Sitemaps

Your robots.txt file should also list one or more of your XML sitemaps. For example:

Sitemap: https://www.exampleagency.gov/sitemap.xml
Sitemap: https://www.exampleagency.gov/independent-subsection-sitemap.xml
  • Only list sitemaps for the domain matching where the robots.txt file is. A different subdomain’s sitemap should be listed on that subdomain’s robots.txt.

Allow only the content that you want searchable

We recommend disallowing any directories or files that should not be searchable. For example:

Disallow: /archive/
Disallow: /news-1997/
Disallow: /reports/duplicative-page.html
  • Note that if you disallow a directory after it’s been indexed by a search engine, this may not trigger a removal of that content from the index. You’ll need to go into the search engine’s webmaster tools to request removal.
  • Also note that search engines may index individual pages within a disallowed folder if the search engine learns about the URL from a non-crawl method, like a link from another site or your sitemap. To ensure a given page is not searchable, set a robots meta tag on that page.

Customize settings for different bots

You can set different permissions for different bots. For example, if you want us to index your archived content but don’t want Google or Bing to index it, you can specify that:

User-agent: usasearch  
Crawl-delay: 2
Allow: /archive/

User-agent: *
Crawl-delay: 10
Disallow: /archive/

Robots.txt checklist

1. A robots.txt file has been created in the site’s root directory (https://exampleagency.gov/robots.txt)

2. The robots.txt file disallows any directories and files that automated bots should not crawl

3. The robots.txt file lists one or more XML sitemaps

4. The robots.txt file format has been validated  (External link)

Additional Resources

Yoast SEO’s Ultimate Guide to Robots.txt  (External link)

Google’s “Learn about robots.txt files”  (External link)


Page last reviewed or updated:

How to get search engines to index the right content for better discoverability

Website structure and content can have a significant impact on the ability of search engines to provide a good search experience. As a result, the Search Engine Optimization industry evolved to provide better understanding of these impacts and close critical gaps. Some elements on your website will actively hinder the search experience, and this post will show you how to target valuable content and exclude distractions.

We’ve written a post about robots.txt files, talking about high level inclusion and exclusion of content from search engines. There are other key tools you will want to employ on your website to further target the content on individual pages:


The <main> element

Targeting particular content on a page

A <main> element allows you to target content you want indexed by search engines. If a <main> element is present, the system will only collect the content inside the element. Be sure that the content you want indexed is inside of this element. If the element is closed too early, important content will not be indexed. Unless the system finds a <main> element demarcating where the primary content of the page is to be found, repetitive content such as headers, footers, and sidebars will be picked up by search engines as part of a page’s content.

The element is implemented as a stand-alone tag:

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>This is your page title</h1>
<p>This is the main text of your page
</main>
Redundant footer code
Various scripts, etc.
</body>

The element can also take the form of a <div> with the role of main, though this approach is now outdated:

<body>
Redundant header code and navigation elements, sidebars, etc.
<div role=”main”>
<h1>This is your page title</h1>
<p>This is the main text of your page
</div>
Redundant footer code
Various scripts, etc.
</body>

As mentioned above, if no <main> element is present, the entire page will be scraped. This is best reserved for non-HTML file types, though, including PDFs, DOCs, and PPTs.


Declare the ‘real’ URL for a page

There are two good reasons to declare the URL for a given page: CMS sites can easily become crawler traps, and list views can generate urls that are unhelpful as search results.

A crawler trap occurs when the engine falls into a loop of visiting, opening, and “discovering” pages that seem new, but are modifications on existing URLs. These URLs may have appended parameters such as tags, referring pages, Google Tag Manager tokens, page numbers, etc. Crawler traps tend to occur when your site can generate an infinite number of URLs. The crawler is ultimately unable to determine what constitutes the entirety of a site. <link rel="canonical" href="https://www.example.gov/topic1" />

By using a canonical link, shown above, you tell the crawler this is the real URL for the page despite parameters present in the URL when the page is opened. In the example above, even if a crawler opened the page with a URL like https://example.gov/topic1?sortby=desc, only https://www.example.gov/topic1 will be captured by the search engine.

Another important use-case for canonical links is the dynamic list. If the example above is a dynamic list of pages about Topic 1, it’s likely there will be pagination at the bottom of the page. This pagination dynamically separates items into distinct pages and generates urls like: https://example.gov/topic1?page=3. As new items are added to or removed from the list, there’s no guarantee that existing items will remain on a particular page. This behavior may frustrate users when a particular page no longer contains the item they want.

Use a canonical link to limit the search engine to indexing only the first page of the list, which the user can then sort or move through as they choose. The individual items on the list are indexed separately and included in search results.


Robots meta tags

There are individual pages on your websites that do not make good search results. This could be archived event pages, list views such as Recent Blog Posts, etc. Blocking individual pages on the robots.txt file will be difficult if you don’t have easy access to edit the file Even if edits are easy, it could quickly lead to an unmanageably long robots.txt.

It’s also important to note that search engines will pay attention to Disallow directives in robots.txt when crawling, but may not when accessing your URLs from other sources, like links from other sites or your sitemap. Search.gov will rely on robots meta tags when working off your sitemap to know what content you want searchable, and what you don’t want searchable.

To achieve best results for blocking indexing of particular pages, you’ll want to employ meta robots tags in the <head> of the pages you want to exclude from the search index.

This example says not to index the page, but allows following the links on the page:

<meta name="robots" content="noindex" />

This example says to index the page, but not follow any of the links on the page:

<meta name="robots" content="nofollow" />

This example tells bots not to index the page, and not to follow any of the links on the page:

<meta name="robots" content="noindex, nofollow" />

You can also add an X-Robots-Tag to you HTTP header response to control indexing for a given page. This requires deeper access to servers than our customers usually have themselves, so if you are interested in learning more, you can do so here  (External link).

If you have content that should be indexed when it’s fresh, but needs to be removed from the index once it’s outdated, you’ll want to take a few actions:

  • Once the page’s window of relevance is over, add a <meta name="robots" content="noindex" /> tag to the <head> of the page.
  • Make sure the modified_time on the page is updated.
  • Leave the item in the sitemap, so that search engines will see the page was updated, revisit it, and see that the item should be removed from the index.


Sample code structure

Dynamic list 1: Topic landing page

The following code sample is for a dynamically generated list of pages on your site, where you want the landing page for the list to appear in search results.

<head>
<title>Unique title of the page</title>
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/topic1" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
</main>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Dynamic list 2: Posts tagged XYZ

The following code sample is for a dynamically generated list of pages on your site, where you do not want the list to appear in search results. In the case of pages tagged with a particular term, the pages themselves would be good search results, but the list of them would be just another click between the user and the content.

Note: the description tags are still present in case someone links to this page in another system and that system wants to display a summary with the link.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-09-28” />
<meta property=”article:modified_time” content=”2018-09-28” />
<link rel="canonical" href="https://www.example.gov/posts-tagged-xyz" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<h1>Unique title of the page</h1>
Dynamically generated list of relevant pages
Pagination
Redundant footer code
Various scripts, etc.
</body>

Event from last month

In the following example, an event page was published in June, and then updated the day after the event occurred. This update adds the meta robots tag, which declares the page should not be indexed, and links from the page should not be followed in future crawls. Again, the meta descriptions are retained in case of linking from other systems.

<head>
<title>Unique title of the page</title>
<meta name="robots" content="noindex, nofollow" />
<meta name="description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175." />
<meta property="og:title" content="Unique title of the page" />
<meta property="og:description" content="Some multi-sentence description of various things a person will find on this page. This is a great place to use different terms for the same thing, which is hopefully both plain language and keyword stuffing at the same time. Recommended max characters is 175. This could be the same or slightly different than the regular meta description." />
<meta property=”article:published_time” content=”2018-06-04” />
<meta property=”article:modified_time” content=”2018-08-13” />
<link rel="canonical" href="https://www.example.gov/events/august-12-title-of-event" />
</head>

<body>
Redundant header code and navigation elements, sidebars, etc.
<main>
<h1>Unique title of the page</h1>
<p>This is the introductory text of the page. It tells people what they’ll find here, why the topic is important, etc. This text is within the main element, and so it will be used to retrieve this page in searches.
Specifics about the event.
</main>
Redundant footer code
Various scripts, etc.
</body>

Resources

Six Months In: Lessons Learned in the Transition to Search.gov

In September 2017 we announced that we would be moving away from using commercial search engines as our primary source of web results. This was driven by Google’s announcement that they would be sunsetting their Site Search API at the end of March 2018, and our desire to have more control over the quality, coverage, and cost of our web results than we were able to achieve with Bing.

We’d like to update you on our progress and share some lessons learned. Over the past months we’ve worked with many of you and listened to the challenges you face in your particular environments. In the fall we believed that encouraging agencies towards using our i14y indexing API was the best way to go. Having lived into it, we now know that we need to focus on indexing content directly off of your websites, leveraging structured lists of URLs known as XML sitemaps. The lessons here are presented in the order we learned them, and we’ve updated the indexing FAQs to reflect the current approach.

Lessons Learned

1. The API-first focus we began with isn’t viable as a standard solution

  • Many agencies use content management systems for which we are unable to provide an integration, and struggle to create one on their own.
  • Even though many agencies use Drupal, and may have been able to leverage our module, distributed site management and quarantine-like firewalling make it highly difficult to get i14y indexing up and running via the module.
  • Mixed platforms for content mean that most sites will need other indexing support beyond just i14y.
  • The number of indexes created by our API model isn’t scalable. Even though Elasticsearch, the search solution our system is built on, is designed for its indexes to expand to great size with ease, it doesn’t perform well with a high number of relatively small indexes.
  • And finally, many agencies still have static sites, from which it’s close to impossible to export even a clean list of URLs for indexing, much less to push content out to an API over time.

2. Crawling is prohibitively resource-intense

  • To assist some static sites in our early-transition group, we incorporated a crawler into our stack, to facilitate content discover in those websites. The crawler does a good job, but it’s still a manual process and automating for continuous discovery would require significant processing power.
  • Crawl delays, as set in robots.txt files, have a serious impact on indexing speed. At a crawl delay of 10, it would take over two weeks to crawl a site of 150,000 pages.

3. Relevancy ranking is easiest to manage if all the content is in the same index

  • In addition to search indexes holding data from pages and files, it can also have indicator data showing how a given piece of content should be ranked relative to other pieces of content. For sites who have sent web content into i14y drawer indexes and have content that we have indexed from static files, each index will have ranking indicators, but it’s hard for a search system to know how to compare the different indicators when blending results together.
  • We’ll be able to add more ranking indicators to the index that we build than we can offer in the i14y drawer indexes built by agencies via API.
  • If an agency is able to send 100% of the content they want searched, with full text, to a single i14y drawer, then their relevance ranking would be easy to determine. While most agencies can’t do this, we will continue to support i14y for the agencies that have already launched with it.

4. XML sitemaps are great - really helpful for search engines and pretty easy for agencies to implement

  • An XML sitemap is a machine-friendly list of the contents of a website. While no one can know for sure, the consensus in the SEO industry is that Google and Bing use XML sitemaps as part of their monitoring sites for new or updated content. Making it easy for them to find your content is thought to give a ranking boost to your site’s content  (External link).
  • Most content management systems have plugins that will generate an xml sitemap for the content in the CMS. Static content can either be added to the CMS-generated sitemap, or listed in a separate sitemap file.
  • We feel it’s a better use of time for agency teams to work on implementing good sitemaps that will help them out in Google and Bing, as well as with the Search.gov system, than to invest a lot of time in an integration that only works with us.

So where are we now?

We built the new index model and released that in December 2017, along with our core indexing job that grabs data from pages - html, pdf, and the other major file types.

We added a crawler in January 2018, to facilitate URL discovery on a given website.

We added the ability to index content from XML sitemaps in February. We follow the sitemaps protocol  (External link), which relies in part on also having and have posted a explainer pages about XML sitemaps and robots.txt files to get you the most essential information.

We removed our connection to the Google Site Search API in March, and are now serving results from our own index for those cases where we had previously used Google.

We started with a small representative set of sites, and have moved into working on our highest traffic customers. Over 90 search sites now bear the Powered by Search.gov mark on their search results pages, including SSA.gov, TSA.gov, Medicare.gov, and many more.

Where to next?

Unlike with the Google API sunset, there is no hard deadline for our moving sites off Bing and into our own index. Our timelines around Bing are driving toward having well over half of our search traffic going to our own indexes this fiscal year. We’ll continue to work on high traffic sites, large Department website searches, and the many agency component websites that combine to create the parent website’s search.

If your search site is low traffic and you haven’t heard from us yet, you can expect to remain on Bing for the foreseeable future. We will reach out as the time draws near.

In the meantime, we encourage all sites to invest some time in developing and maintaining a good XML sitemap. As mentioned above, this will help us maintain a good index of your content, and it will also give you a Google boost, so it’s really win-win. Part of having a good sitemap is having a good robots.txt file as well. Read over our new explainer posts and reach out with any questions.

Learn more about XML sitemaps

Learn more about robots.txt files

XML Sitemaps

An XML sitemap  (External link) is an XML formatted file containing a list of URLs on a website. An XML sitemap provides information that allows a search engine to index your website more intelligently, and to keep its search index up to date.

Sitemaps tell search engines what URLs are on a website, and, if URLs are added as they are published, they tell the engines what new content needs to be picked up. They may also provide additional metadata about each URL, such as the last modified date, which signals to the engine to update the index record for that page.

Search.gov uses sitemaps to tell us what URLs should be in our index and when a URL has been updated. Sitemaps are used in a similar way by Google  (External link), Bing, and and other search engines. Having an xml sitemap will improve your Google SEO (search engine optimization).

Example: https://search.gov/sitemap.xml

What content should be on XML sitemap?

Some sitemaps are comprehensive, but for very large sites you may need to publish several sitemaps. Each sitemap should be no more than 50MB or 50,000 URLs, whichever comes first. You do not need to add URLs of content you want to remain unsearchable.

Note that an HTML formatted file listing the pages of a site is more akin to an index page, and is not the same as an XML sitemap. HTML files are human friendly, but not machine friendly, and Search engines need an xml formatted file in order to leverage the information for indexing work.

More than one web platform? Use multiple sitemaps.

It’s common for agencies to use more than one platform to publish their websites. For instance, a CMS was launched, but some content is still on the legacy site’s platform. In this case, use available plugins for the CMS’s in your environment to auto-generate sitemaps for that content. Manually generate a sitemap for any static content. You can publish a sitemap index file  (External link) that lists the locations of all your specific sitemaps, or you can list all your sitemaps on your robots.txt file.

How do search engines find my sitemap(s)?

Sitemaps (or the sitemap index  (External link)) should be listed in your site’s robots.txt file, i.e.:
Sitemap: https://www.agency.gov/sitemap_1.xml
Sitemap: https://www.agency.gov/sitemap_2.xml

List the appropriate sitemap(s) for the domain or subdomain. www.exampleagency.gov/robots.txt would list sitemaps for content in the www subdomain, while forms.exampleagency.gov/robots.txt would list sitemaps for the forms subdomain.

Read more about robots.txt files, and take a look at ours: https://search.gov/robots.txt

What should my XML sitemap look like?

Please refer to the official sitemaps protocol  (External link) for full information on how a sitemap should be structured.

When publishing your sitemap, be sure it begins with an <xml> declaration, and that the URLs are enclosed in opening and closing tags. To take a simplified example:

<?xml version="1.0" encoding="UTF-8"?>
<urlset>
<url>
<loc>https://exampleagency.gov/blog/file1.html</loc>
<lastmod>2018-03-19T00:00:00+00:00</lastmod>
</url>
<url>
<loc>https://exampleagency.gov/policy/new-policy.html</loc>
<lastmod>2018-03-27T00:00:00+00:00</lastmod>
</url>
</urlset>

If you use multiple sitemaps, then you’ll need to use a sitemap index  (External link), along these lines:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex>
<sitemap>https://exampleagency.gov/sitemap.xml?page=1</sitemap>
<sitemap>https://exampleagency.gov/sitemap.xml?page=2</sitemap>
</sitemapindex>

Importantly, be sure that any special characters in your URLs are escaped  (External link) so the search engines will know how to read them.

What metadata does Search.gov require for each XML sitemap URL?

The sitemap protocol defines required and optional XML tags  (External link) for each URL. We recommend including the <lastmod> value (the date of last modification of the file) whenever possible, to indicate when a file has been updated and needs to be re-indexed.

We do not have plans to support the <priority> tag, which is no longer used  (External link) by search engines like Google. We may support the <changefreq> tag in the future, but the <lastmod> tag is more accurate and supported by more search engines.

How can I create an XML sitemap?

Most content management systems provide tools to generate a sitemap and keep it updated. Below are some tools that we recommend:

Drupal

XML Sitemap Module  (External link)

Wordpress

Yoast SEO Plugin  (External link)

Google Sitemap Plugin  (External link)

Wagtail

Sitemap Generator  (External link)

Github Pages (Jekyll)

Jekyll Sitemap gem  (External link)

Online generators

(Note: free online generators often have a limit to the number of URLs they will include, and do not always generate the most accurate sitemaps. Use them only as a last resort.)

Free Sitemap Generator  (External link)

Web Sitemap  (External link)

Sitemap checklist

1. One or more sitemaps have been created

2. The URLs in the sitemap have been reviewed (clean URLs, only includes URLs that should be searchable)

3. Each sitemap’s XML format has been validated  (External link)

4. Each sitemap (or a sitemap index) is listed in the site’s robots.txt file

Additional Resources:

Official Documentation from Sitemaps.org  (External link)

Google’s guide to building a sitemap  (External link)

Sitemap validator  (External link)

More questions?

If you have questions that aren’t answered here, email us. We’ll also keep updating this page over time.