Highlights

  • We continue to make good progress towards our indexing system, and continues to be highly focused on the back end of our system. See below for more details.

Chores

  • We created back-end interfaces allowing the Search.gov team to manage indexed domains & urls.
  • We added a delay method to SearchGov Domain, to honor the crawl delay settings in a given site’s robots.txt file.
  • We created a SearchGov Domain Indexer job that will enqueue urls in need of fetching, to allow bulk indexing tasks to be automated without overloading anyone’s servers, and we added support for resque-scheduler to our configuration baseline.
  • We set the sitemap indexer to reject urls from other domains to avoid erroneous attempts to index content from beta sites, old domains, etc.
  • We now check the protocol of a domain, and whether the site is responding to us. We also set our url fetcher to throw an error if the domain is unavailable or blocking our indexer.
  • We re-indexed the searchgov indices.
  • We upgraded mySQL in demo environments, and streamlined the scenario data for our test suite.

Bug Fixes

  • We fixed bug that sent searchers back to page 1 results when changing the time scope in a Collection search.
  • We mitigated SSL certificate problems with some sites.
  • We made our redirection check more strict to avoid filling our database and indexes with domains and web pages that don’t need to be searchable.