We continue to make good progress towards our indexing system, and continues to be highly focused on the back end of our system. See below for more details.
We created back-end interfaces allowing the Search.gov team to manage indexed domains & urls.
We added a delay method to SearchGov Domain, to honor the crawl delay settings in a given site’s robots.txt file.
We created a SearchGov Domain Indexer job that will enqueue urls in need of fetching, to allow bulk indexing tasks to be automated without overloading anyone’s servers, and we added support for resque-scheduler to our configuration baseline.
We set the sitemap indexer to reject urls from other domains to avoid erroneous attempts to index content from beta sites, old domains, etc.
We now check the protocol of a domain, and whether the site is responding to us. We also set our url fetcher to throw an error if the domain is unavailable or blocking our indexer.
We re-indexed the searchgov indices.
We upgraded mySQL in demo environments, and streamlined the scenario data for our test suite.
We fixed bug that sent searchers back to page 1 results when changing the time scope in a Collection search.
We mitigated SSL certificate problems with some sites.
We made our redirection check more strict to avoid filling our database and indexes with domains and web pages that don’t need to be searchable.