Posts tagged release-notes
- We’ve added Click counts as a ranking factor for customers indexed by Search.gov. We look at the URLs that represent 75% of all clicks on search results, and give those a boost. This is the “fat head” that comes before the “long tail.” As always, following best practices will help results stay relevant:
- Use Best Bets to promote frequently visited pages that are not bubbling to the top of results on their own
- Periodically review and update Best Bets
- Maintain an up-to-date and complete sitemap with updated dates
Fixes, Upgrades, Misc
- Fixed an issue with the type-ahead feature on customer search boxes
- Fixed an issue with disappearing search icon on search boxes
- Continued with Rails upgrade for our applications
- After integrating directly with the new USAJOBS API, we worked on additional tuning of trigger words to avoid false positive job-related searches
- We have improved search performance by caching repeat queries made to our data store
- We updated our content parser to accept some non-standard HTML tags, and to ignore any content within
Fixes, Upgrades, Misc
- We upgraded Ruby on our
- We increased the processing power on the servers that support our primary web index
- We reindexed our primary index into more Elasticsearch shards
- We decreased the cookie timeout for Admin Center sessions
- We made the failed password reset alert language more ambiguous, so people will no longer be able to tell whether the email address has an account
- We fixed a bug in our MRSS photo indexer
- We integrated with Bing v7 and transitioned our customers to this newer version.
- We now will index content on a domain even if the root of that domain lists a different domain as the canonical domain. For example,
https://publications.sampleagency.gov may list
https://www.sampleagency.gov as the canonical domain, but still serve content from
https://publications.sampleagency.gov/reports/first_report.pdf. We can now index https://publications.sampleagency.gov/reports/first_report.pdf`.
- We updated our job search location feature to show more job openings, and cleaned up how we send job queries to the USAJOBS api to get more results.
- We now automatically review URLs for reindexing, checking for 404s and 301s. We’re doing this every 30 days to begin with, and will adjust that timeframe as needed.
- We upgraded the Ruby version on the repo for our search.gov website, and
asis, our image indexing repo.
- We upgraded the
activejob Ruby gem across repos.
- We made several updates to our Chef cookbooks to further harden our operating system, including backend password policies, package configuration, and OS configuration.
- We shifted our model for supporting domain masks for hosted search results pages to leverage CAA records.
- We fixed a gnarly bug in Elasticsearch that made queries containing very common words, like “the”, behave as if there were no results.
- We integrated directly with the new USAJOBS API. This means that we
- are now querying their system at query time, rather than building an index of their job postings within our own system and querying that at query time.
- have reconfigured what information our jobs searches include in the full query that we send to USAJOBS
- have increased the geographic radius we’ll look at when a user searches on a jobs-related term. The radius is now 75 miles from the user’s general location.
- we are now always providing a link to USAJOBS.gov if someone has searched for a jobs-related term, even if there are no jobs located near the searcher.
- We now support indexing TXT files. There are more TXT files on government websites than you would have thought!
- We fixed a link in the Jobs module that led searchers to a broken USAJOBS.gov page.
- We now deduplicate sitemap URLs so we will not try to index the same content more than once.
- We updated Ruby gems: Loofah, Rack, FFI.
- We upgraded Ruby.
- We began work on using click data in our relevancy ranking, starting with
- Recording the domain of the clicked-on URL separately, so we can manage all the clicks for a particular domain.
- Calculating the top N clicked-on URLs for a given domain.
- We indexed a lot of content for agencies.
- We got our new developers set up and ready to work on great stuff.
- We began work on leveraging click data in our relevancy scoring. This will allow us to use the relative popularity of pages as a ranking signal.
- We now record the domain of a URL that has been clicked in addition to recording the full click. This way we can compare the click volume of URLs within a given domain.
- We resolved security vulnerabilities in grape & sprockets
- Configure rspec to run specs in random order
- We added support for XML sitemaps that are located in non-standard locations within a domain.
- We added sort_by support to our Results API
- We finished migrating to CircleCI for our continuous integration monitoring.
- We improved our internal tracking of queries to the Bing API.
- We improved how we handle indexing domains that time out.
- We began indexing the last-modified date of a page, if provided
- Our SitemapIndexer now processes one sitemap at a time, and we created an automated queue for indexing jobs and url fetching.
- We improved the management of Searchgov domain states. Now each Searchgov domain has an “indexing activity”. States might include: indexing sitemaps, fetching new URLs (such as after bulk import), and crawling.
- We now follow client-side redirects.
- We improved our ability to avoid certain crawler traps.
- We now index documents up to 15 MB in size. The previous limit was 10 MB.
- We finalized our compliance with BOD 18-01.
- We cleaned up how we handle temp files during indexing.
- We tidied up our internal errors on indexing jobs, as well as our test suite.
- We fixed a bug that was not showing diacritics properly in non-English searches.
- We continue to make good progress towards our indexing system, and continues to be highly focused on the back end of our system. See below for more details.
- We created back-end interfaces allowing the Search.gov team to manage indexed domains & urls.
- We added a delay method to
SearchGov Domain, to honor the crawl delay settings in a given site’s
- We created a
SearchGov Domain Indexer job that will enqueue urls in need of fetching, to allow bulk indexing tasks to be automated without overloading anyone’s servers, and we added support for
resque-scheduler to our configuration baseline.
- We set the sitemap indexer to reject urls from other domains to avoid erroneous attempts to index content from beta sites, old domains, etc.
- We now check the protocol of a domain, and whether the site is responding to us. We also set our url fetcher to throw an error if the domain is unavailable or blocking our indexer.
- We re-indexed the searchgov indices.
- We upgraded mySQL in demo environments, and streamlined the scenario data for our test suite.
- We fixed bug that sent searchers back to page 1 results when changing the time scope in a Collection search.
- We mitigated SSL certificate problems with some sites.
- We made our redirection check more strict to avoid filling our database and indexes with domains and web pages that don’t need to be searchable.
- We’re making good progress towards our indexing system, but all our work in April was in the back end of our system. See below for more information.
- We have updated the jQuery version.
- We configured our analytics alerts to send emails via SES instead of Mandrill.
- We upgraded Ruby to version 2.3.7.
- We computed filename extensions for documents in our primary index.
- We improved how we handle email bounces for our notifications, and complaints that may come in.
- We fixed and error with our S3 backups for Logstash.