Please stand by for realtime captions >> Hello everyone, welcome to Search.gov first webinar of 2019, Indexing with Search.gov. It is a placer solution that federal agencies can use to provide a high-quality search experience to the site users. Is the service of the General services administration supporting search boxes on over 2000 websites across 30 % of federal domains. Today during the webinar we will walk you through our indexing process with a focus on a new indexing package addressing the most common indexing questions we have received over the past year. The webinar is also an excellent opportunity to ask questions. You can ask the questions by posting them in the chat box. If for some reason you cannot create a profile, you can send us your questions by email at search at support. Dij If for some reason you cannot create a profile, you can send us your questions by email at search@support.Dij.gov.gov. Today our instructor is Dawn. She will be guiding you through the webinar. She is a professional library and on a mission to help people find what they are looking for. She is also the program manager for Search.gov where she works to improve agency users experience with the service and public experience when searching on government websites. >> Dawn? >> Thanks. We are excited to be getting together with all of you today to go over the indexing process. As she mentioned we receive similar questions over the past year and responding to that, we developing a set of help documentation which we hope will illuminate the process quite a bit. There is a lot of interest in the inner workings of our system which is fantastic and we hope to be able to satisfy some of the curiosity today. >> Since we have a mix of current users and new folks attending today, I want to take a moment to orient everyone to what we are talking about. Search.gov supports the search function of about 30 % of federal domains and within the domains about 2000 different websites. Agencies create user accounts in our system to configure the search experience and connect the websites to arts and, to point at us rather than at a internal target. In addition to indexing, we published a search site launch guide that it is linked from our home page and from the help manual page. >> The launch guide walks you through everything you want to do to get a search experience ready to support your website. The middle section is all about indexing. We will be focusing on this information. Before we dive into that, I want to show you the larger package of information. Our goal is to be as much of a self-service tool as possible. These documents should support you getting into our index with a couple of emails to our team. So far it has been taking quite a few more. The index section is linked from the homepage and from the help manual. It lists all of the information about indexing. >> At the top, we have information on how search engines work that specifically how we work so you can wrap your mind around the concepts and techniques. We are trying to demystify everything for you. Search engines such as computers. There is a webinar I did last year about how search engines work in the slides include contained links and examples, that should be worth a glance. We have a set of new documents responding to the questions we have been receiving. What do you index, how do you link results, how does the indexing process work, and perhaps my favorite, how site maps control the content of the search site results. Just kidding, they do not do that at all. >> The next sections are about as EO, these post about site map, we published last year so you may have seen them already. We will talk about these during the discussion of the indexing process. On page level considerations we have pages we published a while ago and are bringing these together under this roof so you can find them easily. The metadata elements, we encourage you to include in your pages as well as other signals you can add so the search engines can be better able to index the better substance of your websites that let's take a look at the new documents. >> When we index your site, you think about us collecting your content. If a page includes a main element, we will skip all the rest of the body, the headers, sidebars, footer, anything outside of the main element and collect the real content of the page. If there is no main element, we will collect the full body tag and filter out the naff and fuller, footer elements. If they do not include any of these will index the full body. If it is a file PDF, Excel, we will index the full content. We also collect a bunch of metadata including standard elements and open graph call elements. We chose open graph elements we found they were implemented across government already. They are easy to include the plug-ins for various content management systems. If you know you have open graph elements, it will be good to check on them because sites post a few fields but not the date fields that you see mentioned here. >> We collect all the standard file formats but I want to point out, at this time we cannot index JavaScript page content. Google is still the only search engine for JavaScript content and even they just last week say it takes longer to process and get the pages into their index. If you have your pages displayed in contact with JavaScript, we recommend you pay extra attention to the description metadata that you are including in the head of your pages. Make sure the descriptions include the keyword that you want the page to be searchable for. Ranking factors. Unlike the major search engines we are publishing our ranking factor so you can understand the different info. Hopefully improve the structure and description on your websites to get the better search performance you are looking for and also because a lot of people have asked. >> On Google, if you want to make sure that something will appear at the top of the results you have to pay for and add. Similarly, in our system you have to set up a best bet in the admin center. For ranking regular results, we supplemented the core algorithm of our underlining search engine which is elastic search with attributes that are useful to the government sites we support. Each document is scored for each of the factors and all of those component scores are multiplied together to make up the final relevant score. The ranking is the order of the items mention the query listed by score from highest to lowest. >> The first feedback we heard from sites that started using our index early on is there were way too many PDFs showing up on page 1 crowding out their HTML landing pages and balking things down. We added a preference for HTML format content that it does not matter what the file extension is, if it is HTML we will prefer that. The next feedback we got was about how much old content was showing up at the top of the results. We added the motion for older content which starts when a document is 30 days old, and get stronger, the motion gets stronger as the document gets older. Pages are filed with node date metadata at all our considered permit fresh so they do not get demoted. The recent thing we added was a consideration of how popular a pages. Right now we are measuring this in terms of how many times the URL has been clicked on from the search results. >> Finally, I included information here about the core elastic search algorithm which in rough terms looks at how many term matches there are between the query terms in each document balanced against how common or rare the terms are, and how long the document is that you can imagine how under this core model a PDF from eight years ago might get scored higher than a webpage about Congressional testimony from a month ago. Now we consider file type, precious in popularity, we can elevate the newer shorter page 2 where it is visible to searchers. The last one I want to talk about is the relationship between site maps and search results. >> We have had requests from customers to index the content on site map A and attach it to one, it has overlap, and attach it to another secondary search site. They are working on the understanding that the list of URLs on the site map itself is what will get search from the relevant search site. This is not how it works though. I came up with a metaphor to describe it. I will not read the whole page to you, but, if you imagine a lake with a bunch of tributaries feeding into it, and fishing boats out on the open water. The lake is the index, the tributaries are the site maps flowing in, and all your various search sites are the fishing boat set to pull the right fish out of the water. I made this diagram. I am not a graphic designer. Hopefully it helps communicate the relationships of the inputs that the site map serves into the index and the outputs that the search sites control while the public is searching your websites. >> Now would be a good time to stop for questions. Have we gotten any questions? >> There is a question relevant to the section. When you take your service do you have a sandbox environment we can use to test it out? >> Yes. Also know. In our system when you establish a search site, whether it is the production when you will use connected to your website, or if it is a playground area for you, all of that is in production in our environment. The difference is whether or not a search site is receiving queries from your website. If you want to have a sandbox area to try things out, there are two ways you can do it. One is to set up to search sites, one is the future search site that will be attached to your search box, and then you can have a my site handled data gov equivalent sandbox area that does not get attached at all and you can play around in that. There are 2 ways you can do it. If you are brand-new to the service, the search site that you are setting up to attach to your website is essentially a preproduction environment because it is not live in to you connect it to your search box. >> Are there any other questions? >> Not at this time. >> Let's turn to the indexing process itself. I mentioned earlier the process is documented within the site launch guide. I have tried to point out the differences within the documentation. I try to point out the difference the experience brand-new customers will have versus those that are live with us. The process is basically the same except when it comes to time to shift your results on to our index. The first thing we do is look at your current search traffic as a guide for what index we we used to support your site. For many years our service was able to use the being index at no cost. Since we ourselves is available to agencies that no cost, it was a easy model to expand our coverage within the government webspace especially with budgets getting tighter. For the past few years GSA has been paying for the being results and since one of our goals is to streamline the cause, as you can imagine it is not sustainable to have rapidly increasing market share increase cost within our own budgets with this under continuum resolution just like yours. >> A year and a half ago we started building and improving on our own web index to support all but the smallest sites and the couple of sites that truly have full government search scopes like USA.gov and for you.gov. In those cases it is still more cost-effective to use the being index weather than index keep up to date with however many potential small sites that might come our way. The threshold we have set is 150,000 queries per year. Some of the biggest sites using our service, see this level of traffic in a week. Hundred and 50,000 per year is pretty small. If you are under this level and not really happy of how your site results are being looked, or if you have a site relaunch coming and your site needs to be reindexed anyway, we can index you directly. Please reach out. >> If you are above the threshold, and we have not talk to you yet or if we have talked to you and it has been a while, don't worry we will reach out. We have not forgotten, we are incorporating your feedback once we have determined that we will be indexing your site, we will work with you to determine which domains you want us to index and track over time. Lots of time this is simple as getting a list of your subdomains. Some search sites though are set up to provide a portal search experience and the agency web team we are working with doesn't always manage all the websites they want to search so it gets more complex. In our old model you could list the second little, level domain, example.gov or sub agency example.gov. We would send those domains off to Bing's big Lake of index, and fish out whatever they might have for you that matches the query. >> In the new model, we need to make sure we have the right fish in our Lake. We need to make sure we have the right tributaries connecting and flowing into the lake. In this example, we are going to track the www.data and archive subdomains. The subdomain of sub agency domain example.gov. This would be that is search when someone uses the search box for this search site. If you know your domains have hundreds of domains in your main website acts as a conduit to the subdomain websites, it is probably just as effective from a search and way finding perspective to search against just the largest subdomains and line office subdomains, as it is to search all of the subdomains of the websites and sub websites. It is certainly more efficient from a indexing perspective. >> Major agencies do do this by the way NAFTA has about 3000 subdomains in the search against 15 of them. It is possible. After we have determined the subdomain the next thing we would do is research each of the subdomains we have agreed on to see what platform it is published on, and whether it has a comprehensive up-to-date site map or not. As I mentioned, monitoring site maps is our way of indexing your sites and keep it up-to-date. We have a crawler but we do not use it as part of our automated process because the processing power and time required to get all the way through even one large site is superhigh. It can take weeks to get through the largest websites. >> We have her help page about site maps linked here. This is the same one that is available from the indexing landing page. This page will take you through all of the particulars it at a high level of site map, model is a friendly XML site map inventory on the content of your domain. I say domain here rather than website because any given site map needs to contain URLs from only one domain. A website might contain several subdomains depending on your world. You can have multiple site maps for a given domain which is helpful if you have content coming from multiple platforms, or have more items than the 50,000 URL maximum at a single site map. If you have more than one site map you need to list them on your robot. Text., File or user site met index format so you can check out this page to see which is appropriate for you, maybe both. >> If for some reason you are not able to list a site map on your robots. Text or you had to put, and you had to put your site map in a nonstandard location, let us know via email where it is and we will into the location in the back end of our system. There is not a place in the admin center to submit site map at this time. Hopefully your content management system has a plug-in that will automatically generate and maintain a site map for you. One thing to watch out for is whether PDFs and other static files you have uploaded to your CMS will be included. They may be included by default or you may need to switch something on in order to include them, or they may not be able to be included automatically. You have to enter them one by one which is a nonstarter for everybody. >> If it is not easily automatic, added easily, you may need a second site map solution. We have recommendations at the bottom of the site map page. Including some generators, there are some local tools. Hopefully your content management system has a plug-in but if you need to you use a site map generator it is helpful for sites without content management systems at all. If you need help picking out one that is right for your set up you can reach out to us. If you do use a generator, we recommend that you said it to ignore query strengths, the versions of URLs where it has a question mark in some kind of other information that pulls page display based on the criteria that follow the?. We recommend you set it to ignore the query strings when the site map generator is calling your site unless the query strings are totally necessary for displaying your page content. That is to avoid things like, list of interesting things page 1, page 2, page 3, etc., or page 2 sort by date, or other kinds of sorting and displaying parameters. If you have a document ID that the query string is calling, that is the only way your page will display the appropriate content, then by all means yes include the query strings in the list you develop for your site map. >> Thirdly, not related to site maps, if you have static content that does not change very often, you can send us a list, I flat list of URLs by email and we can index from the. If you do not have site maps are ready, we in courage you to stand one up. Not only because it will help us support you in the future, but the commercial engines work with them to and it should improve your index coverage out there. You have a comprehensive and current site map posted for each of the domains we will be indexing. What happens now? We tell our system to start indexing each of the domains. The system captures all the URLs from your site maps. One of the benefits of working with us is we understand government content and coverage needs better than commercial engines. You may have noticed with some of your dryer repetitive content, Google bought an gaining bot do not pick up the individual files. The official policy is to pick up with a deep perm, determine the most important and useful information on the web but since we are within government even though you are tens of thousands identical PDFs are foreign agent registration forms and they need to be retrievable through search, we are not going to make any decisions about what URLs to include or exclude from your search experience beyond agreeing with you on the subdomains. >> If there are URLs you want to exclude from indexing, we have a help page available from the indexing landing page. That is the one about how to get search engines to index the right content for better discoverability. After we collect the URLs, we fetch your content. There are few stages to this. We visit you each URL, we render the HTML, and we gather all the content we talked about earlier on the Search.gov from your website page. When we come to your site, we will make one request per second, or follow the crawl delay setting your robots.txt file whichever slower. We will always come from the same IP addresses with the same user agent so you can easily identify us. There are some firewall protections that trigger a block on us even though we work slowly and respectfully. There is a chance we may need to ask you to waitlist our IP. Startup indexing is always the heaviest load of requests and daily updates are very light catching whichever couple of score pages that may be new or updated. We visit site maps daily at 4 AM Eastern to check for new or updated pages. Ideally, your site map will include a last mod date that will get updated when a page gets updated. We will pick up the changes on a daily basis. For any pages that have not been updated in a while, we do a monthly sweep to check for 44's or three or one and update the index accordingly to remove items, replace items. If there is something you need removed right away, you can email or call and we can take care of it manually. >> After the content is indexed, we move into testing and review. For brand-new customers, the following steps will probably take place in the search site that is your production site. For sites transitioning from being our team will tax, test the new index using your popular inquiries to make sure things look okay. We do this in a clone of your production search site so we can see the new index in the context of all of the features you have set up for your searchers. YouTube videos, etc. When we think it looks good we will share the test site with you and your team will then make sure your relevant stakeholders have a chance to review, provide feedback, and ask questions. Together we will iterate. For instance, one agency had, we ran our test plan, we shared the review site with the agency. When they looked at it, they had concerns about all content linking high and also about a revelant pages being returned. When we looked into it, they had not added date metadata yet, they had other open graph, and the main tag was enclosing the side rails so we were matching on the side rail menu task. We were able to identify the issues based on the feedback. They were then able to add date fields and move the location of their main tag and we re-fetch their pages to pick up the changes and we saw a market improvement and how the results were being handled. >> When the search experience gets a green light, we are ready to launch the index for you. As I said earlier for new customers your search site will already be attached to the index so you can proceed with the rest of the steps for going live with our service. For sites transitioning from being, the last thing you need to do is give the okay. We will disconnect your existing search site from transcendent, Bing by changing a setting. All your other search features and branding remain the same and just the portion of the results page that used to show Bing results will now be powered by Search.gov and you could see the attribution at the bottom. The changes effective immediately so there is no delay for propagation. >> To quickly review the steps, we have our process. There are several steps involved but I hope it sounds pretty straightforward. 1st we decide whether or not we are going to index your site ourselves or support you on the Bing index and we probably will index you ourselves. We agree on the domains and subdomains we will be indexing for you. We research the domains and work with you to make sure there is a site map in place for each of the domains. As I mentioned putting site maps up is definitely a win-win when it comes to working with us in the commercial engines. We do the work of indexing the subdomains. We test it, we review the index with you, you review it with internal stakeholders, when you are ready you give a final green light and we send the index live with the back in settings changed and that is it. Easy as pie. >> It is a little bit more labor intensive than making up high. Pie. That is the end of the presentation portion and we can take more questions now if we have any. >> We have three questions. Will you index our [ Inaudible - static ]? Demo content? >> We get this request a fair amount which makes sense. We have sites who are developing a new real on-site in a staging environment and it is full of [ Indiscernible ] content. If the test site, if the demo server is publicly available yes we can index it. If it is password-protected or have any sort of IP restrictions on it, we are not supposed to index it. We are not supposed to include any secure content. You could argue your demo staging Contin doesn't have value but still we are not supposed to. If you are able to put it open the server so I can be available, you could put a robot. Text file to keep it from being indexed in Google and Bing . Then, we can index but Google and Bing it will not get into their stuff. It has to be publicly available. >> You mentioned there were some other questions? >> The second question we have, how do you reindex our content if I relaunch my website and I don't want the old content to show? >> This is another question that comes up a bit. It is always great when the question comes ahead of the relaunch but we have people call us with this question after the relaunch. Help my site relaunch yesterday and all of the search results still have the old domain or the old folder structure in it, 404, what do we do? It does take time to rebuild a search index after a relaunch. What we recommend is that you contact us ahead of time so we can in this case, most people relaunching will be using the Bing index. We will have a parallel path going on where we will work with you to figure out what the domains and Sub-Lot domains are. What we would need to do is work with you to coordinate a indexing of the content after a soft launch of your website. You open it up, don't tell anybody, and we can come in and index, and then you have your big release annex day, and the search results are ready to go. In the meantime we have switch your rib, web results. The old site, at the time you are ready to make the switch from the old site to the new site, that is the moment we want to switch the index supporting the site. The change to come right around the change in your platform. If that is complicated, another measure we can do, you can use a feature called search page alert. You can use that to post a welcome to our relaunch site, we are rebuilding the search index, thank you for your patience, it will be ready for you soon. Or a combination of the two Or a combination of the 2. Even though that might sound less than ideal it is still better than incorrect search results entirely or waiting it out to see if Bing will crawl through your site . We have seen major sites relaunch and have it takes six weeks. That is unacceptable. That is another thing we are happy about it controlling our own index we can manage the timing with you. >> Any other questions? >> We have another question. It is asking how do we know new added context is also indexed expect >> We do not have a way for you to view the list of URLs we have indexed for you. From your side of things, one of the easiest things to do would be to find the most unique phrase on the page and search for it in the index and if it comes up, then great it is indexed, if it does not, reach out to us we may want to do some investigation as to why we have not gotten the page yet. In the case where our site map updater has come through and you post something that you want to make sure is indexed and available right away, the best thing to do is to make a best bet for it. You can post it at the top or you can send an email or give us a call to say I have this URL I need index right away, we can manually add an individual URL to the index. We can check on individual URLs to see their status. We are always happy to help. >> We do not have any further questions at this time. >> I think that brings us to the end of our time. >> Thank you very much everybody for participating today. You can send of us your questions at search@support. Digital your questions at search@gsa.gov. The recording will be available on the event page on digital.gov and we will be posting it on our website. >> Thanks everybody. >> Have a good afternoon. >> [ Event concluded ] >>