We are given very direct and easy-to-assimilate information. The wording of these files is very simple, although any mistake, however small, could cause the spiders not to enter the pages we want. In the best case, this will cause them to continue visiting URLs that to waste their time on. In the worst case, it will be the opposite: it will not index content that we actually do want to appear in the search engine.
We start by declaring on a line
The user-agent (name of the system that is browsing or crawling the site) that we want to affect and after this line we will indicate the permitt and prohibit accesses.
We use “Allow: {expression}” and “Disallow:
Directives to indicate whether we grant access or remove it. By default we could say that a robot has access to all URLs on the site (“Allow:*”) but even though this is already the case from the beginning, many decide to leave it explicitly stat and continue prohibiting from there on. Therefore, even though it is unnecessary, it should not seem strange to line data us to see a robots that begins with.
We can specify our sitemap.xml if
We want (sitemap: sitemap.xml). This is of no importance to Google if we manage Google Search Console properly; however, it can help with you can! link other robots to read it, so declaring it will not hurt us.
One interesting thing about
This sitemap file is that it can even be host on domains other than our own. This can be useful if, for example, we ne to upload changes to the file from time to time and the website we work on does not allow us to update it so quickly. An example that covers tg data everything we have just mention would be the following.