Robots txt directives. Yandex robots

The first thing a search bot does when it comes to your site is look for and read the robots.txt file. What is this file? is a set of instructions for a search engine.

It represents text file, with a txt extension, which is located in the root directory of the site. This set of instructions tells the search robot which pages and files on the site to index and which not. It also indicates the main mirror of the site and where to look for the site map.

What is the robots.txt file for? For proper indexing of your site. So that the search does not contain duplicate pages, various service pages and documents. Once you correctly configure directives in robots, you will save your site from many problems with indexing and site mirroring.

How to create the correct robots.txt

It’s quite easy to create robots.txt, let’s create text document in a standard Windows notepad. We write directives for search engines in this file. Next, save this file called “robots” and text extension"txt". Everything can now be uploaded to the hosting, in the root folder of the site. Please note that you can only create one “robots” document for one site. If this file is not on the site, then the bot automatically “decides” that everything can be indexed.

Since there is only one, it contains instructions for all search engines. Moreover, you can write down both separate instructions for each PS, and a general one for all of them at once. The separation of instructions for different search bots is done through the User-agent directive. Let's talk more about this below.

Robots.txt directives

The file “for robots” can contain the following directives for managing indexing: User-agent, Disallow, Allow, Sitemap, Host, Crawl-delay, Clean-param. Let's look at each instruction in more detail.

User-agent directive

User-agent directive— indicates which search engine the instructions will be for (more precisely, which specific bot). If there is a “*”, then the instructions are intended for all robots. If a specific bot is specified, such as Googlebot, then the instructions are intended only for Google's main indexing robot. Moreover, if there are instructions separately for Googlebot and for all other subsystems, then Google will only read its own instructions and ignore the general one. The Yandex bot will do the same. Let's look at an example of writing a directive.

User-agent: YandexBot - instructions only for the main Yandex indexing bot
User-agent: Yandex - instructions for all Yandex bots
User-agent: * - instructions for all bots

Disallow and Allow directives

Disallow and Allow directives— give instructions on what to index and what not. Disallow gives the command not to index a page or an entire section of the site. On the contrary, Allow indicates what needs to be indexed.

Disallow: / - prohibits indexing the entire site
Disallow: /papka/ - prohibits indexing the entire contents of the folder
Disallow: /files.php - prohibits indexing the files.php file

Allow: /cgi-bin – allows cgi-bin pages to be indexed

It is possible and often simply necessary to use special characters in the Disallow and Allow directives. They are needed to specify regular expressions.

Special character * - replaces any sequence of characters. It is assigned by default to the end of each rule. Even if you haven’t registered it, the PS will assign it themselves. Usage example:

Disallow: /cgi-bin/*.aspx – prohibits indexing all files with the .aspx extension
Disallow: /*foto - prohibits indexing of files and folders containing the word foto

The special character $—cancels the effect of the special character “*” at the end of the rule. For example:

Disallow: /example$ - prohibits indexing '/example', but does not prohibit '/example.html'

And if you write it without the special symbol $, then the instruction will work differently:

Disallow: /example - disables both '/example' and '/example.html'

Sitemap Directive

Sitemap Directive— is intended to indicate to the search engine robot where the sitemap is located on the hosting. The sitemap format should be sitemaps.xml. A site map is needed for faster and more complete indexing of the site. Moreover, a sitemap is not necessarily one file, there can be several of them. Direct message format:

Sitemap: http://site/sitemaps1.xml
Sitemap: http://site/sitemaps2.xml

Host directive

Host directive- indicates to the robot the main mirror of the site. Whatever is in the index of site mirrors, you must always specify this directive. If you do not specify it, the Yandex robot will index at least two versions of the site with and without www. Until the mirror robot glues them together. Example entry:

Host: www.site
Host: website

In the first case, the robot will index the version with www, in the second case, without. You are allowed to specify only one Host directive in the robots.txt file. If you enter several of them, the bot will process and take into account only the first one.

A valid host directive must have the following data:
— indicate the connection protocol (HTTP or HTTPS);
- correctly written domain name(you cannot enter an IP address);
— port number, if necessary (for example, Host: site.com:8080).

Directives made incorrectly will simply be ignored.

Crawl-delay directive

Crawl-delay directive allows you to reduce the load on the server. It is needed in case your site begins to fall under the onslaught of various bots. The Crawl-delay directive tells the search bot the waiting time between the end of downloading one page and the start of downloading another page on the site. The directive must come immediately after the "Disallow" and/or "Allow" directive entries. The Yandex search robot can read fractional values. For example: 1.5 (one and a half seconds).

Clean-param directive

Clean-param directive needed by sites whose pages contain dynamic parameters. We are talking about those that do not affect the content of the pages. This is various service information: session identifiers, users, referrers, etc. So, so that there are no duplicates of these pages, this directive is used. She will tell the PS not to re-upload the getting information. The load on the server and the time it takes for the robot to crawl the site will also be reduced.

Clean-param: s /forum/showthread.php

This entry tells the PS that the s parameter will be considered insignificant for all urls that start with /forum/showthread.php. The maximum entry length is 500 characters.

We've sorted out the directives, let's move on to setting up our robots file.

Setting up robots.txt

Let's proceed directly to setting up the robots.txt file. It must contain at least two entries:

User-agent:— indicates which search engine the instructions below will be for.
Disallow:— specifies which part of the site should not be indexed. It can block both a single page of a site and entire sections from indexing.

Moreover, you can indicate that these directives are intended for all search engines, or for one specifically. This is indicated in the User-agent directive. If you want all bots to read the instructions, put an asterisk

If you want to write instructions for a specific robot, you must specify its name.

User-agent: YandexBot

A simplified example of a correctly composed robots file would be like this:

User-agent: *
Disallow: /files.php
Disallow: /section/
Host: website

Where, * indicates that the instructions are intended for all PS;
Disallow: /files.php– prohibits indexing of the file file.php;
Disallow: /foto/— prohibits indexing the entire “foto” section with all attached files;
Host: website— tells robots which mirror to index.

If you don’t have pages on your site that need to be closed from indexing, then your robots.txt file should be like this:

User-agent: *
Disallow:
Host: website

Robots.txt for Yandex (Yandex)

To indicate that these instructions are intended for the Yandex search engine, you must specify in the User-agent: Yandex directive. Moreover, if we enter “Yandex,” then all Yandex robots will index the site, and if we specify “YandexBot,” then this will be a command only for the main indexing robot.

It is also necessary to specify the “Host” directive, where to indicate the main mirror of the site. As I wrote above, this is done to prevent duplicate pages. Your correct robots.txt for Yandex will be like this.

This requires instructions for work; search engines are no exception to the rule, which is why they came up with a special file called robots.txt. This file should be located in the root folder of your site, or it can be virtual, but it must be opened upon request: www.yoursite.ru/robots.txt

Search engines have long learned to distinguish necessary files html, from the internal sets of scripts of your CMS system, or rather, they have learned to recognize links to content articles and all sorts of rubbish. Therefore, many webmasters already forget to do robots for their sites and think that everything will be fine anyway. Yes, they are 99% right, because if your site does not have this file, then search engines are limitless in their search for content, but there are nuances, the errors of which can be taken care of in advance.

If you have any problems with this file on the site, write comments on this article and I will quickly help you with this, absolutely free. Very often, webmasters make minor mistakes in it, which results in poor indexing for the site, or even exclusion from the index.

What is robots.txt for?

The robots.txt file is created to configure the correct indexing of the site by search engines. That is, it contains rules for permissions and prohibitions on certain paths of your site or type of content. But this is not a panacea. All rules in a robots file are not guidelines follow them exactly, but simply a recommendation for search engines. Google for example writes:

You cannot use the robots.txt file to hide a page from results Google Search. Other pages can link to it and it will still be indexed.

Search robots themselves decide what to index and what not, and how to behave on the site. Each search engine has its own tasks and functions. No matter how much we want, this is not a way to tame them.

But there is one trick that is not directly related to the topic of this article. To completely prevent robots from indexing and showing the page in search results, you need to write:

Let's return to robots. The rules in this file can block or allow access to the following types of files:

Non-graphic files. Basically these are html files that contain some information. You can close duplicate pages, or pages that don't serve any purpose. useful information(pagination pages, calendar pages, archive pages, profile pages, etc.).
Graphic files. If you want site images not to be displayed in searches, you can set this in the robots.
Resource files. Also, using robots, you can block the indexing of various scripts, CSS style files and other unimportant resources. But you should not block resources that are responsible for the visual part of the site for visitors (for example, if you close the css and js of the site that display beautiful blocks or tables, the search robot will not see this and will complain about it).

To clearly show how robots works, look at the picture below:

A search robot, following a site, looks at the indexing rules, then begins indexing according to the recommendations of the file.
Depending on the rules settings, the search engine knows what can be indexed and what cannot.

From the robots.txt file intax

To write rules for search engines, directives with various parameters are used in the robots file, with the help of which the robots follow. Let's start with the very first and probably the most important directive: