允许视图/函数中的Robots.txt


robots.txt in allow view/function

我读了一点关于robots.txt,我读我应该禁止在我的web应用程序中的所有文件夹,但我想允许机器人读取主页和一个视图(url例如:www.mywebapp/searchresults -这是一个编码器路由-它是从应用程序/控制器/函数调用)。

文件夹结构,例如:

-index.php(should be able to read by bots)
-application
  -controllers
    -controller(here is a function which load view)
  -views
-public
我应该这样创建robots.txt吗?
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/controllers/function

或者使用像

这样的路由
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /www.mywebapp/searchresults

或者使用视图?

User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/views/search/index.php

谢谢!

回答我自己的老问题:

当我们想让机器人读取某些页面时,我们需要使用我们的URL(路由),所以在这种情况下:

Allow: /www.mywebapp/searchresults

在某些情况下,我们也可以通过HTML标签(add to header)禁止某些页面:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

当我们想要阻止某些文件夹时,例如图片,只需执行:

Disallow: /public/images

不要阻塞视图文件,因为爬虫不能直接访问它。您需要阻止用于访问视图的URL

robots.txt文件必须放在主机的文档根目录下。在其他位置不起作用。

If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt

要删除网站的目录或单个页面,可以在服务器的根目录下放置robots.txt文件。在创建robots.txt文件时,请记住以下几点:当决定在特定主机上抓取哪些页面时,Googlebot将遵循robots.txt文件中的第一条记录,其中包含以"Googlebot"开头的User-agent。如果不存在这样的条目,它将遵循带有""的User-agent的第一个条目。此外,Google通过使用星号增加了robots.txt文件标准的灵活性。禁止模式可以包括""来匹配任何字符序列,模式可以以"$"结尾来表示名称的结束。

To remove all pages under a particular directory (for example, listings), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /listings
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$ 
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*? 
Option 2: Meta tags
Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Search Engine's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">

进一步参考

https://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file