Author: neubau
Description:
Google has released a protocol for sitemaps. It is in an experimental stage
right now and google does not guarantee anything.
The announcement is at
http://googleblog.blogspot.com/2005/06/webmaster-friendly.html.
The protocol is explained at
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html
The FAQ for this project is at
https://www.google.com/webmasters/sitemaps/docs/en/faq.html.
Currently, Wikipedia's pages do not get updated by google "fast enough". I've
monitored the article [[Sarah Kane]] at de.wikipedia when it was first written
and well-linked. It took several weeks until the url appeared and several weeks
more until the content was indexed. With more than 2 million wikipedia articles
and many more pages such as user-pages, talks, other namespaces, simply crawling
the site might not be the best solution.
Mediawiki could automatically provide sitemaps. The protocol is supporting gzip
compression. A single file may not be larger than 10MB (uncompressed) and may
not contain more than 50.000 urls. It is allowed to have more than one sitemap
file and link them in a sitemap index file. Sitemap index files may not list
more than 1,000 Sitemaps. Under these circumstances, it seems to be possible not
to run into protocol limits in the near time, given the size of the individual
large mediawiki installations (such as en.wikipedia and de.wikipedia).
The XML-DTD contains several tags which would have to be filled with content:
- changefreq: Enumerated list. Valid values are "always", "hourly", "daily",
"weekly", "monthly", "yearly" and "never". Suggestion: Date on which the
article was created minus the current date. Divide the result by the number of
revisions. A finer solution might be just to monitor the frequency of edits
within the last 2 months of that article to reflect "current event"-articles better.
- lastmod: time that the URL was last modified. We already have that information
in the cur-table.
- loc: URL for that page. obvious.
- priority: Optional. The priority of a particular URL relative to other pages
on the same site. The value for this tag is a number between 0.0 and 1.0, where
0.0 identifies the lowest priority page(s) on your site and 1.0 identifies the
highest priority page(s) on your site. It would be simple to give all the
articles a 0.7 and other namespaces sigificant lower priorities. People might
consider to use a more sophisticated approach based on the number of backlinks
or whatever.
Version: unspecified
Severity: enhancement
URL: https://www.google.com/webmasters/sitemaps/docs/en/protocol.html