Introduce page creation log
Open, LowPublic

Description

Problem statement

We currently have a "Page deletion log", but not a "Page creation log". This proposal is to add a create/create log type, and log an event with that type upon the creation of a new page (e.g. first revision).

Original description

Adds an article creation log to Special:Log

There was discussion on a making a page deletion log, and it came down to a bunch of indexes and such being added and things being changed around and all around confusion.

Thus, I decided to kill two birds with one stone and write this nifty little gadget. I figured if there was a "Deletion log" there should be a "Creation log" as well. This is a patch against MediaWiki 1.10.0, so that each time someone creates a page, it gets added to the log. This way, if the page gets deleted, it gets redlinked, and if it's alive, it's bluelinked. I figure it's a hack to the concept of a "deleted pages" log, but it's most definitely an enhancement to fishing through revisions to find the original page creator. Anyhooo...

There's a caveat. Since I don't know half of the languages that MW supports, there's going to be a problem. Adding a creation log requires a couple edits (check out the patch) to add full multi-language support, otherwise, it'll just turn up as "createpg" which is very user unfriendly. So, right now it only supports english out of the box. Sorry. :(

There's another caveat: it's semi-not backwards compatible to your current database. That is, the patch only works from installation onward in that entries in the creation log will only appear once someone creates a new page after you apply the patch. So, in order to get a full page creation log, either you (or someone else) will need to write a script to add the appropriate entries. Otherwise, it will work fine with your existing installation.

Instructions:

  1. Grab the patch, save it into your brand spankin' new mediawiki root directory.
  2. Run patch -p0 < createpg.patch
  3. If your installation's language is not primarily english, translate to your native language the 'createpglogtext', 'createdarticle', and 'createpglogpage' lines of languages/messages/MessagesEn.php.

    Tested on MediaWiki 1.10.0, php 5.2.3 (fcgi, debug).

    If you have any questions, comments, concerns, or if I totally botched something, please feel free to contact me.

    Cheers,

    Kurt Radwanski irc: slakr@freenode or galaxynet. en.wp: Slakr

    -------

    Attached: See also: T44135: Add page creator index to MediaWiki core

Details

Reference
bz10331
There are a very large number of changes, so older changes are hidden. Show Older Changes

cnit wrote:

In r76679, 'ru' locale, updated from 1.15, when I try to upload the file, the following error is generated:

Обнаружена ошибка синтаксиса запроса к базе данных. Это может означать ошибку в программном обеспечении. Последний запрос к базе данных:

(SQL запрос скрыт)

произошёл из функции «LogPage::saveContent». База данных возвратила ошибку «1054: Unknown column 'log_user_text' in 'field list' (localhost)».

Brief translation:

Database syntax error (SQL query is hidden)

occured in function "LogPage::saveContent". Database has returned an error "1054: Unknown column 'log_user_text' in 'field list' (localhost)"

Also, I've tried to use $wgDebugDumpSql = true;
in LocalSettings.php, yet the query is hidden anyway.

In the log there is the query:

Query 49 (slave): INSERT /* LogPage::saveContent Syntone */ INTO wiki_logging (log_id,log_type,log_action,log_timestamp,log_user,log_user_text,log_namespace,log_title,log_page,log_comment,log_params) VALUES (NULL,'upload','overwrite','20101115100409','1','Sdv','6','Myfile.jpg','0','Овечки','')

cnit wrote:

Sorry. I've just forgot to re-run php update.php after re-importing dump from 1.15.

Reopening. No obvious fix in place.

  • Bug 42026 has been marked as a duplicate of this bug. ***
  • Bug 29730 has been marked as a duplicate of this bug. ***
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2016, 7:49 PM
daniel added a subscriber: daniel.

I think we should revive this. Should be possible to handle this easily via the logging table and Special:Log.

For the record, we do have log_user_text now, so recording page creation by anons is no longer a problem.

daniel updated the task description. (Show Details)Feb 4 2016, 11:20 AM
daniel updated the task description. (Show Details)

For the record, we do have log_user_text now, so recording page creation by anons is no longer a problem.

We also have a "Only show edits that are page creations" checkbox at https://en.wikipedia.org/wiki/Special:Contributions/Daniel now.

daniel added a comment.Feb 4 2016, 5:36 PM

We also have a "Only show edits that are page creations" checkbox at https://en.wikipedia.org/wiki/Special:Contributions/Daniel now.

I suppose that relies on rev_parent_id = 0, then. When I try this, I see a lot of page moves.

Hm, I suppose revision tags would also solve this. Creations should just have a "creation" tag.

Yes, page moves are really cluttering that list on my contributions too. For the same reason, it would be handy if creation of redirects could optionally be omitted from the list.

A log seems overkill to me, do we really want log entries remaining after pages are deleted? A revision tag seems better to be IMO.

brion added a subscriber: brion.Mar 2 2016, 10:11 PM

My inclination is that revision tag sounds good too, as long as it doesn't introduce UI clutter. Not sure of the current state of tag visibility in UI.

Taking this on as a shepherd. I want to have an IRC meeting about this soon, to find out if we want this, what speaks against it, and what alternatives we have.

daniel lowered the priority of this task from Normal to Low.Mar 2 2016, 10:19 PM
Akeron added a subscriber: Akeron.Aug 29 2016, 10:27 PM

I could easily add a page creation tag as part of https://gerrit.wikimedia.org/r/#/c/194458/ (for T73236).

I think this wouldn't be really helpful. As tag they would effectively be lost after 30 days in any page where they would have more functionality than the new tag.. The reason why this log entry is wanted is because it is resistant to deletion and gives together with deletion and move log entreis a complete summary of the page history.

This would be helpful for contributions, we don't have any built-in way to check non-redirect page creations by a user.
I'm not saying this would replace a log, but this would be easier to implement.

daniel moved this task from Under discussion to Backlog on the TechCom-RFC board.Feb 8 2017, 9:37 PM

A log seems overkill to me, do we really want log entries remaining after pages are deleted? A revision tag seems better to be IMO.

@Legoktm: That's actually one of the main reasons the community wants a creation log: to be able to easily see who created deleted pages. Right now it shows the deletion, protection, and move logs when you look at a deleted page, but not who created it. This lack of information makes it harder to track down paid PR editors. See the discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#List_of_previous_creators_of_an_article for elaboration.

Change 399897 had a related patch set uploaded (by Kaldari; owner: kaldari):
[mediawiki/core@master] Record a log entry on page creation

https://gerrit.wikimedia.org/r/399897

Even though the log won't be complete and won't solve the "show all pages created by this user ever" issue, I think adding a log entry for new page creations makes sense. The current situation where we log page moves and page deletions (and file uploads), but not page creations, is weird. Page creations are important enough to warrant explicit log entries.

MGChecker added a comment.EditedSun, Dec 24, 12:49 AM

Even though the log won't be complete and won't solve the "show all pages created by this user ever" issue, (…)

Are you sure it won't`? Sure, there won't be any record about past page creations, but from the day this patch is merged on it would be possible to track all page creations by a specific user by the "executing user" filter, wouldn't it? Am I missing some point here?

Anyone have any idea why my patch causes all the tests in ApiQueryWatchlistIntegrationTest to fail??

@WMDE-leszek fixed it!

@daniel: Do you still want to have an IRC meeting about this? There's clearly desire for this from the community.

daniel added a comment.EditedWed, Jan 3, 8:41 PM

@kaldari If there is commitment (read: resourcing) for this, we can have an RFC about the technical details.

Do I understand correctly that this is half-done? We log page creation, but can't query it efficiently per user? Actually... why can't we? What's missing?

Note that this has consequences for suppression/oversight: if a user creates a page with a title that contains bad/private information, then right now you have to delete the page and suppress the deletion log entry to hide the page name. With this change, you would also have to suppress the creation log entry.

It might be nice to provide a nice UI to do this automatically when a person with suppression rights deletes a page.

Krinkle renamed this task from Introduce article creation log to Introduce page creation log.EditedWed, Jan 3, 10:07 PM
Krinkle removed a project: Patch-For-Review.
Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: wikibugs-l-list.
Krinkle added a subscriber: Krinkle.

The idea of adding a log event for page creation does not seem particularly cross-cutting from a technical perspective.

In the TechCom meeting today we decided that this is mostly a product decision with various impacts that need to be considered with regards to database performance and community workflows (e.g. deleting a sensitive page title, would also require suppression of the page-create event), but that's not something TechCom would normally organise or do beforehand. We and/or other developers, can do that as part of code review.

Given it was already tagged, however, we decided to proceed with the process.

TechCom proposes an outcome of Approved after a Last Call period of 2 weeks, starting today. Decision to be made on January 17.

Krinkle moved this task from Backlog to Last Call on the TechCom-RFC board.Wed, Jan 3, 10:07 PM

In the TechCom meeting today we decided that this is mostly a product decision with various impacts that need to be considered with regards to database performance...

@jcrespo: Any concerns from a DBA perspective? This change would add new entries into the logging table for each page creation. There is no plan to backfill for previous creations, so the database impact would be (roughly):

  • enwiki: +7000-8000 log entries per day
  • eswiki, frwiki, ruwiki, itwiki: +1000-2000 log entries per day
  • most other wikis: +>1000 log entries per day
  • wikidatawiki: ?

The code is here: https://gerrit.wikimedia.org/r/#/c/399897/6/includes/page/WikiPage.php

jcrespo added a comment.EditedThu, Jan 4, 7:57 AM

Please research the size (in bytes) of wikidatawiki on a worse case scenario, and giving the total size (in bytes) of the table for wikidata, commons and enwiki now and in a year with and without the feature. Please also research the number of extra IOPS on s3 (800 wikis). Tall tables are not an issue, the issue is large ones. We may not have the disk available, or it may become larger than we can handle for future schema changes requiring logical partitioning. We may or not have the disk to handle extra write operations.

kaldari added a comment.EditedThu, Jan 4, 8:37 PM

The current sizes for reference:

enwiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |   84080214 |          14876147712 |           29504831488 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

commonswiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  248227776 |          61615374336 |           95083823104 |                248 |
+-----------+------------+----------------------+-----------------------+--------------------+

wikidatawiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  606370402 |         107029200896 |          152214437888 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

I still have no data from s3, which would be worrying in terms of new iops.

kaldari added a comment.EditedFri, Jan 5, 10:50 PM

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

Yeah, I'm pretty amazed how big the wikidata table is (especially since it's only 5 years old). Your estimate of 70,000 per day sounds like a reasonable ball park. I'll need to get access to the analytics server to give you a more exact number, so stay tuned. Given the astronomical growth of the wikidata logging table, do you think it's going to run out of space regardless? i.e. should we consider removing some logging options there and/or pruning the existing table? I have no idea why it is so huge and it already seems to be difficult to query. Even running a simple indexed timestamp query can take several minutes!

wikiadmin@db1106(wikidatawiki)>explain select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
| id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
|    1 | SIMPLE      | logging | ALL  | times         | NULL | NULL    | NULL | 606682265 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

wikiadmin@db1106(wikidatawiki)>select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| log_id    | log_type | log_action | log_timestamp  | log_user | log_user_text | log_namespace | log_title | log_page | log_comment | log_params                                                                      | log_deleted |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| 262985215 | patrol   | patrol     | 20151026200510 |   201281 | Liangent-bot  |             0 | Q14279462 | 15947340 |             | a:3:{s:8:"4::curid";i:262851054;s:9:"5::previd";i:130223534;s:7:"6::auto";i:1;} |           0 |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
1 row in set (2 min 29.59 sec)

It would be good to see which types of logging actions are most heavily represented there, but I'm scared to try running any more complicated queries against it.

I still have no data from s3, which would be worrying in terms of new iops.

There's a bit of a chicken and egg problem here. It's not easy to get this data without the log already existing. I'm going to try to get access to the analytics server and see if I can track down some relevant EventLogging data there.

kaldari added a comment.EditedFri, Jan 5, 11:38 PM

Here's the average number of extra logging table insertions we could expect on enwiki, commons, and wikidata:

  • enwiki: 6952 inserts/day
  • commonswiki: 15,511 inserts/day
  • wikidatawiki: 76,316 inserts/day

Getting the data for all of s3 will take a while...

MusikAnimal added a comment.EditedSun, Jan 7, 1:21 PM

It'd be nice to also log pages created from a redirect, like the PageTriage extension does. Currently XTools and similar tools are unable to report these creations. It is up to the user to manually keep track of them. I'm not sure how many more inserts a day that would result in, probably not that much for Wikidata and Commons but on enwiki this scenario is common.

jcrespo added a comment.EditedMon, Jan 8, 9:04 AM

@kaldari Have you talked to the people wanting this feature? Maybe the people that want it want it only for "frwiki" and "enwikivoyage", and not for wikidata, so it can be enabled only on wikis that is required. Maybe people only want page creations with certain restrictions, and not "all pages created, period" and the logging can be trimmed somehow; for example, Maybe a flag/row on recentchanges can be added instead, so we only get the new pages in the last month. Basically the question is, what is the "user story"? I am trying to comprehend that to provide the best idea for the implementation.

Note this sounds very similar to "adding wikidata to recentchanges", which without thinking, ended up filling up 90% of recentchanges rows on many wikis and causing watchlist and recentchanges issues- that is why I am asking what is the final goal- logging is a heavily indexed table with a lot of storage overhead. And of course I am not saying this cannot be done, I am just saying we need more information to know which is the best way to do it- otherwise, if logging should grow no matter what, we should be starting to work on a logical partitioning/sharding framework for mediawiki (or probably, integrating an existing one like http://vitess.io/ ).

kaldari added a comment.EditedMon, Jan 8, 10:10 PM

@jcrespo: There was a discussion on English Wikipedia at the village pump. The main use case was to help identify PR/paid accounts and sockpuppets/long-term abuse. I don't think this would be as much of an issue on Wikidata, as no one creates Wikidata items for PR or SEO purposes (and rarely for abuse/vandalism). I imagine it would also be useful on Commons for identifying repeat copyvio offenders. Thus I don't think a flag in recent changes would meet the use case. I'm all for trimming the logging, but I'm also wondering which logs would make the most sense to trim. Like I wonder if the gazillion logs in Wikidata are all just automatic patrol actions by auto-patrolled bots, in which case we could probably just delete all the entries that are older than 30 days and reduce the size of the table by 90% (this is just a guess though).

kaldari added a comment.EditedMon, Jan 8, 10:19 PM

Yeah, it looks like the vast majority of log actions are patrolling:

wikiadmin@db1090(itwiki)>select count(*) from logging;
+----------+
| count(*) |
+----------+
| 48016968 |
+----------+
1 row in set (8.85 sec)

wikiadmin@db1090(itwiki)>select count(*) from logging where log_type = 'patrol';
+----------+
| count(*) |
+----------+
| 43029327 |
+----------+
1 row in set (11.29 sec)

Which is dumb, since that data is only useful for 30 days AFAIK, and is already stored as a flag in recent changes. Why do we even log patrol actions??

@kaldari I see a relatively "simple" solution (at least for short term)- log only creation events when a page has been deleted- If the page exists and has not been deleted- we can gather that from the smaller "page" table (or revision?)- if the page has been deleted afterwards, it will be on logging, added only when it is deleted. This is far from ideal, and it requires 2 queries- to page and to logging, but it would avoid information duplication, while having "easier" the vandals actions (logging pages that have been deleted afterwards). Do you think that, or that with some changes, would be interesting? Most pages will never be deleted, so it could work?

@jcrespo That's definitely an interesting idea! It would solve the main use case, although it might be a bit confusing having creation logs only show up after the fact. Certainly worth considering though.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

I created a separate task at T184485.

@MZMcBride as long as it appears on Special:Log/Page_title things are ok- my proposal is compatible with that- it would only affect how things are stored internally at the database level. Nobody here is discussing what should happen, but how that should be implemented internally- the trivial way- more records every time may break other features or, literally, we may not have resources- which means we need to buy newer machines just to implement this, at least for wikidatawiki and or s3. A smarter implementation, which could have the same user-facing results, could be more efficient, allowing faster operation when we query those resources. Or it could be enabled on only some wikis (e.g. skip wikidatawiki) until we have the money and time to purchase those resources. That is the discussion here- implementing things so we do not break existing features, something that has happened in the past for not being careful.

jcrespo added a comment.EditedTue, Jan 9, 7:50 AM

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

I think I can answer a few of those questions... There's no pressing need for page creation logs. We've lived without them since the beginning of Wikipedia and waiting a bit longer won't kill anyone. The Wikipedias have a pretty strong use case. Commons has a pretty strong use case. The other projects use cases aren't as strong, but it would still be useful (for example, quickly looking up all the pages you've created or figuring out how many pages were created on a certain day). Since page creation is such an important action, it just makes sense to be in the logs. Personally, I favor fixing T184485 and then just adding page creation logging (in a straight-forward implementation) to all the wikis. But if it looks like that's not going to happen, I would want to consider your alternative proposal more seriously.

Knowing that, I would sugest to implement it conditionally- and we configure pretty much every wiki except enwiki, commonswiki, and wikidatawiki (obviously with a escalated deployment, to check for regressions). These three would need more thinking and care due to its edit volume, but that could be done later.

Given that any idiot vandal can come along and permanently add multiple rows to the revision table (and thousands of them regularly do!) and given that we're already dealing with other massively large database tables such as pagelinks or categorylinks, it's pretty difficult for me to care about logging growing very moderately to include page creations. I understand and appreciate that disk space and other resources are finite and that large tables can require more maintenance, but this seems like a particularly arbitrary place to draw a line.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

+1. The creation log is the last missing piece to a coherent persistent page history sketch.

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

To be honest, I really don't like the idea that there are common use case core logs that exist only in some installations of MediaWiki of a given version.