Introduce page creation log
Closed, ResolvedPublic

Description

Problem statement

We currently have a "Page deletion log", but not a "Page creation log". This proposal is to add a create/create log type, and log an event with that type upon the creation of a new page (e.g. first revision).

Original description

Adds an article creation log to Special:Log

There was discussion on a making a page deletion log, and it came down to a bunch of indexes and such being added and things being changed around and all around confusion.

Thus, I decided to kill two birds with one stone and write this nifty little gadget. I figured if there was a "Deletion log" there should be a "Creation log" as well. This is a patch against MediaWiki 1.10.0, so that each time someone creates a page, it gets added to the log. This way, if the page gets deleted, it gets redlinked, and if it's alive, it's bluelinked. I figure it's a hack to the concept of a "deleted pages" log, but it's most definitely an enhancement to fishing through revisions to find the original page creator. Anyhooo...

There's a caveat. Since I don't know half of the languages that MW supports, there's going to be a problem. Adding a creation log requires a couple edits (check out the patch) to add full multi-language support, otherwise, it'll just turn up as "createpg" which is very user unfriendly. So, right now it only supports english out of the box. Sorry. :(

There's another caveat: it's semi-not backwards compatible to your current database. That is, the patch only works from installation onward in that entries in the creation log will only appear once someone creates a new page after you apply the patch. So, in order to get a full page creation log, either you (or someone else) will need to write a script to add the appropriate entries. Otherwise, it will work fine with your existing installation.

Instructions:

  1. Grab the patch, save it into your brand spankin' new mediawiki root directory.
  2. Run patch -p0 < createpg.patch
  3. If your installation's language is not primarily english, translate to your native language the 'createpglogtext', 'createdarticle', and 'createpglogpage' lines of languages/messages/MessagesEn.php.

    Tested on MediaWiki 1.10.0, php 5.2.3 (fcgi, debug).

    If you have any questions, comments, concerns, or if I totally botched something, please feel free to contact me.

    Cheers,

    Kurt Radwanski irc: slakr@freenode or galaxynet. en.wp: Slakr

    -------

    Attached: See also: T44135: Add page creator index to MediaWiki core

Details

Reference
bz10331
There are a very large number of changes, so older changes are hidden. Show Older Changes
Akeron added a subscriber: Akeron.Aug 29 2016, 10:27 PM

I could easily add a page creation tag as part of https://gerrit.wikimedia.org/r/#/c/194458/ (for T73236).

I think this wouldn't be really helpful. As tag they would effectively be lost after 30 days in any page where they would have more functionality than the new tag.. The reason why this log entry is wanted is because it is resistant to deletion and gives together with deletion and move log entreis a complete summary of the page history.

This would be helpful for contributions, we don't have any built-in way to check non-redirect page creations by a user.
I'm not saying this would replace a log, but this would be easier to implement.

daniel moved this task from Under discussion to Backlog on the TechCom-RFC board.Feb 8 2017, 9:37 PM

A log seems overkill to me, do we really want log entries remaining after pages are deleted? A revision tag seems better to be IMO.

@Legoktm: That's actually one of the main reasons the community wants a creation log: to be able to easily see who created deleted pages. Right now it shows the deletion, protection, and move logs when you look at a deleted page, but not who created it. This lack of information makes it harder to track down paid PR editors. See the discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#List_of_previous_creators_of_an_article for elaboration.

Change 399897 had a related patch set uploaded (by Kaldari; owner: kaldari):
[mediawiki/core@master] Record a log entry on page creation

https://gerrit.wikimedia.org/r/399897

Even though the log won't be complete and won't solve the "show all pages created by this user ever" issue, I think adding a log entry for new page creations makes sense. The current situation where we log page moves and page deletions (and file uploads), but not page creations, is weird. Page creations are important enough to warrant explicit log entries.

MGChecker added a comment.EditedDec 24 2017, 12:49 AM

Even though the log won't be complete and won't solve the "show all pages created by this user ever" issue, (…)

Are you sure it won't`? Sure, there won't be any record about past page creations, but from the day this patch is merged on it would be possible to track all page creations by a specific user by the "executing user" filter, wouldn't it? Am I missing some point here?

Anyone have any idea why my patch causes all the tests in ApiQueryWatchlistIntegrationTest to fail??

@WMDE-leszek fixed it!

@daniel: Do you still want to have an IRC meeting about this? There's clearly desire for this from the community.

daniel added a comment.EditedJan 3 2018, 8:41 PM

@kaldari If there is commitment (read: resourcing) for this, we can have an RFC about the technical details.

Do I understand correctly that this is half-done? We log page creation, but can't query it efficiently per user? Actually... why can't we? What's missing?

Note that this has consequences for suppression/oversight: if a user creates a page with a title that contains bad/private information, then right now you have to delete the page and suppress the deletion log entry to hide the page name. With this change, you would also have to suppress the creation log entry.

It might be nice to provide a nice UI to do this automatically when a person with suppression rights deletes a page.

Krinkle renamed this task from Introduce article creation log to Introduce page creation log.EditedJan 3 2018, 10:07 PM
Krinkle removed a project: Patch-For-Review.
Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: wikibugs-l-list.
Krinkle added a subscriber: Krinkle.

The idea of adding a log event for page creation does not seem particularly cross-cutting from a technical perspective.

In the TechCom meeting today we decided that this is mostly a product decision with various impacts that need to be considered with regards to database performance and community workflows (e.g. deleting a sensitive page title, would also require suppression of the page-create event), but that's not something TechCom would normally organise or do beforehand. We and/or other developers, can do that as part of code review.

Given it was already tagged, however, we decided to proceed with the process.

TechCom proposes an outcome of Approved after a Last Call period of 2 weeks, starting today. Decision to be made on January 17.

Krinkle moved this task from Backlog to Last Call on the TechCom-RFC board.Jan 3 2018, 10:07 PM

In the TechCom meeting today we decided that this is mostly a product decision with various impacts that need to be considered with regards to database performance...

@jcrespo: Any concerns from a DBA perspective? This change would add new entries into the logging table for each page creation. There is no plan to backfill for previous creations, so the database impact would be (roughly):

  • enwiki: +7000-8000 log entries per day
  • eswiki, frwiki, ruwiki, itwiki: +1000-2000 log entries per day
  • most other wikis: +>1000 log entries per day
  • wikidatawiki: ?

The code is here: https://gerrit.wikimedia.org/r/#/c/399897/6/includes/page/WikiPage.php

jcrespo added a comment.EditedJan 4 2018, 7:57 AM

Please research the size (in bytes) of wikidatawiki on a worse case scenario, and giving the total size (in bytes) of the table for wikidata, commons and enwiki now and in a year with and without the feature. Please also research the number of extra IOPS on s3 (800 wikis). Tall tables are not an issue, the issue is large ones. We may not have the disk available, or it may become larger than we can handle for future schema changes requiring logical partitioning. We may or not have the disk to handle extra write operations.

kaldari added a comment.EditedJan 4 2018, 8:37 PM

The current sizes for reference:

enwiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |   84080214 |          14876147712 |           29504831488 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

commonswiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  248227776 |          61615374336 |           95083823104 |                248 |
+-----------+------------+----------------------+-----------------------+--------------------+

wikidatawiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  606370402 |         107029200896 |          152214437888 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

I still have no data from s3, which would be worrying in terms of new iops.

kaldari added a comment.EditedJan 5 2018, 10:50 PM

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

Yeah, I'm pretty amazed how big the wikidata table is (especially since it's only 5 years old). Your estimate of 70,000 per day sounds like a reasonable ball park. I'll need to get access to the analytics server to give you a more exact number, so stay tuned. Given the astronomical growth of the wikidata logging table, do you think it's going to run out of space regardless? i.e. should we consider removing some logging options there and/or pruning the existing table? I have no idea why it is so huge and it already seems to be difficult to query. Even running a simple indexed timestamp query can take several minutes!

wikiadmin@db1106(wikidatawiki)>explain select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
| id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
|    1 | SIMPLE      | logging | ALL  | times         | NULL | NULL    | NULL | 606682265 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

wikiadmin@db1106(wikidatawiki)>select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| log_id    | log_type | log_action | log_timestamp  | log_user | log_user_text | log_namespace | log_title | log_page | log_comment | log_params                                                                      | log_deleted |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| 262985215 | patrol   | patrol     | 20151026200510 |   201281 | Liangent-bot  |             0 | Q14279462 | 15947340 |             | a:3:{s:8:"4::curid";i:262851054;s:9:"5::previd";i:130223534;s:7:"6::auto";i:1;} |           0 |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
1 row in set (2 min 29.59 sec)

It would be good to see which types of logging actions are most heavily represented there, but I'm scared to try running any more complicated queries against it.

I still have no data from s3, which would be worrying in terms of new iops.

There's a bit of a chicken and egg problem here. It's not easy to get this data without the log already existing. I'm going to try to get access to the analytics server and see if I can track down some relevant EventLogging data there.

kaldari added a comment.EditedJan 5 2018, 11:38 PM

Here's the average number of extra logging table insertions we could expect on enwiki, commons, and wikidata:

  • enwiki: 6952 inserts/day
  • commonswiki: 15,511 inserts/day
  • wikidatawiki: 76,316 inserts/day

Getting the data for all of s3 will take a while...

MusikAnimal added a comment.EditedJan 7 2018, 1:21 PM

It'd be nice to also log pages created from a redirect, like the PageTriage extension does. Currently XTools and similar tools are unable to report these creations. It is up to the user to manually keep track of them. I'm not sure how many more inserts a day that would result in, probably not that much for Wikidata and Commons but on enwiki this scenario is common.

jcrespo added a comment.EditedJan 8 2018, 9:04 AM

@kaldari Have you talked to the people wanting this feature? Maybe the people that want it want it only for "frwiki" and "enwikivoyage", and not for wikidata, so it can be enabled only on wikis that is required. Maybe people only want page creations with certain restrictions, and not "all pages created, period" and the logging can be trimmed somehow; for example, Maybe a flag/row on recentchanges can be added instead, so we only get the new pages in the last month. Basically the question is, what is the "user story"? I am trying to comprehend that to provide the best idea for the implementation.

Note this sounds very similar to "adding wikidata to recentchanges", which without thinking, ended up filling up 90% of recentchanges rows on many wikis and causing watchlist and recentchanges issues- that is why I am asking what is the final goal- logging is a heavily indexed table with a lot of storage overhead. And of course I am not saying this cannot be done, I am just saying we need more information to know which is the best way to do it- otherwise, if logging should grow no matter what, we should be starting to work on a logical partitioning/sharding framework for mediawiki (or probably, integrating an existing one like http://vitess.io/ ).

kaldari added a comment.EditedJan 8 2018, 10:10 PM

@jcrespo: There was a discussion on English Wikipedia at the village pump. The main use case was to help identify PR/paid accounts and sockpuppets/long-term abuse. I don't think this would be as much of an issue on Wikidata, as no one creates Wikidata items for PR or SEO purposes (and rarely for abuse/vandalism). I imagine it would also be useful on Commons for identifying repeat copyvio offenders. Thus I don't think a flag in recent changes would meet the use case. I'm all for trimming the logging, but I'm also wondering which logs would make the most sense to trim. Like I wonder if the gazillion logs in Wikidata are all just automatic patrol actions by auto-patrolled bots, in which case we could probably just delete all the entries that are older than 30 days and reduce the size of the table by 90% (this is just a guess though).

kaldari added a comment.EditedJan 8 2018, 10:19 PM

Yeah, it looks like the vast majority of log actions are patrolling:

wikiadmin@db1090(itwiki)>select count(*) from logging;
+----------+
| count(*) |
+----------+
| 48016968 |
+----------+
1 row in set (8.85 sec)

wikiadmin@db1090(itwiki)>select count(*) from logging where log_type = 'patrol';
+----------+
| count(*) |
+----------+
| 43029327 |
+----------+
1 row in set (11.29 sec)

Which is dumb, since that data is only useful for 30 days AFAIK, and is already stored as a flag in recent changes. Why do we even log patrol actions??

@kaldari I see a relatively "simple" solution (at least for short term)- log only creation events when a page has been deleted- If the page exists and has not been deleted- we can gather that from the smaller "page" table (or revision?)- if the page has been deleted afterwards, it will be on logging, added only when it is deleted. This is far from ideal, and it requires 2 queries- to page and to logging, but it would avoid information duplication, while having "easier" the vandals actions (logging pages that have been deleted afterwards). Do you think that, or that with some changes, would be interesting? Most pages will never be deleted, so it could work?

@jcrespo That's definitely an interesting idea! It would solve the main use case, although it might be a bit confusing having creation logs only show up after the fact. Certainly worth considering though.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

I created a separate task at T184485.

@MZMcBride as long as it appears on Special:Log/Page_title things are ok- my proposal is compatible with that- it would only affect how things are stored internally at the database level. Nobody here is discussing what should happen, but how that should be implemented internally- the trivial way- more records every time may break other features or, literally, we may not have resources- which means we need to buy newer machines just to implement this, at least for wikidatawiki and or s3. A smarter implementation, which could have the same user-facing results, could be more efficient, allowing faster operation when we query those resources. Or it could be enabled on only some wikis (e.g. skip wikidatawiki) until we have the money and time to purchase those resources. That is the discussion here- implementing things so we do not break existing features, something that has happened in the past for not being careful.

jcrespo added a comment.EditedJan 9 2018, 7:50 AM

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

I think I can answer a few of those questions... There's no pressing need for page creation logs. We've lived without them since the beginning of Wikipedia and waiting a bit longer won't kill anyone. The Wikipedias have a pretty strong use case. Commons has a pretty strong use case. The other projects use cases aren't as strong, but it would still be useful (for example, quickly looking up all the pages you've created or figuring out how many pages were created on a certain day). Since page creation is such an important action, it just makes sense to be in the logs. Personally, I favor fixing T184485 and then just adding page creation logging (in a straight-forward implementation) to all the wikis. But if it looks like that's not going to happen, I would want to consider your alternative proposal more seriously.

Knowing that, I would sugest to implement it conditionally- and we configure pretty much every wiki except enwiki, commonswiki, and wikidatawiki (obviously with a escalated deployment, to check for regressions). These three would need more thinking and care due to its edit volume, but that could be done later.

Given that any idiot vandal can come along and permanently add multiple rows to the revision table (and thousands of them regularly do!) and given that we're already dealing with other massively large database tables such as pagelinks or categorylinks, it's pretty difficult for me to care about logging growing very moderately to include page creations. I understand and appreciate that disk space and other resources are finite and that large tables can require more maintenance, but this seems like a particularly arbitrary place to draw a line.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

+1. The creation log is the last missing piece to a coherent persistent page history sketch.

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

To be honest, I really don't like the idea that there are common use case core logs that exist only in some installations of MediaWiki of a given version.

As quite a few issues were raised during the Last Call period of this RFC, it is not approved for implementation for the time being. It should remain in the "under discussion" stage until agreement is reached on the issues raised. Participants should feel free to request an RFC meeting if they feel it would be helpful.

Nirmos added a subscriber: Nirmos.Jan 31 2018, 12:58 PM
daniel added a comment.Feb 7 2018, 6:43 PM

@kaldari Do you think having an IRC meeting on soon this would be useful? Or do you think the current discussion here is sufficient to move forward?

@daniel: It seems that this task is basically blocked by T49415 (other than some kind of partial roll-out). Eventually, lots of things are going to be blocked by T49415. I think having an IRC meeting about T49415 would be more useful.

daniel added a comment.Feb 8 2018, 8:13 PM

@kaldari would that be solved by T184485: Stop logging autopatrol actions? This has gone on Last Call, and if all goes well, it's approved in two weeks. Which raises the question - who would actually implement that?

Since T49415 is resolved, this should no longer be blocked any more. The only remaining hurdle to merging https://gerrit.wikimedia.org/r/#/c/399897/ is that we need to prevent page creation events from creating 2 different entries in recentchanges (one for the edit event and one for the creation log event). The consensus seems to be to not record the creation log event in recentchanges. Skimming through the logging code, it wasn't obvious how to do this, but I haven't had time to really investigate. Apparently, the patrolling action is also logged but doesn't insert into recentchanges, so we should be able to do whatever it's doing.

I think that can be done analogous to the implementation of the $wgAutopromoteOnceLogInRC configuration setting.

The patch is ready for reviewing/merging now: https://gerrit.wikimedia.org/r/#/c/399897/.

@jcrespo: I put the new page logging behind a feature flag which is set to false by default. That will allow us to not enable it on wikidatawiki (due to the volume concerns). If that sounds good to you, would appreciate a +1 on the patch :)

@kaldari Hello.

Is this log compatible with RevisionDelete and Suppression? It is important for us to be able to remove/hide nasty page titles there.

Also, will page titles remove themselves if and when the page is deleted with suppression (deleted with the checkbox "suppress data from administrators..." marked).

Regards.

@MarcoAurelio: I believe this should work the same as existing page move logs.

Change 399897 merged by jenkins-bot:
[mediawiki/core@master] Record a log entry on page creation

https://gerrit.wikimedia.org/r/399897

Are there any plans to activate page creation logs on wmf servers? I would really appreciate it.

@MGChecker: My plan is to activate it on Test Wikipedia this week, and then all WMF projects except Wikidata and Commons.

I'm guessing this will not log creations from redirects, as described in T12331#3881196 ? Maybe we could take on that next, as this is really hard to query for currently (even with the new mw-new-redirect tag). Perhaps a create/fromredirect log type? Relevant task at T184305

I would really like this addition with its own log action.

kaldari closed this task as Resolved.Jun 28 2018, 6:36 AM

This is now live on all the wikis except Wikidata and Commons.

Good idea, but could you, please, add filtering by namespace? Thank you.

Good idea, but could you, please, add filtering by namespace? Thank you.

I suppose that would be T16711.