Page MenuHomePhabricator

Missing observations from eswikiquote
Closed, ResolvedPublic

Description

It looks like a lot of pages were deleted on eswikiquote. This changed our labeled dataset substantially. Is this a problem? What happened?

E.g., this are the counts for the "damaging" label:

Old:
	counts (n=11732):
		label        n         ~True    ~False
		-------  -----  ---  -------  --------
		True      1019  -->      827       192
		False    10713  -->     1003      9710
New:
	counts (n=9758):
		label       n         ~True    ~False
		-------  ----  ---  -------  --------
		True      849  -->      687       162
		False    8909  -->      750      8159

Event Timeline

Halfak created this task.Jun 8 2020, 4:56 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 8 2020, 4:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

From T177762: Edit quality campaign for es.wikiquote, it's not clear who we originally talked to about setting up this wiki, but @MarcoAurelio originally set up the task. Maybe he can comment.

For what it's worth, I don't have any immediate concerns about the model. We are still able to learn something from the observations that were not deleted. I'd mostly like to get a sense for what happened so we can know what kind of effects it *might* have had on the models.

@Halfak It looks es.wikiquote is undertaking a massive copyright cleanup campaing that led to the deletion of many pages that were copyright violations inserted in the days nobody was active in the project. Given that they have few sysops over there, I'm not sure how fast they'll finish the cleanup. Please let me know if I can assist with anything.

Halfak added a comment.Jun 8 2020, 6:17 PM

Thanks for the notes. I don't think we'll need to change anything here. We might consider extending our feature extractor to be able to extract features from deleted pages in order to preserve the value of observations like these, but for the time being, I don't think this will be a problem.

Noting that I got a reply from the admin Cookie. She said the cleanup round is not yet complete.

Halfak closed this task as Resolved.Jul 13 2020, 4:41 PM
Halfak claimed this task.

Thanks for the information @MarcoAurelio. Given that this is an expected deletion of data, we're going to resolve this.