Page MenuHomePhabricator

Transclusion doesn't work after deletion of source file
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  1. File has been uploaded to Commons - https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf
  2. Index has been created in Wikisource - https://ru.wikisource.org/wiki/Индекс:Красный_библиотекарь_(журнал),_1923,_№_1.pdf
  3. Text has been recognised
  4. Index has been transcluded into pages with <pages> tag - https://ru.wikisource.org/wiki/Служебная:Ссылки_сюда?target=Индекс%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf&namespace=&hidelinks=1
  5. File has been deleted from Commons - https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf
  6. Index file and recognised text still available in Wikisource - https://ru.wikisource.org/wiki/Индекс:Красный_библиотекарь_(журнал),_1923,_№_1.pdf
  7. Transclusion doesn't work after deletion of source file. For example https://ru.wikisource.org/w/index.php?title=Работа_детских_библиотек_в_Москве_(Каптерева)&oldid=4627379

What happens?:
<pages>doesn't return anything

What should have happened instead?:
<pages> should add text from Index even if source file is absent
(optional for this bug) add tracking category for indexes without source files

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):
MediaWiki 1.43.0-wmf.4 (2111e6d) - https://ru.wikisource.org/wiki/Служебная:Версия

Other information (browser name/version, screenshots, etc.):

Event Timeline

Reedy renamed this task from Transclusion doesn't work atfer deletion of source file to Transclusion doesn't work after deletion of source file.Sun, May 12, 3:07 PM

This is expected if there is no file backing a index file, transclusions won't work.

I agree—this is expected behavior. Moving forward there are a few options:

  1. Re-upload the file. Since it was deleted at Commons for copyright reasons, I recommend against doing that at Commons (though that would solve your technical problem, albeit briefly). That said, since WMF servers are in the USA (including Wikisource Russian), you might ultimately run into issues even if you re-uploaded it to s:ru:Файл:Красный библиотекарь (журнал), 1923, № 1.pdf instead of c:File:Красный библиотекарь (журнал), 1923, № 1.pdf
  2. Use Шаблон:Страница to transclude the pages instead of <pages/> since the individual Страница: wikitext still exists for each page.

In any event, based upon c:Commons:Deletion requests/Красный библиотекарь (журнал), it looks like you are running into a legal issue of copyrights. You should probably give up on hosting that at WMF until the copyrights have expired from a USA point of view (or you can get the copyright holders to publish their content under a WMF acceptable free license such as CC-BY, CC-BY-SA, GPL, LGPL, FAL/LAL, ODC, GFDL, etc.).

Even the second option I listed above, is likely not a long term solution. But you might be able to use such a method to reconstruct the target(s) long enough to download and store the transcriptions locally before they ultimately get deleted for copyright cause (by an admin/sysop at Wikisource Russian).

As an example, I did Служебная:Изменения/5130429/prev as can be seen at: Библиотека и учащиеся (Смушкова).

Case above is just an example. This can be any abstract file with any abstract reason of deletion. After deletion of file, the text is no longer displayed even though text exists.

Case above is just an example. This can be any abstract file with any abstract reason of deletion. After deletion of file, the text is no longer displayed even though text exists.

In other situations where there are no legal or other policy issues, I recommend just re-uploading the file (but Commons does not usually delete things without ample cause).

This is expected if there is no file backing a index file, transclusions won't work.

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

  1. Use Шаблон:Страница to transclude the pages instead of <pages/> since the individual Страница: wikitext still exists for each page.
  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

In addition, there is no tracking category for such errors. We found out about this completely by accident. It is not known how many similar ones exist in the project and Wikisource in all other languages.

In any event, based upon c:Commons:Deletion requests/Красный библиотекарь (журнал), it looks like you are running into a legal issue of copyrights.

This example is a complicated situation. It was removed due to copyright reasons of the magazine editor. However, according to the law, editors are not recognized as authors, so deletion for this reason is not legal. But it is useless to prove on Wikimedia Commons, since this is a separate project with independent administrators who often delete some files out of caution, while a lot of others with gross copyright violations and false license templates remain there and are multiplied daily.
Also, several articles (a dozen pages of the magazine) violate copyright. This is the reason to delete the file.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

As an example, I did Служебная:Изменения/5130429/prev as can be seen at: Библиотека и учащиеся (Смушкова).

Here is similar example from there. Sorry, I deleted your example before I read this message, because it is a violation of copyright ("Смушкова").

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

I never suggested it was easy, fun or recommended—just possible.

In addition, there is no tracking category for such errors. We found out about this completely by accident. It is not known how many similar ones exist in the project and Wikisource in all other languages.

I agree this could use considerably better error reporting/tracking.

This example is a complicated situation. It was removed due to copyright reasons of the magazine editor. However, according to the law, editors are not recognized as authors, so deletion for this reason is not legal. But it is useless to prove on Wikimedia Commons, since this is a separate project with independent administrators who often delete some files out of caution, while a lot of others with gross copyright violations and false license templates remain there and are multiplied daily.
Also, several articles (a dozen pages of the magazine) violate copyright. This is the reason to delete the file.

Well it is up to Wikisource Russian as to whether or not to allow the files to be locally hosted there or not, however, WMF does have some say in things when it comes to legal issues like copyrights.

Some ways around this are to censor the pages of the PDFs replacing them with blanks for the copyright violation issues and then re-upload them somewhere (either locally or Commons).

Alternatively you could break the pages apart into images like JPEGs and upload all the individual pages, skipping the ones still under copyright. I am sure you are aware of the ability to create Index pages across such individual page images.

It would be nice if the guys at Commons somehow notified the wikis consuming their media when they make such deletions, etc.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

Here is similar example from there. Sorry, I deleted your example before I read this message, because it is a violation of copyright ("Смушкова").

That is not a problem. I do not really know Russian so I was sort of shooting in the dark anyway (I had just randomly picked a main space article that depended on the Index and some of the Page pages from the ones related to this discussion).

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

I never suggested it was easy, fun or recommended—just possible.

This is impossible to do manually in practice. You will not correct pages of this journal, I and no one in our small project will change it to {{страница}} due to technical complexity, no one will ever. The pages remains broken. There is no reason to delete them, because they are proofread in Page NS and are legal.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

This is a massive bug, unfortunately. A lot of pages with free licenses are broken. Like this, this, this, etc. Users and administrators don’t even know about it, and corrections are technically impossible.
What then is a “bug” if not the breakdown of a lot of pages with the disappearance of text, although it is present in the system and is legal.

I agree the lack of useful error reporting to know about when this happens is certainly a bug but Proofread Page (PRP) is specifically for media-backed transcriptions. As far as I know it has never supported transcription without being backed by some File media. So in that way I would argue this is not a bug. Most Wikisource sites do support other forms of transcription (and even translation, etc.) not involving PRP (in addition to supporting PRP-based).

I disagree that <pages/> should be able to emit Page content without the supporting File media. It might be able to be made to do such but that would be an additional feature request and not a bug (and I am not convinced that is the right things to do even). It could only be a bug if it supported that in the past and so far as I know it never has. Your wanting it to support such a feature does not make it a bug (otherwise I could make ludicrous claims that it is a bug because it does not provide me monetary income or some other random feature, etc.).

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

This is a massive bug, unfortunately. A lot of pages with free licenses are broken. Like this, this, this, etc. Users and administrators don’t even know about it, and corrections are technically impossible.
What then is a “bug” if not the breakdown of a lot of pages with the disappearance of text, although it is present in the system and is legal.

The lack of error reporting could be construed as a bug, the breakdown of pages is not a bug and is intended behavior. You're trying to start a car with water and then claiming that it is a manufacturing defect.

The correct approach in this kind of scenario is to upload the file to your local wiki (if the local/global policies allow it) with the same name as the now deleted file on commons. If the local/global policies don't allow it, then you are out of luck and Phabricator is not the correct place to dispute that.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

Even in situations where the pagelist is entirely determined by the Index (e.g., loose non-multiple page media like a pile of images), I believe there is a dependency on File media.

The lack of error reporting could be construed as a bug, the breakdown of pages is not a bug and is intended behavior. You're trying to start a car with water and then claiming that it is a manufacturing defect.

I agree. This is definitely a bug but only for the lack or proper and useful error reporting under such circumstances which are actually fairly common and quite hard to detect and find.

The correct approach in this kind of scenario is to upload the file to your local wiki (if the local/global policies allow it) with the same name as the now deleted file on commons. If the local/global policies don't allow it, then you are out of luck and Phabricator is not the correct place to dispute that.

Again I agree. Short of complete removal of all related content (which is always an option but it has its down sides), I believe there are basically three long term approaches:

  1. Modify the PDFs to replace the content still under copyrights with blank pages/sections and then re-upload them either locally or at Commons
  2. Break the pages apart and re-upload the pages with non-copyright infringing material as separate image pages. There are number of images formats and again the media can be hosted either locally or at Commons
  3. Forsake the File media and do not use PRP to transcribe the non-copyright infringing material, thereby hosting the content without being media-backed.

I personally prefer option one because as the as parts of the original document that fall out of copyright over time, the media can be re-uploaded repeatedly slowly adding back those parts until the complete publication can eventually be hosted as is.

If option one does not seem optimal for Wikisource Russian, then I recommend option three and finally option two. Ultimately it is up to the Wikisource Russian community to decide what is best for it (be that via community leadership or consensus), etc.

FYI: In terms of the error reporting bug from this issue, the following seems to be applicable:

Change #1031631 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/extensions/ProofreadPage@master] Add a tracking category for pagelist tags without associated Files

https://gerrit.wikimedia.org/r/1031631

It seems like that patch, when merged might solve the issue with finding <pagelist/> usage when the File media cannot be properly processed (in the case of that issue/task it was related to media hosting issues that could be rectified with purges and null edits but it should also apply when the backing media is deleted).

Change #1031631 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/extensions/ProofreadPage@master] Add a tracking category for pagelist tags without associated Files

https://gerrit.wikimedia.org/r/1031631

The correct approach in this kind of scenario is to upload the file ...

Without tracking tools nobody knows which pages are broken. How to find all pages in all Wikisource instances where is required uploading new files? They can only be found by chance.