Page MenuHomePhabricator

Regex to list all pages without internal links inside it while ignoring any [[Category:*]] links inside pages
Closed, DeclinedPublic

Description

Hi. I use cirrussearch regex search phrases to search for articles that need maintenance. And it works with me every time. I'm looking for a regex to get all the pages that doesn't have any internal links inside it. I use this regex for this purpose:

-insource:/\[\[.*/

But This regex counts the categories also, so if page contains no links, but has categories inside it, it will not be appear inside it. Since there is no way (to my knowledge) to except categories from the count.

So I use this regex to restrict categories, but it doesn't work with me either.

-insource:/\[\[[^c][^a][^t][^e][^g][^o][^r][^e][^y][^:].{3,}/

How can I restrict the categories from the total counts of strings begin with "[["?

Thanks in advance.

Event Timeline

ASammour created this task.Dec 1 2018, 4:31 AM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 1 2018, 4:31 AM
Restricted Application added subscribers: alanajjar, Aklapper. · View Herald Transcript

Hmm, why does -insource:/\[\[[^c][^a][^t][^e][^g][^o][^r][^y][^:].*/ not work? Maybe I misunderstand?
Do you have a specific search example with full URLs so it's easier to see what you're seeing, like which page you expected to [not] be listed and what happens instead?

Thanks for reply. I use the same regex in arwiki, but of course I change the characters from category to تصنيف which is the opposite of category keyword in arwiki.

This is the search results I got when type the previous regex. It suppose that all returned results doesn't have any internal links, but if you go the first result "جمادى الأولى" you will find it contains an internal link inside it. But in the other hand, if you go to the 8th result "خربة العمور" you will see it doesn't contain any internal links inside it.

If this regex cannot be run, Is there any other solutions that match this goal (Getting pages without internal links)?

Aklapper renamed this task from Bug in cirrusSearch regex to Regex to list all pages without internal links inside it while ignoring any [[Category:*]] links inside pages.Dec 1 2018, 3:01 PM

Thanks, that is helpful! :) Let me summarize, plus simplify the two testcases by also using intitle:. (Note to myself: Category = تصنيف.)

1st result جمادى الأولى which has an internal link should not be listed for intitle:"جمادى الأولى" -insource:/\[\[[^ت][^ص][^ن][^ي][^ف][^:].{3,}/

8th result خربة العمور without internal links should be listed for intitle:"خربة العمور" -insource:/\[\[[^ت][^ص][^ن][^ي][^ف][^:].{3,}/

Exactly, this is the problem.

I'm not sure regex search can solve this problem. Due to the way we accelerate regex's in the backend it will be impossible for the suggested regex's to ever finish (basically the acceleration does nothing for these queries)

Unfortunately what you are looking for doesn't match the wikitext parsers definition of not having links or it would be relatively easy to find. We keep an index of the outgoing links for every page. But as far as the wikitext parser is concerned your example page with 0 outgoing links actually has 451 outgoing links (from the templates). I'm sadly not sure of any data source that is queryable online to extract this information.

EBjune closed this task as Declined.Dec 11 2018, 6:38 PM
EBjune added a subscriber: EBjune.

Declining, as per EBernhardson's answer above