Page MenuHomePhabricator

20c of thoughts for globalsearch tool
Open, Needs TriagePublic

Description

Some thoughts

  • A regex search, is that case sensitive or not, here I am doing phrase searches quickly.
    • If it is case sensitive, can one easily flick between the two?
  • Might be worth adding a note that main ns is ns:0 or simple adding that as a tick box for where someone wants to quickly check that space.
  • there is a lag between search and deletions
  • searching for spam can need a sort display function. When I am seeing which spambots it is usually time-related, so it would be useful to see search results based on last date edited (see previous search example)
  • logout quiescence between searches, could it be a bit longer?
  • sanity check fore regex prior to running .... my search term ([hH]ogan|[Vv]alentino|[Ff]erragamo) ([Ss]hoes|[Ff]ootwear|[Oo]utlet|[Bb]elts) then forgot to tick the regex box, gave me 5000 out of 177k in User: ns (doh!)
  • If this becomes a more permanent fixture, I would like to see some integration with COIBot, as if db are up-to-date the ability to quickly search based on urls has advantages

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2019, 4:05 AM
Billinghurst renamed this task from 20c of thoughts to 20c of thoughts for globalsearch tool.May 26 2019, 4:05 AM
Billinghurst updated the task description. (Show Details)May 26 2019, 4:58 AM
Billinghurst updated the task description. (Show Details)May 26 2019, 5:15 AM
Billinghurst updated the task description. (Show Details)May 26 2019, 5:28 AM

@Billinghurst Thanks for the thorough feedback!

A regex search, is that case sensitive or not, here I am doing phrase searches quickly.
If it is case sensitive, can one easily flick between the two?

It is case-sensitive, as regex would be by default. I need to do testing but you should be able to prepend your regex with (?i) to make it insensitive. I will add some messaging to explain this to the user. I plan to have a separate option for exact (non-regex) searches, see T224359: Use regex for exact searches.

Might be worth adding a note that main ns is ns:0 or simple adding that as a tick box for where someone wants to quickly check that space.

Sure, we can add a help icon (?) that will explain the main namespace is 0, as you suggest. I wanted a dropdown multi-select kind of input for namespaces, but the issue is this applies to all wikis and the namespaces are inconsistent. For instance the "Draft" namespace is only on so many wikis.

there is a lag between search and deletions

https://tools.wmflabs.org/global-search/?q=gay+porno+&regex=1&namespaces=
ky.wikipedia Free Porn Blockers - 3 Things They Will Not Do
9 May 2019 Billinghurst talk contribs block deleted page Free Porn Blockers - 3 Things They Will Not Do

@EBernhardson Any idea why this is? In this case the content was removed over two weeks ago!

searching for spam can need a sort display function. When I am seeing which spambots it is usually time-related, so it would be useful to see search results based on last date edited (see previous search example)

It unfortunately wouldn't be possible to identify which edits added the content, and sort by the time those edits were made. It is possible to sort by the time the most recent edit was made to that page, but I think this might end up being very expensive. Would this still be helpful? If so I will investigate :)

logout quiescence between searches, could it be a bit longer?

I have this same problem with other tools that use OAuth. The cookie is set to expire for an entire year after login... so I don't know why logs you out so often. I've created T224382: Investigate why tools do not stay logged in for the duration of the session cookie

sanity check fore regex prior to running .... my search term ([hH]ogan|[Vv]alentino|[Ff]erragamo) ([Ss]hoes|[Ff]ootwear|[Oo]utlet|[Bb]elts) then forgot to tick the regex box, gave me 5000 out of 177k in User: ns (doh!)

It will be difficult to detect regex from a keyword search that happens to include reserved regex characters. For instance {{foo|bar could be regex or a keyword search for Template:Foo with "bar" as the first parameter. I will brainstorm solutions, but no promises!

If this becomes a more permanent fixture, I would like to see some integration with COIBot, as if db are up-to-date the ability to quickly search based on urls has advantages

I'm not sure what this integration would entail, but I'm open to it! As I understand it, the CloudElastic service is still experimental. We are not sure how it will scale yet, so I'm adverse to allowing any bots to query it right now. If you were talking about the other direction -- where Global Search queries the COIBot database -- then that may be more realistic in the near-term. I'll need more information on the requirements, though.

This comment was removed by MusikAnimal.

A regex search, is that case sensitive or not, here I am doing phrase searches quickly.
If it is case sensitive, can one easily flick between the two?

It is case-sensitive, as regex would be by default. I need to do testing but you should be able to prepend your regex with (?i) to make it insensitive. I will add some messaging to explain this to the user. I plan to have a separate option for exact (non-regex) searches, see T224359: Use regex for exact searches.

I may be missing your expected regex syntax, however, wouldn't a ? just make the previous character optional
ENterobius vermicularis?i

https://tools.wmflabs.org/global-search/?q=ENterobius+vermicularis%3Fi&regex=1&namespaces=

it doesn't find "Enterobius vermicularis"

Might be worth adding a note that main ns is ns:0 or simple adding that as a tick box for where someone wants to quickly check that space.

Sure, we can add a help icon (?) that will explain the main namespace is 0, as you suggest. I wanted a dropdown multi-select kind of input for namespaces, but the issue is this applies to all wikis and the namespaces are inconsistent. For instance the "Draft" namespace is only on so many wikis.

Yes, we face that issue just within the Wikisources for ProofreadPage extension, and have been slowly moving towards a standardisation.

[snip]

searching for spam can need a sort display function. When I am seeing which spambots it is usually time-related, so it would be useful to see search results based on last date edited (see previous search example)

It unfortunately wouldn't be possible to identify which edits added the content, and sort by the time those edits were made. It is possible to sort by the time the most recent edit was made to that page, but I think this might end up being very expensive. Would this still be helpful? If so I will investigate :)

For me I am (selfishly?) looking at following abuse (spam or CoI), which is usually something new within the timespan of recent changes if reaction is done well, so for spambots it is primarily new pages. So I am looking at recent activity, be it the actual spam, or the addition of a deletion template. Time of actual edit is not pertinent

COIBot generally is pretty good at finding domains through time that are urls (though not perfect with all wikis, as some of the small wikis, and Commons seems to slip through). [I will make separate commentary about urls]

[snip]

It will be difficult to detect regex from a keyword search that happens to include reserved regex characters. For instance {{foo|bar could be regex or a keyword search for Template:Foo with "bar" as the first parameter. I will brainstorm solutions, but no promises!

If this becomes a more permanent fixture, I would like to see some integration with COIBot, as if db are up-to-date the ability to quickly search based on urls has advantages

I'm not sure what this integration would entail, but I'm open to it! As I understand it, the CloudElastic service is still experimental. We are not sure how it will scale yet, so I'm adverse to allowing any bots to query it right now. If you were talking about the other direction -- where Global Search queries the COIBot database -- then that may be more realistic in the near-term. I'll need more information on the requirements, though.

Not sure what exactly I am thinking, I am just seeing me doing things back and forth fro COIBot reports, and from Abuselog So I definitely through some templates and easily feeding components into a global search. I haven't overly thought this through, and won't whilst we are still playing on the basic concepts.

Billinghurst added a subscriber: MER-C.EditedMay 27 2019, 2:47 AM

One of the biggest issues that I have in spam management is the limitations on looking at LinkSearches, especially as we have the unresolved protocol differences of http and https type urls that require separate searches.

COIBot does a really good job, though when it is forced to do a review it does miss some components, especially in residue edits, and seems to have occasional blindspots in missing edits when you tell it to do a full retrospective review. It also can sometimes have issues with not seeing Commons urls. [I will note that it is about to get some forced system changes due to T224154 and I am unsure of those consequences]

and @MER-C has a tool that looks at linksearches, (linksearch.jsp) though it focuses upon the bigger wikis, so those little wikis and the sisters can sit with spambot activity (past and present) for an extended time undiagnosed. And these tools don't help when with certain small sister wikis weirdly become sporadic spambot targets, often in sets, so it can be a bit of digging to hunt them out.

Supplementary: with this tool when running a search on a url as a general search, the period in a url is presumably ignored, so the parts of the url become separate searches so something like casertadeluxe.com becomes a search for casertadeluxe and com. So I suppose I am asking whether there is a means to fine-tune for urls in the searches.

Supplementary: with this tool when running a search on a url as a general search, the period in a url is presumably ignored, so the parts of the url become separate searches so something like casertadeluxe.com becomes a search for casertadeluxe and com. So I suppose I am asking whether there is a means to fine-tune for urls in the searches.

I just deployed T224359. You can now simply wrap any query in double quotes and you will get only exact matches, e.g. https://tools.wmflabs.org/global-search/?q="casertadeluxe.com"

Supplementary: with this tool when running a search on a url as a general search, the period in a url is presumably ignored, so the parts of the url become separate searches so something like casertadeluxe.com becomes a search for casertadeluxe and com. So I suppose I am asking whether there is a means to fine-tune for urls in the searches.

I just deployed T224359. You can now simply wrap any query in double quotes and you will get only exact matches, e.g. https://tools.wmflabs.org/global-search/?q="casertadeluxe.com"

excellent news! That was the style of my very first search yesterday. :-) Thanks.

Thought bubble of what may be a nice search isolations—if it is easy

  • search by sister, eg. just show me searches for wikimedia domains, or wiktionary domains
  • search by language, eg. an easy means to limit a search by the set natural language of the wikis, just show me results from German language wikis

Trying a search for Charles Darwin in ns:102 which is the author ns at enWS, and the Page: ns in deWS, and obviously others in others.

https://tools.wmflabs.org/global-search/?q=%22Charles+Darwin%22&namespaces=102

Total results: 1,506

though only displays about 100 results (presumably it is 100) then stops. First time I have seen a foreshortened search results. (though I have been running targeted searches so not testing that output)

EBernhardson added a comment.EditedMay 28 2019, 4:36 PM

@Billinghurst Thanks for the thorough feedback!

A regex search, is that case sensitive or not, here I am doing phrase searches quickly.
If it is case sensitive, can one easily flick between the two?

It is case-sensitive, as regex would be by default. I need to do testing but you should be able to prepend your regex with (?i) to make it insensitive. I will add some messaging to explain this to the user. I plan to have a separate option for exact (non-regex) searches, see T224359: Use regex for exact searches.

I don't think this will work. With the regex query you are using (https://github.com/wikimedia/search-extra/blob/master/docs/source_regex.md) there is a case_sensitive field that has to be passed in the elasticsearch query itself.

Might be worth adding a note that main ns is ns:0 or simple adding that as a tick box for where someone wants to quickly check that space.

Sure, we can add a help icon (?) that will explain the main namespace is 0, as you suggest. I wanted a dropdown multi-select kind of input for namespaces, but the issue is this applies to all wikis and the namespaces are inconsistent. For instance the "Draft" namespace is only on so many wikis.

there is a lag between search and deletions

https://tools.wmflabs.org/global-search/?q=gay+porno+&regex=1&namespaces=
ky.wikipedia Free Porn Blockers - 3 Things They Will Not Do
9 May 2019 Billinghurst talk contribs block deleted page Free Porn Blockers - 3 Things They Will Not Do

@EBernhardson Any idea why this is? In this case the content was removed over two weeks ago!

Currently only group0 is live updating, the rest has a snapshot of the production indices as of april 29th. This will hopefully soon have group1, and then group2 (everything) but is currently delayed on getting LVS (load balancing) setup in front of the servers. The load balancer is tracked in T224324, turning on replication is T220625. Essentially we don't want to start sending full update rates without the ability to easily depool servers.

searching for spam can need a sort display function. When I am seeing which spambots it is usually time-related, so it would be useful to see search results based on last date edited (see previous search example)

It unfortunately wouldn't be possible to identify which edits added the content, and sort by the time those edits were made. It is possible to sort by the time the most recent edit was made to that page, but I think this might end up being very expensive. Would this still be helpful? If so I will investigate :)

Sorting by the last edited date should be reasonable. See the query dump for https://www.mediawiki.org/wiki/?search=edit&fulltext=1&sort=last_edit_desc&cirrusDumpQuery to see how cirrus does that sorting, the tl/dr is: "sort" => ["timestamp" => "desc"]

logout quiescence between searches, could it be a bit longer?

I have this same problem with other tools that use OAuth. The cookie is set to expire for an entire year after login... so I don't know why logs you out so often. I've created T224382: Investigate why tools do not stay logged in for the duration of the session cookie

sanity check fore regex prior to running .... my search term ([hH]ogan|[Vv]alentino|[Ff]erragamo) ([Ss]hoes|[Ff]ootwear|[Oo]utlet|[Bb]elts) then forgot to tick the regex box, gave me 5000 out of 177k in User: ns (doh!)

It will be difficult to detect regex from a keyword search that happens to include reserved regex characters. For instance {{foo|bar could be regex or a keyword search for Template:Foo with "bar" as the first parameter. I will brainstorm solutions, but no promises!

If this becomes a more permanent fixture, I would like to see some integration with COIBot, as if db are up-to-date the ability to quickly search based on urls has advantages

I'm not sure what this integration would entail, but I'm open to it! As I understand it, the CloudElastic service is still experimental. We are not sure how it will scale yet, so I'm adverse to allowing any bots to query it right now. If you were talking about the other direction -- where Global Search queries the COIBot database -- then that may be more realistic in the near-term. I'll need more information on the requirements, though.

If we are looking for url based search, elastic can probably help here but we might need to adjust the mappings to further expose it. For example we have a list of external urls linked to by a page in it's own field, but if you want to search on that field we need to understand the use cases so it can be analyzed appropriately (for example, matching only domain names? Matching domain names by domain prefix? Matching words in the url?). Today these urls are only indexed by exact keyword lookup so not directly useful to anything.

Thanks for the detailed replies, EBernhardson!

@Billinghurst I have deployed a "Case-insensitive" option for regex searches, e.g. https://tools.wmflabs.org/global-search/?q=FooBar&regex=1&ignorecase=1

I have created T224513: Add sort options for adding sort options.