Page MenuHomePhabricator

He7d3r (Helder)
Research

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 6 2014, 11:25 PM (320 w, 2 d)
Availability
Available
IRC Nick
he7d3r
LDAP User
He7d3r
MediaWiki User
He7d3r [ Global Accounts ]

Recent Activity

Oct 21 2020

He7d3r claimed T176711: jQueryMsg should generate external links with 'external' CSS class.
Oct 21 2020, 9:12 AM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Growth-Team, MediaWiki-extensions-GuidedTour, good first task, JavaScript, I18n

Oct 20 2020

He7d3r added a comment to T264490: ContentTranslation adds duplicate 'Category:' prefix.

Still happening: https://pt.wikipedia.org/wiki/Simetria_de_reflex%C3%A3o?diff=59627837#footer

Oct 20 2020, 9:49 AM · Language-Team (Language-2020-October-December), ContentTranslation
He7d3r created T265985: "Uncaught TypeError: item is null" when clicking on category translation.
Oct 20 2020, 9:43 AM · JavaScript, ContentTranslation

Oct 13 2020

He7d3r added a comment to T264940: Track metrics on ptwiki relating to IP-editing turn off.

@Danilo made these tools for that:
https://ptwikis.toolforge.org/FiltroIP
https://ptwikis.toolforge.org/Filtros:180

Oct 13 2020, 12:11 PM · Product-Analytics (Kanban), Anti-Harassment

Oct 12 2020

He7d3r created T265270: ContentTranslation2 adds invalid categories (Localized ns+English ns+Category name).
Oct 12 2020, 10:38 AM · ContentTranslation

Oct 10 2020

He7d3r added a comment to T264940: Track metrics on ptwiki relating to IP-editing turn off.

Number of edits: "Edits per day" graph tool. I have created that tool with a query similar to that I used to get the active users. The graph show us that IPs used to make approximately 1700 edits per day. After the mandatory registration the new users edits have raised approximately 700 daily edits (from ~700 to ~1400), that suggest that about 700 edits that was made by IPs become to be made by new registered users and about 1000 are no longer been made.
(...)
Quality of edits with ORES: https://quarry.wmflabs.org/query/48860. I used the ORES damaging model to estimate the proportion of damaging edits. The data shows that it has decreased from approx. 18% to approx. 7%. That suggest us that those approx. 1000 edits per day that are no longer been made by IPs are worse edits then those approx. 700 that become to be made by new registered users.

Oct 10 2020, 11:35 AM · Product-Analytics (Kanban), Anti-Harassment

Oct 5 2020

He7d3r added a comment to T264622: $wgAbuseFilterEmergencyDisableThreshold is ignored.

There you go, the stats were reset and the filter was throttled. Likely some caching issue. T264629 could help, probably.

Indeed, now the main page says it is "Enabled, throttled":
https://pt.wikipedia.org/wiki/Special:AbuseFilter?offset=179&limit=1&uselang=en

Oct 5 2020, 5:54 PM · AbuseFilter
He7d3r updated subscribers of T13664: Add User Preference Option to hide reverted edits from Watchlist and Page History.

I wonder if this deserves a higher priority now, given that https://gerrit.wikimedia.org/r/609773 implemented some kind of metadata (T254074: Implement the reverted edit tag). This feature should help with part of the concerns raised at

...
As for Huggle, not only there is a chronic problem of people willing to waste their volunteer time there operating it, instead of creating content, but it's a bad solution as well, since it fills up the historic of the articles with revertions after revertions, polluting it and making it much less readable.
...

Oct 5 2020, 5:39 PM · MediaWiki-User-preferences
He7d3r awarded T254074: Implement the reverted edit tag a Like token.
Oct 5 2020, 5:29 PM · User-notice, MW-1.35-notes (1.35.0-wmf.41; 2020-07-14), Patch-For-Review, Product-Analytics, MediaWiki-Page-editing
He7d3r awarded T13664: Add User Preference Option to hide reverted edits from Watchlist and Page History a Like token.
Oct 5 2020, 5:24 PM · MediaWiki-User-preferences
He7d3r created T264622: $wgAbuseFilterEmergencyDisableThreshold is ignored.
Oct 5 2020, 2:24 PM · AbuseFilter
He7d3r awarded T261133: Ban IP edits on pt.wiki a Dislike token.
Oct 5 2020, 12:39 PM · Growth-Team, Anti-Harassment, Wikimedia-Site-requests

Oct 3 2020

Krinkle awarded T256732: [[Special:Notifications]] uses deprecated $.trimByteLength a Orange Medal token.
Oct 3 2020, 10:51 PM · MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED), Technical-Debt, Growth-Team, JavaScript, Notifications

Oct 1 2020

Quiddity awarded T63547: Make [[Special:WhatLinksHere]] and [[Special:RecentChangesLinked]] work with links which use [[Special:MyLanguage]] a Doubloon token.
Oct 1 2020, 4:58 PM · I18n, MediaWiki-General

Sep 19 2020

DannyS712 awarded T157218: Special:Log should display all logs a user has the rights to see (instead of only public logs) a Like token.
Sep 19 2020, 5:58 PM · Platform Engineering, MediaWiki-Logevents, AbuseFilter, SpamBlacklist, TitleBlacklist

Aug 27 2020

He7d3r added a watcher for Outreach-Programs-Projects: He7d3r.
Aug 27 2020, 11:32 PM

Aug 9 2020

Pppery awarded T63547: Make [[Special:WhatLinksHere]] and [[Special:RecentChangesLinked]] work with links which use [[Special:MyLanguage]] a Like token.
Aug 9 2020, 12:35 AM · I18n, MediaWiki-General

Jul 28 2020

Yair_rand awarded T63547: Make [[Special:WhatLinksHere]] and [[Special:RecentChangesLinked]] work with links which use [[Special:MyLanguage]] a Doubloon token.
Jul 28 2020, 9:32 PM · I18n, MediaWiki-General

Jul 22 2020

He7d3r updated the task description for T253938: Future proof addPortletLink and work towards a standard mw-portlet class for all menus across all skins.
Jul 22 2020, 2:30 PM · MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED), Patch-For-Review, Readers-Web-Backlog (Kanbanana-FY-2020-21), MediaWiki-Core-Skin-Architecture, Timeless, Vector

Jul 16 2020

He7d3r created T258149: Show source article quality at Special:ContentTranslation's "translations in progress", "suggestions" and "for later" lists.
Jul 16 2020, 10:52 AM · ORES, Machine Learning Platform, ContentTranslation

Jul 15 2020

He7d3r awarded T254352: Each filter should have a talk page a Love token.
Jul 15 2020, 9:22 PM · AbuseFilter

Jul 7 2020

He7d3r merged task T61688: Dissappeared content of categories cannot be gathered again into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories
He7d3r merged task T36269: (un)categorization actions should get logged into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories
He7d3r merged task T36597: Category history into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories
He7d3r merged task T7484: include page categorization/decategorization event in the related category watch list into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories
He7d3r merged task T7526: It should be possible to see the chronology of the additions and removals of articles in a given category into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories
He7d3r merged tasks T7526: It should be possible to see the chronology of the additions and removals of articles in a given category, T7484: include page categorization/decategorization event in the related category watch list, T36597: Category history, T36269: (un)categorization actions should get logged, T61688: Dissappeared content of categories cannot be gathered again into T6366: Category history should show past members.
Jul 7 2020, 6:10 PM · MediaWiki-Categories

Jun 30 2020

He7d3r created T256732: [[Special:Notifications]] uses deprecated $.trimByteLength.
Jun 30 2020, 10:37 AM · MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED), Technical-Debt, Growth-Team, JavaScript, Notifications

Jun 27 2020

He7d3r added a comment to T256534: "Uncaught Error: Syntax error, unrecognized expression: ." when clicking on paragraph.

Jun 27 2020, 2:18 PM · Language-Team (Language-2020-July-September), MW-1.36-notes (1.36.0-wmf.2; 2020-07-28), Patch-For-Review, ContentTranslation
He7d3r created T256534: "Uncaught Error: Syntax error, unrecognized expression: ." when clicking on paragraph.
Jun 27 2020, 2:11 PM · Language-Team (Language-2020-July-September), MW-1.36-notes (1.36.0-wmf.2; 2020-07-28), Patch-For-Review, ContentTranslation

Jun 26 2020

He7d3r awarded T134681: ContentTranslation should not validate single sections against abuse filters intended for full pages a Heartbreak token.
Jun 26 2020, 9:15 PM · WorkType-Maintenance, AbuseFilter, ContentTranslation
He7d3r awarded T134678: ContentTranslation should generate an AbuseFilter log whenever it shows a warning for the users a Heartbreak token.
Jun 26 2020, 9:12 PM · WorkType-NewFunctionality, AbuseFilter, ContentTranslation

Jun 23 2020

He7d3r updated subscribers of T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

@Danilo generated the following table comparing articlequality scores for the latest version of all articles to the scores which would be produced by the Python script which is/was used to make bot assessments:

MariaDB [s51206__ptwikis]> SELECT pe_qualidade, SUM(pe_qores = 0) ORES_0, SUM(pe_qores = 1) ORES_1, SUM(pe_qores = 2) ORES_2, SUM(pe_qores = 3) ORES_3, SUM(pe_qores = 4) ORES_4, SUM(pe_qores = 5) ORES_5, SUM(pe_qores = 6) ORES_6 FROM page_extra GROUP BY pe_qualidade ORDER BY pe_qualidade;
+--------------+--------+--------+--------+--------+--------+--------+--------+
| pe_qualidade | ORES_0 | ORES_1 | ORES_2 | ORES_3 | ORES_4 | ORES_5 | ORES_6 |
+--------------+--------+--------+--------+--------+--------+--------+--------+
|            0 |  68218 |      0 |      0 |      0 |      0 |      0 |      0 |
|            1 |      3 | 618819 | 204187 |  27523 |   1847 |   3562 |   3261 |
|            2 |      0 |   5565 |  69323 |  24777 |   1390 |   7496 |    350 |
|            3 |      0 |     71 |    472 |  14361 |   1861 |   7412 |    572 |
|            4 |      0 |      5 |     10 |   2948 |   2361 |   2978 |   1208 |
|            5 |      0 |      0 |     16 |     59 |    136 |   1056 |    161 |
|            6 |      0 |      0 |      0 |     35 |    190 |    188 |    782 |
+--------------+--------+--------+--------+--------+--------+--------+--------+
7 rows in set (3.70 sec)

(the label is set to zero if the quality is unknown, possibly due to the page being deleted)

Jun 23 2020, 8:57 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

Jun 19 2020

He7d3r added a comment to T157271: Web-based AutoWikiBrowser alternative.

See also: https://en.wikipedia.org/wiki/User:Joeytje50/JWB

Jun 19 2020, 7:33 PM · Wikimedia-Hackathon-2017, Community-Wishlist-Survey-2016, AutoWikiBrowser

Jun 18 2020

He7d3r added a comment to T255796: Twinkle gadget broken on Telugu Wikipedia..

Possibly related to https://github.com/azatoth/twinkle/commit/8f7b2f367276c6cf8e0ef78b82d9957415221780

Jun 18 2020, 5:19 PM · Reading-Web-Local-Wiki-Issues
He7d3r updated the task description for T252447: Notify gadget users to update Vector scripts and styles.
Jun 18 2020, 11:22 AM · Tech-Ambassadors, Readers-Web-Backlog (Tracking), Desktop Improvements, Vector (Vector (Tracking)), User-notice

Jun 14 2020

He7d3r added a comment to T255367: Global script is not loaded on debug=true.

I confirmed this by replacing my global.js by console.log( 'Started global.js.' ); and then loading
https://pt.wikipedia.org/wiki/Special:BlankPage?debug=true
There should be a log in the console, but it was not there.

Jun 14 2020, 2:01 PM · Performance-Team, MediaWiki-ResourceLoader, GlobalCssJs

Jun 5 2020

He7d3r added a comment to T246668: Create follow-up edit quality campaign for ptwikipedia.

Progress (100% done, -19 labels left):


https://labels.wmflabs.org/stats/ptwiki/93

Jun 5 2020, 2:02 PM · Machine Learning Platform (Current), editquality-modeling, Wikilabels, artificial-intelligence

May 23 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

I've submitted https://github.com/wikimedia/articlequality/pull/132

May 23 2020, 10:59 AM · Machine Learning Platform (Current), ORES, artificial-intelligence

May 22 2020

He7d3r committed rOWC05578ef2334f: Build new ptwiki model with data since 2014 (authored by He7d3r).
Build new ptwiki model with data since 2014
May 22 2020, 11:44 PM
He7d3r committed rOWC340b621c5ac0: Update class sizes and pop-rates (authored by He7d3r).
Update class sizes and pop-rates
May 22 2020, 9:04 PM
He7d3r committed rOWC1346f67c6478: Update Makefile to remove revisions older than 2014 (authored by He7d3r).
Update Makefile to remove revisions older than 2014
May 22 2020, 8:00 PM
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Updated info (as of commit c3a66b0 plus the specific changes which define each of the tests):

accuracy (micro=0.8, macro=0.861):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.781  0.827  0.877  0.899  0.875  0.908
$ cat datasets/ptwiki.labelings.20200301.remove_bots.json | json2tsv wp10 | sort | uniq -c
 145657 1
  32807 2
   6177 3
   2346 4
   1646 5
   1542 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.remove_bots.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1500 3
   1500 4
   1500 5
   1328 6
accuracy (micro=0.81, macro=0.875):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.799  0.806  0.867  0.915  0.928  0.933
$ cat datasets/ptwiki.labelings.20200301.since_2014.json | json2tsv wp10 | sort | uniq -c
   7537 1
   3346 2
   1276 3
    690 4
    653 5
    684 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.since_2014.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1276 3
    690 4
    653 5
    684 6
May 22 2020, 5:45 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r created T253388: Automatically create task on Phabricator based on Issues from Github repositories.
May 22 2020, 3:59 PM · Technical-Tool-Request, User-Majavah

May 21 2020

He7d3r committed rOWCa75e9327a258: Remove bots assessments from dataset (authored by He7d3r).
Remove bots assessments from dataset
May 21 2020, 8:50 PM
He7d3r committed rOWCae7bacd2d60c: Fix AttributeError when revision.user is None (authored by He7d3r).
Fix AttributeError when revision.user is None
May 21 2020, 6:08 PM

May 20 2020

He7d3r committed rOWCd45172394f86: Convert page id to string explicitly (authored by He7d3r).
Convert page id to string explicitly
May 20 2020, 8:44 PM
He7d3r committed rOWCeb97707eee3a: Convert page id to string explicitly (authored by He7d3r).
Convert page id to string explicitly
May 20 2020, 8:38 PM
He7d3r committed rOWC4a5095cff48c: Remove unused user (authored by He7d3r).
Remove unused user
May 20 2020, 2:31 PM
He7d3r committed rOWC4b7381456f0d: Add user to tests (authored by He7d3r).
Add user to tests
May 20 2020, 2:26 PM
He7d3r committed rOWC8c9042633b58: Remove bots assessments from dataset (authored by He7d3r).
Remove bots assessments from dataset
May 20 2020, 10:08 AM

May 19 2020

He7d3r committed rOWCac1c8df235e4: Remove bots assessments from dataset (authored by He7d3r).
Remove bots assessments from dataset
May 19 2020, 9:45 PM
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

@Halfak: Oops... I missed the -v flag when I used grep to remove the bot assessments. So, instead of considering only human assessments, I extracted only the bot assessments! Once I add that flag, the number of assessments by humans seems more reasonable:

$ cat datasets/ptwiki.labelings.20200301.user.json |grep -v -P '"user": "[^"]*([Bb][Oo][Tt]|[Rr][Oo][Bb][ÔôOo])[^"]*"' | json2tsv wp10 | sort | uniq -c
  28403 1
  13343 2
   5329 3
   2209 4
   1458 5
   1281 6

In this case, the explanation for such a high accuracy is likely that the bots assessments are very predictable (it is hardcoded in their dna code ;-).

May 19 2020, 8:41 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T209387: Update documentation for ArticleQuality.js.

For future reference: there is now a translation at https://pt.wikipedia.org/wiki/User:EpochFail/ArticleQuality

May 19 2020, 10:57 AM · artificial-intelligence, articlequality-modeling, Machine Learning Platform (Current)

May 18 2020

He7d3r added a comment to T246667: Build draft quality model for ptwikipedia.

@GoEThe: in case you have any suggestions on better images for this purpose, we can try changing them. @Halfak suggested the https://commons.wikimedia.org/wiki/Category:OOUI_icons as a good source of icons we could use.

May 18 2020, 10:28 PM · Machine Learning Platform (Current), editquality-modeling, Wikilabels, artificial-intelligence
He7d3r added a comment to T246667: Build draft quality model for ptwikipedia.

@GoEThe : I see you've installed the version of the script I mentioned at T246667#6079484. Did you have the chance to test it on Special:Newpages? Is it good enough for us to publicize it for other users?

May 18 2020, 9:01 PM · Machine Learning Platform (Current), editquality-modeling, Wikilabels, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

PS: I didn't change the thresholds in the Makefile, so the samples were not as balanced as might be wanted:
(Note: By mistake, I forgot the -v flag in the grep above, so the results for the first case are inverted, that is, they contain bot_only, instead of no_bots)

$ cat datasets/ptwiki.balanced_labelings.9k_2020.no_bots.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
    759 3
     20 4
     95 5
    203 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.since_2014.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1247 3
    674 4
    630 5
    654 6
May 18 2020, 5:51 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Wow. This is really awesome. I wonder what would happen if we retrained the models on recent data only. In enwiki we found that the definition of quality changed over time. There should be plenty of observation after 2014 to give us good signal.

May 18 2020, 5:49 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

May 15 2020

He7d3r reassigned T250704: Internal links on comment/summary point to Wikilabels instead of the target wiki from He7d3r to Halfak.
May 15 2020, 11:13 AM · Machine Learning Platform, Wikilabels
He7d3r closed T250704: Internal links on comment/summary point to Wikilabels instead of the target wiki, a subtask of T252280: Improve Wikilabels UI, as Resolved.
May 15 2020, 11:12 AM · Machine Learning Platform (Current), Wikilabels, Wikimedia-Hackathon-2020
He7d3r closed T250704: Internal links on comment/summary point to Wikilabels instead of the target wiki as Resolved.

Halfak fixed this in https://github.com/wikimedia/wikilabels/pull/263

May 15 2020, 11:12 AM · Machine Learning Platform, Wikilabels

May 12 2020

He7d3r added a comment to T252441: Wikilabels: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY.

See https://github.com/wikimedia/wikilabels/pull/264

May 12 2020, 12:09 PM · Machine Learning Platform (Current), Wikilabels

May 11 2020

He7d3r added a comment to T252441: Wikilabels: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY.

I've tried to do this:

diff --git a/Dockerfile b/Dockerfile
index 5fc2d0d..19f39a5 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -5,7 +5,8 @@ RUN apt-get update && apt-get install -y \
     g++ \
     python3-dev \
     libmemcached-dev \
-    libz-dev
+    libz-dev \
+    memcached
May 11 2020, 7:21 PM · Machine Learning Platform (Current), Wikilabels
He7d3r created T252441: Wikilabels: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY.
May 11 2020, 6:51 PM · Machine Learning Platform (Current), Wikilabels

May 10 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

After patching¹ the extractor to also collect user names, I found that these are the top 10 users who added/modified the most assessments:

FMTbot            91366
Rei-bot           25155
BotStats          15830
Fabiano Tatsch    14829
Leandro Drudo      4660
GoEThe             3172
Burmeister         3128
Rei-artur          2965
FilRBot            2444
VítoR Valente      1895

Then I produced² the following graphs showing the number of labels added/modified by bots³ by year, for each of the six quality levels. There are many quality 1 and 2 assessments made by bots.

May 10 2020, 10:22 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

May 9 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Here are some graphs showing the evolution of the assessments extracted from ptwiki:

May 9 2020, 5:45 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

May 8 2020

He7d3r added a comment to T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.

It occurred to me that some of these expressions are also used by Salebot¹, with the difference that in the bot config² users assign a score to each word/regex indicating how much it contributes towards classifying an edit as needing to be reverted. This allows it to "ignore" words which are common in good edits, unless there are too many of them.

May 8 2020, 12:08 PM · Machine Learning Platform (Current), artificial-intelligence

May 7 2020

He7d3r added a comment to T252152: Extracted labels might not be accurate when there are multiple reverts.

See https://github.com/wikimedia/articlequality/pull/127 for a possible solution.

May 7 2020, 6:58 PM · Machine Learning Platform (Current), artificial-intelligence, articlequality-modeling
He7d3r created T252152: Extracted labels might not be accurate when there are multiple reverts.
May 7 2020, 6:53 PM · Machine Learning Platform (Current), artificial-intelligence, articlequality-modeling

May 6 2020

He7d3r added a comment to T158916: Store/read informals, badwords, stopwords and other language assets on a wiki page.

It is not uncommon for some good faith edit to add a new expression (or badly written regex) to such lists and then breaking (to some extent) the tools which use them (e.g. increasing its false positives).

May 6 2020, 6:30 PM · artificial-intelligence, revscoring, Machine Learning Platform
He7d3r added a comment to T158916: Store/read informals, badwords, stopwords and other language assets on a wiki page.

Here are some examples of existing lists, of varying quality and formats, used by other tools:

May 6 2020, 6:24 PM · artificial-intelligence, revscoring, Machine Learning Platform
He7d3r added a comment to T251608: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages).

I've updated the patch.

May 6 2020, 6:09 PM · Machine Learning Platform (Current), artificial-intelligence, articlequality-modeling
He7d3r added a comment to T158916: Store/read informals, badwords, stopwords and other language assets on a wiki page.

Is this still wanted nowadays?

May 6 2020, 5:44 PM · artificial-intelligence, revscoring, Machine Learning Platform

May 5 2020

He7d3r created T251904: New Wikitext Editor: Unable to add new group of tools to VisualEditor's toolbar .
May 5 2020, 2:43 PM · VisualEditor
He7d3r added a comment to T251608: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages).

While the dumps are processed, we could store the <id> of the talk pages instead of their <title>s. Then, an API query such as
https://pt.wikipedia.org/w/api.php?action=query&format=json&prop=info&pageids=18363&formatversion=2&inprop=subjectid
will return the <id> of the associated subject page (the one whose text we are interested in). This should work when pages are moved, since page moves do not change the pageid (but it is not guaranteed if the page is deleted and restored).

May 5 2020, 11:12 AM · Machine Learning Platform (Current), artificial-intelligence, articlequality-modeling

May 1 2020

He7d3r created T251608: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages).
May 1 2020, 3:00 PM · Machine Learning Platform (Current), artificial-intelligence, articlequality-modeling
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Could the number of labels per article have a negative impact on the quality of the model?
These are the frequency of the number of labels/page in the full set and in the 9k sample:

$ cat ptwiki.labelings.20200301.json | json2tsv page_title | sort | uniq -c | cut -c-8 | sort |uniq -c
 181477       1 
   3042       2 
    517       3 
    100       4 
     19       5 
      2       6 
      2       7 
May 1 2020, 12:10 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

Apr 28 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

The following pull request is related to improving the articlequality model: https://github.com/wikimedia/articlequality/pull/122

Apr 28 2020, 11:49 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.

See https://github.com/wikimedia/articlequality/pull/122 for another possible explanation for the problem:

It does include the new articles matched beyond the ER# tags. Could it be possible that we're not matching the features effectively? Maybe we could generate a sample of articles and the values of the features the spam and vandalism articles. Maybe there's a bug in the extraction that is hard to see.

I didn't train extract/retrain the model after the change to verify its impact on the metrics, but I believe it might help by improving the dataset quality.

Apr 28 2020, 11:41 PM · Machine Learning Platform (Current), artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

@Halfak do you have a quick way to get how many assessments by each user in the dataset ptwiki.balanced_labelings.*_2020.json which was used for articlequality model? Are we getting labels from a diverse set of users or mostly from just a few users?

Apr 28 2020, 5:40 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

Apr 27 2020

Krinkle awarded T121516: Add opt-out statistics for default gadgets listed at Special:GadgetUsage a Orange Medal token.
Apr 27 2020, 11:40 PM · Patch-For-Review, MediaWiki-extensions-Gadgets
He7d3r updated subscribers of T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.

It could be. For example, @Darwinius noticed that images loaded from Wikidata are not counted:
https://www.mediawiki.org/wiki/ORES/Issues/Article_quality?diff=3804470
I wouldn't be surprised if there was some obscure problem with feature extraction.

Apr 27 2020, 9:56 PM · Machine Learning Platform (Current), artificial-intelligence
He7d3r updated subscribers of T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.

@GoEThe Correct me if I'm mistaken, but I believe a reasonable amount of new articles having vandalism or spam would contain expressions such as the words_to_watch mentioned by Halfak. For reference, the expressions are listed at
https://github.com/wikimedia/revscoring/blob/76c737f2998bbba5b5dd942823f43383f1a4b47e/revscoring/languages/portuguese.py#L153-L189

Apr 27 2020, 9:46 PM · Machine Learning Platform (Current), artificial-intelligence
He7d3r added a comment to T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.

That is odd. Does this tuning report reflect only the changes in the ptwiki features, or does it also include other articles to the dataset as mentioned at T246667#6067366?

Apr 27 2020, 9:42 PM · Machine Learning Platform (Current), artificial-intelligence

Apr 23 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

@Halfak: what should be considered as a true positive in these multi-class classification problems? (when filling the template misclassification report at mw:ORES/Issues/Article quality)
Would a "featured article" be a "positive" case for the articlequality model, and any other level is a "negative"? Or something else?
What about the draftquality model? (in this case it does not even seem to have any implicit order between the classes (e.g. nothing like OK < spam < unsuitable)

Apr 23 2020, 10:37 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

@He7d3r has updated the message. What do you think about it now? I think we should see some input from the community on the model (articlequality) soon. In the meantime what can we achieve for draftquality?

Apr 23 2020, 1:46 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T246667: Build draft quality model for ptwikipedia.

Here is an updated version of @Halfak 's script, css, and loader code:

Apr 23 2020, 1:34 PM · Machine Learning Platform (Current), editquality-modeling, Wikilabels, artificial-intelligence

Apr 22 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

(...)
When we don't, that probably indicates that there are many articles with similar qualities ("features" in the ML literature) that fell across a wide spectrum of classes.

Apr 22 2020, 4:57 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

How should we interpret the different weighted sum values (shown in the parenthesis) for articles such as
https://pt.wikipedia.org/wiki/Ambientalismo
and
https://pt.wikipedia.org/wiki/Mesas_girantes
which have predictions 6 (5.74) and 6 (4.88) respectively?

Apr 22 2020, 4:02 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

Apr 21 2020

He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Out of the 974 good articles (quality 5) on ptwiki:

  • 12 (1,2%) are predicted as having quality 3
  • 32 (3,3%) are predicted as having quality 4
  • 796 (81,7%) are predicted as having quality 5 (Good article)
  • 134 (13,8%) are predicted as having quality 6 (Featured article)

This is table shows the specific articles:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Artigos_bons/Conte%C3%BAdo&oldid=58092346

Apr 21 2020, 8:12 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Out of the 951 featured articles (quality 6) on ptwiki:

  • 3 (0,3%) are predicted as having quality 3
  • 37 (3,9%) are predicted as having quality 4
  • 138 (14,5%) are predicted as having quality 5 (Good article)
  • 773 (81,3) are predicted as having quality 6 (Featured article)

This is table shows the specific articles:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:P%C3%A1gina_de_testes/1&oldid=58092216

Apr 21 2020, 7:57 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Also, it would be interesting to generate a list of articles where ORES quality prediction differs from the current automatic assesment provided by the local Lua module
https://pt.wikipedia.org/wiki/Module:Avalia%C3%A7%C3%A3o

Apr 21 2020, 3:30 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

If I'm not mistaken, we can get the earliest revision of the article using something like this:
https://pt.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=revisions&list=&titles=Gita%20Ramjee&rvlimit=1&rvdir=newer
and then compare the timestamp with some (configurable) delta from the current time.

Apr 21 2020, 3:02 PM · Machine Learning Platform (Current), ORES, artificial-intelligence
He7d3r added a comment to T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.

Maybe only display the draftquality if the article was created no more than X days ago? Or has no more tha Y revisions? I don't know if there is something like a "isNewArticle" flag available to us.

Apr 21 2020, 2:54 PM · Machine Learning Platform (Current), ORES, artificial-intelligence

Apr 20 2020

He7d3r added a watcher for artificial-intelligence: He7d3r.
Apr 20 2020, 5:20 PM
He7d3r updated the task description for T250704: Internal links on comment/summary point to Wikilabels instead of the target wiki.
Apr 20 2020, 3:20 PM · Machine Learning Platform, Wikilabels
He7d3r created T250704: Internal links on comment/summary point to Wikilabels instead of the target wiki.
Apr 20 2020, 3:06 PM · Machine Learning Platform, Wikilabels
He7d3r added a comment to T163099: lastmodifiedat shows the time of the last edit on the page itself, but it should be affected by templates or completely removed.

Maybe mention both information somehow? E.g.:
This page was last edited on 12 December 2019, at 20:04, and some of its templates were edited on 20 April 2020, at 20:04.
or
This page was last edited directly on 12 December 2019, at 20:04, but it might show more recent content from other pages.

Apr 20 2020, 11:57 AM · patch-welcome, MediaWiki-Interface, I18n
He7d3r updated the task description for T63007: Allow specifying when a gadget should load (conditional, page title, action or namespace).
Apr 20 2020, 11:43 AM · Patch-For-Review, Wikimedia-Israel-Hackers, MediaWiki-extensions-Gadgets

Apr 19 2020

He7d3r added a project to T250635: Scoring example no longer works: revscoring.
Apr 19 2020, 10:29 PM · Machine Learning Platform (Current), Documentation, artificial-intelligence, revscoring