Currently, if an anonymous IP user creates a new article on Wikipedia, the CopyPatrol interface will show the text "editor not found". Instead, we should show the user's IP address and provide appropriate links. Don't show edit count for IPs.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T116957 Plagiarism detection tools for text (tracking) | |||
Resolved | • TBolliger | T120435 Improve the plagiarism detection bot | |||
Resolved | • TBolliger | T131583 Epic: Make a tool labs interface for Plagiabot aka Eranbot | |||
Resolved | MusikAnimal | T137382 Have CopyPatrol output editor's IP address if they are anonymous |
Event Timeline
I can't imagine this being very useful, honestly. It'll take another join on the Revision table because the user table does not contain information about IP addresses, making the tool even slower.
And we do need the join on the user table to get the editcount.
Aren't we already querying against the revision table to get the user ID? or is that something returned from Eranbot's database? If we're already querying against the revision table, it should be pretty simple: just include the rev_user_text in the query, and if the user ID is 0, output the user text.
Aren't we already querying against the revision table to get the user ID? or is that something returned from Eranbot's database? If we're already querying against the revision table, it should be pretty simple: just include the rev_user_text in the query, and if the user ID is 0, output the user text.
You're right, I am already using the revision table but weirdly, it doesn't seem to contain IP address information.
MariaDB [enwiki_p]> SELECT rev_user_text, rev_user FROM revision LIMIT 10\G; *************************** 1. row *************************** rev_user_text: rev_user: 0 *************************** 2. row *************************** rev_user_text: rev_user: 0 *************************** 3. row *************************** rev_user_text: rev_user: 0 *************************** 4. row *************************** rev_user_text: rev_user: 0 *************************** 5. row *************************** rev_user_text: rev_user: 856 *************************** 6. row *************************** rev_user_text: rev_user: 62 *************************** 7. row *************************** rev_user_text: rev_user: 188 *************************** 8. row *************************** rev_user_text: rev_user: 0 *************************** 9. row *************************** rev_user_text: rev_user: 90 *************************** 10. row *************************** rev_user_text: rev_user: 3903 10 rows in set (0.00 sec)
I'll investigate this a bit.
Hmm, I believe it should be storing the IP addresses in the rev_user_text field, but I could be wrong. Also, you may need to do a LEFT OUTER JOIN in the getUserDetails query (rather than a simple JOIN) so that it includes the records where there is no corresponding user table entry (which will be the case with anon IP users).
OK, I'm super confused. If I do...
MariaDB [enwiki_p]> select * FROM revision where rev_user = 0 LIMIT 10; +-----------+----------+-------------+----------------------------------+----------+-------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+ | rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 | rev_content_model | rev_content_format | +-----------+----------+-------------+----------------------------------+----------+-------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+ | 519207843 | 0 | 0 | /* Beschreibung */ Kleinigkeiten | 0 | Der-wuppertaler | 20121004222920 | 0 | 0 | 2836 | 0 | dt0qefs4iu6bysy4g3elu33z0w07vn7 | NULL | NULL | | 519207844 | 0 | 0 | /* Auszeichnungen */ | 0 | Der-wuppertaler | 20121004222956 | 0 | 0 | 2837 | 0 | 1s7bjkqw61zqgsc5j2lfxjvbc9hr0te | NULL | NULL | | 862220 | 10 | 0 | Automated conversion | 0 | Conversion script | 20020225154311 | 1 | 0 | 35 | 233192 | i8pwco22fwt12yp12x29wc065ded2bh | NULL | NULL | | 381200179 | 10 | 0 | | 0 | 76.28.186.133 | 20100826222351 | 0 | 0 | 1875 | 133452289 | im247wfm4kfzwwu24bgw3hnxkjd1gvv | NULL | NULL | | 18201 | 12 | 0 | Automated conversion | 0 | Conversion script | 20020225150022 | 1 | 0 | 9546 | 332419362 | 07sqam7073877kptdznnip3viznphpy | NULL | NULL | | 19746 | 12 | 0 | * | 0 | 140.232.153.45 | 20020225154311 | 0 | 0 | 11279 | 18201 | px5ovjydixhpbysqmqiw7floem9ii5i | NULL | NULL | | 19749 | 12 | 0 | * | 0 | 24.188.31.147 | 20020227173409 | 0 | 0 | 11394 | 19746 | ir7scjzn4w7y93it75q41tamjrfd3nd | NULL | NULL | | 20514 | 12 | 0 | * | 0 | 24.188.31.147 | 20020227173641 | 0 | 0 | 11486 | 19749 | o139okh8ak7f6x7f6hwdogun10kucpy | NULL | NULL | | 42733 | 12 | 0 | term in bold | 0 | 213.253.39.175 | 20020301001317 | 0 | 0 | 11488 | 20514 | jmq3csxbr1t85ko3qkt0i8d3dtjlc2v | NULL | NULL | | 42738 | 12 | 0 | * | 0 | 206.82.16.35 | 20020402095125 | 0 | 0 | 11546 | 42733 | q0yfrgc9zuz5vcsjsb40301rn49emow | NULL | NULL | +-----------+----------+-------------+----------------------------------+----------+-------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
...but...
MariaDB [enwiki_p]> select rev_user_text, rev_user FROM revision where rev_user = 0 LIMIT 10; +---------------+----------+ | rev_user_text | rev_user | +---------------+----------+ | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | | | 0 | +---------------+----------+
Probably there is a covering index for the second query, and using that results in different ordering. (Remember, SQL does not give you any guarantee of sorting consistency if you don't use ORDER BY.) Rows with empty rev_user_text might be the residue of some old bug or maybe that column didn't even exist at the time the first few revisions have been made.
Fixed with this commit, included in the all mighty Wikimania branch!
Indeed, the revision table has what we want, so we just need to do a LEFT JOIN to get the rows from revision that don't match ON r.rev_user = u.user_id (since anons don't have a user_id). No noticeable affect on runtime.
For performance, we should probably be fetching data about the revisions in a single query (so 100 at a time) like WHERE r.rev_id IN (1234, 1235, 1236, ...); I bet that would speed it up quite a bit. You could do the same for the API queries, which have a 50-title limit