Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | leila | T171224 [Objective 9.1.1] Article expansion recommendations | |||
Resolved | diego | T190770 Improve section translation classifier | |||
Resolved | diego | T190771 Improve section synonym classifier | |||
Resolved | diego | T203044 Output 1.2: Section recommendation algorithm in many languages | |||
Resolved | diego | T190772 Build the first version of section recommender by fusing the synonym and translator models |
Event Timeline
Hi @bmansurov ,
I'm cleaning my code, and found that my parser produce duplicated outputs. Each row is present twice in the output. These two repeated rows are not together, meaning that line 1 is not repeated in line 2, but in line X with X > 2. Can you please have a look here and try to guess what I am doing wrong?
For sure I can do a post filter, but I would love to understand what is happening.
Thanks
@diego I looked at your code briefly and tested it with lang=uz, and the output JSON didn't contain any duplicate rows. Can you paste one of the duplicate rows from ruwiki maybe?
@bmansurov , interesting. I've tried with 'uz' and also don't see anything repeated. Giving that 'uz' current is a single file that make me things that is something related with the parallelization.
For Russian, if you parse the dump from 20180801 you will see, the lines repeated below. The first number if the line number, which I think can change from parsing to parsing, due the parallelization.
To check duplicated rows I'm doing:
$ wc -l sections-articles_ru.json 6842290 sections-articles_ru.json $ sort -u sections-articles_ru.json |wc -l 3421145
Here an example of repeated lines:
2966167:[1852270, "\u0423\u043b\u0438\u0446\u0430 \u0410\u0443\u0448\u0440\u043e\u0441 \u0412\u0430\u0440\u0442\u0443", {"\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u0435 \u0437\u0434\u0430\u043d\u0438\u044f": {"links": ["Q6877", "Q4527630", "Q170292", "Q4168072", "Q2479493", "-1", "Q4314885", "Q51879601", "Q2043", "-1", "-1", "-1", "-1", "Q2094", "Q189164", "Q243472", "-1", "Q948392", "-1", "Q77430", "-1", "Q1341702", "Q458245", "Q2088144", "-1", "-1", "Q1987", "-1", "Q46825", "-1", "Q1984809", "Q336754", "Q6082", "Q336754", "Q128399", "Q6204", "Q712444", "-1", "Q7008", "Q948392", "Q361", "Q362", "Q170463", "Q7017", "Q6955", "Q6785", "-1", "Q6955", "-1", "Q6994", "Q2088144", "Q79822", "Q458245", "-1", "Q1961083", "-1", "Q2663670", "Q362", "-1", "-1", "-1", "-1", "Q7018", "Q7017", "Q3142", "-1", "-1", "Q23444", "Q23445", "Q7017", "Q4168072", "-1", "Q2478", "-1", "-1", "Q6642", "Q77430", "-1", "-1", "Q7621", "-1", "Q7015", "-1", "-1", "-1", "Q1361961", "Q6955", "Q1984809", "-1", "-1", "-1", "Q49683", "-1", "-1", "Q6951", "-1", "Q123868", "-1", "Q34", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q854429", "-1", "-1", "-1", "-1", "Q189548", "Q175112", "-1", "Q2045313", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q186277", "-1", "Q6821", "-1", "Q7017", "-1", "Q6299", "Q6204", "Q134194", "Q41484", "Q668988", "-1", "Q49683", "-1", "-1", "Q6918", "-1", "Q18743", "Q192664", "Q175112", "-1", "Q285268", "-1", "Q397", "-1", "Q7016", "Q216", "Q37", "Q7017", "Q7610", "Q7650", "Q36", "Q6819"], "pos": 2, "size": 13105}, "\u041e\u0431\u0449\u0430\u044f \u0445\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430": {"links": ["-1", "Q2280", "Q722836", "Q1916517", "Q4390852", "-1", "-1", "-1", "Q1640381", "Q2011885", "-1", "-1", "-1"], "pos": 1, "size": 1637}, "\u0421\u0441\u044b\u043b\u043a\u0438": {"links": ["Q8819406", "Q8479441"], "pos": 5, "size": 594}, "\u041b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u0430": {"links": [], "pos": 4, "size": 1282}, "\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u043d\u0438\u044f": {"links": [], "pos": 3, "size": 15}}] 4302040:[1852270, "\u0423\u043b\u0438\u0446\u0430 \u0410\u0443\u0448\u0440\u043e\u0441 \u0412\u0430\u0440\u0442\u0443", {"\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u0435 \u0437\u0434\u0430\u043d\u0438\u044f": {"links": ["Q6877", "Q4527630", "Q170292", "Q4168072", "Q2479493", "-1", "Q4314885", "Q51879601", "Q2043", "-1", "-1", "-1", "-1", "Q2094", "Q189164", "Q243472", "-1", "Q948392", "-1", "Q77430", "-1", "Q1341702", "Q458245", "Q2088144", "-1", "-1", "Q1987", "-1", "Q46825", "-1", "Q1984809", "Q336754", "Q6082", "Q336754", "Q128399", "Q6204", "Q712444", "-1", "Q7008", "Q948392", "Q361", "Q362", "Q170463", "Q7017", "Q6955", "Q6785", "-1", "Q6955", "-1", "Q6994", "Q2088144", "Q79822", "Q458245", "-1", "Q1961083", "-1", "Q2663670", "Q362", "-1", "-1", "-1", "-1", "Q7018", "Q7017", "Q3142", "-1", "-1", "Q23444", "Q23445", "Q7017", "Q4168072", "-1", "Q2478", "-1", "-1", "Q6642", "Q77430", "-1", "-1", "Q7621", "-1", "Q7015", "-1", "-1", "-1", "Q1361961", "Q6955", "Q1984809", "-1", "-1", "-1", "Q49683", "-1", "-1", "Q6951", "-1", "Q123868", "-1", "Q34", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q854429", "-1", "-1", "-1", "-1", "Q189548", "Q175112", "-1", "Q2045313", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q186277", "-1", "Q6821", "-1", "Q7017", "-1", "Q6299", "Q6204", "Q134194", "Q41484", "Q668988", "-1", "Q49683", "-1", "-1", "Q6918", "-1", "Q18743", "Q192664", "Q175112", "-1", "Q285268", "-1", "Q397", "-1", "Q7016", "Q216", "Q37", "Q7017", "Q7610", "Q7650", "Q36", "Q6819"], "pos": 2, "size": 13105}, "\u041e\u0431\u0449\u0430\u044f \u0445\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430": {"links": ["-1", "Q2280", "Q722836", "Q1916517", "Q4390852", "-1", "-1", "-1", "Q1640381", "Q2011885", "-1", "-1", "-1"], "pos": 1, "size": 1637}, "\u0421\u0441\u044b\u043b\u043a\u0438": {"links": ["Q8819406", "Q8479441"], "pos": 5, "size": 594}, "\u041b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u0430": {"links": [], "pos": 4, "size": 1282}, "\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u043d\u0438\u044f": {"links": [], "pos": 3, "size": 15}}]
I think here's why it's happening. You'll see that articles appear in both current.xml and current[N].xml. Here's an example:
(1007, 'СССР', {}, '/mnt/data/xmldatadumps/public/ruwiki/20180801/ruwiki-20180801-pages-meta-current1.xml-p4p204179.bz2')
(1007, 'СССР', {}, '/mnt/data/xmldatadumps/public/ruwiki/20180801/ruwiki-20180801-pages-meta-current.xml.bz2
I think you should exclude current.xml from your list and process the remaining items. It seems to be the combination of the other XML files.
This has been done and tracked in T215348
Documentation can be found here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation