Page MenuHomePhabricator

Build the first version of section recommender by fusing the synonym and translator models
Closed, ResolvedPublic

Description

The brainstorming on this task can start now. The detailed work can start when T190771 and T190770 are done.

Event Timeline

leila triaged this task as High priority.
leila updated the task description. (Show Details)
leila subscribed.

Hi @bmansurov ,

I'm cleaning my code, and found that my parser produce duplicated outputs. Each row is present twice in the output. These two repeated rows are not together, meaning that line 1 is not repeated in line 2, but in line X with X > 2. Can you please have a look here and try to guess what I am doing wrong?
For sure I can do a post filter, but I would love to understand what is happening.

Thanks

@diego I looked at your code briefly and tested it with lang=uz, and the output JSON didn't contain any duplicate rows. Can you paste one of the duplicate rows from ruwiki maybe?

@bmansurov , interesting. I've tried with 'uz' and also don't see anything repeated. Giving that 'uz' current is a single file that make me things that is something related with the parallelization.

For Russian, if you parse the dump from 20180801 you will see, the lines repeated below. The first number if the line number, which I think can change from parsing to parsing, due the parallelization.

To check duplicated rows I'm doing:

$ wc -l sections-articles_ru.json 
6842290 sections-articles_ru.json

$ sort -u sections-articles_ru.json |wc -l
3421145

Here an example of repeated lines:

2966167:[1852270, "\u0423\u043b\u0438\u0446\u0430 \u0410\u0443\u0448\u0440\u043e\u0441 \u0412\u0430\u0440\u0442\u0443", {"\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u0435 \u0437\u0434\u0430\u043d\u0438\u044f": {"links": ["Q6877", "Q4527630", "Q170292", "Q4168072", "Q2479493", "-1", "Q4314885", "Q51879601", "Q2043", "-1", "-1", "-1", "-1", "Q2094", "Q189164", "Q243472", "-1", "Q948392", "-1", "Q77430", "-1", "Q1341702", "Q458245", "Q2088144", "-1", "-1", "Q1987", "-1", "Q46825", "-1", "Q1984809", "Q336754", "Q6082", "Q336754", "Q128399", "Q6204", "Q712444", "-1", "Q7008", "Q948392", "Q361", "Q362", "Q170463", "Q7017", "Q6955", "Q6785", "-1", "Q6955", "-1", "Q6994", "Q2088144", "Q79822", "Q458245", "-1", "Q1961083", "-1", "Q2663670", "Q362", "-1", "-1", "-1", "-1", "Q7018", "Q7017", "Q3142", "-1", "-1", "Q23444", "Q23445", "Q7017", "Q4168072", "-1", "Q2478", "-1", "-1", "Q6642", "Q77430", "-1", "-1", "Q7621", "-1", "Q7015", "-1", "-1", "-1", "Q1361961", "Q6955", "Q1984809", "-1", "-1", "-1", "Q49683", "-1", "-1", "Q6951", "-1", "Q123868", "-1", "Q34", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q854429", "-1", "-1", "-1", "-1", "Q189548", "Q175112", "-1", "Q2045313", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q186277", "-1", "Q6821", "-1", "Q7017", "-1", "Q6299", "Q6204", "Q134194", "Q41484", "Q668988", "-1", "Q49683", "-1", "-1", "Q6918", "-1", "Q18743", "Q192664", "Q175112", "-1", "Q285268", "-1", "Q397", "-1", "Q7016", "Q216", "Q37", "Q7017", "Q7610", "Q7650", "Q36", "Q6819"], "pos": 2, "size": 13105}, "\u041e\u0431\u0449\u0430\u044f \u0445\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430": {"links": ["-1", "Q2280", "Q722836", "Q1916517", "Q4390852", "-1", "-1", "-1", "Q1640381", "Q2011885", "-1", "-1", "-1"], "pos": 1, "size": 1637}, "\u0421\u0441\u044b\u043b\u043a\u0438": {"links": ["Q8819406", "Q8479441"], "pos": 5, "size": 594}, "\u041b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u0430": {"links": [], "pos": 4, "size": 1282}, "\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u043d\u0438\u044f": {"links": [], "pos": 3, "size": 15}}]

4302040:[1852270, "\u0423\u043b\u0438\u0446\u0430 \u0410\u0443\u0448\u0440\u043e\u0441 \u0412\u0430\u0440\u0442\u0443", {"\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u0442\u0435\u043b\u044c\u043d\u044b\u0435 \u0437\u0434\u0430\u043d\u0438\u044f": {"links": ["Q6877", "Q4527630", "Q170292", "Q4168072", "Q2479493", "-1", "Q4314885", "Q51879601", "Q2043", "-1", "-1", "-1", "-1", "Q2094", "Q189164", "Q243472", "-1", "Q948392", "-1", "Q77430", "-1", "Q1341702", "Q458245", "Q2088144", "-1", "-1", "Q1987", "-1", "Q46825", "-1", "Q1984809", "Q336754", "Q6082", "Q336754", "Q128399", "Q6204", "Q712444", "-1", "Q7008", "Q948392", "Q361", "Q362", "Q170463", "Q7017", "Q6955", "Q6785", "-1", "Q6955", "-1", "Q6994", "Q2088144", "Q79822", "Q458245", "-1", "Q1961083", "-1", "Q2663670", "Q362", "-1", "-1", "-1", "-1", "Q7018", "Q7017", "Q3142", "-1", "-1", "Q23444", "Q23445", "Q7017", "Q4168072", "-1", "Q2478", "-1", "-1", "Q6642", "Q77430", "-1", "-1", "Q7621", "-1", "Q7015", "-1", "-1", "-1", "Q1361961", "Q6955", "Q1984809", "-1", "-1", "-1", "Q49683", "-1", "-1", "Q6951", "-1", "Q123868", "-1", "Q34", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q854429", "-1", "-1", "-1", "-1", "Q189548", "Q175112", "-1", "Q2045313", "-1", "-1", "-1", "-1", "-1", "-1", "-1", "Q186277", "-1", "Q6821", "-1", "Q7017", "-1", "Q6299", "Q6204", "Q134194", "Q41484", "Q668988", "-1", "Q49683", "-1", "-1", "Q6918", "-1", "Q18743", "Q192664", "Q175112", "-1", "Q285268", "-1", "Q397", "-1", "Q7016", "Q216", "Q37", "Q7017", "Q7610", "Q7650", "Q36", "Q6819"], "pos": 2, "size": 13105}, "\u041e\u0431\u0449\u0430\u044f \u0445\u0430\u0440\u0430\u043a\u0442\u0435\u0440\u0438\u0441\u0442\u0438\u043a\u0430": {"links": ["-1", "Q2280", "Q722836", "Q1916517", "Q4390852", "-1", "-1", "-1", "Q1640381", "Q2011885", "-1", "-1", "-1"], "pos": 1, "size": 1637}, "\u0421\u0441\u044b\u043b\u043a\u0438": {"links": ["Q8819406", "Q8479441"], "pos": 5, "size": 594}, "\u041b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u0430": {"links": [], "pos": 4, "size": 1282}, "\u041f\u0440\u0438\u043c\u0435\u0447\u0430\u043d\u0438\u044f": {"links": [], "pos": 3, "size": 15}}]

I think here's why it's happening. You'll see that articles appear in both current.xml and current[N].xml. Here's an example:

(1007, 'СССР', {}, '/mnt/data/xmldatadumps/public/ruwiki/20180801/ruwiki-20180801-pages-meta-current1.xml-p4p204179.bz2')
(1007, 'СССР', {}, '/mnt/data/xmldatadumps/public/ruwiki/20180801/ruwiki-20180801-pages-meta-current.xml.bz2

I think you should exclude current.xml from your list and process the remaining items. It seems to be the combination of the other XML files.

BTW, try adding ensure_ascii=False to json_dumps for easy debugging.