Page MenuHomePhabricator

Some tables show wrong or incomplete information on Current Events dashboard
Closed, ResolvedPublicBUG REPORT

Description

Main components:

Problem:

  • Logic dictates that all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h. This is still not the case, so something *must* be wrong here.

Acceptance criteria:

  • Error fixed.
  • Loading time does not noticeably increase.

Event Timeline

Logic dictates that all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h. This is still not the case, so something *must* be wrong here.

I have restarted the update engine on the test server and cleared all data. I will be monitoring the accumulation of data across the specified (6h, 24h, 48h, and 72h) intervals to find out why is this happening.

Logic dictates that all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h. This is still not the case, so something *must* be wrong here.

This anomaly is not present in the restarted test environment (yet). Still monitoring; complete code evaluation of the dashboard's update engine will take place following the implementation of changes requested in T294983.

The anomaly is now observed in the test environment: complete code evaluation of the dashboard's update engine will take place following the implementation of changes requested in T294983.

@Manuel

  • Found one possible (and most probable) cause of the anomaly;
  • fixed;
  • restarted the update engine in the test environment;
  • monitoring.

@Manuel

This is possibly fixed now in the test environment, but we need another decision now:

  • in order to fix this obvious and annoying bug I removed all filtering of items that have received edits by two or more editors recently;
  • in the test environment you will see that the system now again produces large tables (i.e. all items that have received edits by two or more editors recently are found there);
  • thus, we need a new criterion on what items to filter our (say: anything with less than N edits is filtered out, or something similar).

Please advise.

@Manuel To clarify and help you follow the developments here:

I think the whole problem was related to filtering out items based on their revision frequencies ranks, and not absolute revision frequencies.

The frequency revision ranks could have differed across the 72h, 48h, 24h, and 6h tables.

Now we need a new criterion to filter out items (either by number of editors who work on them or by the absolute number of revisions they receive) in order to avoid serving too long item lists under Current Events.

Yes, the error seems to be fixed now. The quality of the data has improved dramatically as a result! \o/

Now we need a new criterion to filter out items

Loading times are a nightmare now.. can you test to cut off each table after the top 100 entries? This would work for now, but the better alternative would be to implement the label lookup in the backend.

either by number of editors who work on them

Yes, only that criterion is relevant!

@Manuel

Loading times are a nightmare now.. can you test to cut off each table after the top 100 entries?

To Do

This would work for now, but the better alternative would be to implement the label lookup in the backend.

Please elaborate on this idea,

This would work for now, but the better alternative would be to implement the label lookup in the backend.

Just what we discussed: The code for label lookup can live in the client JS or on the server. The latter could cache this info so that lookup would need no time on page load.

But as I mentioned: Time is running out on this, so anything will do to finally get this operational.

@Manuel @Tobi_WMDE_SW

The code for label lookup can live in the client JS or on the server.

Who in the WMDE is our JS specialist so that I can get in touch with them? I don't do JS as you know, I am Data Scientist.

Loading times are a nightmare now.. can you test to cut off each table after the top 100 entries?

But that would contradict yours:

Logic dictates that all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h. This is still not the case, so something *must* be wrong here.

from the ticket description.

Namely, if you keep top 100 entities, well... Then the entities from the 72h dataset - to take an example only - would not necessarily encompass the entities in the 6h dataset...

I think the best way to proceed would be for you to formulate exactly the sort criteria that would fit our users' needs the best. To clarify:

  • in each of the following datasets: 6h, 24h, 48h, 72h, we have
  • the number of editors who edited a particular item
  • and the number of edits for those items;
  • we already keep only the items that were edited by two or more editors.

Please: exactly what filter/sorting do you have in mind?

Besides this open question, everything else is ready for production.

@Manuel @Tobi_WMDE_SW

To clarify a bit:

  • if we keep the top 100 items per number of editors, then the top 100 in the 72h dataset might encompass all items with at least, say, 7 editors, while in the 6h dataset at the same time we could have the situation that no items have even reached 7 editors yet;
  • that would contradict the idea of: ... all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h....

We need to understand exactly what filtering do we want to implement. Let me know of your thoughts and I will implement whatever filtering you find to be the most suitable for this system.

all 6h should be included in 24h, all 24h should be included in 48h, and all 48h should be included in 72h

This was the logic that I followed for debugging the tool, nothing more. I explained this only for you to understand why I think that the data shown made no sense at that time. This is not relevant beyond that.

if we keep the top 100 items per number of editors, then the top 100 in the 72h dataset might encompass
all items with at least, say, 7 editors, while in the 6h dataset at the same time we could have the situation
that no items have even reached 7 editors yet;

Yes, this is no problem.

Sorting:

  • primary: number of editors (high numbers first)
  • secondary: number of edits (high numbers first)

Filters:

  • min 2 editors
  • max 100 lines per table from top

I don't do JS as you know, I am Data Scientist.

The important part was that the cut-off works for now. So please don't bother with this. When I wrote my comment, I did not know that we have only limited time together, so I mentioned the optimal alternative in case you wanted to try and play with it.

Please let me know if you need any more information.

@Manuel

Filters:
min 2 editors
max 100 lines per table from top

Please be reminded that we already keep items with min 2 editors, and if we go for max 100 lines per table then we will end up filtering out items that indeed had min 2 editors but did not enter the table because of their row number in it merely.

I remember when I started working on this dashboard and how I realised that the problem of what items to present deemed more complicated than what a pure intuition would suggest! Would you have some time for a very concise 1:1 on this, perhaps in the following two hours or so? Thank you. We can figure out the criteria we just need to be aware fully of the consequences of each (non-ideal) option that we have.

@Manuel

Following our 1:1

  • the criterion to keep the top 100 items per table is in force now in the test environment;
  • the table loading times have significantly improved;
  • I am putting the system in production now.
Manuel claimed this task.

Thank you! \o/