User Details
- User Since
- Oct 8 2014, 5:48 PM (402 w, 4 d)
- Availability
- Available
- IRC Nick
- Milimetric
- LDAP User
- Milimetric
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Thu, Jun 16
Sorry, Andre, I didn't even know there was a Gerrit tag. I'm marking this as resolved for now. If we ever come up with a different way of handling inactive repositories, I'll circle back and apply it here.
Wed, Jun 15
Tue, Jun 14
Mon, Jun 13
Fri, Jun 10
Ok, jobs ran, dashboard looks ok again, I think it's solved, ping me again if anything seems weird.
The logs showed consistent errors since 2021-03, but I think it was just because this file had a trailing half-empty row (just the date and no output). So I reran the jobs, they seem ok... so weird. I think this means the data will be fixed soon. I'll move to Done if I'm right.
Thu, Jun 9
Yeah, it looks like the queries have been failing and the data dashiki is trying to load is corrupted. But I ran the queries manually and they don't fail. So I'll take this as a bug and work on it as soon as I can. It's weird :)
Tue, Jun 7
added my draft at https://wikitech.wikimedia.org/wiki/User:Milimetric/Notebook/MediaWiki_History, shall we edit there before moving it to DataHub or shall we edit on DataHub? I don't think it's useful to craft text there, since that pollutes the history, but let me know what you think
Mon, Jun 6
I'm not sure how we can remove it, https://www.mediawiki.org/wiki/Gerrit/Inactive_projects seems to say we just mark repositories as "Read Only". Is this enough? Does someone know if we have a more permanent removal? It's indeed just a repo that was never really used. @Aklapper: any advice?
Thu, Jun 2
Wed, Jun 1
Emil is still having a problem authenticating. When he logs in, his username doesn't have the groups that I add for user echetty.
Tue, May 31
drop table event_sanitized.gettingstartedredirectimpression; drop table event.gettingstartedredirectimpression;
drop table event_sanitized.uploadwizarderrorflowevent; drop table event_sanitized.uploadwizardexceptionflowevent; drop table event_sanitized.uploadwizardflowevent; drop table event_sanitized.uploadwizardstep; drop table event_sanitized.uploadwizardtutorialactions; drop table event_sanitized.uploadwizarduploadflowevent;
I've reviewed everything above and it can all be safely deleted. An admin needs to do this, with cumin, see instructions (ping @Ottomata) The HDFS and Hive stuff is done, I took care of it.
====== stat1004 ====== total 513244 drwxr-xr-x 2 26051 wikidev 4096 Jul 20 2021 hdfs-namenode-fsimage -rw-rw-r-- 1 26051 wikidev 1245367 Jan 10 16:42 part.txt -rw-r--r-- 1 26051 wikidev 3155 Oct 28 2020 razzi-key.txt drwxrwxr-x 11 26051 wikidev 4096 Mar 16 2021 refinery -rw-r--r-- 1 root root 524288000 May 18 2021 test.img drwxrwxr-x 6 26051 wikidev 4096 Dec 7 2020 venv drwxrwxr-x 6 26051 wikidev 4096 Dec 7 2020 venv3
May 27 2022
@EChetty: how does this get prioritized though? Is this a bug affecting users? (I think it is, but not sure how we're defining that)
@Jasonkhanlar: thanks very much, I hadn't heard of that before. I'll consider it for our wikistats refactor, and maybe @egardner would be interested when he gets back.
@Tsevener is right, and that's the access that @RhinosF1 pointed to. @Dmantena: unfortunately, due to how authentication and authorization works more broadly at wmf, this is the only way that we can manage access right now. Desiree Abad is leading an effort to improve that, you can connect with her for more details. But I totally agree with you that there should be a way to get this access without all the other implications. For your peace of mind, you can read the User Responsibilities section. You'll notice that you're very unlikely to get in trouble if you're going through the use case you describe here.
seems like a bug to me. If this is a requirement of the system, it should just lowercase transparent to the user.
I'm very intrigued @Milimetric about your comment about reinstrumenting pageviews in a declarative way (that sounds like it could help with some of our work around differential privacy too) though I assume that's a large large project.
May 25 2022
Jobs are up for review at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/63, tested in prod
May 24 2022
May 23 2022
May 19 2022
In short: it would be very hard. There's a complicated data pipeline leading to the UI. It depends on how much value you would get out of such a tool. It's not a priority for us to make this generic beyond the scope of WMF projects, but it's not an inflexible piece of code.
May 17 2022
Indeed, RhinosF1 is right, take a look at that link and I believe you need analytics-privatedata-users to run queries and access Presto-backed dashboards
May 13 2022
May 11 2022
@Mayakp.wiki I think we should build all new line charts using apache echarts (Time Series Line Chart in this case). Whenever the migration CLI is ready, we can use it. Until then, echarts seem strictly better (let me know if I'm wrong). So maybe by the time the migration CLI comes out, we'll be naturally migrated anyway.
@Kipala & @TheresNoTime: I recently updated the language here as part of another task, can you take a look and see if it makes more sense? If not, please feel free to suggest a change and I can incorporate it: https://stats.wikimedia.org/#/sw.wikipedia.org/reading/total-page-views/normal|bar|1-year|~total|monthly
@Htriedman: I know you're talking to @EChetty about this, we're triaging it to this column which is like a task "incubator". Once this is fully formed and we know what the pipeline looks like, we can help you expand this into the necessary tasks. When you're done, you can move this back to incoming to effectively ping us.
May 10 2022
TODO: validate with @EChetty that the description here is what we want to evaluate (it looks more like what we want to know about schemas). And if not, see what else we need to understand about descriptions.
Ok, got a sense for how this works:
@BTullis: I'm doling out these tasks per our grooming session today, just to expedite the process. We decided there's only a few of us and we can stay in a tight loop. This was the top infrastructure thing we needed to look into. Emil said he validated that users without ssh access can login to datahub, but that it's confusing knowing which username to use. I guess maybe some clarity on the approach here, like a simple wiki article that we can link to, would be useful? Ping me if you want to brainbounce.
@EChetty: how should we do this? Do you want to draft a policy and set up a meeting to discuss? Would you like me to have a first draft? Your call, I'm happy either way.
I will work on this in parallel with the schema spike since repeated ingestion will tell us about both.
I will work on this first, using my hive database, milimetric, and reporting findings here.
This looks to be available behind the scenes but just not surfaced in the UI yet? https://datahubproject.io/docs/dev-guides/timeline/
Quick stats check on revision sizes and diff sizes:
May 3 2022
@BBlack: this was never our pipeline. It looks like @dr0ptp4kt's original idea was remove wprov so it doesn't fragment the cache. We don't particularly care one way or another, it doesn't affect our datasets directly. But obviously if the mechanism chosen here creates duplicate data, we should consider what we could add to the duplicate requests so they can be filtered out later. Personally, I think it's way overdue that we just instrument pageviews in a declarative way instead of parsing them out of webrequest.
I was wondering if we could disable the Line Chart type then, if it's deprecated, and did some digging but it doesn't seem to be easy to do. So this is a good workaround until we can get a Superset build without the buggy charts and replace all the existing dashboards. @Iflorez let us know what you think.
Apr 18 2022
When I get back I'll write an airflow job that does the ingestion on a regular basis. For now I'd just like @EChetty and @odimitrijevic to take a look and let me know their thoughts on the set of databases we chose to ingest (event, event_sanitized, wmf, wmf_raw, canonical_data), the frequency that we think we want to do this at, and anything else that comes to mind.