Fri, Dec 2
Wed, Nov 30
Fri, Nov 18
I took a closer look at Grafana's "Logs" panel type, it's interesting and could be useful here. We can feed it logs from a database and it works with the panel time settings. When select a time range on a graph, you get the same range in the logs chart. See https://frmon.wikimedia.org/d/xyo7y7d4z/jeff-test?orgId=1
Thu, Nov 17
The thing that stands out to me is that the civi failure message was always within 5-7 seconds of the start of the job. If the timeout is really that short an easy fix could be to increase it to 15+ seconds. On the other hand if it's already that long then maybe it's a different issue.
As far as I can tell, this is done.
Thu, Nov 10
We have the process-control status, load, queue size and tons of other metrics in Prometheus already. Here's an attempt at putting a bunch of potentially useful metrics in one Grafana dashboard: https://frmon.wikimedia.org/d/9VG5JhvVk/p-c-vs-other-things . I don't think it's feasible to expose logs directly in the same place (both in terms of senstive information and usability) but we have tools to scrape and quantify logs based on regular expressions that could be used to collect metrics.
Wed, Nov 9
@Eileenmcnaughton take a look at https://frmon.wikimedia.org/d/9VG5JhvVk/p-c-vs-other-things -- it's a mix of mysql- and prometheus-backed charts. You can use the hover/crosshairs to see how timings line up across the charts. Note I just picked a few example metrics that seemed plausibly useful. Also note that while the timings of the process-control jobs are accurate to within a second, the other metrics could be offset by up to a minute due to the way prometheus metrics are collected.
Nov 4 2022
Nov 3 2022
Here's a first try at more precise way of reporting this in Grafana.
Nov 2 2022
There's definitely room for improvement, below is what we have now.
Nov 1 2022
Oct 26 2022
Oct 21 2022
Oct 17 2022
Oct 14 2022
Oct 13 2022
Oct 11 2022
Oct 6 2022
Oct 4 2022
Marking "Unbreak Now!" because this is blocking us from investigating why frdb2001 failed to recover from a reboot.
Oct 3 2022
Logs from the payments server are at frpm1001:/tmp/T319203-*
Sep 27 2022
Update for posterity: we have reactivated a Dmarcian test account and provided access, and are looking at where to go from there.
Sep 26 2022
@RLewis I'm going to close this task since the part we can do on our end is done. If you decide to change the Endowment URL to a fundraising.wikimedia.org one, please reopen the task and we'll take care of that part too.
Sep 23 2022
The annual appeal URL is all set. The endowment one is on a separate website and has to go through a separate process.
Sep 13 2022
;; ANSWER SECTION:
w.wiki._report._dmarc.wikimedia.org. 3600 IN TXT "v=DMARC1;"
Sep 12 2022
@cmooney Both interfaces show no-carrier, can you confirm that the switch ports are enabled?
Sep 9 2022
There's a nice summary of this issue here https://dmarcian.com/what-is-external-destination-verification/
Sep 6 2022
Sep 2 2022
I was able to log into the "wikimedia" dmarcian account and determine that the subscription expired, so I sent a query to firstname.lastname@example.org to see how much it would cost per month or per year to re-subscribe.
T86209 is relevant, we do appear to still be sending dmarc-ruf@ to dmarcian, although I don't know who has access to it.
Aug 31 2022
Done, and it seems to have worked.
Aug 29 2022
Aug 24 2022
Aug 23 2022
Aug 22 2022