I've manually removed a bunch of old refinery artifact jar versions from the refinery deploy on notebook hosts to free up space.
Just came across this ticket after reading the TechCom radar email.
For posterity, this is being done in https://github.com/wikimedia/jsonschema-tools with json-schema-ref-parser.
Fri, Jul 12
Thu, Jul 11
I believe this happens on an-coord1001 and notebook* hosts because their /srv partitions are relatively small. When the disk fills up during scap deploy, the scap deploy aborts, and does not remove old cached deploys. Upon future successful deploys it does.
@DStrine, I need some help from someone who knows how to make changes to centralnotice campaigns to test that these events work after migration. Who should I ask?
It's weird that you can't send a null value for a non-required field (and it complicates the instrumentation code a bit too).
I don't think it is weird, but I understand that it might not be obvious. null is a JSON datatype, just like string and number. If you wanted to allow a field to have both null or string, you'd have to use a union type. Please don't do that though! We don't support union types!
Here are the top 10 offenders:
@EBernhardson analytics-search user should now be able to access the auth file
Wed, Jul 10
Tue, Jul 9
Eric needs the analytics-search user to be able to access the swift auth file so his Oozie jobs can upload to swift.
Ah, hm ok.
Mon, Jul 8
Oops I duplicated. Thanks for closing. :)
We need to upgrade the OSes to Debian Buster.
+1 for 1 eqiad and 1 codfw
Fri, Jul 5
Wed, Jul 3
Please decommission the current servers to spare role
Ok will do. I'll downtime the the hostnames in icinga when I do.
Also, do we still need to rsync that data?
Ya, I believe so: https://dumps.wikimedia.org/other/pageviews/2019/2019-07/
Also related: T111433
@Nuria, see comment https://phabricator.wikimedia.org/T205319#5300239. I'm trying to isolate stream config uses the from the larger problem of data governance. Part of the upcoming projects will include uses cases from this as well as T201063: Modern Event Platform: Schema Registry, including things like schema UIs. Atlas has a 'schema' UI and a search engine for schema and dataset discovery. There's overlap with stream config, but I'm not sure if stream config itself fits into something like Atlas...maybe we could use for the UI components of stream config? Really not sure.
Tue, Jul 2
FYI, I wanted to know more about Apache Atlas, so I set it up a standalone on stat1004 and ran the Hive import process for the wmf and event databases. I added some glossary terms for 'user_agent' and 'ip', classified them as PII, tagged related fields, etc.
This also often affects other hosts with relatively small /srv partitions, like notebook* hosts.
I like an-conf. Also gives us the option to colocate something else on them if we need to one day.
Feel free to reassign
Ah still wrong. Full details.
- As a product manager/analyst/engineer, I want to set the privacy whitelist settings of stream's event fields so that I can retain non-PII data for longer than 90 days.
- As a product manager/analyst/engineer, I want to set and discover the ownership of schemas and streams so I can track governance over time and know when a stream can be decommissioned.
Hm, I just noticed there are more eventstreams processors than I had thought. There are 6 scb nodes in codfw, and 4 nodes in eqiad, for a total of 208 processors between them. eventstreams is configured to spawn one worker per processor. As is, this won't help keeping the varnish connection pool from filling up.
when we refine we get the schema for the 1st record we find and we assume always backwards compatibility of schemas.
For EventLogging Hive, we actually use the latest schema. The latest schema is used to read the data, so if the data has fields that are not in the lastest schema, they will not be read. Removing fields is a backwards incompatible change.
Hm, I don't think T183303 affected any encryption status of logs. The Avro logs we migrated to event gate just do an HTTP post to EventGate, and EventGate produces to Kafka unencrypted.
+1 I think this alarm should alert SRE.
Mon, Jul 1
Ok, all patches ready to go. Deployed in beta and looks good there. It is near the end of my day now, so I'll wait until tomorrow to deploy to production eventstreams.
Thank you! I think that should be fine.
BTW, I don't need 'admin' rights, I think just 'manager' rights. I want to be able to create repositories.
I think that having the client send as simple and immediately useful a message as possible should be a goal
Indeed, as we will likely also import these events into Hadoop/Hive for longer term querying and analysis. The fewer transformations we have to do, the better.
Not sure, but we are also waiting for buster to upgrade Spark. When I asked Moritz before, he said Buster would be ready in a monthish time.
As each event is migrated to the new Event Platform format in T211248, the timestamps will use the 'Z' suffix.
Looks good, I can take the k8s task. I can start the schema but we'll need to bikeshed that one together.
Not sure what's the right place for that: client, EventGate, logstash filter?
I think client is the right place. We need to bikeshed the actual error schema we will use. EventGate will only validate that the incoming events conform to the schema.
Fri, Jun 28
To hold us over on the weekend, I've manually blacklisted the offending IP in EventStreams code and deployed. We'll work on a better solution next week.
Collected some info about which IPs were connecting on scb1001. Over a period of about 40 minutes:
EventStreams is hitting its concurrent connection limits of about 200 connections. We think this is probably due to a single client starting many connections, but aren't yet 100% sure about that. We are looking into it!
Oh ok, will do!
Great! @fgiunchedi you said 'that is something we'd have to deploy first'. Can I use this now?
Hm @herron, today we experienced T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap, which I think is caused by the fact that the eventstreams service has service::node auto_refresh => false. I forgot about this. eventstreams should be depooled, puppet run, and restarted for each new server. Same goes for change-prop, and possibly change-prop-job-queue. Sorry for not catching this when I reviewed the migration plan.
Thu, Jun 27
Gerrit manager sounds fine!