Change Details

To close the firewall hole between analytics and the elasticsearch clusters mjolnir needs another way to query elasticsearch. We implemented one method and used it with relforge previously. The method is essentially to genreate elasticsearch queries in analytics, push them into kafka, and then read their response back on a second topic. As one can imagine there is some extra trickery in knowing when we are done. Kafka isn't really made for this purpose, it's mostly a hack. This works fine for it's purpose, collecting feature logs for smaller data sets being experimented on. This was never scaled up enough to work for the use case of collecting for a full batch for training production models. Additionally the volume of data generated by this process is quite arbitrary. The typical monthly job to build production search models will be a fairly consistent ~100M (query,page) pairs, but any experimentation from an engineer will generate unknown amounts of data completely depending on whatever is being experimented on. Some problems with the current deployment: [ ] The query normalization procedure also queries elasticsearch, but was never ported to the kafka based query [ ] I don't have any hard numbers, but I imagine the data volume is significant while having little long term value. The topic used for query/response should probably have a limited history [x] ~The daemon should fan out requests to multiple hosts. We wont likely have too many partitions, which will focus the load of query parsing/coordination on a few hosts. It would be better if we could spread that around the cluster. When running against relforge with a single partition topic the load difference between the two servers in the cluster is significant.~ We can instead use 20 or 30 partitions in kafka which will spread the load similarly with less complexity. [ ] We currently check the cluster load at the beginning of collection against production clusters to verify they are indeed idle and we aren't about to add a ton of load to the cluster serving production traffic. Without a direct connection to the cluster that's not too easy. Potential solutions: * Query normalization simply needs to be ported to use the kafka transport. It has all the typical problems though. * If we want to reduce the data volume we would need to move more smarts into the daemon. Right now it has to be verbose as we are filling topics with elasticsearch requests and responses. If the daemon was smarter it could receive a much simpler requests and send a simpler response. * Fanning out shouldn't be too hard. If we switch over to the elasticsearch-py client, already used by the bulk updates daemon, it has some built in smarts for this. * Cluster load is a tricky one, but I think we can postpone dealing with it. Codfw is de-facto the spare cluster, we probably shouldn't even install the msearch daemon on the eqiad side to prevent mistakes. Secondly the process is currently only triggered by a human, actually kicking it off requires interaction. It's not too far fetched to expect the human to understand this always runs against the spare cluster, and if we have failed over production traffic to the spare cluster it shouldn't be run. This is far from ideal, but it gives us time to find a better solution without blocking current work. There is probably more I havn't thought of.

To close the firewall hole between analytics and the elasticsearch clusters mjolnir needs another way to query elasticsearch. We implemented one method and used it with relforge previously. The method is essentially to genreate elasticsearch queries in analytics, push them into kafka, and then read their response back on a second topic. As one can imagine there is some extra trickery in knowing when we are done. Kafka isn't really made for this purpose, it's mostly a hack. This works fine for it's purpose, collecting feature logs for smaller data sets being experimented on. This was never scaled up enough to work for the use case of collecting for a full batch for training production models. Additionally the volume of data generated by this process is quite arbitrary. The typical monthly job to build production search models will be a fairly consistent ~100M (query,page) pairs, but any experimentation from an engineer will generate unknown amounts of data completely depending on whatever is being experimented on. Some problems with the current deployment: [x] The query normalization procedure also queries elasticsearch, but was never ported to the kafka based query. https://gerrit.wikimedia.org/r/451217 [ ] I don't have any hard numbers, but I imagine the data volume is significant while having little long term value. The topic used for query/response should probably have a limited history [ ] ~~The daemon should fan out requests to multiple hosts. We wont likely have too many partitions, which will focus the load of query parsing/coordination on a few hosts. It would be better if we could spread that around the cluster. When running against relforge with a single partition topic the load difference between the two servers in the cluster is significant. ~~ We can instead use 20 or 30 partitions in kafka which will spread the load similarly with less complexity. [ ] We currently check the cluster load at the beginning of collection against production clusters to verify they are indeed idle and we aren't about to add a ton of load to the cluster serving production traffic. Without a direct connection to the cluster that's not too easy. https://gerrit.wikimedia.org/r/452605 Potential solutions: * Query normalization simply needs to be ported to use the kafka transport. It has all the typical problems though. * If we want to reduce the data volume we would need to move more smarts into the daemon. Right now it has to be verbose as we are filling topics with elasticsearch requests and responses. If the daemon was smarter it could receive a much simpler requests and send a simpler response. * Fanning out shouldn't be too hard. If we switch over to the elasticsearch-py client, already used by the bulk updates daemon, it has some built in smarts for this. * All consumers in both datacenters should subscribe to the same topic on the same cluster with the same group id. The consumer will monitor qps of the `full_text` stat group against a canary index (enwiki_content). The consumer will only subscribe to kafka when qps is below some threshold, and must stop consuming within a minute of traffic rising above the threshold. In this way only consumers in the idle DC will consume. There is probably more I havn't thought of.

To close the firewall hole between analytics and the elasticsearch clusters mjolnir needs another way to query elasticsearch. We implemented one method and used it with relforge previously. The method is essentially to genreate elasticsearch queries in analytics, push them into kafka, and then read their response back on a second topic. As one can imagine there is some extra trickery in knowing when we are done. Kafka isn't really made for this purpose, it's mostly a hack. This works fine for it's purpose, collecting feature logs for smaller data sets being experimented on. This was never scaled up enough to work for the use case of collecting for a full batch for training production models. Additionally the volume of data generated by this process is quite arbitrary. The typical monthly job to build production search models will be a fairly consistent ~100M (query,page) pairs, but any experimentation from an engineer will generate unknown amounts of data completely depending on whatever is being experimented on. Some problems with the current deployment: [x] The query normalization procedure also queries elasticsearch, ] The query normalization procedure also queries elasticsearch,but was never ported to the kafka based query. but was never ported to the kafka based queryhttps://gerrit.wikimedia.org/r/451217 [ ] I don't have any hard numbers, but I imagine the data volume is significant while having little long term value. The topic used for query/response should probably have a limited history [x] ] ~~The daemon should fan out requests to multiple hosts. We wont likely have too many partitions, which will focus the load of query parsing/coordination on a few hosts. It would be better if we could spread that around the cluster. When running against relforge with a single partition topic the load difference between the two servers in the cluster is significant. ~~ We can instead use 20 or 30 partitions in kafka which will spread the load similarly with less complexity. [ ] We currently check the cluster load at the beginning of collection against production clusters to verify they are indeed idle and we aren't about to add a ton of load to the cluster serving production traffic. Without a direct connection to the cluster that's not too easy. https://gerrit.wikimedia.org/r/452605 Potential solutions: * Query normalization simply needs to be ported to use the kafka transport. It has all the typical problems though. * If we want to reduce the data volume we would need to move more smarts into the daemon. Right now it has to be verbose as we are filling topics with elasticsearch requests and responses. If the daemon was smarter it could receive a much simpler requests and send a simpler response. * Fanning out shouldn't be too hard. If we switch over to the elasticsearch-py client, already used by the bulk updates daemon, it has some built in smarts for this. * Cluster load is a tricky one, but I think we can postpone dealing with it. Codfw is de-facto the spare cluster, we probably shouldn't even install the msearch daemon on the eqiad side to prevent mistakes. Secondly the process is currently only triggered by a human,All consumers in both datacenters should subscribe to the same topic on the same cluster with the same group id. actually kicking it off requires interactionThe consumer will monitor qps of the `full_text` stat group against a canary index (enwiki_content). It's not too far fetched to expect the humanThe consumer will only subscribe to understand this always runs against the spare clusterkafka when qps is below some threshold, and if we have failed over production traffic to the spare cluster it shouldn't be run. This is far from ideal,must stop consuming within a minute of traffic rising above the threshold. but it gives us time to find a better solution without blocking current work. In this way only consumers in the idle DC will consume. There is probably more I havn't thought of.