[05:41:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:46:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:03:46] 10Traffic, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Aklapper) All patches merged. Is this still an issue? Should this still remain o... [08:02:25] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:07:25] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:03:56] 10Traffic, 10Prod-Kubernetes, 10SRE, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This is now used by miscweb and documented at https://wikitech.wikimedia.... [09:29:25] (HAProxyEdgeTrafficDrop) firing: 28% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:31:05] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) Hey @cmooney, is there any input needed from WMCS on this? (just want to make sure you are not blocked) [09:59:25] (HAProxyEdgeTrafficDrop) resolved: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:05:25] (HAProxyEdgeTrafficDrop) firing: (2) 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:10:25] (HAProxyEdgeTrafficDrop) resolved: (2) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:50:47] 10Traffic: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) [10:50:56] 10Traffic: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) p:05Triage→03High [10:57:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10jbond) [10:59:25] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:04:25] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:42:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) >>! In T295691#7894680, @Papaul wrote: > @Jgreen hello do you think this can be done on May the 16th? @Papaul, yes that sounds good. We can plan for downt... [12:47:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10jbond) this is likely related to https://wikitech.wikimedia.org/wiki/Performance/Graphite/Synthetic_Instance [12:59:43] 10Traffic, 10SRE: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) p:05High→03Medium Lowering the priority as after downgrading text we aren't experiencing more issues [13:08:43] vgutierrez: bblack: someone asked me about the sre position in traffic (https://boards.greenhouse.io/wikimedia/jobs/3659197?gh_src=7c0a18b71us) and wanted to know how in important this requierment is "Experience with C, C++, Golang or Rust" and how much expereince is required i.e. experience reading and making small changes to $lang vs experience working on big projects and in depth knowlage of [13:08:49] launguage structure and advance use case [13:09:08] also for my own benefit are there some rust projects in the works? [13:09:26] that's the position for the service ops manager [13:09:36] wrong link https://boards.greenhouse.io/wikimedia/jobs/3891672?gh_src=1aa374291us [13:10:11] right.. so the candidate ideally should be comfy patching and/or debugging varnish/ATS/HAProxy [13:10:36] of course the LB work is going to require writing C++ code [13:10:44] and we got some small tools written in go [13:11:41] perfect thanks vgutierrez [14:20:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10fgiunchedi) Yes valid indeed, see also {T231870} and {T304583} [14:30:49] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7892221, @Ottomata wrote: >> perhaps this is a client browser opening a connection but send... [14:51:14] jbond: I'd add - it's probably slightly less-important the language (if someone knows a few, they can probably pick up another), but the level of experience developing some kind of software, in a context where knowledge of things like syscalls and thread scaling performance, etc actually matters. [14:53:44] bblack: that makes sense, thanks [14:53:58] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It possible that the request aborted errors are actually requests being terminated mid-flight by the clie... [15:04:30] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn I mentioned over email, but I t... [15:15:52] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) @jhathaway Yes and no. What I definitely d... [15:18:29] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) >>! In T300977#7836272, @Volans wrote: > If I may add my use case too, I would like to be able to restrict the a... [15:32:42] 10Traffic, 10SRE, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) reported to upstream: https://github.com/haproxy/haproxy/issues/1684 [15:45:57] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn that makes sense, so I assume i... [15:51:57] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7899762, @Ottomata wrote: > It possible that the request aborted errors are actually reques... [15:53:11] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7883435, @elukey wrote: > We have discussed this issue in the #serviceops channel yesterday, and the i... [15:59:58] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) to be accurate, the remote client talks to HAProxy over a TLS connection and HAProxy handles the traffi... [16:05:16] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It does seem that a 400 bad request is being sent to the client. I think that perhaps the 500 reported b... [16:34:34] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) The part that we don't (can't actually) re... [16:34:47] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @Vgutierrez for the clarification on that. I hadn't picked up on the progress of the HAProxy migrat... [21:29:01] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) But what isn't is that there seems to be a...