[00:20:20] (ProbeDown) firing: (2) Service otrs1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:20:26] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312609 (10phaultfinder) [00:28:53] ^ silenced again (they expired) [00:29:06] and the fix from earlier has not fixed it just yet [00:29:27] not a real issue with VRTS in any way [01:35:14] (ProbeDown) firing: (2) Service gitlab1003:443 has failed probes (http_gitlab_replica_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:36] sigh. https://gerrit.wikimedia.org/r/c/operations/puppet/+/812144/ [01:50:14] (ProbeDown) resolved: (2) Service gitlab1003:443 has failed probes (http_gitlab_replica_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:29] (ProbeDown) firing: (2) Service otrs1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:24:04] merged another follow-up fix. solved one issue,ran into next [03:50:29] (ProbeDown) resolved: (2) Service otrs1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:50:54] <_joe_> I am not convinced having this additional noise in this channel is productive [05:51:11] <_joe_> I'd like to keep the alerts, esp ones flapping, to #operations cc mutante jelto [08:03:58] 10serviceops, 10MediaWiki-extensions-Score, 10SRE, 10Shellbox, 10Sustainability (Incident Followup): Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Legoktm) [08:06:00] 10serviceops, 10MediaWiki-extensions-Score, 10SRE, 10Shellbox, 10Sustainability (Incident Followup): Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Legoktm) p:05Triage→03High I'm triaging this as high priority because it is causing temporary outages of Shel... [08:42:56] 10serviceops: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10tstarling) I can't find any incident documentation for an incident on 2020-06-08, and I'm unclear on what problem was caused by mcrouter flapping. Was mc1029 slow, able to serve VERSION probes, but unable to s... [09:01:39] _joe_: I agree. I'll bring this up in our next Mondays meeting [10:41:31] 10serviceops, 10Performance-Team (Radar): Properly support php7.4 across the observability stack - https://phabricator.wikimedia.org/T312634 (10Joe) [10:41:46] 10serviceops, 10Performance-Team (Radar): Properly support php7.4 across the observability stack - https://phabricator.wikimedia.org/T312634 (10Joe) p:05Triage→03High [11:23:51] 10serviceops, 10Parsoid, 10Performance-Team (Radar): Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 (10Joe) [13:06:06] _joe_: Managed to get etcdmirror ported to python3 and now building it for bullseye [13:06:08] https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812306 [13:06:36] it isn't gonna break existing clusters (buster, stretch) so I am gonna merge. It's very simple changes anyway [13:30:51] <_joe_> akosiaris: aren't you exapnding the eqiad cluster? [13:49:54] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners), 10Patch-For-Review: DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241 (10Jelto) I'm seeing dns issues for jobs on Trusted Runners again: (example here https://gitlab.wikimedia.org/repos/releng/gitlab-tru... [14:10:16] _joe_: yup. But also saving us the trouble from having to do this again next year [14:10:21] when buster is deprecated [14:11:51] <_joe_> yeah sure, I was asking if you're just expanding the eqiad cluster, then you just need to verify it still works on conf2005 [14:46:41] 10serviceops, 10Performance-Team (Radar): Add "php 7.4" option to the Wikimedia Debug extension - https://phabricator.wikimedia.org/T312653 (10Joe) [15:41:08] 10serviceops, 10Patch-For-Review: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 (10akosiaris) Get etcd-mirror packaged for bullseye just got fixed. https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812306 and https://gerrit.wikimedia.org/r/c/operations/software/e... [16:48:53] not doing it now but on Monday I'd like to change the weights in pybal for thumbor to reduce the weights of the older servers, newer ones have more capacity and the older ones are creaking a bit [18:01:52] 10serviceops, 10Performance-Team (Radar): Add "php 7.4" option to the Wikimedia Debug extension - https://phabricator.wikimedia.org/T312653 (10Reedy) [19:52:27] 10serviceops, 10GitLab, 10Release-Engineering-Team, 10serviceops-collab, 10User-dduvall: Changes to modules/gitlab_runner/templates/config-template.toml.erb have no effect on existing runners - https://phabricator.wikimedia.org/T311746 (10dduvall) p:05Triage→03Medium a:03dduvall [20:34:26] hnowlan: sounds reasonable, we did the same for appservers [20:34:42] or rather raise the ones for newer servers [20:42:29] replacing puppet compiler names with new "pcc-worker" names in https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3APuppet-diffs%2FDocumentation&type=revision&diff=1995470&oldid=1936327 and attempting to sync puppet facts.. so that we can compile changes in devtools/cloud [21:42:06] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312609 (10Dzahn) 05Open→03Invalid These were partially false positives and partially delibarate tests. They were not real alerts but alerting is still being worked on. There were a couple follow-ups but we will activate it a... [21:45:04] 10serviceops, 10WikimediaDebug, 10Performance-Team (Radar): Add "php 7.4" option to the Wikimedia Debug extension - https://phabricator.wikimedia.org/T312653 (10Krinkle) [21:47:01] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312609 (10Dzahn) The nice part here is: hey, automatic task creation based on alerting :) [23:09:06] 10serviceops: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10RLazarus) >>! In T255511#8064532, @tstarling wrote: > I can't find any incident documentation for an incident on 2020-06-08, and I'm unclear on what problem was caused by mcrouter flapping. Was mc1029 slow, ab...