[00:05:33] <jinxer-wm>	 (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:10:33] <jinxer-wm>	 (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:13:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P52573 and previous config saved to /var/cache/conftool/dbconfig/20230922-001316-arnaudb.json
[00:15:02] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:35] <wikibugs>	 (03PS3) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938)
[00:19:16] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P52574 and previous config saved to /var/cache/conftool/dbconfig/20230922-002823-arnaudb.json
[00:29:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[00:29:08] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:30:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:38:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983
[00:38:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983 (owner: 10TrainBranchBot)
[00:43:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343198)', diff saved to https://phabricator.wikimedia.org/P52575 and previous config saved to /var/cache/conftool/dbconfig/20230922-004330-arnaudb.json
[00:43:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[00:43:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[00:43:38] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:44:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:53:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:53:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983 (owner: 10TrainBranchBot)
[00:54:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:54:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:55:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:04:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[01:12:37] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[02:00:54] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:37] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:00] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:22:37] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:12] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:35:02] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2001:688:0:4::2d4)
[02:35:18] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100%
[02:35:22] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100%
[02:40:28] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.39 ms
[02:40:44] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 67.66 ms
[02:40:44] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 83 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:40:48] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.92 ms
[02:42:37] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:58] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 10 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:47:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:49:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:52:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:53:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:28] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:49:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[04:51:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:53:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:54:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:56:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:57:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:59:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:59:50] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy kserve 0.11 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959797 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos)
[05:00:39] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy kserve 0.11 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959797 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos)
[05:02:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (for selecting the partition layout we already use a apt* globbing pattern in netboot.cfg which also matches the staging host)" [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney)
[05:07:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "I can confirm that based on discussion with Miriam and Martin back in June Aisha's old access was simply temporarily put on hold until the" [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) (owner: 10Cwhite)
[05:09:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:10:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (the access request also mentions analytics access), but that can still happen in a followup when/if clarified what is needed" [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) (owner: 10Cwhite)
[05:13:10] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[05:13:16] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[05:13:22] <wikibugs>	 (03CR) 10Muehlenhoff: durum: Select the custom nginx provider with no additional modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[05:21:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:23:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:30:02] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:34:14] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:34:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:36:09] <wikibugs>	 (03CR) 10Muehlenhoff: sshd: Disable keyboard-interactive authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling)
[05:49:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:50:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:52:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:53:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:55:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:57:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230922T0600)
[06:08:43] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey)
[06:13:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] "Applied:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey)
[06:24:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Delete the fastapi-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959757 (owner: 10Elukey)
[06:25:00] <wikibugs>	 (03Abandoned) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690 (owner: 10Elukey)
[06:25:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779
[06:26:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779 (owner: 10Marostegui)
[06:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779 (owner: 10Marostegui)
[06:27:26] <wikibugs>	 (03PS1) 10Elukey: ml-services: remove special resource settings for eswiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959893 (https://phabricator.wikimedia.org/T346445)
[06:32:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P52576 and previous config saved to /var/cache/conftool/dbconfig/20230922-063212-root.json
[06:34:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: remove special resource settings for eswiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959893 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey)
[06:36:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1132', diff saved to https://phabricator.wikimedia.org/P52577 and previous config saved to /var/cache/conftool/dbconfig/20230922-063617-root.json
[06:36:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:39:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:40:12] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:42:10] <wikibugs>	 (03PS2) 10Phedenskog: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870)
[06:43:10] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:43:45] <wikibugs>	 (03CR) 10Phedenskog: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog)
[06:43:50] <wikibugs>	 (03PS1) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication for cloud [puppet] - 10https://gerrit.wikimedia.org/r/959894
[06:43:52] <wikibugs>	 (03PS1) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895
[06:43:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove config option for challenge response auth [puppet] - 10https://gerrit.wikimedia.org/r/959896
[06:44:28] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:45:06] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:22] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:34] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:57:58] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[07:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230922T0700)
[07:04:32] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025)
[07:05:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[07:06:39] <moritzm>	 !log installing mutt security updates
[07:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, this will work. Passing -O2 to CPPFLAGS is a little odd, I think the fully correct way to resolve this would be to re-export C" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[07:20:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah)
[07:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:24:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:25:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) >>! In T344164#9186835, @Urbanecm wrote: > If needed, we can also start with a different part of the on...
[07:27:47] <wikibugs>	 (03CR) 10Volans: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[07:28:21] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025)
[07:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:30:08] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:57] <hashar>	 I am restarting Gerrit to apply a configuration setting
[07:34:22] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:34:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] firewall: add 'none' provider [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah)
[07:36:36] <hashar>	 !log Restarting Gerrit to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/953967 "Link account creation to IDM"   # T345226
[07:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:43] <stashbot>	 T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226
[07:40:49] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10hashar) I have restarted Gerrit and the {nav Sign up} link now points to https://idm.wikimedia.org/signup/
[07:44:14] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "looks fine" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (owner: 10Jbond)
[07:45:40] <hashar>	 !log Upgrading CI Jenkins from 2.401.3 to 2.414.2
[07:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:48:23] <wikibugs>	 (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[07:50:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:51:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:52:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:53:21] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] sync-gitlab-group-with-ldap: Use --yes flag [puppet] - 10https://gerrit.wikimedia.org/r/959881 (owner: 10Ahmon Dancy)
[07:53:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948
[07:57:05] <wikibugs>	 (03PS4) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938)
[08:01:30] <wikibugs>	 (03PS19) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741)
[08:02:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:02:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach) >>! In T346796#9189454, @colewhite wrote: > @MGerlach is there an expiry date for this contract renewal?  The contract ends June 30, 2024.
[08:03:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:06:28] <wikibugs>	 (03PS9) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[08:08:24] <wikibugs>	 (03PS2) 10Muehlenhoff: Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497)
[08:09:06] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:09:22] <wikibugs>	 (03CR) 10Muehlenhoff: Add profile::firewall::provider: none for roles where P:firewall is not applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:09:41] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[08:15:06] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the heads up Ben! Idea and script LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[08:20:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog)
[08:21:23] <wikibugs>	 (03PS3) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog)
[08:21:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog)
[08:26:38] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[08:30:07] <wikibugs>	 (03CR) 10Volans: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[08:32:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add jaeger query/collector alerts [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712)
[08:34:20] <wikibugs>	 (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[08:34:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[08:34:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: remove redundant '0m' time spec from prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/959952
[08:36:01] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179
[08:38:10] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney)
[08:40:23] <wikibugs>	 (03PS1) 10Muehlenhoff: LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954
[08:40:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff)
[08:40:28] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/959966 (https://phabricator.wikimedia.org/T347140)
[08:41:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: remove redundant '0m' time spec from prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/959952 (owner: 10Filippo Giunchedi)
[08:43:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9189953, @MoritzMuehlenhoff wrote: >>>! In T344164#9186835, @Urbanecm wrote: >> If needed, we can...
[08:46:57] <wikibugs>	 (03PS1) 10Majavah: Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948)
[08:47:00] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956
[08:48:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948) (owner: 10Majavah)
[08:48:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948) (owner: 10Majavah)
[08:50:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:51:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 20 hosts with reason: Schema change
[08:51:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 20 hosts with reason: Schema change
[08:51:55] <Amir1>	 !log dbmaint on s4@eqiad (T343198)
[08:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:02] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[08:54:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[08:55:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:56:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956 (owner: 10Majavah)
[08:57:59] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10Prod-Kubernetes, 10serviceops, and 2 others: Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10JMeybohm)
[08:58:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956 (owner: 10Majavah)
[08:59:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for looking into this!" [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[09:06:46] <moritzm>	 !log installing perf updates on buster hosts
[09:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10JMeybohm)
[09:07:51] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm)
[09:08:26] <wikibugs>	 (03PS1) 10Sohom Datta: Make sure different key values are handled while submitting [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496)
[09:09:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:11:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:11:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "We've had this on Toolforge for a while and it hasn't caused any problems there. Let's not deploy this on a Friday though." [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff)
[09:12:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 15 hosts with reason: Schema change
[09:12:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 15 hosts with reason: Schema change
[09:12:50] <wikibugs>	 (03CR) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication for cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff)
[09:13:41] <moritzm>	 !log installing perf updates on bookworm hosts
[09:13:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[09:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:13] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: thanos: bump max open files for query/rule/compact [puppet] - 10https://gerrit.wikimedia.org/r/959674 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi)
[09:14:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:18:22] <Amir1>	 !log dbmaint on s6@eqiad (T343198)
[09:18:26] <stashbot>	 Amir1: Failed to log message to wiki. Somebody should check the error logs.
[09:18:27] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[09:18:45] <Amir1>	 !log dbmaint on s6@eqiad (T343198)
[09:18:48] <stashbot>	 Amir1: Failed to log message to wiki. Somebody should check the error logs.
[09:19:39] <Amir1>	 ah, is it because wikitech is outside production and technically pooled?
[09:21:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: bump store max open files [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950)
[09:21:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 16 hosts with reason: Schema change
[09:22:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 16 hosts with reason: Schema change
[09:22:59] <Amir1>	 !log dbmaint on s2@eqiad (T343198)
[09:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] firewall: add 'none' provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah)
[09:25:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) @RKemper @bking   Hey there! I was wondering if you please verify the racking proposal. It is listed to have the racking locations...  wdqs1006 (row A) to be repl...
[09:27:35] <wikibugs>	 (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:27:53] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[09:28:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959896 (owner: 10Muehlenhoff)
[09:29:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] ssh: Disable ChallengeResponseAuthentication for cloud [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff)
[09:29:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895 (owner: 10Muehlenhoff)
[09:30:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954 (owner: 10Muehlenhoff)
[09:32:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10Stevemunene) Hi @odimitrijevic , Requesting approval for adding the `analytics-wmde` user to analtyics-...
[09:34:13] <wikibugs>	 (03CR) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[09:36:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950)
[09:43:23] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet
[09:43:48] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcumin1001.eqiad.wmnet
[09:45:08] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet
[09:45:11] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcumin1001.eqiad.wmnet
[09:48:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:49:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:50:04] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet
[09:52:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:53:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:53:38] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet
[09:55:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:55:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:56:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:57:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:00:37] <fabfur>	 !log repool cp1090 (T346874)
[10:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:51] <stashbot>	 T346874: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874
[10:04:59] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall: Default provider to none [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497)
[10:06:35] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950)
[10:07:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:08:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:21:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[10:22:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet
[10:26:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet
[10:27:21] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add new apt-staging host to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney)
[10:41:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:42:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10VRiley-WMF) wdqs1017 D 2. U 38. CableID 230304500154. Port 24 wdqs1018 E 2. U 40. CableID 230304500260. Port  32 wdqs1019 F 2. U 39. CableID 230304500198 Port 32
[10:42:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:43:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:46:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214)
[10:50:21] <wikibugs>	 (03PS3) 10Jbond: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216)
[10:50:23] <wikibugs>	 (03PS1) 10Jbond: 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216)
[10:50:28] <wikibugs>	 (03CR) 10Jbond: templates/diffs: escape parameters (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[10:51:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) Hey @ssingh   If possible, please let us know the racking configuration for these devices.   Currently, all the NVMe SSD have been installed into these servers and awaiting...
[10:55:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[10:56:46] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm but need to update the connect() params" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[10:57:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[10:57:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[10:59:54] <wikibugs>	 (03Merged) 10jenkins-bot: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[10:59:56] <wikibugs>	 (03Merged) 10jenkins-bot: 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[11:01:41] <wikibugs>	 (03CR) 10Muehlenhoff: Move os-reports to the puppetdb host(s) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[11:01:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214)
[11:03:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[11:03:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[11:04:31] <wikibugs>	 (03Abandoned) 10Jbond: do not merge: test change for pcc [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[11:09:16] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host apt-staging2001.codfw.wmnet with OS bookworm
[11:16:41] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version to 2.5.7 [puppet] - 10https://gerrit.wikimedia.org/r/960022 (https://phabricator.wikimedia.org/T346216)
[11:17:13] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::prometheus: fix openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439)
[11:17:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:18:01] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10Stevemunene)
[11:18:07] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10Stevemunene)
[11:18:32] <wikibugs>	 (03CR) 10Majavah: P:wmcs::prometheus: fix openstack-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[11:18:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:20:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version to 2.5.7 [puppet] - 10https://gerrit.wikimedia.org/r/960022 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[11:21:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:21:36] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43463/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[11:22:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[11:22:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:25:28] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: test author [puppet] - 10https://gerrit.wikimedia.org/r/960024
[11:25:41] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on apt-staging2001.codfw.wmnet with reason: host reimage
[11:26:30] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: test author [puppet] - 10https://gerrit.wikimedia.org/r/960024 (owner: 10Jbond)
[11:28:53] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apt-staging2001.codfw.wmnet with reason: host reimage
[11:30:42] <wikibugs>	 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Joe) @Urbanecm_WMF as I said on IRC, there's two main differences when running...
[11:34:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43464/console" [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond)
[11:37:23] <wikibugs>	 (03PS2) 10Muehlenhoff: dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079
[11:37:52] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43465/console" [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah)
[11:41:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[11:41:47] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[11:42:13] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apt-staging2001.codfw.wmnet with OS bookworm
[11:45:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[11:46:24] <wikibugs>	 (03PS8) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[11:46:35] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::prometheus: fix openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah)
[11:49:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol)
[11:49:47] <wikibugs>	 (03PS1) 10Majavah: cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266)
[11:49:52] <wikibugs>	 (03PS1) 10Majavah: hieradata: drop dmz_cidr excemptions for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266)
[11:50:27] <wikibugs>	 (03PS9) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[11:52:31] <wikibugs>	 (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029
[11:54:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029 (owner: 10Jbond)
[11:54:49] <wikibugs>	 (03Abandoned) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029 (owner: 10Jbond)
[11:56:58] <wikibugs>	 (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/960031
[11:57:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:57:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff)
[11:58:16] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad
[11:58:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:58:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:58:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Merge branch 'master' into 2.x [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/960031 (owner: 10Jbond)
[12:00:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:00:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[12:04:29] <wikibugs>	 (03PS9) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373)
[12:06:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:08:59] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497)
[12:10:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond)
[12:11:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:12:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:13:50] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet
[12:13:59] <wikibugs>	 (03PS1) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107)
[12:14:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:15:04] <wikibugs>	 (03PS1) 10Jbond: pcc: move pcc1004 to pcc version 3 [puppet] - 10https://gerrit.wikimedia.org/r/960035 (https://phabricator.wikimedia.org/T236373)
[12:15:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pcc: move pcc1004 to pcc version 3 [puppet] - 10https://gerrit.wikimedia.org/r/960035 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond)
[12:16:14] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43466/console" [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto)
[12:16:23] <wikibugs>	 (03PS1) 10Muehlenhoff: webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036
[12:16:36] <wikibugs>	 (03PS2) 10Muehlenhoff: webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036
[12:17:36] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet
[12:18:39] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:18:40] <wikibugs>	 (03PS1) 10Jbond: test pcc: [puppet] - 10https://gerrit.wikimedia.org/r/960037
[12:21:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960036 (owner: 10Muehlenhoff)
[12:23:12] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad
[12:29:09] <wikibugs>	 (03PS10) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[12:35:37] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10ayounsi) I'm going to briefly re-purpose that host as ganeti-test2004 for some tests. cc T345602 let me know if there is any issue.
[12:36:09] <wikibugs>	 (03Abandoned) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff)
[12:36:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036 (owner: 10Muehlenhoff)
[12:43:12] <wikibugs>	 (03PS11) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741)
[12:45:25] <wikibugs>	 (03CR) 10Jbond: "did you see an issue some where.  this should not be needed as Stdlib::Port covers Stdlib::Port::UnPrivileged" [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:45:51] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[12:47:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack)
[12:48:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) @VRiley-WMF - Sukhbir's out right now, but I've updated the racking plan on his behalf!
[12:48:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[12:50:04] <wikibugs>	 (03PS1) 10JMeybohm: prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915)
[12:52:00] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:54:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:56:13] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10JAllemandou) It feels wrong to me to be willing to return all page views on a date: the result set would be enormous and wouldn't b...
[12:59:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:00:07] <wikibugs>	 (03PS2) 10JMeybohm: prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915)
[13:00:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) Reading a little deeper on this, I think we still have a hostnames issue.  If those other 8 hosts are indeed being brought from ulsfo+eqsin.  Those 8 hosts, I presume, would be...
[13:03:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) Adding to the confusion: historically, we once used the hostname `cp1099` back in 2015 for a one-off host: T96873 - therefore that name already exists in both phab and git hist...
[13:07:41] <jnuche>	 _joe_: sorry for the bad timing yesterday
[13:07:51] <jnuche>	 Clément mentioned you had an issue with Scap running an explicit helm rollback when it's not required
[13:07:58] <jnuche>	 do you have logs or a task with details?
[13:08:22] <_joe_>	 jnuche: not yet, it's been a quite hectic week
[13:09:16] <jnuche>	 _joe_: no worries, just wanted to get a better sense of the problem
[13:09:33] <jnuche>	 please tag me on the task when you get a chance
[13:09:52] <jnuche>	 and hope things are calming down a bit!
[13:11:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43468/console" [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[13:13:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[13:14:18] <wikibugs>	 (03PS1) 10JMeybohm: prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915)
[13:14:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712)
[13:15:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I _think_ this is the minimal configuration to get Prometheus to scrape otel-coll, please let me know what you think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[13:19:08] <wikibugs>	 (03CR) 10JMeybohm: otel-coll: enable prometheus scraping (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[13:19:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[13:19:46] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43469/console" [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[13:19:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Nice cleanup, not tested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[13:19:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[13:20:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:21:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm)
[13:22:29] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey)
[13:29:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi)
[13:29:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10taavi)
[13:31:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) 05In progress→03Resolved
[13:31:19] <wikibugs>	 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10cmooney)
[13:32:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:37:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:38:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712)
[13:39:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: otel-coll: enable prometheus scraping (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[13:39:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[13:44:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Makes sense thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi)
[13:45:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:45:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED
[13:46:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:48:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[13:50:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi)
[13:50:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:02:30] <wikibugs>	 (03PS1) 10Hashar: envoyproxy: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960062
[14:04:14] <wikibugs>	 (03PS1) 10Hashar: mcrouter: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960063
[14:08:44] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:55] <wikibugs>	 (03PS1) 10Hashar: Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064
[14:10:18] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) >>! In T343855#9190791, @JAllemandou wrote: > It feels wrong to me to be willing to return all page views on a date: the re...
[14:15:42] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Make wikifunctionswiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857)
[14:18:44] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:06] <wikibugs>	 (03PS1) 10Hashar: tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238)
[14:23:39] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:28:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:41:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:42:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:44:32] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/945872/43471/vrts1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth)
[14:45:17] <wikibugs>	 (03PS36) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[14:45:44] <wikibugs>	 (03PS1) 10Urbanecm: listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120)
[14:49:25] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[14:49:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc1015']
[14:51:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr)
[14:54:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED
[14:56:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:56:54] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:12] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED
[14:57:33] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:39] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar)
[14:58:06] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:58:30] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:58:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc1015']
[15:00:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:01:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:02:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:03:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm)
[15:03:45] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139)
[15:05:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:08:48] <wikibugs>	 (03PS1) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[15:08:50] <wikibugs>	 (03PS1) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076
[15:09:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I trust you know better than me that this is no longer needed :).  Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[15:09:19] <wikibugs>	 (03PS2) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[15:09:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:09:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:11:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:57] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1247']
[15:13:09] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:28] <wikibugs>	 (03PS37) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822)
[15:13:50] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.9.1 - T346737
[15:13:59] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.9.1 - T346737 (duration: 00m 09s)
[15:14:05] <denisse>	 !log upgrading LibreNMS to 23.9.1
[15:15:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:16:27] <denisse>	 !log upgrading LibreNMS in codfw
[15:17:49] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-check-services.service,librenms-discovery-new.service,librenms-poll-billing.service,librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:20:37] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:21:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:21:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks! Would you please review ops/homer/public.git and see if there is a pending cleanup related to this?" [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[15:21:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1247']
[15:21:46] <wikibugs>	 (03CR) 10Majavah: "Thanks all! I plan to deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[15:23:09] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:24:52] <denisse>	 !log upgrading LibreNMS in eqiad
[15:24:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[15:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr)
[15:26:03] <wikibugs>	 (03PS1) 10Bking: dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149)
[15:27:21] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[15:27:52] <wikibugs>	 (03PS2) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[15:27:54] <wikibugs>	 (03PS3) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[15:28:01] <wikibugs>	 (03CR) 10Bking: [C: 03+2] dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[15:28:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:28:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:30:49] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:31:02] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:33:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr)
[15:34:25] <wikibugs>	 (03PS4) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[15:34:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:34:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:36:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:38:18] <wikibugs>	 (03PS1) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149)
[15:38:55] <wikibugs>	 (03PS3) 10Cwhite: Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796)
[15:39:47] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[15:39:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[15:39:53] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:40:14] <wikibugs>	 (03PS1) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960090 (https://phabricator.wikimedia.org/T342149)
[15:41:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:41:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:41:43] <wikibugs>	 (03PS2) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149)
[15:41:59] <wikibugs>	 (03CR) 10Jbond: "ill move the current puppetdbs to insetup next week and then we can merge this" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[15:41:59] <wikibugs>	 10SRE, 10Traffic: Allow Varnish to be called on UDS from services other than HAProxy - https://phabricator.wikimedia.org/T347059 (10Fabfur) 05Open→03Declined
[15:42:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) (owner: 10Cwhite)
[15:43:10] <wikibugs>	 (03Abandoned) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960090 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[15:43:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002"
[15:43:18] <wikibugs>	 (03CR) 10Bking: [C: 03+2] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[15:43:23] <wikibugs>	 (03PS3) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[15:43:25] <wikibugs>	 (03PS5) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[15:43:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:44:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002"
[15:44:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:44:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:44:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) 05Open→03Resolved a:03colewhite Restored the level of access held before last contract expired.  Please feel free to reopen if yo...
[15:45:46] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:45:54] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:46:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:48:00] <wikibugs>	 (03PS4) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[15:48:02] <wikibugs>	 (03PS6) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[15:48:25] <wikibugs>	 (03PS2) 10Cwhite: admin: add mabualruz to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535)
[15:48:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:48:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[15:48:43] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) p:05Triage→03Medium
[15:49:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:50:09] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:50:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: add mabualruz to deployment group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) (owner: 10Cwhite)
[15:51:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:51:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) 05Open→03Resolved The group membership change has been deployed.  Please feel free to reopen if you encounter any...
[15:53:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:54:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:09] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:59:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[15:59:37] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:01:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:35:12] <wikibugs>	 (03PS2) 10Jdlrobson: WIP: Wordmarks for Wikinews projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258)
[16:36:44] <wikibugs>	 (03CR) 10FNegri: "I cherry-picked this change on top of my patch and it compiles successfully. Feel free to merge this one and I'll rebase my patch on top." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[16:46:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:12] <wikibugs>	 (03PS5) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[16:48:14] <wikibugs>	 (03PS7) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[16:48:31] <wikibugs>	 (03PS5) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938)
[16:48:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[16:49:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[16:49:51] <wikibugs>	 (03PS5) 10Jdlrobson: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242)
[16:50:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:52] <wikibugs>	 (03PS6) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[16:53:54] <wikibugs>	 (03PS8) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[16:54:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[16:54:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[16:58:36] <wikibugs>	 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur)
[16:58:39] <wikibugs>	 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur)
[17:02:55] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:08:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10dr0ptp4kt) Thanks @colewhite , I checked in a few places like mwlog1002, deploy1002, and mwmaint1002 and it looks to be in working order. Have a great weekend!
[17:09:52] <wikibugs>	 (03PS1) 10Cathal Mooney: Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191)
[17:12:33] <wikibugs>	 (03CR) 10Peter Fischer: "All SUP-config-related PRs are merged, thank you! Do you think you could adapt the chart?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[17:13:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[17:14:21] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 1.595 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:15:49] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 54549 bytes in 6.165 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[17:17:55] <wikibugs>	 (03PS1) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192)
[17:20:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:20:38] <wikibugs>	 (03CR) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:21:08] <wikibugs>	 (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[17:22:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:24:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:27:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) I checked on, for example, an-launcher1002, and look to be in place. Thanks here as well @colewhite!
[17:29:54] <wikibugs>	 (03PS7) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[17:29:56] <wikibugs>	 (03PS9) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:30:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:30:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:32:24] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@a30e944]: (no justification provided)
[17:32:34] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@a30e944]: (no justification provided) (duration: 00m 09s)
[17:33:51] <wikibugs>	 (03PS1) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463)
[17:34:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[17:34:27] <wikibugs>	 (03PS8) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[17:34:29] <wikibugs>	 (03PS10) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:34:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:35:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:35:54] <wikibugs>	 (03PS2) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463)
[17:36:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[17:38:51] <wikibugs>	 (03PS3) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463)
[17:39:21] <wikibugs>	 (03CR) 10jenkins-bot: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[17:39:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43484/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:42:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:43:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:43:32] <wikibugs>	 (03PS9) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[17:43:34] <wikibugs>	 (03PS11) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:44:03] <wikibugs>	 (03PS4) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463)
[17:44:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:48:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43485/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:51:11] <wikibugs>	 (03PS10) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[17:51:13] <wikibugs>	 (03PS12) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:51:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:53:10] <wikibugs>	 (03PS11) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[17:53:12] <wikibugs>	 (03PS13) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:56:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:56:54] <wikibugs>	 (03PS14) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[17:57:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[17:59:33] <wikibugs>	 (03PS15) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:06:28] <wikibugs>	 (03PS12) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:06:30] <wikibugs>	 (03PS16) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:07:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:07:28] <wikibugs>	 (03PS17) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:12:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43488/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:17:56] <wikibugs>	 (03PS13) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:17:58] <wikibugs>	 (03PS18) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:18:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:18:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:20:02] <wikibugs>	 (03PS14) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:20:04] <wikibugs>	 (03PS19) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:23:22] <wikibugs>	 (03PS20) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:23:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43489/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:24:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:28:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43490/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:33:49] <wikibugs>	 (03PS15) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:33:51] <wikibugs>	 (03PS21) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:38:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:39:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43491/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:47:50] <wikibugs>	 (03PS16) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:47:52] <wikibugs>	 (03PS22) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:52:10] <wikibugs>	 (03PS17) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:52:12] <wikibugs>	 (03PS23) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:52:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:56:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[18:57:07] <wikibugs>	 (03PS6) 10Jdlrobson: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242)
[18:57:09] <wikibugs>	 (03PS1) 10Jdlrobson: WIP: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250)
[18:57:18] <wikibugs>	 (03PS18) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[18:57:20] <wikibugs>	 (03PS24) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[18:57:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[18:57:59] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:00:50] <wikibugs>	 (03PS19) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[19:00:52] <wikibugs>	 (03PS25) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[19:02:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43495/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[19:03:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:04:56] <wikibugs>	 (03PS1) 10Sharvaniharan: New stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960123
[19:06:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[19:06:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43496/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[19:06:57] <wikibugs>	 (03Abandoned) 10Sharvaniharan: New stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960123 (owner: 10Sharvaniharan)
[19:08:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "hi all are you able to review this changes, the diff in the most recent pcc is related to sorting, with this new version being a i think a" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[19:10:05] <wikibugs>	 (03PS1) 10Sharvaniharan: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124
[19:12:33] <wikibugs>	 (03PS1) 10Jbond: prometheus::class_config: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373)
[19:13:32] <wikibugs>	 (03PS3) 10Jdlrobson: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne)
[19:13:34] <wikibugs>	 (03PS2) 10Jdlrobson: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251)
[19:16:56] <wikibugs>	 (03PS2) 10Jbond: prometheus::class_config: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373)
[19:20:35] <wikibugs>	 (03PS2) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893)
[19:21:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:29:00] <wikibugs>	 (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958981/43499/" [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite)
[19:29:11] <wikibugs>	 (03PS20) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[19:29:13] <wikibugs>	 (03PS26) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[19:29:15] <wikibugs>	 (03PS3) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373)
[19:32:21] <wikibugs>	 (03PS4) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373)
[19:33:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[19:33:59] <wikibugs>	 (03PS1) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373)
[19:42:39] <wikibugs>	 (03PS21) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373)
[19:42:41] <wikibugs>	 (03PS27) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373)
[19:43:14] <wikibugs>	 (03PS5) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373)
[19:43:23] <wikibugs>	 (03PS2) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373)
[19:50:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:53:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:55:16] <wikibugs>	 (03CR) 10Bking: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking)
[19:59:03] <wikibugs>	 (03CR) 10Jbond: "similar to the last i think the only diff is ordering" [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[20:05:32] <wikibugs>	 (03Abandoned) 10Bking: elastic: introduce jbod-related config [puppet] - 10https://gerrit.wikimedia.org/r/959854 (https://phabricator.wikimedia.org/T231010) (owner: 10Bking)
[21:07:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10odimitrijevic) approved
[21:09:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:13:44] <jinxer-wm>	 (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[21:25:13] <wikibugs>	 (03PS4) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315)
[21:25:15] <wikibugs>	 (03PS12) 10Ebernhardson: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960
[21:25:17] <wikibugs>	 (03CR) 10Ebernhardson: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[21:29:30] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:58:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:59:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:09:40] <wikibugs>	 (03PS1) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257)
[22:09:44] <wikibugs>	 (03PS1) 10Jdlrobson: WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257)
[22:10:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson)
[22:18:44] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:55:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:56:13] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[22:56:26] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[22:56:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:09:49] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:09:53] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:15:35] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:21:15] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:22:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:52:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:53:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase