[00:05:33] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:10:33] (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:13:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P52573 and previous config saved to /var/cache/conftool/dbconfig/20230922-001316-arnaudb.json [00:15:02] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:35] (03PS3) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) [00:19:16] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P52574 and previous config saved to /var/cache/conftool/dbconfig/20230922-002823-arnaudb.json [00:29:08] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [00:29:08] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:34] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983 [00:38:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983 (owner: 10TrainBranchBot) [00:43:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T343198)', diff saved to https://phabricator.wikimedia.org/P52575 and previous config saved to /var/cache/conftool/dbconfig/20230922-004330-arnaudb.json [00:43:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:43:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:43:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:44:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:53:06] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/958983 (owner: 10TrainBranchBot) [00:54:30] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:54:32] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:55:52] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:04:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:12:37] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:00:54] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:22:37] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:12] RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:02] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2001:688:0:4::2d4) [02:35:18] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:35:22] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:40:28] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.39 ms [02:40:44] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 67.66 ms [02:40:44] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 83 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:40:48] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.92 ms [02:42:37] (JobUnavailable) firing: (7) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:58] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 10 probes of 770 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:47:46] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:49:12] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:52:02] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:53:28] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:28] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:49:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:51:38] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:53:04] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:54:44] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:56:08] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:57:46] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:59:12] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:59:50] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy kserve 0.11 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959797 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:00:39] (03Merged) 10jenkins-bot: ml-services: deploy kserve 0.11 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959797 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:02:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (for selecting the partition layout we already use a apt* globbing pattern in netboot.cfg which also matches the staging host)" [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney) [05:07:39] (03CR) 10Muehlenhoff: [C: 03+1] "I can confirm that based on discussion with Miriam and Martin back in June Aisha's old access was simply temporarily put on hold until the" [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) (owner: 10Cwhite) [05:09:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:10:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (the access request also mentions analytics access), but that can still happen in a followup when/if clarified what is needed" [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) (owner: 10Cwhite) [05:13:10] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:13:16] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:13:22] (03CR) 10Muehlenhoff: durum: Select the custom nginx provider with no additional modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959749 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [05:21:46] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:23:12] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:30:02] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:22] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:34:14] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:46] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:36:09] (03CR) 10Muehlenhoff: sshd: Disable keyboard-interactive authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956983 (owner: 10Tim Starling) [05:49:54] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:50:18] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:52:44] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:53:10] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:55:44] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:57:10] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230922T0600) [06:08:43] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [06:13:00] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Applied:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) (owner: 10Elukey) [06:24:24] (03CR) 10Elukey: [C: 03+2] Delete the fastapi-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959757 (owner: 10Elukey) [06:25:00] (03Abandoned) 10Elukey: Set ores.wikimedia.org as CNAME for ores-legacy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957690 (owner: 10Elukey) [06:25:44] (03PS1) 10Marostegui: Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779 [06:26:30] (03CR) 10Marostegui: [C: 03+2] Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779 (owner: 10Marostegui) [06:27:00] (03Merged) 10jenkins-bot: Revert "control-mariadb-10.4-bullseye: Bump version" [software] - 10https://gerrit.wikimedia.org/r/959779 (owner: 10Marostegui) [06:27:26] (03PS1) 10Elukey: ml-services: remove special resource settings for eswiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959893 (https://phabricator.wikimedia.org/T346445) [06:32:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P52576 and previous config saved to /var/cache/conftool/dbconfig/20230922-063212-root.json [06:34:36] (03CR) 10Elukey: [C: 03+2] ml-services: remove special resource settings for eswiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/959893 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [06:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1132', diff saved to https://phabricator.wikimedia.org/P52577 and previous config saved to /var/cache/conftool/dbconfig/20230922-063617-root.json [06:36:48] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:39:40] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:40:12] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:10] (03PS2) 10Phedenskog: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) [06:43:10] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:43:45] (03CR) 10Phedenskog: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog) [06:43:50] (03PS1) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication for cloud [puppet] - 10https://gerrit.wikimedia.org/r/959894 [06:43:52] (03PS1) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895 [06:43:54] (03PS1) 10Muehlenhoff: Remove config option for challenge response auth [puppet] - 10https://gerrit.wikimedia.org/r/959896 [06:44:28] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:06] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:22] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:34] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:57:58] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230922T0700) [07:04:32] (03PS2) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [07:05:16] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [07:06:39] !log installing mutt security updates [07:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, this will work. Passing -O2 to CPPFLAGS is a little odd, I think the fully correct way to resolve this would be to re-export C" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [07:20:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [07:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:24:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:25:51] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) >>! In T344164#9186835, @Urbanecm wrote: > If needed, we can also start with a different part of the on... [07:27:47] (03CR) 10Volans: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [07:28:21] (03PS3) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [07:29:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:30:08] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:57] I am restarting Gerrit to apply a configuration setting [07:34:22] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:33] (03CR) 10Majavah: [C: 03+2] firewall: add 'none' provider [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [07:36:36] !log Restarting Gerrit to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/953967 "Link account creation to IDM" # T345226 [07:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:43] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [07:40:49] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10hashar) I have restarted Gerrit and the {nav Sign up} link now points to https://idm.wikimedia.org/signup/ [07:44:14] (03CR) 10Majavah: [C: 03+1] "looks fine" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (owner: 10Jbond) [07:45:40] !log Upgrading CI Jenkins from 2.401.3 to 2.414.2 [07:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:48:23] (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [07:50:38] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:51:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:52:04] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:53:21] (03CR) 10EoghanGaffney: [C: 03+2] sync-gitlab-group-with-ldap: Use --yes flag [puppet] - 10https://gerrit.wikimedia.org/r/959881 (owner: 10Ahmon Dancy) [07:53:33] (03PS1) 10Giuseppe Lavagetto: thumbor: use base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959948 [07:57:05] (03PS4) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) [08:01:30] (03PS19) 10Brouberol: Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) [08:02:14] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:02:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach) >>! In T346796#9189454, @colewhite wrote: > @MGerlach is there an expiry date for this contract renewal? The contract ends June 30, 2024. [08:03:40] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:06:28] (03PS9) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [08:08:24] (03PS2) 10Muehlenhoff: Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) [08:09:06] (03CR) 10Majavah: [C: 03+1] Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:09:22] (03CR) 10Muehlenhoff: Add profile::firewall::provider: none for roles where P:firewall is not applied (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:09:41] (03CR) 10Klausman: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [08:15:06] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the heads up Ben! Idea and script LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [08:20:49] (03CR) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog) [08:21:23] (03PS3) 10Filippo Giunchedi: alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog) [08:21:37] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] alertmanager: setup QTE mailing group. [puppet] - 10https://gerrit.wikimedia.org/r/959157 (https://phabricator.wikimedia.org/T346870) (owner: 10Phedenskog) [08:26:38] (03CR) 10Vgutierrez: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [08:30:07] (03CR) 10Volans: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [08:32:02] (03PS1) 10Filippo Giunchedi: sre: add jaeger query/collector alerts [alerts] - 10https://gerrit.wikimedia.org/r/959950 (https://phabricator.wikimedia.org/T345712) [08:34:20] (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [08:34:22] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::firewall::provider: none for roles where P:firewall is not applied [puppet] - 10https://gerrit.wikimedia.org/r/959759 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:34:25] (03PS1) 10Filippo Giunchedi: o11y: remove redundant '0m' time spec from prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/959952 [08:36:01] (03PS2) 10Muehlenhoff: profile::cumin::cloud_target: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/959179 [08:38:10] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney) [08:40:23] (03PS1) 10Muehlenhoff: LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954 [08:40:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959179 (owner: 10Muehlenhoff) [08:40:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/959966 (https://phabricator.wikimedia.org/T347140) [08:41:23] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: remove redundant '0m' time spec from prometheus alerts [alerts] - 10https://gerrit.wikimedia.org/r/959952 (owner: 10Filippo Giunchedi) [08:43:56] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9189953, @MoritzMuehlenhoff wrote: >>>! In T344164#9186835, @Urbanecm wrote: >> If needed, we can... [08:46:57] (03PS1) 10Majavah: Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948) [08:47:00] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956 [08:48:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948) (owner: 10Majavah) [08:48:35] (03CR) 10Majavah: [C: 03+2] Fix puppet on cloudvirt-wdqs* until they have been moved [puppet] - 10https://gerrit.wikimedia.org/r/959955 (https://phabricator.wikimedia.org/T346948) (owner: 10Majavah) [08:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:51:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 20 hosts with reason: Schema change [08:51:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 20 hosts with reason: Schema change [08:51:55] !log dbmaint on s4@eqiad (T343198) [08:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:02] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:54:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:55:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:56:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956 (owner: 10Majavah) [08:57:59] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10Prod-Kubernetes, 10serviceops, and 2 others: Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10JMeybohm) [08:58:27] (03CR) 10Majavah: [C: 03+2] P:wmcs::cloud_private_subnet: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/959956 (owner: 10Majavah) [08:59:40] (03CR) 10Filippo Giunchedi: "Thank you for looking into this!" [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [09:06:46] !log installing perf updates on buster hosts [09:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10JMeybohm) [09:07:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [09:08:26] (03PS1) 10Sohom Datta: Make sure different key values are handled while submitting [extensions/PageTriage] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959986 (https://phabricator.wikimedia.org/T345496) [09:09:44] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:11:08] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:11:34] (03CR) 10Majavah: [C: 03+1] "We've had this on Toolforge for a while and it hasn't caused any problems there. Let's not deploy this on a Friday though." [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff) [09:12:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 15 hosts with reason: Schema change [09:12:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 15 hosts with reason: Schema change [09:12:50] (03CR) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication for cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff) [09:13:41] !log installing perf updates on bookworm hosts [09:13:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:13] (03Abandoned) 10Filippo Giunchedi: thanos: bump max open files for query/rule/compact [puppet] - 10https://gerrit.wikimedia.org/r/959674 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi) [09:14:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:18:22] !log dbmaint on s6@eqiad (T343198) [09:18:26] Amir1: Failed to log message to wiki. Somebody should check the error logs. [09:18:27] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:18:45] !log dbmaint on s6@eqiad (T343198) [09:18:48] Amir1: Failed to log message to wiki. Somebody should check the error logs. [09:19:39] ah, is it because wikitech is outside production and technically pooled? [09:21:42] (03PS1) 10Filippo Giunchedi: thanos: bump store max open files [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) [09:21:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 16 hosts with reason: Schema change [09:22:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 16 hosts with reason: Schema change [09:22:59] !log dbmaint on s2@eqiad (T343198) [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:19] (03CR) 10Muehlenhoff: [C: 03+1] firewall: add 'none' provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959739 (owner: 10Majavah) [09:25:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (10VRiley-WMF) @RKemper @bking Hey there! I was wondering if you please verify the racking proposal. It is listed to have the racking locations... wdqs1006 (row A) to be repl... [09:27:35] (03CR) 10Brouberol: Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:27:53] (03CR) 10Brouberol: [C: 03+2] Define a script in charge of checking the kafka broker in sync status [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [09:28:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959896 (owner: 10Muehlenhoff) [09:29:20] (03CR) 10Jbond: [C: 03+1] ssh: Disable ChallengeResponseAuthentication for cloud [puppet] - 10https://gerrit.wikimedia.org/r/959894 (owner: 10Muehlenhoff) [09:29:26] (03CR) 10Jbond: [C: 03+1] ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895 (owner: 10Muehlenhoff) [09:30:44] (03CR) 10Jbond: [C: 03+1] LVS: Set profile::firewall::provider: none [puppet] - 10https://gerrit.wikimedia.org/r/959954 (owner: 10Muehlenhoff) [09:32:00] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10Stevemunene) Hi @odimitrijevic , Requesting approval for adding the `analytics-wmde` user to analtyics-... [09:34:13] (03CR) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:36:27] (03PS2) 10Filippo Giunchedi: thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) [09:43:23] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [09:43:48] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcumin1001.eqiad.wmnet [09:45:08] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [09:45:11] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcumin1001.eqiad.wmnet [09:48:10] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:49:34] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:50:04] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [09:52:10] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:53:34] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:53:38] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [09:55:00] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:55:14] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:56:38] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:57:52] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:00:37] !log repool cp1090 (T346874) [10:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:51] T346874: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 [10:04:59] (03PS1) 10Muehlenhoff: firewall: Default provider to none [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) [10:06:35] (03PS3) 10Filippo Giunchedi: thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) [10:07:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:08:39] (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:21:16] (03CR) 10Filippo Giunchedi: [C: 03+1] Define a script in charge of checking the kafka broker in sync status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959162 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [10:22:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet [10:26:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet [10:27:21] (03CR) 10EoghanGaffney: [C: 03+2] Add new apt-staging host to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/959807 (owner: 10EoghanGaffney) [10:41:30] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:42:27] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10VRiley-WMF) wdqs1017 D 2. U 38. CableID 230304500154. Port 24 wdqs1018 E 2. U 40. CableID 230304500260. Port 32 wdqs1019 F 2. U 39. CableID 230304500198 Port 32 [10:42:56] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:43:44] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:46:11] (03PS1) 10Muehlenhoff: Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) [10:50:21] (03PS3) 10Jbond: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) [10:50:23] (03PS1) 10Jbond: 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216) [10:50:28] (03CR) 10Jbond: templates/diffs: escape parameters (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [10:51:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) Hey @ssingh If possible, please let us know the racking configuration for these devices. Currently, all the NVMe SSD have been installed into these servers and awaiting... [10:55:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [10:56:46] (03CR) 10Jbond: [C: 04-1] "lgtm but need to update the connect() params" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [10:57:07] (03CR) 10Jbond: [C: 03+2] templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [10:57:10] (03CR) 10Jbond: [C: 03+2] 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [10:59:54] (03Merged) 10jenkins-bot: templates/diffs: escape parameters [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/959827 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [10:59:56] (03Merged) 10jenkins-bot: 2.5.7: prepare releaser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960016 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [11:01:41] (03CR) 10Muehlenhoff: Move os-reports to the puppetdb host(s) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [11:01:45] (03PS2) 10Muehlenhoff: Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) [11:03:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [11:03:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [11:04:31] (03Abandoned) 10Jbond: do not merge: test change for pcc [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [11:09:16] !log eoghan@cumin1001 START - Cookbook sre.hosts.reimage for host apt-staging2001.codfw.wmnet with OS bookworm [11:16:41] (03PS1) 10Jbond: puppet_compiler: bump version to 2.5.7 [puppet] - 10https://gerrit.wikimedia.org/r/960022 (https://phabricator.wikimedia.org/T346216) [11:17:13] (03PS1) 10Majavah: P:wmcs::prometheus: fix openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) [11:17:16] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:18:01] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10Stevemunene) [11:18:07] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10Stevemunene) [11:18:32] (03CR) 10Majavah: P:wmcs::prometheus: fix openstack-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [11:18:40] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:20:37] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version to 2.5.7 [puppet] - 10https://gerrit.wikimedia.org/r/960022 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [11:21:10] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:21:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43463/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [11:22:07] (03CR) 10Muehlenhoff: [C: 03+2] Move os-reports to the puppetdb host(s) [puppet] - 10https://gerrit.wikimedia.org/r/960015 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [11:22:36] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:28] (03PS1) 10Jbond: DO NOT MERGE: test author [puppet] - 10https://gerrit.wikimedia.org/r/960024 [11:25:41] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on apt-staging2001.codfw.wmnet with reason: host reimage [11:26:30] (03Abandoned) 10Jbond: DO NOT MERGE: test author [puppet] - 10https://gerrit.wikimedia.org/r/960024 (owner: 10Jbond) [11:28:53] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apt-staging2001.codfw.wmnet with reason: host reimage [11:30:42] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Joe) @Urbanecm_WMF as I said on IRC, there's two main differences when running... [11:34:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43464/console" [puppet] - 10https://gerrit.wikimedia.org/r/959831 (https://phabricator.wikimedia.org/T346216) (owner: 10Jbond) [11:37:23] (03PS2) 10Muehlenhoff: dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 [11:37:52] (03CR) 10Majavah: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43465/console" [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah) [11:41:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:41:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [11:42:13] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apt-staging2001.codfw.wmnet with OS bookworm [11:45:05] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [11:46:24] (03PS8) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [11:46:35] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::prometheus: fix openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/960023 (https://phabricator.wikimedia.org/T346439) (owner: 10Majavah) [11:49:10] (03CR) 10CI reject: [V: 04-1] [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) (owner: 10Brouberol) [11:49:47] (03PS1) 10Majavah: cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) [11:49:52] (03PS1) 10Majavah: hieradata: drop dmz_cidr excemptions for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) [11:50:27] (03PS9) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [11:52:31] (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029 [11:54:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029 (owner: 10Jbond) [11:54:49] (03Abandoned) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/960029 (owner: 10Jbond) [11:56:58] (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/960031 [11:57:29] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:57:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff) [11:58:16] !log brouberol@cumin1001 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [11:58:53] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:58:53] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:58:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] Merge branch 'master' into 2.x [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/960031 (owner: 10Jbond) [12:00:17] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:00:28] (03CR) 10Ayounsi: [C: 03+1] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [12:04:29] (03PS9) 10Jbond: puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) [12:06:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960011 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:08:59] (03PS1) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) [12:10:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet: switch to puppet7 command [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/936321 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond) [12:11:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:12:57] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:13:50] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [12:13:59] (03PS1) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) [12:14:23] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:15:04] (03PS1) 10Jbond: pcc: move pcc1004 to pcc version 3 [puppet] - 10https://gerrit.wikimedia.org/r/960035 (https://phabricator.wikimedia.org/T236373) [12:15:18] (03CR) 10Jbond: [C: 03+2] pcc: move pcc1004 to pcc version 3 [puppet] - 10https://gerrit.wikimedia.org/r/960035 (https://phabricator.wikimedia.org/T236373) (owner: 10Jbond) [12:16:14] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43466/console" [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:16:23] (03PS1) 10Muehlenhoff: webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036 [12:16:36] (03PS2) 10Muehlenhoff: webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036 [12:17:36] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [12:18:39] (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:18:40] (03PS1) 10Jbond: test pcc: [puppet] - 10https://gerrit.wikimedia.org/r/960037 [12:21:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/960036 (owner: 10Muehlenhoff) [12:23:12] !log brouberol@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [12:29:09] (03PS10) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [12:35:37] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10ayounsi) I'm going to briefly re-purpose that host as ganeti-test2004 for some tests. cc T345602 let me know if there is any issue. [12:36:09] (03Abandoned) 10Muehlenhoff: os-reports: Stop configuring a puppetdb server and switch to discovery record [puppet] - 10https://gerrit.wikimedia.org/r/959696 (https://phabricator.wikimedia.org/T342214) (owner: 10Muehlenhoff) [12:36:57] (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036 (owner: 10Muehlenhoff) [12:43:12] (03PS11) 10Brouberol: [Kafka] Use broker in-sync status as a gate between broker restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/959720 (https://phabricator.wikimedia.org/T346741) [12:45:25] (03CR) 10Jbond: "did you see an issue some where. this should not be needed as Stdlib::Port covers Stdlib::Port::UnPrivileged" [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:45:51] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [12:47:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) [12:48:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) @VRiley-WMF - Sukhbir's out right now, but I've updated the racking plan on his behalf! [12:48:55] (03CR) 10Muehlenhoff: [C: 03+1] ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [12:50:04] (03PS1) 10JMeybohm: prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) [12:52:00] (03CR) 10Muehlenhoff: firewall: Also support Stdlib::Port::Unprivileged in Ferm::Port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960033 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:54:49] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:57] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:56:13] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10JAllemandou) It feels wrong to me to be willing to return all page views on a date: the result set would be enormous and wouldn't b... [12:59:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:00:07] (03PS2) 10JMeybohm: prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) [13:00:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) Reading a little deeper on this, I think we still have a hostnames issue. If those other 8 hosts are indeed being brought from ulsfo+eqsin. Those 8 hosts, I presume, would be... [13:03:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10BBlack) Adding to the confusion: historically, we once used the hostname `cp1099` back in 2015 for a one-off host: T96873 - therefore that name already exists in both phab and git hist... [13:07:41] _joe_: sorry for the bad timing yesterday [13:07:51] Clément mentioned you had an issue with Scap running an explicit helm rollback when it's not required [13:07:58] do you have logs or a task with details? [13:08:22] <_joe_> jnuche: not yet, it's been a quite hectic week [13:09:16] _joe_: no worries, just wanted to get a better sense of the problem [13:09:33] please tag me on the task when you get a chance [13:09:52] and hope things are calming down a bit! [13:11:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43468/console" [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [13:13:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:14:18] (03PS1) 10JMeybohm: prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) [13:14:46] (03PS1) 10Filippo Giunchedi: otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) [13:15:49] (03CR) 10Filippo Giunchedi: "I _think_ this is the minimal configuration to get Prometheus to scrape otel-coll, please let me know what you think" [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [13:19:08] (03CR) 10JMeybohm: otel-coll: enable prometheus scraping (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [13:19:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:19:46] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43469/console" [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [13:19:50] (03CR) 10Filippo Giunchedi: "Nice cleanup, not tested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [13:19:54] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Discover calico-felix targets from k8s api [puppet] - 10https://gerrit.wikimedia.org/r/960049 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [13:20:09] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:15] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:21:19] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: Drop puppet class names [puppet] - 10https://gerrit.wikimedia.org/r/960055 (https://phabricator.wikimedia.org/T346915) (owner: 10JMeybohm) [13:22:29] (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::trafficserver::backend: switch ores traffic to ores-legacy [puppet] - 10https://gerrit.wikimedia.org/r/959762 (https://phabricator.wikimedia.org/T341696) (owner: 10Elukey) [13:29:02] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [13:29:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10taavi) [13:31:11] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) 05In progress→03Resolved [13:31:19] 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10cmooney) [13:32:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:37:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:38:25] (03PS2) 10Filippo Giunchedi: otel-coll: enable prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) [13:39:00] (03CR) 10Filippo Giunchedi: otel-coll: enable prometheus scraping (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960056 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [13:39:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [13:44:05] (03CR) 10Herron: [C: 03+1] "Makes sense thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi) [13:45:13] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:45:55] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [13:46:37] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:43] (03CR) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [13:50:43] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: don't manage limitnofile for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/960008 (https://phabricator.wikimedia.org/T346950) (owner: 10Filippo Giunchedi) [13:50:45] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:35] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:02:30] (03PS1) 10Hashar: envoyproxy: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960062 [14:04:14] (03PS1) 10Hashar: mcrouter: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960063 [14:08:44] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:55] (03PS1) 10Hashar: Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064 [14:10:18] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) >>! In T343855#9190791, @JAllemandou wrote: > It feels wrong to me to be willing to return all page views on a date: the re... [14:15:42] (03PS1) 10Lucas Werkmeister (WMDE): Make wikifunctionswiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960066 (https://phabricator.wikimedia.org/T342857) [14:18:44] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:06] (03PS1) 10Hashar: tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) [14:23:39] (KeyholderUnarmed) firing: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:28:39] (KeyholderUnarmed) resolved: (2) 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:41:13] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:39] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:32] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/945872/43471/vrts1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:45:17] (03PS36) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [14:45:44] (03PS1) 10Urbanecm: listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) [14:49:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [14:49:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc1015'] [14:51:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr) [14:54:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [14:56:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:56:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:21] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [14:57:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:39] (03CR) 10CDanis: [C: 03+1] tox.ini: remove skipsdist [software/conftool] - 10https://gerrit.wikimedia.org/r/960068 (https://phabricator.wikimedia.org/T346238) (owner: 10Hashar) [14:58:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:58:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:58:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc1015'] [15:00:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:01:31] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:57] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:03:03] (03CR) 10CI reject: [V: 04-1] listTaskCounts: Do not expect tasks key to be present [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959987 (https://phabricator.wikimedia.org/T347120) (owner: 10Urbanecm) [15:03:45] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960074 (https://phabricator.wikimedia.org/T308139) [15:05:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:48] (03PS1) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [15:08:50] (03PS1) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 [15:09:03] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I trust you know better than me that this is no longer needed :). Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [15:09:19] (03PS2) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [15:09:21] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:09:52] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:11:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1247.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1247'] [15:13:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:28] (03PS37) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [15:13:50] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.9.1 - T346737 [15:13:59] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.9.1 - T346737 (duration: 00m 09s) [15:14:05] !log upgrading LibreNMS to 23.9.1 [15:15:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:16:27] !log upgrading LibreNMS in codfw [15:17:49] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-check-services.service,librenms-discovery-new.service,librenms-poll-billing.service,librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:20:37] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:21:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks! Would you please review ops/homer/public.git and see if there is a pending cleanup related to this?" [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [15:21:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1247'] [15:21:46] (03CR) 10Majavah: "Thanks all! I plan to deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/960028 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [15:23:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:52] !log upgrading LibreNMS in eqiad [15:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-cloud: Drop cloudmetrics excemptions [homer/public] - 10https://gerrit.wikimedia.org/r/960027 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [15:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) [15:26:03] (03PS1) 10Bking: dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149) [15:27:21] (03CR) 10DCausse: [C: 03+1] dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:27:52] (03PS2) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [15:27:54] (03PS3) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [15:28:01] (03CR) 10Bking: [C: 03+2] dse-k8s: Manually restore flink-app [deployment-charts] - 10https://gerrit.wikimedia.org/r/960080 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:28:27] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:28:34] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:30:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:31:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:33:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) [15:34:25] (03PS4) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [15:34:59] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:59] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:36:23] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:38:18] (03PS1) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) [15:38:55] (03PS3) 10Cwhite: Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) [15:39:47] (03CR) 10DCausse: [C: 03+1] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:39:53] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [15:39:53] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:14] (03PS1) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960090 (https://phabricator.wikimedia.org/T342149) [15:41:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:41:17] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:43] (03PS2) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) [15:41:59] (03CR) 10Jbond: "ill move the current puppetdbs to insetup next week and then we can merge this" [puppet] - 10https://gerrit.wikimedia.org/r/959754 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [15:41:59] 10SRE, 10Traffic: Allow Varnish to be called on UDS from services other than HAProxy - https://phabricator.wikimedia.org/T347059 (10Fabfur) 05Open→03Declined [15:42:13] (03CR) 10Cwhite: [C: 03+2] Restore access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/959771 (https://phabricator.wikimedia.org/T346796) (owner: 10Cwhite) [15:43:10] (03Abandoned) 10Bking: dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960090 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:43:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [15:43:18] (03CR) 10Bking: [C: 03+2] dse-k8s: Trigger flink-app savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/960087 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [15:43:23] (03PS3) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [15:43:25] (03PS5) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [15:43:56] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:44:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding ganeti-test server to codfw - jhancock@cumin2002" [15:44:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:03] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:44:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10colewhite) 05Open→03Resolved a:03colewhite Restored the level of access held before last contract expired. Please feel free to reopen if yo... [15:45:46] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:45:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:46:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:48:00] (03PS4) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [15:48:02] (03PS6) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [15:48:25] (03PS2) 10Cwhite: admin: add mabualruz to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) [15:48:37] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:48:40] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:48:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) p:05Triage→03Medium [15:49:43] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:50:12] (03CR) 10Cwhite: [C: 03+2] admin: add mabualruz to deployment group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958982 (https://phabricator.wikimedia.org/T342535) (owner: 10Cwhite) [15:51:09] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:51:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10colewhite) 05Open→03Resolved The group membership change has been deployed. Please feel free to reopen if you encounter any... [15:53:35] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:54:59] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:37] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [15:59:37] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:01:03] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:35:12] (03PS2) 10Jdlrobson: WIP: Wordmarks for Wikinews projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959876 (https://phabricator.wikimedia.org/T341258) [16:36:44] (03CR) 10FNegri: "I cherry-picked this change on top of my patch and it compiles successfully. Feel free to merge this one and I'll rebase my patch on top." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [16:46:23] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:12] (03PS5) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [16:48:14] (03PS7) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [16:48:31] (03PS5) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) [16:48:53] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [16:49:02] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [16:49:51] (03PS5) 10Jdlrobson: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [16:50:39] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:52] (03PS6) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [16:53:54] (03PS8) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [16:54:28] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [16:54:30] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [16:58:36] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) [16:58:39] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) [17:02:55] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:08:04] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for dr0ptp4kt - https://phabricator.wikimedia.org/T347110 (10dr0ptp4kt) Thanks @colewhite , I checked in a few places like mwlog1002, deploy1002, and mwmaint1002 and it looks to be in working order. Have a great weekend! [17:09:52] (03PS1) 10Cathal Mooney: Temporarily adjust EVPN outbound policy to CRs to block existing nets [homer/public] - 10https://gerrit.wikimedia.org/r/960109 (https://phabricator.wikimedia.org/T347191) [17:12:33] (03CR) 10Peter Fischer: "All SUP-config-related PRs are merged, thank you! Do you think you could adapt the chart?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [17:13:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:14:21] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 250 bytes in 1.595 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:15:49] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard1003 is OK: HTTP OK: HTTP/1.1 200 OK - 54549 bytes in 6.165 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [17:17:55] (03PS1) 10Fabfur: vanish: allow PURGE requests only from dedicated socket [puppet] - 10https://gerrit.wikimedia.org/r/960112 (https://phabricator.wikimedia.org/T347192) [17:20:05] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:20:38] (03CR) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:21:08] (03CR) 10Cathal Mooney: Support configuration of EVPN anycast GW on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/959873 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [17:22:59] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:24:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:27:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) I checked on, for example, an-launcher1002, and look to be in place. Thanks here as well @colewhite! [17:29:54] (03PS7) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [17:29:56] (03PS9) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:30:32] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:30:34] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:32:24] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@a30e944]: (no justification provided) [17:32:34] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@a30e944]: (no justification provided) (duration: 00m 09s) [17:33:51] (03PS1) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [17:34:20] (03CR) 10CI reject: [V: 04-1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:34:27] (03PS8) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [17:34:29] (03PS10) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:34:57] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:35:06] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:35:54] (03PS2) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [17:36:24] (03CR) 10CI reject: [V: 04-1] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:38:51] (03PS3) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [17:39:21] (03CR) 10jenkins-bot: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:39:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43484/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:42:01] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:43:25] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:43:32] (03PS9) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [17:43:34] (03PS11) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:44:03] (03PS4) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/960114 (https://phabricator.wikimedia.org/T342463) [17:44:13] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:48:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43485/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:51:11] (03PS10) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [17:51:13] (03PS12) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:51:54] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:53:10] (03PS11) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [17:53:12] (03PS13) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:56:46] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:56:54] (03PS14) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [17:57:20] (03CR) 10CI reject: [V: 04-1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:59:33] (03PS15) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:06:28] (03PS12) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:06:30] (03PS16) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:07:17] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:07:28] (03PS17) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:12:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43488/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:17:56] (03PS13) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:17:58] (03PS18) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:18:37] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:18:44] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:02] (03PS14) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:20:04] (03PS19) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:23:22] (03PS20) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:23:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43489/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:24:27] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:28:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43490/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:33:49] (03PS15) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:33:51] (03PS21) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:38:08] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:39:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43491/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:47:50] (03PS16) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:47:52] (03PS22) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:52:10] (03PS17) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:52:12] (03PS23) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:52:16] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:56:36] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:57:07] (03PS6) 10Jdlrobson: Icons for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [18:57:09] (03PS1) 10Jdlrobson: WIP: Special wiki wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960121 (https://phabricator.wikimedia.org/T341250) [18:57:18] (03PS18) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [18:57:20] (03PS24) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [18:57:59] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [18:57:59] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:00:50] (03PS19) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [19:00:52] (03PS25) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [19:02:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43495/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:03:49] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:04:56] (03PS1) 10Sharvaniharan: New stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960123 [19:06:13] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:06:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43496/console" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:06:57] (03Abandoned) 10Sharvaniharan: New stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960123 (owner: 10Sharvaniharan) [19:08:55] (03CR) 10Jbond: [V: 03+1] "hi all are you able to review this changes, the diff in the most recent pcc is related to sorting, with this new version being a i think a" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:10:05] (03PS1) 10Sharvaniharan: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 [19:12:33] (03PS1) 10Jbond: prometheus::class_config: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [19:13:32] (03PS3) 10Jdlrobson: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne) [19:13:34] (03PS2) 10Jdlrobson: Provide wordmarks/taglines for Wikibooks projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959872 (https://phabricator.wikimedia.org/T341251) [19:16:56] (03PS2) 10Jbond: prometheus::class_config: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [19:20:35] (03PS2) 10Cwhite: prometheus: add option to configure probe-specific params [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) [19:21:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:29:00] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/958981/43499/" [puppet] - 10https://gerrit.wikimedia.org/r/958981 (https://phabricator.wikimedia.org/T346893) (owner: 10Cwhite) [19:29:11] (03PS20) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [19:29:13] (03PS26) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [19:29:15] (03PS3) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [19:32:21] (03PS4) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [19:33:43] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:33:59] (03PS1) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) [19:42:39] (03PS21) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusteres [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) [19:42:41] (03PS27) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) [19:43:14] (03PS5) 10Jbond: prometheus: switch to wmflib::get_config [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) [19:43:23] (03PS2) 10Jbond: get_clusters: rmove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) [19:50:55] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:53:43] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:55:16] (03CR) 10Bking: "This change is ready for review." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [19:59:03] (03CR) 10Jbond: "similar to the last i think the only diff is ordering" [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:05:32] (03Abandoned) 10Bking: elastic: introduce jbod-related config [puppet] - 10https://gerrit.wikimedia.org/r/959854 (https://phabricator.wikimedia.org/T231010) (owner: 10Bking) [21:07:23] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10odimitrijevic) approved [21:09:30] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:13:44] (PuppetConstantChange) firing: (2) Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:25:13] (03PS4) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) [21:25:15] (03PS12) 10Ebernhardson: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [21:25:17] (03CR) 10Ebernhardson: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [21:29:30] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:58:19] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:59:43] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:40] (03PS1) 10Jdlrobson: wordmarks/taglines for Wiktionary projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960147 (https://phabricator.wikimedia.org/T341257) [22:09:44] (03PS1) 10Jdlrobson: WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257) [22:10:25] (03CR) 10CI reject: [V: 04-1] WIP: Logos for Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960148 (https://phabricator.wikimedia.org/T341257) (owner: 10Jdlrobson) [22:18:44] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:55:21] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:56:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [22:56:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [22:56:47] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:09:49] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:09:53] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:15:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:21:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:22:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:52:29] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:53:55] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase