[00:00:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016371 (owner: 10TrainBranchBot) [00:03:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2003.codfw.wmnet with reason: host reimage [00:07:12] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:07:16] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:13:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:13:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:17:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:17:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:23:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:23:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:25:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:25:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:25:56] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2003.codfw.wmnet with OS bookworm [00:30:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:30:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:36:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:37:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:43:35] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logging-hd2002.codfw.wmnet with OS bookworm [00:44:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:44:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:46:07] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logging-hd2002.codfw.wmnet with OS bookworm [01:00:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:00:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:05:31] (03CR) 10Tim Starling: "Amir says" [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [01:06:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:06:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:09:20] (03PS1) 10TChin: [WIP] Add datasets-config helm chart and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [01:10:00] (03CR) 10CI reject: [V:04-1] [WIP] Add datasets-config helm chart and helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [01:15:59] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage [01:19:03] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd2002.codfw.wmnet with reason: host reimage [01:40:39] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd2002.codfw.wmnet with OS bookworm [01:44:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 834.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:49:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 869.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:51:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:51:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:58:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:58:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:32:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:22] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:58] (03PS1) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480 [02:45:04] (03PS2) 10Krinkle: codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480 [02:45:07] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle) [02:50:41] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:02:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:53:48] (03PS3) 10Krinkle: [WIP] codesearch: Set CODESEARCH_HOUND_BASE for codesearch-frontend [puppet] - 10https://gerrit.wikimedia.org/r/1016480 [04:03:44] (03CR) 10Krinkle: "Based on the below test, I believe this would not work currently." [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle) [04:32:03] (03PS2) 10Tim Starling: WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) [04:32:03] (03CR) 10Tim Starling: "I tested it locally using I0d9afa97a4566e9c9fd8cd812b5fcb8698eaf4f9. Now I'm moderately confident and ready for it to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [04:35:46] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665 (10AndyRussG) 03NEW [04:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:09:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:10:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:10:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:10:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:10:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59238 and previous config saved to /var/cache/conftool/dbconfig/20240403-051029-marostegui.json [05:10:32] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [05:11:21] (03PS1) 10Marostegui: db1222: Upgrade to Bookworm and MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016489 (https://phabricator.wikimedia.org/T361543) [05:11:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222 T361543', diff saved to https://phabricator.wikimedia.org/P59239 and previous config saved to /var/cache/conftool/dbconfig/20240403-051149-root.json [05:11:53] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [05:12:26] (03CR) 10Marostegui: [C:03+2] db1222: Upgrade to Bookworm and MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016489 (https://phabricator.wikimedia.org/T361543) (owner: 10Marostegui) [05:13:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1222.eqiad.wmnet with OS bookworm [05:16:13] (03PS1) 10Marostegui: Revert "db1222: Upgrade to Bookworm and MariaDB 10.6" [puppet] - 10https://gerrit.wikimedia.org/r/1016506 [05:25:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [05:28:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [05:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:42:52] (03PS1) 10Marostegui: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016491 [05:43:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2148 T361543', diff saved to https://phabricator.wikimedia.org/P59240 and previous config saved to /var/cache/conftool/dbconfig/20240403-054310-root.json [05:43:14] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [05:43:51] (03CR) 10Marostegui: [C:03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016491 (owner: 10Marostegui) [05:44:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2148.codfw.wmnet with OS bookworm [05:46:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59241 and previous config saved to /var/cache/conftool/dbconfig/20240403-054641-root.json [05:47:02] (03CR) 10Marostegui: [C:03+2] Revert "db1222: Upgrade to Bookworm and MariaDB 10.6" [puppet] - 10https://gerrit.wikimedia.org/r/1016506 (owner: 10Marostegui) [05:48:21] (03PS1) 10Marostegui: Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016507 [05:49:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1222.eqiad.wmnet with OS bookworm [05:50:15] (03PS1) 10Marostegui: installserver: Do not format es2037 [puppet] - 10https://gerrit.wikimedia.org/r/1016492 [05:50:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:51:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:53:23] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es2037 [puppet] - 10https://gerrit.wikimedia.org/r/1016492 (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0600) [06:01:10] jouncebot: next [06:01:10] In 0 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0700) [06:01:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2148.codfw.wmnet with reason: host reimage [06:01:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59242 and previous config saved to /var/cache/conftool/dbconfig/20240403-060147-root.json [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2148.codfw.wmnet with reason: host reimage [06:05:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59243 and previous config saved to /var/cache/conftool/dbconfig/20240403-061055-marostegui.json [06:11:00] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:13:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:13:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:16:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59244 and previous config saved to /var/cache/conftool/dbconfig/20240403-061653-root.json [06:23:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:23:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:24:11] (03CR) 10Marostegui: [C:03+2] Revert "db2148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016507 (owner: 10Marostegui) [06:24:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59245 and previous config saved to /var/cache/conftool/dbconfig/20240403-062436-root.json [06:25:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2148.codfw.wmnet with OS bookworm [06:26:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59246 and previous config saved to /var/cache/conftool/dbconfig/20240403-062602-marostegui.json [06:31:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59247 and previous config saved to /var/cache/conftool/dbconfig/20240403-063159-root.json [06:32:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59248 and previous config saved to /var/cache/conftool/dbconfig/20240403-063941-root.json [06:41:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P59249 and previous config saved to /var/cache/conftool/dbconfig/20240403-064110-marostegui.json [06:47:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59250 and previous config saved to /var/cache/conftool/dbconfig/20240403-064704-root.json [06:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59251 and previous config saved to /var/cache/conftool/dbconfig/20240403-065447-root.json [06:56:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59252 and previous config saved to /var/cache/conftool/dbconfig/20240403-065617-marostegui.json [06:56:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:56:20] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [06:56:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:56:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:56:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:57:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59253 and previous config saved to /var/cache/conftool/dbconfig/20240403-065706-marostegui.json [06:59:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59254 and previous config saved to /var/cache/conftool/dbconfig/20240403-065923-marostegui.json [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59255 and previous config saved to /var/cache/conftool/dbconfig/20240403-070212-root.json [07:02:26] (RoutinatorRRDPErrors) firing: Routinator RRDP fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors [07:09:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2125 T361543', diff saved to https://phabricator.wikimedia.org/P59256 and previous config saved to /var/cache/conftool/dbconfig/20240403-070946-root.json [07:09:50] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [07:09:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59257 and previous config saved to /var/cache/conftool/dbconfig/20240403-070953-root.json [07:10:42] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016620 [07:11:20] (03CR) 10Marostegui: [C:03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1016620 (owner: 10Marostegui) [07:11:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2125.codfw.wmnet with OS bookworm [07:11:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2202.codfw.wmnet with OS bookworm [07:14:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59258 and previous config saved to /var/cache/conftool/dbconfig/20240403-071431-marostegui.json [07:16:53] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2100 after memory failure [puppet] - 10https://gerrit.wikimedia.org/r/1015463 (https://phabricator.wikimedia.org/T361584) (owner: 10Arnaudb) [07:17:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1222 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59259 and previous config saved to /var/cache/conftool/dbconfig/20240403-071718-root.json [07:18:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2100.codfw.wmnet [07:20:46] (03PS1) 10Slyngshede: SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) [07:22:07] (03CR) 10Slyngshede: "I apparently lost the patch for adding messages to the user on key operations, so I had to redo it." [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede) [07:22:26] (RoutinatorRRDPErrors) firing: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors [07:24:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59260 and previous config saved to /var/cache/conftool/dbconfig/20240403-072459-root.json [07:25:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:26:40] (03PS1) 10Muehlenhoff: Record updated contract end for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1016702 (https://phabricator.wikimedia.org/T361527) [07:27:26] (RoutinatorRRDPErrors) firing: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors [07:27:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [07:28:26] (03CR) 10Ryan Kemper: [C:04-1] "Given some context I'm seeing in the tickets (about spicerack using curator; I don't yet understand why), this feels like a risky change. " [puppet] - 10https://gerrit.wikimedia.org/r/1016425 (https://phabricator.wikimedia.org/T354670) (owner: 10Ryan Kemper) [07:28:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2125.codfw.wmnet with reason: host reimage [07:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P59261 and previous config saved to /var/cache/conftool/dbconfig/20240403-072938-marostegui.json [07:30:04] (03CR) 10Muehlenhoff: [C:03+2] Record updated contract end for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1016702 (https://phabricator.wikimedia.org/T361527) (owner: 10Muehlenhoff) [07:32:26] (RoutinatorRRDPErrors) resolved: (2) Routinator RRDP fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RRDP_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRRDPErrors [07:32:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [07:35:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2125.codfw.wmnet with reason: host reimage [07:37:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede) [07:37:45] (03CR) 10Slyngshede: [C:03+2] SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede) [07:39:00] (03Merged) 10jenkins-bot: SSH keys: Provide feedback on actions. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016621 (https://phabricator.wikimedia.org/T360966) (owner: 10Slyngshede) [07:40:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59262 and previous config saved to /var/cache/conftool/dbconfig/20240403-074004-root.json [07:44:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T356166)', diff saved to https://phabricator.wikimedia.org/P59263 and previous config saved to /var/cache/conftool/dbconfig/20240403-074446-marostegui.json [07:44:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:44:50] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:45:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:45:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59264 and previous config saved to /var/cache/conftool/dbconfig/20240403-074509-marostegui.json [07:47:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59265 and previous config saved to /var/cache/conftool/dbconfig/20240403-074727-marostegui.json [07:53:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2202.codfw.wmnet with OS bookworm [07:55:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2148 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59266 and previous config saved to /var/cache/conftool/dbconfig/20240403-075510-root.json [07:56:24] (03PS1) 10Majavah: hieradata: Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705 [07:56:28] (03PS1) 10Marostegui: common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673) [07:58:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2125.codfw.wmnet with OS bookworm [07:59:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1789/console" [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (owner: 10Majavah) [08:00:04] jnuche and jeena: Time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T0800). [08:00:25] morning, train deploy in a few minutes [08:00:47] (03PS2) 10Majavah: Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666) [08:01:13] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:02:02] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59267 and previous config saved to /var/cache/conftool/dbconfig/20240403-080235-marostegui.json [08:04:32] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157) [08:04:33] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:04:58] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1790/co" [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666) (owner: 10Majavah) [08:05:32] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016707 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [08:05:39] (03CR) 10Majavah: [V:03+1 C:03+2] Upgrade clouddb2002-dev to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1016705 (https://phabricator.wikimedia.org/T361666) (owner: 10Majavah) [08:07:02] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:06] (03PS1) 10Marostegui: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016512 [08:09:16] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2100.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:10:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2100.codfw.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:10:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:10:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2100.codfw.wmnet [08:10:45] (03CR) 10Dreamy Jazz: [C:03+1] common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673) (owner: 10Marostegui) [08:11:26] (03CR) 10Marostegui: [C:03+2] common.yaml: Add cu_useragent to private tables [puppet] - 10https://gerrit.wikimedia.org/r/1016706 (https://phabricator.wikimedia.org/T361673) (owner: 10Marostegui) [08:11:40] (03CR) 10Marostegui: [C:03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1016512 (owner: 10Marostegui) [08:11:47] 10ops-codfw, 06DBA, 10decommission-hardware, 13Patch-For-Review: decommission db2100.codfw.wmnet - https://phabricator.wikimedia.org/T361584#9683193 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None [08:12:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59268 and previous config saved to /var/cache/conftool/dbconfig/20240403-081207-root.json [08:12:22] (03PS1) 10Ayounsi: Routed Ganeti: fix v6 route install [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) [08:14:16] (03PS2) 10Ayounsi: Routed Ganeti: fix v6 route install [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) [08:14:33] jouncebot: next [08:14:33] In 1 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000) [08:15:42] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: bump ops prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016304 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi) [08:15:50] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster1002.eqiad.wmnet [08:16:45] !log roll-restart prometheus/ops in codfw/eqiad to apply new retention settings - T360537 [08:17:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P59269 and previous config saved to /var/cache/conftool/dbconfig/20240403-081742-marostegui.json [08:19:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:20:00] (ProbeDown) firing: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:24] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.25 refs T360157 [08:21:54] (03CR) 10Volans: [C:03+1] "LGTM, typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:23:55] (03CR) 10Fabfur: [C:03+2] cp3067: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1015969 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [08:24:12] !log depool cp3067 for reimage (T360430) [08:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:15] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [08:24:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:24:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:24:25] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: fix v6 route install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016708 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:25:42] XioNoX: I have a change ready to be merged on puppetmaster with also yours [08:25:43] (03PS1) 10Majavah: Bind mariadb on clouddb2002-dev to the IPv4 address [puppet] - 10https://gerrit.wikimedia.org/r/1016712 [08:25:46] it's ok for you? [08:25:54] fabfur: yep [08:25:55] thx [08:26:07] ahaha sorry, switching from one channel to another [08:26:09] I'll go [08:27:03] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3067.esams.wmnet [08:27:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59270 and previous config saved to /var/cache/conftool/dbconfig/20240403-082712-root.json [08:28:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:29:44] (03CR) 10Majavah: [C:03+2] Bind mariadb on clouddb2002-dev to the IPv4 address [puppet] - 10https://gerrit.wikimedia.org/r/1016712 (owner: 10Majavah) [08:29:46] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp3067.esams.wmnet with OS bullseye [08:29:56] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye [08:30:00] (ProbeDown) resolved: (2) Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:30:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:30:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:31:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:31:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:31:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59271 and previous config saved to /var/cache/conftool/dbconfig/20240403-083123-arnaudb.json [08:31:29] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [08:31:42] (03PS1) 10Slyngshede: Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 [08:32:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T356166)', diff saved to https://phabricator.wikimedia.org/P59272 and previous config saved to /var/cache/conftool/dbconfig/20240403-083249-marostegui.json [08:32:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:32:53] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:33:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:33:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:33:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1002.eqiad.wmnet [08:33:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:33:10] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9683311 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster1002.eqiad.wmnet` - puppetmaster10... [08:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59273 and previous config saved to /var/cache/conftool/dbconfig/20240403-083313-marostegui.json [08:33:25] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.25 refs T360157 (duration: 13m 00s) [08:33:28] T360157: 1.42.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T360157 [08:33:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59274 and previous config saved to /var/cache/conftool/dbconfig/20240403-083343-arnaudb.json [08:35:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede) [08:35:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59275 and previous config saved to /var/cache/conftool/dbconfig/20240403-083530-marostegui.json [08:35:46] (03CR) 10Slyngshede: [C:03+2] Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede) [08:36:18] !log stop sanitarium codfw hosts T361673 [08:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:20] T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673 [08:36:54] (03Merged) 10jenkins-bot: Add svg files to packages. [software/bitu] - 10https://gerrit.wikimedia.org/r/1016713 (owner: 10Slyngshede) [08:40:39] (03PS1) 10Ayounsi: Add routed Ganeti to Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152) [08:42:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59276 and previous config saved to /var/cache/conftool/dbconfig/20240403-084218-root.json [08:48:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59278 and previous config saved to /var/cache/conftool/dbconfig/20240403-084851-arnaudb.json [08:50:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59279 and previous config saved to /var/cache/conftool/dbconfig/20240403-085037-marostegui.json [08:51:25] (SystemdUnitFailed) firing: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:42] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [08:52:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:55:12] (03CR) 10Ayounsi: [C:03+2] Add routed Ganeti to Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1016714 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:55:58] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3067.esams.wmnet with reason: host reimage [08:56:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59280 and previous config saved to /var/cache/conftool/dbconfig/20240403-085723-root.json [09:00:10] !log Upgraded Bitu / idm.wikimedia.org to version 0.0.6-2 [09:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:03:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P59281 and previous config saved to /var/cache/conftool/dbconfig/20240403-090358-arnaudb.json [09:04:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:05:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P59282 and previous config saved to /var/cache/conftool/dbconfig/20240403-090545-marostegui.json [09:06:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59283 and previous config saved to /var/cache/conftool/dbconfig/20240403-091229-root.json [09:12:44] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016372 (https://phabricator.wikimedia.org/T361682) [09:13:44] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): 14Decommission puppetmaster1002 - 14https://phabricator.wikimedia.org/T357093#9683402 (10MoritzMuehlenhoff) 05Open→03Resolved 14puppetmaster1002 has been decommissioned. [09:14:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: 14Connection errors from puppetmaster1002 to puppetdb - 14https://phabricator.wikimedia.org/T358187#9683417 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff 14We never got to the bottom of this error, it was likely a hardwa... [09:17:17] 10ops-eqiad, 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Decommission puppetmaster1002 - https://phabricator.wikimedia.org/T357093#9683422 (10MoritzMuehlenhoff) 05Resolved→03Open a:05MoritzMuehlenhoff→03Jclark-ctr [09:18:30] (03PS1) 10Muehlenhoff: Remove puppetmaster1002 from puppetdb ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1016716 (https://phabricator.wikimedia.org/T357093) [09:18:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:19:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:19:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T360332)', diff saved to https://phabricator.wikimedia.org/P59284 and previous config saved to /var/cache/conftool/dbconfig/20240403-091906-arnaudb.json [09:19:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:19:09] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:19:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:19:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59285 and previous config saved to /var/cache/conftool/dbconfig/20240403-091929-arnaudb.json [09:19:45] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3067.esams.wmnet with OS bullseye [09:19:46] (03PS1) 10Muehlenhoff: Remove puppetmaster1002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1016717 (https://phabricator.wikimedia.org/T357093) [09:19:56] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye completed: - cp3067 (**PASS**)... [09:20:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T356166)', diff saved to https://phabricator.wikimedia.org/P59286 and previous config saved to /var/cache/conftool/dbconfig/20240403-092053-marostegui.json [09:20:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:20:56] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [09:21:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:21:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59287 and previous config saved to /var/cache/conftool/dbconfig/20240403-092116-marostegui.json [09:21:48] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1016717 (https://phabricator.wikimedia.org/T357093) (owner: 10Muehlenhoff) [09:21:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59288 and previous config saved to /var/cache/conftool/dbconfig/20240403-092149-arnaudb.json [09:21:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9683434 (10ayounsi) 05Open→03Resolved 14We can consider this task completed with success. Next step is to discuss the next steps and ope... [09:23:09] (03CR) 10JMeybohm: [V:03+1 C:03+2] k8s/apiserver: Add option to configure audit logging [puppet] - 10https://gerrit.wikimedia.org/r/1015354 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:23:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59289 and previous config saved to /var/cache/conftool/dbconfig/20240403-092334-marostegui.json [09:24:07] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9683452 (10Fabfur) [09:24:19] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3067.esams.wmnet [09:27:12] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1037.eqiad.wmnet with OS bookworm [09:27:25] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet... [09:27:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59290 and previous config saved to /var/cache/conftool/dbconfig/20240403-092735-root.json [09:27:37] !log Restart sanitarium db1155 T361673 [09:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:40] T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673 [09:27:43] (03PS1) 10David Caro: containerd: export the crictl endpoint in profile.d [puppet] - 10https://gerrit.wikimedia.org/r/1016719 [09:28:07] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184) [09:29:34] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro) [09:30:20] (03CR) 10David Caro: [C:03+1] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [09:31:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:31:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:31:57] (03PS1) 10JMeybohm: k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) [09:32:04] (03PS1) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:32:25] (03CR) 10CI reject: [V:04-1] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:32:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016720 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [09:32:53] (03CR) 10CI reject: [V:04-1] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [09:32:55] (03PS2) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:33:52] (03PS2) 10JMeybohm: k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) [09:33:56] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:34:36] (03PS1) 10Muehlenhoff: debmonitor: Remove obsolete discovery certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750) [09:34:50] (03PS1) 10Slavina Stefanova: harbor: upgrade from 2.9.0 to 2.10.1 [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) [09:34:55] (03CR) 10JMeybohm: [C:03+2] k8s/apiserver: Fix parameter syntax for --audit-log-maxsize [puppet] - 10https://gerrit.wikimedia.org/r/1016721 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [09:35:50] (03PS2) 10Muehlenhoff: debmonitor: Remove obsolete discovery certificate [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750) [09:36:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016723 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:36:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59291 and previous config saved to /var/cache/conftool/dbconfig/20240403-093657-arnaudb.json [09:37:34] (03PS3) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:38:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59292 and previous config saved to /var/cache/conftool/dbconfig/20240403-093842-marostegui.json [09:38:47] jouncebot: next [09:38:48] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000) [09:39:30] !log roll-restart prometheus/k8s in codfw/eqiad to apply new retention settings - T360537 [09:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:33] T360537: Bump prometheus instances allocated space - https://phabricator.wikimedia.org/T360537 [09:39:42] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: bump k8s prometheus retention_size [puppet] - 10https://gerrit.wikimedia.org/r/1016305 (https://phabricator.wikimedia.org/T360537) (owner: 10Filippo Giunchedi) [09:40:20] (03PS4) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:41:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:35] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1037 [09:42:39] (03PS5) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:42:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59293 and previous config saved to /var/cache/conftool/dbconfig/20240403-094241-root.json [09:43:00] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1037 [09:44:05] (03CR) 10Majavah: [C:03+2] php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [09:44:40] jouncebot: next [09:44:40] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000) [09:44:43] (03Merged) 10jenkins-bot: php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [09:44:51] (03PS5) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [09:45:13] !log Doing security deploy for T361293 [09:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:31] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [09:48:15] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1037.eqiad.wmnet with reason: host reimage [09:48:30] (03PS6) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [09:49:28] (03PS1) 10Muehlenhoff: Remove dummy cert for debmonitor [labs/private] - 10https://gerrit.wikimedia.org/r/1016726 (https://phabricator.wikimedia.org/T357750) [09:50:16] (03PS1) 10Mvolz: Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118) [09:51:25] (SystemdUnitFailed) resolved: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:55] (03CR) 10David Caro: [C:03+2] containerd: export the crictl endpoint in profile.d [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro) [09:52:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P59294 and previous config saved to /var/cache/conftool/dbconfig/20240403-095204-arnaudb.json [09:52:27] (03CR) 10David Caro: [C:03+2] "Tested in tools" [puppet] - 10https://gerrit.wikimedia.org/r/1016719 (owner: 10David Caro) [09:53:21] (03CR) 10Jgiannelos: "This is the missing config section to enable caching in PCS staging. From the CI output it looks like the template generates whats expecte" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [09:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P59295 and previous config saved to /var/cache/conftool/dbconfig/20240403-095349-marostegui.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1000) [10:06:37] !log dreamyjazz Deployed security patch for T361293 [10:07:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T360332)', diff saved to https://phabricator.wikimedia.org/P59296 and previous config saved to /var/cache/conftool/dbconfig/20240403-100712-arnaudb.json [10:07:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:07:15] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:07:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:07:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59297 and previous config saved to /var/cache/conftool/dbconfig/20240403-100735-arnaudb.json [10:08:29] (03CR) 10Hnowlan: [C:03+1] "lgtm - it might be nice to add a .fixture entry to show this feature being enabled for testing purposes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:08:31] (03PS2) 10Muehlenhoff: analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) [10:08:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T356166)', diff saved to https://phabricator.wikimedia.org/P59298 and previous config saved to /var/cache/conftool/dbconfig/20240403-100857-marostegui.json [10:08:58] !log installing util-linux security updates [10:08:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:09:00] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:09:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59299 and previous config saved to /var/cache/conftool/dbconfig/20240403-100919-marostegui.json [10:10:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59300 and previous config saved to /var/cache/conftool/dbconfig/20240403-100959-arnaudb.json [10:10:29] !log Restart sanitarium db1154 T361673 [10:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:32] T361673: Filter cu_useragent on sanitarium - https://phabricator.wikimedia.org/T361673 [10:11:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59301 and previous config saved to /var/cache/conftool/dbconfig/20240403-101137-marostegui.json [10:14:04] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:14:15] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:14:27] (03CR) 10Volans: "Code looks sane, I would love to see it in action, but if you tested in your lab that's enough for me. One question and minor nits/suggest" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [10:17:19] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1037.eqiad.wmnet with OS bookworm [10:17:34] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1037.eqiad.wmnet with... [10:18:49] (03PS7) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [10:19:15] (03PS8) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [10:19:51] (03CR) 10Volans: [C:03+1] "LGTM, better explicit than implicit and we could split it" [puppet] - 10https://gerrit.wikimedia.org/r/1016456 (owner: 10Scott French) [10:20:00] !log dreamyjazz Deployed security patch for T361293 [10:20:40] (03PS9) 10Jgiannelos: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) [10:24:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 841.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:25:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59302 and previous config saved to /var/cache/conftool/dbconfig/20240403-102507-arnaudb.json [10:25:21] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683651 (10aborrero) [10:25:30] (03CR) 10Hnowlan: [C:03+1] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59303 and previous config saved to /var/cache/conftool/dbconfig/20240403-102644-marostegui.json [10:27:15] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:28:07] (03Merged) 10jenkins-bot: mobileapps: Caching config for pregenerated content [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016722 (https://phabricator.wikimedia.org/T350507) (owner: 10Jgiannelos) [10:29:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 813.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:29:41] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:29:45] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:35:41] (03PS7) 10Stevemunene: Decommission an-coord100[12] The change includes removal of an-coord100[1-2] mentions in comments and references. [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:37:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:38:41] (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] The change includes removal of an-coord100[1-2] mentions in comments and references. [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:38:44] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1016457 (owner: 10Scott French) [10:39:40] (03CR) 10Volans: [C:03+1] "Nice! We could also keep it for the migration." [puppet] - 10https://gerrit.wikimedia.org/r/1016458 (owner: 10Scott French) [10:40:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P59304 and previous config saved to /var/cache/conftool/dbconfig/20240403-104014-arnaudb.json [10:40:42] (03PS1) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774) [10:41:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P59305 and previous config saved to /var/cache/conftool/dbconfig/20240403-104152-marostegui.json [10:45:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 837.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:47:03] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1038.eqiad.wmnet with OS bookworm [10:47:37] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:50:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 837.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:55:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59306 and previous config saved to /var/cache/conftool/dbconfig/20240403-105522-arnaudb.json [10:55:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:55:31] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [10:55:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:55:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59307 and previous config saved to /var/cache/conftool/dbconfig/20240403-105545-arnaudb.json [10:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T356166)', diff saved to https://phabricator.wikimedia.org/P59308 and previous config saved to /var/cache/conftool/dbconfig/20240403-105659-marostegui.json [10:57:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [10:57:05] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [10:57:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [10:57:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59309 and previous config saved to /var/cache/conftool/dbconfig/20240403-105722-marostegui.json [10:57:45] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:57:57] (03CR) 10CI reject: [V:04-1] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774) (owner: 10Stevemunene) [10:58:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59310 and previous config saved to /var/cache/conftool/dbconfig/20240403-105804-arnaudb.json [10:58:46] (03Abandoned) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1016741 (https://phabricator.wikimedia.org/T353774) (owner: 10Stevemunene) [10:59:01] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet... [10:59:24] (03PS8) 10Stevemunene: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:59:28] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184) [10:59:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59311 and previous config saved to /var/cache/conftool/dbconfig/20240403-105940-marostegui.json [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1100). [11:00:48] (03CR) 10Majavah: [C:03+1] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:01:26] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1037: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016743 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:01:53] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Updating my +1 after testing the latest changes!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [11:02:55] (03CR) 10Brouberol: [C:03+1] Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) (owner: 10Muehlenhoff) [11:04:01] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [11:05:12] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:05:14] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:06:26] (03CR) 10Mvolz: [C:03+2] Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118) (owner: 10Mvolz) [11:07:07] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: host reimage [11:07:33] (03Merged) 10jenkins-bot: Update zotero to node18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016728 (https://phabricator.wikimedia.org/T349118) (owner: 10Mvolz) [11:08:01] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:08:03] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:09:51] !log fab@deploy1002 Started deploy [airflow-dags/research@75163c7]: (no justification provided) [11:10:23] !log fab@deploy1002 Finished deploy [airflow-dags/research@75163c7]: (no justification provided) (duration: 00m 32s) [11:11:25] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:11:50] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:13:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59312 and previous config saved to /var/cache/conftool/dbconfig/20240403-111312-arnaudb.json [11:13:27] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:13:59] (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/1016293 (https://phabricator.wikimedia.org/T341895) (owner: 10Muehlenhoff) [11:14:07] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59313 and previous config saved to /var/cache/conftool/dbconfig/20240403-111447-marostegui.json [11:15:30] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:16:03] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:16:07] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1038 [11:16:31] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1038 [11:17:54] (03CR) 10Volans: [C:03+1] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (owner: 10Scott French) [11:19:30] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374 [11:19:54] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374 (owner: 10PipelineBot) [11:20:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:21:00] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016374 (owner: 10PipelineBot) [11:22:46] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015457 (owner: 10PipelineBot) [11:23:45] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:23:55] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015457 (owner: 10PipelineBot) [11:23:57] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:24:04] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014058 (owner: 10PipelineBot) [11:24:23] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:25:14] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:25:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:27:09] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:27:33] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:28:06] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:28:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P59314 and previous config saved to /var/cache/conftool/dbconfig/20240403-112819-arnaudb.json [11:28:43] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:29:15] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:29:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P59315 and previous config saved to /var/cache/conftool/dbconfig/20240403-112955-marostegui.json [11:30:02] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:33:03] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1038.eqiad.wmnet with OS bookworm [11:35:17] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1038.eqiad.wmnet with... [11:35:31] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683893 (10aborrero) [11:35:54] (03CR) 10Alexandros Kosiaris: [C:03+2] ores: Remove old ORES DNS entries [dns] - 10https://gerrit.wikimedia.org/r/1016389 (owner: 10Alexandros Kosiaris) [11:36:00] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks for the +1" [dns] - 10https://gerrit.wikimedia.org/r/1016389 (owner: 10Alexandros Kosiaris) [11:37:14] (03CR) 10Majavah: [C:03+1] "typo inline, otherwise LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [11:38:17] (03PS2) 10Volans: puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) [11:38:56] (03CR) 10Volans: "fixed typo" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [11:43:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T360332)', diff saved to https://phabricator.wikimedia.org/P59317 and previous config saved to /var/cache/conftool/dbconfig/20240403-114327-arnaudb.json [11:43:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:43:32] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [11:43:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:43:51] !log installing imagemagick security updates [11:43:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59318 and previous config saved to /var/cache/conftool/dbconfig/20240403-114350-arnaudb.json [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T356166)', diff saved to https://phabricator.wikimedia.org/P59319 and previous config saved to /var/cache/conftool/dbconfig/20240403-114502-marostegui.json [11:45:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [11:45:10] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [11:45:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [11:45:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59320 and previous config saved to /var/cache/conftool/dbconfig/20240403-114525-marostegui.json [11:46:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59321 and previous config saved to /var/cache/conftool/dbconfig/20240403-114611-arnaudb.json [11:47:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59322 and previous config saved to /var/cache/conftool/dbconfig/20240403-114743-marostegui.json [11:47:59] (03CR) 10Slavina Stefanova: "tested on toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/1016724 (https://phabricator.wikimedia.org/T354507) (owner: 10Slavina Stefanova) [11:50:39] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bookworm [11:50:57] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet... [11:52:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:52:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:53:26] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184) [11:54:36] (03CR) 10David Caro: [C:03+1] cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:54:49] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1039: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016751 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:55:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:55:54] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9683981 (10aborrero) [11:57:01] (03PS2) 10Filippo Giunchedi: hieradata: add logstash_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1016301 (https://phabricator.wikimedia.org/T337818) [11:57:18] (03PS10) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [11:57:21] (03PS4) 10Filippo Giunchedi: Use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) [11:57:21] (03CR) 10Filippo Giunchedi: "Thank you, I've added the vhost at Ib82d2a93" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (https://phabricator.wikimedia.org/T337818) (owner: 10Filippo Giunchedi) [11:58:42] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1039 [11:59:39] (03PS11) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [12:01:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59323 and previous config saved to /var/cache/conftool/dbconfig/20240403-120118-arnaudb.json [12:02:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:02:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:02:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59324 and previous config saved to /var/cache/conftool/dbconfig/20240403-120251-marostegui.json [12:05:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:30] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1039 [12:07:42] (03CR) 10Ladsgroup: [C:03+1] WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [12:08:17] (03PS1) 10JMeybohm: k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) [12:08:45] (03CR) 10CI reject: [V:04-1] k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:08:58] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [12:09:05] (03PS2) 10JMeybohm: k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) [12:10:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:11:40] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [12:11:58] (03CR) 10Ayounsi: Netbox: add functions to get and set device name (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [12:14:01] (03CR) 10Ayounsi: [C:03+1] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [12:14:34] (03CR) 10Majavah: [C:03+1] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [12:15:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:16:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P59325 and previous config saved to /var/cache/conftool/dbconfig/20240403-121626-arnaudb.json [12:16:47] (03CR) 10Brouberol: [C:03+1] analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:17:51] (03CR) 10Volans: [C:03+2] puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [12:17:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P59326 and previous config saved to /var/cache/conftool/dbconfig/20240403-121759-marostegui.json [12:20:08] (03PS1) 10Fabfur: benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) [12:20:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:26:06] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:26:07] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:27:02] (03Merged) 10jenkins-bot: puppet: PuppetServer.destroy improvement [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016438 (https://phabricator.wikimedia.org/T360293) (owner: 10Volans) [12:27:08] (03CR) 10Volans: [C:04-1] "I spot few minor corner cases to cover." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [12:31:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T360332)', diff saved to https://phabricator.wikimedia.org/P59327 and previous config saved to /var/cache/conftool/dbconfig/20240403-123133-arnaudb.json [12:31:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [12:31:37] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:31:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [12:31:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59328 and previous config saved to /var/cache/conftool/dbconfig/20240403-123156-arnaudb.json [12:32:19] I am going to upgrade the CI Jenkins [12:33:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T356166)', diff saved to https://phabricator.wikimedia.org/P59329 and previous config saved to /var/cache/conftool/dbconfig/20240403-123306-marostegui.json [12:33:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [12:33:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:33:17] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:33:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [12:33:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59330 and previous config saved to /var/cache/conftool/dbconfig/20240403-123329-marostegui.json [12:34:40] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bookworm [12:34:50] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1039.eqiad.wmnet with... [12:35:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:36:40] (03CR) 10Gmodena: [C:03+1] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:36:46] (03CR) 10JMeybohm: [C:03+1] "Looks a bit hackish - but I trust your thesis on how blubber would copy everything around again potentially. So this LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [12:37:33] (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [12:41:53] * hashar !log Upgrading CI Jenkins # T360759 [12:42:14] * Lucas_WMDE confused at !log in /me message [12:42:20] OH MY [12:42:22] well spotted [12:42:25] :D [12:42:26] (03CR) 10JMeybohm: [V:03+1 C:03+2] k8s: Enable audit logging in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1016753 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:42:29] !log Upgrading CI Jenkins # T360759 [12:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:32] T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759 [12:42:49] well hmm [12:42:52] apparently it managed to start [12:43:13] * hashar claims success [12:45:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59332 and previous config saved to /var/cache/conftool/dbconfig/20240403-124550-marostegui.json [12:45:54] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [12:47:55] (03CR) 10Brouberol: [C:03+1] "LG thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [12:51:23] (03CR) 10Elukey: [V:03+2 C:03+2] Remove profile::pki::client's specific hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:52:12] (03CR) 10Majavah: "No, this is needed for PCC runs for wikiproduction hosts..." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:52:21] (03CR) 10Muehlenhoff: [C:03+2] analytics_cluster::coordinator: Configure Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1016310 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:53:14] (03CR) 10Elukey: [V:03+2 C:03+2] "There is already a value in common.yaml, it should be fine to just use that one, no? I think it is confusing to keep two values.." [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:55:05] (03CR) 10Brouberol: [C:03+2] Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [12:55:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59333 and previous config saved to /var/cache/conftool/dbconfig/20240403-125521-arnaudb.json [12:55:28] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [12:55:39] (03CR) 10Majavah: "I don't think namespaced keys are looked up from common.yaml in production, but I might be wrong?" [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:55:46] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9684225 (10MoritzMuehlenhoff) [12:58:35] (03CR) 10Elukey: [V:03+2 C:03+2] "I was convinced they were, but then I discovered https://phabricator.wikimedia.org/T209265. This task unveils horrible holes in my puppet " [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [12:58:42] jouncebot: next [12:58:42] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1300) [12:59:06] Considering there is no patches in the window, I want to do a security deploy. [12:59:13] 10SRE-swift-storage, 10Observability-Metrics: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#9684238 (10fgiunchedi) Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:18] \o [13:00:21] Dreamy_Jazz: go ahead [13:00:23] Thanks. [13:00:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59334 and previous config saved to /var/cache/conftool/dbconfig/20240403-130058-marostegui.json [13:02:23] (03PS1) 10Elukey: profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595) [13:03:22] (03PS2) 10Fabfur: benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) [13:03:27] (03CR) 10Muehlenhoff: [C:03+1] Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [13:03:32] (03CR) 10Elukey: [V:03+2 C:03+2] profile::pki::client: re-introduce fake auth token [labs/private] - 10https://gerrit.wikimedia.org/r/1016764 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [13:04:07] taavi: ok now I am going to stop messing with deployment-prep I promise, thanks for the patience [13:05:04] (03CR) 10Gmodena: [C:03+1] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:05:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:06:25] (03PS1) 10Filippo Giunchedi: sre: disable pint promql/series for EnvoyRuntimeAdminOverrides [alerts] - 10https://gerrit.wikimedia.org/r/1016786 (https://phabricator.wikimedia.org/T359633) [13:06:26] I have two security patches to deploy. I will say once I'm done. [13:10:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59335 and previous config saved to /var/cache/conftool/dbconfig/20240403-131029-arnaudb.json [13:12:53] (03PS2) 10Alexandros Kosiaris: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) [13:13:06] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9684341 (10fnegri) It's very slow for me as well, I hadn't opened it in a while but it was barely usable both yesterday and today. ` ~ $ curl -o /... [13:14:08] (03PS3) 10Alexandros Kosiaris: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) [13:15:38] !log installing tiff security updates [13:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P59336 and previous config saved to /var/cache/conftool/dbconfig/20240403-131606-marostegui.json [13:16:33] (KubernetesCalicoDown) firing: (78) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:16:55] (ProbeDown) firing: Service miscweb1003:30443 has failed probes (http_dbtree_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:29] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:40] (CalicoTyphaDown) firing: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [13:17:47] * sukhe here for the eventual pag.e I guess! [13:17:57] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:01] ha [13:18:09] here [13:18:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:18:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:18:20] ACKed [13:18:27] here too, thank you sukhe [13:18:45] calico in trouble maybe? 13:16 -jinxer-wm:#wikimedia-operations- (KubernetesCalicoDown) firing: (78) [13:19:00] here..ugh [13:19:31] yeah calico or typha [13:19:34] (03PS9) 10Elukey: Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) [13:20:26] wow zero typha containers running? [13:20:27] calico pods are in crashloopbackoff [13:20:29] yeahhhh [13:20:51] (SwaggerProbeHasFailures) firing: (5) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:21:00] not sure what the next best action here is? [13:21:03] sharp increase in 5xx [13:21:15] bird/confd is not live: Service confd is not running. [13:21:33] ouch [13:21:33] (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:21:38] jayme, akosiaris: are you about? [13:21:55] (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:02] here too [13:22:15] (MediaWikiLatencyExceeded) firing: (2) p75 latency high: eqiad mw-api-ext (k8s) 21.31s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:22:16] people lets take this to -sre [13:22:25] as it looks very very very bad [13:22:30] (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:56] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:22:57] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:06] (MediaWikiEditFailures) firing: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:23:15] (PHPFPMTooBusy) firing: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:23:20] !incidents [13:23:20] 4556 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:23:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:24:43] (VarnishUnavailable) firing: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:24:44] (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:24:49] this is a fun one [13:25:02] ACKed all [13:25:16] Currently deploying a security fix but my internet went out. [13:25:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P59337 and previous config saved to /var/cache/conftool/dbconfig/20240403-132536-arnaudb.json [13:25:42] arnaudb@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:25:51] Dreamy_Jazz: not sure how far your backscroll go, but there is an incident ATM [13:25:51] (SwaggerProbeHasFailures) firing: (11) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:25:56] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:26:01] Oh. I see. [13:26:21] I'm not sure if my console is actually still connected, so no idea if the security deploy has errored out or is still continuing. [13:26:24] Dreamy_Jazz: what were you deploying? [13:26:29] A security patch [13:26:30] I don’t see a running scap, at least [13:26:33] (CalicoKubeControllersDown) firing: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:26:33] (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:26:53] Using deploy_security.py [13:26:55] (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:56] nor a login session in `who` [13:27:15] (MediaWikiLatencyExceeded) firing: (3) p75 latency high: eqiad mw-api-ext (k8s) 8.301s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:27:20] Dreamy_Jazz: task # ? I doubt its related but.. [13:27:30] (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:31] but we should probably back off for the incident anyway [13:27:36] https://phabricator.wikimedia.org/T361479 [13:27:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:27:50] !incidents [13:27:50] 4556 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:27:50] 4557 (ACKED) [2x] VarnishUnavailable global sre (varnish-text) [13:27:50] 4558 (ACKED) [2x] HaproxyUnavailable cache_text global sre () [13:27:51] 4559 (UNACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:27:54] !ack 4559 [13:27:55] 4559 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:27:57] (ProbeDown) firing: (16) Service eventgate-analytics:4592 has failed probes (http_eventgate-analytics_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:30] (MediaWikiHighErrorRate) firing: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:28:52] My internet went out after I saw messages related to the incident, so I don't think it is related. [13:28:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:29:40] 10ops-codfw, 06SRE, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9684390 (10Jhancock.wm) [13:29:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:30:04] !incidents [13:30:04] 4556 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:30:05] 4557 (ACKED) [2x] VarnishUnavailable global sre (varnish-text) [13:30:05] 4558 (ACKED) [2x] HaproxyUnavailable cache_text global sre () [13:30:05] 4559 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [13:30:05] 4560 (UNACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [13:30:08] !ack 4560 [13:30:08] 4560 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [13:30:51] (SwaggerProbeHasFailures) firing: (19) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:31:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T356166)', diff saved to https://phabricator.wikimedia.org/P59338 and previous config saved to /var/cache/conftool/dbconfig/20240403-133113-marostegui.json [13:31:15] (MediaWikiMemcachedHighErrorRate) firing: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:31:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:31:18] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:31:19] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:31:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:31:33] (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:31:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [13:31:48] (KubernetesCalicoDown) firing: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:31:51] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:31:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [13:31:55] (ProbeDown) firing: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59339 and previous config saved to /var/cache/conftool/dbconfig/20240403-133200-marostegui.json [13:32:15] (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 2.451s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:32:30] (ProbeDown) resolved: (30) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:36] (GatewayBackendErrorsHigh) firing: (3) rest-gateway: elevated 5xx errors from page-analytics_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:32:40] (CalicoTyphaDown) resolved: Too few (1) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [13:32:46] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9684402 (10Reedy) [13:32:57] (ProbeDown) resolved: (18) Service citoid:4003 has failed probes (http_citoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:15] (PHPFPMTooBusy) resolved: (5) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.33% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:33:30] (MediaWikiHighErrorRate) firing: (7) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:34:43] (VarnishUnavailable) resolved: (2) varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [13:34:44] (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:34:51] (ATSBackendErrorsHigh) firing: (9) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:35:51] (SwaggerProbeHasFailures) firing: (18) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:35:56] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:35:56] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: ... [13:36:02] Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:36:15] (MediaWikiMemcachedHighErrorRate) resolved: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:36:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59340 and previous config saved to /var/cache/conftool/dbconfig/20240403-133619-marostegui.json [13:36:22] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [13:36:23] (03CR) 10Alexandros Kosiaris: [C:03+2] changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [13:36:25] Is the issue related to Thumbor specifically? [13:36:28] no [13:36:33] (CalicoKubeControllersDown) resolved: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:36:33] (KubernetesCalicoDown) resolved: (174) kubemaster1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:36:51] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:36:55] (ProbeDown) resolved: (12) Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:15] (MediaWikiLatencyExceeded) resolved: (4) p75 latency high: eqiad mw-api-ext (k8s) 1.496s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:37:18] Okay. Thanks. The MediaModeration dashboard suggested issues since 7am today [13:37:21] https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m&var-wiki=commonswiki [13:37:25] (03Merged) 10jenkins-bot: changeprop: Remove ORES functionality from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016391 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [13:37:36] (GatewayBackendErrorsHigh) firing: (3) rest-gateway: elevated 5xx errors from page-analytics_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:37:56] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:38:06] (MediaWikiEditFailures) resolved: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:38:30] (MediaWikiHighErrorRate) resolved: (6) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:38:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:39:04] (03CR) 10Alexandros Kosiaris: "Thanks for this, let us know when you are ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:39:51] (ATSBackendErrorsHigh) resolved: (9) ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:40:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T360332)', diff saved to https://phabricator.wikimedia.org/P59341 and previous config saved to /var/cache/conftool/dbconfig/20240403-134044-arnaudb.json [13:40:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:40:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [13:40:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:40:51] (SwaggerProbeHasFailures) firing: (17) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:40:56] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:41:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [13:41:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [13:41:34] It seems my security deploy is half deployed [13:41:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59342 and previous config saved to /var/cache/conftool/dbconfig/20240403-134136-arnaudb.json [13:41:49] The code is applied but the patch file isn't listed in /srv/patches [13:41:51] (ATSBackendErrorsHigh) resolved: (5) ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:43:47] I'm not sure if retrying the deploy again will error out or just do the bits which were left — I guess leaving it in its current state is safe enough whilst things are a bit unstable [13:43:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:44:15] yeah, IMHO it’s best not to do anything right now until the other incident is resolved [13:44:47] (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:45:12] Dreamy_Jazz: yes, please do not do deploy right now without coordinating with #-sre [13:45:23] Okay. [13:45:51] (SwaggerProbeHasFailures) firing: (15) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:47:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:51] (SwaggerProbeHasFailures) firing: (9) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:51:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59343 and previous config saved to /var/cache/conftool/dbconfig/20240403-135126-marostegui.json [13:55:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1400) [14:00:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:00:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:05:30] 10ops-eqiad, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan100[12] ram/ssd upgrade coordination - https://phabricator.wikimedia.org/T361251#9684511 (10lmata) [14:06:03] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [14:06:30] 10ops-codfw, 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#9684516 (10lmata) [14:06:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P59344 and previous config saved to /var/cache/conftool/dbconfig/20240403-140634-marostegui.json [14:07:32] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 2527 [14:08:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 2527 [14:08:46] (03PS1) 10Hnowlan: calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 [14:09:22] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:09:29] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:11:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [14:11:30] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [14:15:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:16:46] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bookworm [14:17:02] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet... [14:17:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1013968 (owner: 10Majavah) [14:17:36] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184) [14:17:40] (03CR) 10Elukey: [C:03+1] "500 would also be ok, but 600 is fine for me as well, we can always revisit later on." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan) [14:17:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [14:17:56] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:53] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:18:55] (03CR) 10Jelto: ""Bug: T361706" could be added to the commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan) [14:19:22] (03CR) 10Hnowlan: [C:03+2] calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan) [14:20:42] (03CR) 10David Caro: [C:03+1] cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [14:20:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:21:01] (03CR) 10Effie Mouzeli: [C:03+1] calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan) [14:21:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T356166)', diff saved to https://phabricator.wikimedia.org/P59345 and previous config saved to /var/cache/conftool/dbconfig/20240403-142142-marostegui.json [14:21:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:21:46] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [14:21:53] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1040: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1016795 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [14:21:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:22:22] (03Merged) 10jenkins-bot: calico-typha: double memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016794 (owner: 10Hnowlan) [14:22:56] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:13] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1040 [14:24:37] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1040 [14:24:55] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684602 (10aborrero) [14:26:03] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:26:33] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:27:02] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:27:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59346 and previous config saved to /var/cache/conftool/dbconfig/20240403-142709-arnaudb.json [14:27:12] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:27:27] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:31:18] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9684629 (10andrea.denisse) a:03andrea.denisse [14:31:38] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos: add BENTHOS_SOURCE envvar [puppet] - 10https://gerrit.wikimedia.org/r/1016760 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:31:49] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [14:32:09] (03CR) 10Muehlenhoff: [C:03+2] aqs: Remove ferm service [puppet] - 10https://gerrit.wikimedia.org/r/1013323 (https://phabricator.wikimedia.org/T360522) (owner: 10Muehlenhoff) [14:34:37] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [14:37:22] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:49] (PuppetDisabled) firing: Puppet disabled on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-internal&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:40:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:41:45] (03PS1) 10Elukey: role::builder: add the somebody user's UID [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638) [14:42:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59347 and previous config saved to /var/cache/conftool/dbconfig/20240403-144217-arnaudb.json [14:44:00] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1795/co" [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [14:44:01] !log dreamyjazz@deploy1002 Started scap: (no justification provided) [14:44:58] Didn't provide a reason, but this is related to deploying security patch T361479 [14:45:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:46:14] I /win 14 [14:46:58] (03PS4) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) [14:46:58] (03PS5) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [14:46:58] (03PS1) 10Andrew Bogott: role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799 [14:46:59] (03PS1) 10Andrew Bogott: Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800 [14:50:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:52:33] (03PS2) 10Andrew Bogott: role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799 [14:52:34] (03PS2) 10Andrew Bogott: Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800 [14:52:34] (03PS5) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) [14:52:34] (03PS6) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [14:54:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016799 (owner: 10Andrew Bogott) [14:57:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P59349 and previous config saved to /var/cache/conftool/dbconfig/20240403-145725-arnaudb.json [14:59:54] (03CR) 10Andrew Bogott: [C:03+2] role:cinder_backups: include full env scripts [puppet] - 10https://gerrit.wikimedia.org/r/1016799 (owner: 10Andrew Bogott) [15:00:09] (03CR) 10Andrew Bogott: [C:03+2] Revert "wmcs-backup: use novaobserver instead of novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1016800 (owner: 10Andrew Bogott) [15:01:23] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync [15:01:34] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bookworm [15:01:45] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync [15:01:47] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9684744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1040.eqiad.wmnet with... [15:02:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:49] !incidents [15:02:50] 4559 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:02:50] 4561 (RESOLVED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [15:02:50] 4560 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [15:02:50] 4558 (RESOLVED) [2x] HaproxyUnavailable cache_text global sre () [15:02:51] 4557 (RESOLVED) [2x] VarnishUnavailable global sre (varnish-text) [15:02:51] 4556 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:02:56] !log dreamyjazz@deploy1002 Finished scap: (no justification provided) (duration: 18m 54s) [15:03:48] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2098.codfw.wmnet with reason: restart of mysqld [15:04:02] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2098.codfw.wmnet with reason: restart of mysqld [15:04:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [15:06:17] (03CR) 10Elukey: [V:03+1 C:03+2] role::builder: add the somebody user's UID [puppet] - 10https://gerrit.wikimedia.org/r/1016798 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:06:48] (03CR) 10Elukey: [V:03+2 C:03+2] Rework the amd-pytorch22's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015530 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:07:36] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:08:25] 10ops-codfw, 06SRE, 06DBA, 10decommission-hardware, 13Patch-For-Review: 14decommission db2100.codfw.wmnet - 14https://phabricator.wikimedia.org/T361584#9684775 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:11:04] (03PS6) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [15:11:48] (03PS6) 10Ayounsi: Netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [15:12:15] (03CR) 10Ayounsi: "Thanks, addressed" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [15:12:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T360332)', diff saved to https://phabricator.wikimedia.org/P59350 and previous config saved to /var/cache/conftool/dbconfig/20240403-151233-arnaudb.json [15:12:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:12:37] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:12:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:12:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:13:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:13:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [15:13:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [15:13:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59351 and previous config saved to /var/cache/conftool/dbconfig/20240403-151349-arnaudb.json [15:16:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59352 and previous config saved to /var/cache/conftool/dbconfig/20240403-151614-arnaudb.json [15:17:47] (03PS7) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [15:22:15] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:22:35] !log Starting MediaModeration scanning script again - It crashed due to the outage [15:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:55] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:25:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:26:54] (03PS6) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) [15:26:54] (03PS7) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [15:26:54] (03PS1) 10Andrew Bogott: wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) [15:27:54] (03PS1) 10Elukey: amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) [15:30:13] (03CR) 10Ilias Sarantopoulos: [C:03+1] amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:30:54] (03CR) 10CI reject: [V:04-1] wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) (owner: 10Andrew Bogott) [15:31:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P59353 and previous config saved to /var/cache/conftool/dbconfig/20240403-153121-arnaudb.json [15:31:31] (03PS8) 10Ilias Sarantopoulos: Add new version for amd-pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) [15:32:07] (03PS2) 10Andrew Bogott: wmcs-backup.py: replace image_id with image_info in a few more places [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) [15:32:07] (03PS7) 10Andrew Bogott: cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) [15:32:07] (03PS8) 10Andrew Bogott: Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) [15:32:49] (PuppetDisabled) resolved: Puppet disabled on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-internal&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [15:33:22] !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [15:33:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [15:35:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:36:26] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:37:32] (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch22: move comments to a README file [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1016807 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [15:42:01] (03CR) 10Tchanders: [C:03+1] "Looks good - adding +1 for when the -2 is removed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) (owner: 10Dreamy Jazz) [15:42:44] (03CR) 10Volans: [C:04-1] "Sorry last minute bug spotted, not your fault" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [15:44:06] (03CR) 10JHathaway: [C:03+1] wmcs puppetservers: stop pulling hiera from /etc/puppet/secrets [puppet] - 10https://gerrit.wikimedia.org/r/1015392 (owner: 10Andrew Bogott) [15:44:25] (03CR) 10FNegri: [C:03+1] "LGTM, sorry for not spotting this side effect of my change!" [puppet] - 10https://gerrit.wikimedia.org/r/1016806 (https://phabricator.wikimedia.org/T359192) (owner: 10Andrew Bogott) [15:45:43] (03CR) 10JHathaway: "that looks right, do you folks have a cfssl server in fund raising tech, or can you reach out to ours?" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [15:45:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:46:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P59354 and previous config saved to /var/cache/conftool/dbconfig/20240403-154628-arnaudb.json [15:48:39] !depool mw-web-ro in eqiad [15:48:39] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [15:53:42] !log jiji@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=eqiad [16:01:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T360332)', diff saved to https://phabricator.wikimedia.org/P59355 and previous config saved to /var/cache/conftool/dbconfig/20240403-160136-arnaudb.json [16:01:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:01:50] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:01:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [16:02:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59356 and previous config saved to /var/cache/conftool/dbconfig/20240403-160159-arnaudb.json [16:04:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59357 and previous config saved to /var/cache/conftool/dbconfig/20240403-160425-arnaudb.json [16:05:11] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:05:25] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:41] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:08:16] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:09:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [16:12:57] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:14:44] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:14:47] (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:16:05] (03CR) 10Andrew Bogott: [C:03+2] cinder backups: move schedule config from a template into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1016446 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [16:16:10] (03CR) 10Andrew Bogott: [C:03+2] Make cloudbackup200[12]-dev into codfw1dev cinder backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1016447 (https://phabricator.wikimedia.org/T358855) (owner: 10Andrew Bogott) [16:19:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P59358 and previous config saved to /var/cache/conftool/dbconfig/20240403-161933-arnaudb.json [16:19:47] (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:24:47] (HelmReleaseBadStatus) resolved: (4) Helm release mw-api-ext/main on k8s@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:26:02] !log jiji@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=eqiad [16:26:07] !log jayme@deploy1002 Started scap: (no justification provided) [16:26:22] !log pooling back mw-web-ro in eqiad [16:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:29:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:29:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:29:42] !log jayme@deploy1002 Finished scap: (no justification provided) (duration: 03m 34s) [16:29:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:30:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T356166)', diff saved to https://phabricator.wikimedia.org/P59359 and previous config saved to /var/cache/conftool/dbconfig/20240403-163004-marostegui.json [16:30:09] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [16:30:30] (03CR) 10Dwisehaupt: [V:03+1] "We do not have a cfssl server in our area. However, this community-crm host will live on a prod vps host (cloudvps for the testing host). " [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:32:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P59360 and previous config saved to /var/cache/conftool/dbconfig/20240403-163249-root.json [16:34:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P59361 and previous config saved to /var/cache/conftool/dbconfig/20240403-163440-arnaudb.json [16:35:36] (03CR) 10Dzahn: "are you not going to use envoy to do the TLS termination and keep apache on http? that's now the pattern that prod services use when they " [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:36:19] (03CR) 10Dzahn: "if that was the case you would have something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014605/3/hieradata/role/common/mi" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:36:58] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling restart_daemons on A:maps-master [16:37:57] (03CR) 10BryanDavis: "For Striker's Docker deployment on the cloudweb* hosts we use the `service::docker` wrapper with `host_network => true` so that the code i" [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle) [16:38:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling restart_daemons on A:maps-master [16:41:04] (03CR) 10Cwhite: [C:03+2] logstash: provision and commission logging-hd200[123] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1016368 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [16:42:22] (JobUnavailable) firing: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:45:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:47:22] (JobUnavailable) firing: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P59362 and previous config saved to /var/cache/conftool/dbconfig/20240403-164754-root.json [16:49:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T360332)', diff saved to https://phabricator.wikimedia.org/P59363 and previous config saved to /var/cache/conftool/dbconfig/20240403-164948-arnaudb.json [16:49:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [16:49:51] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [16:50:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [16:50:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59364 and previous config saved to /var/cache/conftool/dbconfig/20240403-165011-arnaudb.json [16:52:22] (JobUnavailable) resolved: (2) Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:52:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59365 and previous config saved to /var/cache/conftool/dbconfig/20240403-165234-arnaudb.json [16:52:45] (03PS1) 10Volans: tests: fix typos in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016814 [16:54:36] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:54:43] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:56:25] (SystemdUnitFailed) firing: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:47] jouncebot: next [16:56:47] In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700) [16:56:56] jouncebot: nowandnext [16:56:56] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [16:56:56] In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700) [16:59:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:59:21] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700) [17:00:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:01:25] (SystemdUnitFailed) resolved: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P59366 and previous config saved to /var/cache/conftool/dbconfig/20240403-170300-root.json [17:03:26] (03CR) 10Brouberol: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [17:03:42] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:43] !log performing rolling memory upgrades on prometheus100[56] T360687 [17:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:46] T360687: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687 [17:05:40] as a result of ^^ expect to see gaps on dashboards [17:07:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P59367 and previous config saved to /var/cache/conftool/dbconfig/20240403-170741-arnaudb.json [17:08:42] (JobUnavailable) resolved: Reduced availability for job thanos-sidecar in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:29] (03CR) 10Krinkle: "Aye, so I did consider that in PS1, but I noticed it also affects the ports being exported. There is no longer port mapping in that case, " [puppet] - 10https://gerrit.wikimedia.org/r/1016480 (owner: 10Krinkle) [17:10:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:15:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:16:18] (03PS1) 10Jforrester: Centralize API calls in api.js mixin and fix error handling [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598) [17:17:22] (03CR) 10Dzahn: [C:03+2] logstash_checker.py: Fix _mwdeploy_query for k8s-less realm [puppet] - 10https://gerrit.wikimedia.org/r/1016436 (owner: 10Ahmon Dancy) [17:18:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P59368 and previous config saved to /var/cache/conftool/dbconfig/20240403-171806-root.json [17:19:40] (03CR) 10Dzahn: "@Urbanecm Could we reboot the stewards machines any time or is something running we should look for?" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [17:22:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P59369 and previous config saved to /var/cache/conftool/dbconfig/20240403-172249-arnaudb.json [17:24:27] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9685401 (10VRiley-WMF) worked with @herron and added the 32Gig DDR4 2666 to the requested slots. Both servers came back up and reported the correct sizes as expected. Closing... [17:24:37] 10ops-eqiad, 06SRE, 10Observability-Metrics: 14Memory upgrade request for prometheus100[56] - 14https://phabricator.wikimedia.org/T360687#9685402 (10VRiley-WMF) 05Open→03Resolved [17:25:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:27:28] (03CR) 10Urbanecm: "Absolutely, reboot is okay at any time." [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [17:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:33:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P59370 and previous config saved to /var/cache/conftool/dbconfig/20240403-173312-root.json [17:35:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:37:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T360332)', diff saved to https://phabricator.wikimedia.org/P59371 and previous config saved to /var/cache/conftool/dbconfig/20240403-173756-arnaudb.json [17:38:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [17:38:00] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [17:38:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [17:38:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:38:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:38:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59372 and previous config saved to /var/cache/conftool/dbconfig/20240403-173835-arnaudb.json [17:39:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59373 and previous config saved to /var/cache/conftool/dbconfig/20240403-173958-arnaudb.json [17:42:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685463 (10Bethany) This request is approved on my side [17:45:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:48:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P59374 and previous config saved to /var/cache/conftool/dbconfig/20240403-174817-root.json [17:51:58] (03PS7) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [17:52:34] I'm notice citoid had some downtime twice today which is unusual, and I did a deploy this morning :/ [17:52:41] also the endpoint is not happy [17:54:11] mvolz: hey o/ I did not reach out directly as I thought you're offline, sorry [17:54:18] mvolz: just created https://phabricator.wikimedia.org/T361728 [17:54:21] I just got back [17:54:30] jouncebot: nowandnext [17:54:30] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1700) [17:54:30] In 0 hour(s) and 5 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800) [17:54:30] In 0 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800) [17:55:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P59375 and previous config saved to /var/cache/conftool/dbconfig/20240403-175505-arnaudb.json [17:55:52] mvolz: the error seems flaky, so maybe it's something more related to load - but I did not come around to take a closer look as of now [17:57:11] and I gtg unfortunately. If those failures are "real" I'd suggest rolling back for now [17:57:17] jayme: it's a little odd because for citoid itself it was just a package-lock update. It's also feasible Zotero but theoretically citoid should function without it, and for that it was just switching to node 18 [17:57:47] but yeah I can roll back and see how it goes. might do just citoid to start and watch it for a while? [17:58:51] i can see in grafana there is actually two actual downtimes, and the rest of the time "flaky". [17:59:16] yeah. If it still fails after the rollback it might as well that the actual problem is zotero - maybe there is something useful in logs as well, I did not check at all thb [18:00:04] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800). [18:00:04] jnuche and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800). [18:00:04] (03CR) 10Eevans: [C:03+2] restbase: remove decommissioned hosts restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016003 (https://phabricator.wikimedia.org/T354561) (owner: 10Eevans) [18:00:26] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:00:38] Ok, I'm going to rollback just citoid rn [18:01:01] mvolz: I gtg. Please feel free to reach out to the US colleagues in #wikimedia-serviceops if you need help [18:02:03] (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728) [18:02:37] (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [18:03:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P59377 and previous config saved to /var/cache/conftool/dbconfig/20240403-180323-root.json [18:03:31] (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016781 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [18:04:51] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [18:04:56] (03CR) 10Scott French: "Many thanks for the review, Riccardo!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [18:05:13] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [18:05:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:06:06] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:06:41] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [18:07:09] !log eevans@cumin1002 START - Cookbook sre.hosts.decommission for hosts restbase[1019-1027].eqiad.wmnet [18:07:36] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [18:08:10] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [18:09:03] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [18:10:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P59378 and previous config saved to /var/cache/conftool/dbconfig/20240403-181013-arnaudb.json [18:13:20] !log dreamyjazz Deployed security patch for T361479 [18:14:21] (03CR) 10Dwisehaupt: [V:03+1] "Thanks! I'll have a look at this and test it out." [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:15:44] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685551 (10RLazarus) @AndyRussG Welcome back! - With the information above, I can set up your LDAP access. For your shell access I'll also need the information on [[ https://phabricator.wikimedia.org/m... [18:15:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685552 (10RLazarus) p:05Triage→03Medium a:03RLazarus [18:24:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 1.165s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:25:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T360332)', diff saved to https://phabricator.wikimedia.org/P59379 and previous config saved to /var/cache/conftool/dbconfig/20240403-182520-arnaudb.json [18:25:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:25:30] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [18:25:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:25:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59380 and previous config saved to /var/cache/conftool/dbconfig/20240403-182543-arnaudb.json [18:28:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59381 and previous config saved to /var/cache/conftool/dbconfig/20240403-182806-arnaudb.json [18:29:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 951.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:30:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:31:06] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:34:03] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [18:35:57] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1019-1027].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [18:36:06] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:37:01] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase[1019-1027].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1002" [18:37:01] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:02] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase[1019-1027].eqiad.wmnet [18:40:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:43:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P59382 and previous config saved to /var/cache/conftool/dbconfig/20240403-184313-arnaudb.json [18:43:19] (03PS1) 10Eevans: site.pp: cleanup restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016829 (https://phabricator.wikimedia.org/T354561) [18:45:33] (03CR) 10Eevans: [C:03+2] site.pp: cleanup restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/1016829 (https://phabricator.wikimedia.org/T354561) (owner: 10Eevans) [18:49:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:49:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:50:04] 10ops-eqiad, 10decommission-hardware: decommission restbase10[19-27] - https://phabricator.wikimedia.org/T361372#9685608 (10Eevans) [18:52:15] The citoid rollback doesn't seemed to have fixed things, so I'm going to rollback Zotero. [18:53:05] jouncebot: nowandnext [18:53:05] For the next 0 hour(s) and 6 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800) [18:53:05] For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T1800) [18:53:05] In 1 hour(s) and 6 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2000) [18:53:32] Does anyone care if I do that now? [18:57:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:57:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:57:57] (03PS1) 10Mvolz: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728) [18:58:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P59383 and previous config saved to /var/cache/conftool/dbconfig/20240403-185821-arnaudb.json [18:58:39] (03CR) 10Mvolz: [C:03+2] Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [18:59:35] (03Merged) 10jenkins-bot: Revert "Update zotero to node18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016783 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [19:00:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:01:54] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [19:02:09] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [19:02:32] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [19:03:21] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [19:03:58] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [19:04:31] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [19:05:51] (SwaggerProbeHasFailures) firing: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:06:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:06:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:10:51] (SwaggerProbeHasFailures) resolved: (2) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:13:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T360332)', diff saved to https://phabricator.wikimedia.org/P59384 and previous config saved to /var/cache/conftool/dbconfig/20240403-191328-arnaudb.json [19:13:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:13:32] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [19:13:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:13:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59385 and previous config saved to /var/cache/conftool/dbconfig/20240403-191351-arnaudb.json [19:16:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59386 and previous config saved to /var/cache/conftool/dbconfig/20240403-191615-arnaudb.json [19:16:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:16:38] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:16:55] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 14eqiad: (1) VM for MySQL Orchestrator - 14https://phabricator.wikimedia.org/T332718#9685691 (10jhathaway) 05Open→03Declined 14part of bookworm upgrade sprint week, but I ran out of time, not currently prioritizing this work. [19:18:22] (03CR) 10Dzahn: [C:03+2] stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [19:18:27] (03PS2) 10Dzahn: stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 [19:23:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:23:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:38] (03CR) 10Dzahn: "yea, it would be more in line with the way other services do this. I am happy to show examples to follow and creating certs is much simple" [puppet] - 10https://gerrit.wikimedia.org/r/1016018 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:27:06] (03CR) 10Dzahn: [V:03+2 C:03+2] stewards: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [19:29:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:29:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:31:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P59387 and previous config saved to /var/cache/conftool/dbconfig/20240403-193122-arnaudb.json [19:31:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:31:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:33:39] (03CR) 10Elukey: "Almost ready to go, let's remove config.yaml and rebase to see if everything looks good." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1015297 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [19:35:50] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742 (10AndyRussG) 03NEW [19:38:44] !log stewards2001 - reboot to switch from iptables to nftables [19:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:02] 06SRE, 10SRE-Access-Requests: Requesting access to shell access to analytics client servers for AndyRussG - https://phabricator.wikimedia.org/T361742#9685860 (10AndyRussG) [19:39:03] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685861 (10AndyRussG) [19:45:19] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for AndyRussG - https://phabricator.wikimedia.org/T361665#9685872 (10AndyRussG) >>! In T361665#9685550, @RLazarus wrote: > @AndyRussG Welcome back! Heyyy thanks so much!!!! :) :) > - With the information above, I can set up your LDAP access. For your shell... [19:46:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P59388 and previous config saved to /var/cache/conftool/dbconfig/20240403-194630-arnaudb.json [19:51:55] (03CR) 10Dzahn: [V:03+2] "root@stewards2001:/# nft list table inet base" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2000). [20:00:05] phuedx and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] * James_F waves. [20:00:24] !log stewards1001 - rebooting to switch from iptables to nftables [20:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:33] Hi I'm here [20:01:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T360332)', diff saved to https://phabricator.wikimedia.org/P59390 and previous config saved to /var/cache/conftool/dbconfig/20240403-200137-arnaudb.json [20:01:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [20:01:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:01:53] hi hi [20:01:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [20:02:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59391 and previous config saved to /var/cache/conftool/dbconfig/20240403-200201-arnaudb.json [20:02:05] I can deploy if needed. [20:02:16] James_F: i was just going to ask that [20:02:20] (03CR) 10Dzahn: [V:03+2 C:03+2] "machines rebooted, confirmed with "nft list table inet base" the base rules are there and "lsmod | grep tables" shows after reboot there a" [puppet] - 10https://gerrit.wikimedia.org/r/1013649 (owner: 10Dzahn) [20:02:42] But verification would be best done by phuedx. [20:02:44] Eh. [20:02:48] Let's do my one, at least. [20:03:02] I think David can verify the config patch [20:03:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598) (owner: 10Jforrester) [20:03:14] Ack. [20:03:15] Yes [20:04:19] cool - thanks! [20:04:21] (Except to be honest I'm not sure anymore what the verification step entails) [20:04:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59392 and previous config saved to /var/cache/conftool/dbconfig/20240403-200425-arnaudb.json [20:06:13] * James_F twiddles thumbs waiting for merge. [20:06:54] Note that there is currently a Merge conflict on our patch [20:07:18] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:08:04] (03Merged) 10jenkins-bot: Centralize API calls in api.js mixin and fix error handling [extensions/WikiLambda] (wmf/1.42.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1016778 (https://phabricator.wikimedia.org/T361598) (owner: 10Jforrester) [20:08:52] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]] [20:09:01] T361598: Adapt front-end to understand new errors after returning HTTP error codes - https://phabricator.wikimedia.org/T361598 [20:09:02] T315432: Consolidate all in-Vue API calls into our mixins/api.js file - https://phabricator.wikimedia.org/T315432 [20:10:31] urbanecm: I think instead of wikidev we can do one better and use the group "stewards-users" [20:10:38] uid=13367(urbanecm) gid=500(wikidev) groups=500(wikidev),751(stewards-users) [20:10:40] (03PS8) 10Jforrester: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [20:10:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:10:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:11:18] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:12:19] !log jforrester@deploy1002 jforrester: Continuing with sync [20:12:47] dmartin-WMF: OK, the API change is going out now, so I'll be able to sling out the metrics config change in ~5 minutes' time. [20:14:01] (03Abandoned) 10Jforrester: testwikis wikis to 1.42.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1016069 (https://phabricator.wikimedia.org/T360157) (owner: 10TrainBranchBot) [20:15:10] (03CR) 10Jforrester: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 (owner: 10Aaron Schulz) [20:15:24] (03PS2) 10Dzahn: stewards: let puppet create /srv/exports [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) [20:16:15] (03PS2) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [20:16:19] (03CR) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [20:16:30] (03CR) 10Dzahn: "Amended! But I think we can do better than keep using the old wikidev "hack" and use the proper group "stewards-users" that we already hav" [puppet] - 10https://gerrit.wikimedia.org/r/1016439 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:18:45] Of course, as soon as I say '5 mins' scap then just stops responding. [20:19:06] Right [20:19:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P59393 and previous config saved to /var/cache/conftool/dbconfig/20240403-201933-arnaudb.json [20:19:46] Meh, 5 mins just to update the mw-k8s main pods. [20:21:16] mutante: using that group works as well for me. I suggested wikidev, as that's what we use for the repo with the app itself. [20:22:30] Using stewards-users might cause problems if a root changes something there, as they'd have to use sudo (and the file might be easily owned by other group) [20:23:50] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1016778|Centralize API calls in api.js mixin and fix error handling (T361598 T315432)]] (duration: 14m 58s) [20:23:54] T361598: Adapt front-end to understand new errors after returning HTTP error codes - https://phabricator.wikimedia.org/T361598 [20:23:55] T315432: Consolidate all in-Vue API calls into our mixins/api.js file - https://phabricator.wikimedia.org/T315432 [20:24:47] Finally. [20:25:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [20:25:49] (03Merged) 10jenkins-bot: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [20:26:20] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]] [20:26:29] T350497: Update the WikiLambda instrumentation to use core interaction events - https://phabricator.wikimedia.org/T350497 [20:28:52] !log jforrester@deploy1002 sfaci and jforrester: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:22] dmartin-WMF: OK, it's live on the debug servers – can you test if it works from your end? [20:29:58] Sorry, but please remind me what I should do to verify a change of this sort (only involving ext-EventStreamConfig.php) [20:30:08] You mean to generate an event in our UI? [20:31:02] Yes, it seems to not be erroring at least. [20:31:17] But how to tell if they're going into the metrics platform? [20:31:42] I don't know, sorry [20:32:29] Has the new instruments patch been deployed? I didn't think so [20:33:10] dmartin-WMF: It's on the debug server and holding until we can verify. [20:34:22] OK, it seems good enough for me; in the absence of Sam, I'll continue. [20:34:23] !log jforrester@deploy1002 sfaci and jforrester: Continuing with sync [20:34:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P59394 and previous config saved to /var/cache/conftool/dbconfig/20240403-203440-arnaudb.json [20:34:43] Good; thanks [20:45:24] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:992223|Update the WikiLambda instrumentation to use core interaction events (T350497)]] (duration: 19m 03s) [20:45:27] T350497: Update the WikiLambda instrumentation to use core interaction events - https://phabricator.wikimedia.org/T350497 [20:45:44] All right, all done. [20:46:02] Excellent. Thanks again James! [20:47:05] (03PS3) 10Jforrester: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [20:47:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [20:47:56] (03Merged) 10jenkins-bot: component: Add SandboxLink to Portuguese Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015649 (https://phabricator.wikimedia.org/T361447) (owner: 10Ederporto) [20:48:25] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]] [20:48:28] T361447: Add SandboxLink to ptwikiquote - https://phabricator.wikimedia.org/T361447 [20:49:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T360332)', diff saved to https://phabricator.wikimedia.org/P59395 and previous config saved to /var/cache/conftool/dbconfig/20240403-204949-arnaudb.json [20:49:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [20:49:54] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [20:50:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [20:50:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59396 and previous config saved to /var/cache/conftool/dbconfig/20240403-205014-arnaudb.json [20:50:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:50:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:50:53] !log jforrester@deploy1002 ederporto and jforrester: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:51:57] !log jforrester@deploy1002 ederporto and jforrester: Continuing with sync [20:51:57] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 1 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750 (10jhathaway) 03NEW [20:52:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59397 and previous config saved to /var/cache/conftool/dbconfig/20240403-205240-arnaudb.json [20:58:52] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750#9686110 (10jhathaway) a:03jhathaway [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240403T2100) [21:01:31] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mta-out - https://phabricator.wikimedia.org/T361750#9686125 (10jhathaway) p:05Triage→03Medium [21:02:44] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1015649|component: Add SandboxLink to Portuguese Wikiquote (T361447)]] (duration: 14m 18s) [21:02:47] T361447: Add SandboxLink to ptwikiquote - https://phabricator.wikimedia.org/T361447 [21:04:02] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9686143 (10jhathaway) [21:05:43] (03PS3) 10Dzahn: stewards: puppetize steward-onboarder config file and paths [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) [21:07:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P59398 and previous config saved to /var/cache/conftool/dbconfig/20240403-210747-arnaudb.json [21:15:28] (03CR) 10Dzahn: stewards: puppetize steward-onboarder config file and paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1016441 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [21:22:32] (03CR) 10Dzahn: [C:03+2] aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [21:22:38] (03PS2) 10Dzahn: aphlict: switch envoy cert provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) [21:22:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P59399 and previous config saved to /var/cache/conftool/dbconfig/20240403-212255-arnaudb.json [21:26:20] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1013416 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [21:37:26] (03PS2) 10Cwhite: spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) [21:38:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T360332)', diff saved to https://phabricator.wikimedia.org/P59400 and previous config saved to /var/cache/conftool/dbconfig/20240403-213802-arnaudb.json [21:38:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [21:38:06] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [21:38:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [21:38:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59401 and previous config saved to /var/cache/conftool/dbconfig/20240403-213825-arnaudb.json [21:38:42] (03CR) 10Cwhite: [C:03+2] spicerack: update logging-eqiad host to logging-hd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1016369 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [21:40:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59402 and previous config saved to /var/cache/conftool/dbconfig/20240403-214048-arnaudb.json [21:48:15] (03PS2) 10Dzahn: delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) [21:48:30] (03CR) 10Dzahn: [V:03+2 C:03+2] delete aphlict.discovery dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1013417 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [21:50:26] (03PS2) 10Dzahn: ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413) [21:51:39] (03CR) 10Dzahn: [C:03+2] ssl: delete aphlict.discovery ssl cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1013415 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [21:53:16] (03PS1) 10Bking: WIP: remove elasticsearch-curator dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1016855 (https://phabricator.wikimedia.org/T361647) [21:55:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P59403 and previous config saved to /var/cache/conftool/dbconfig/20240403-215555-arnaudb.json [21:57:41] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9686377 (10Dzahn) [21:57:58] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9686379 (10Dzahn) [21:58:05] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:58:26] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:58:39] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:59:13] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:59:26] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:59:35] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9686403 (10bking) [21:59:50] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:00:05] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:00:06] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:00:28] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:00:41] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:05:31] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9686416 (10Dzahn) @eoghan I have continued with aphlict because I already had the patches uploaded anyways. But Phabricator is left if you still wanted to re-s... [22:05:59] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:11] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:06:13] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:34] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:06:34] (03PS1) 10Ebernhardson: cirrus: Increase taskmanager parallelism and reduce batch size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 [22:06:35] (03PS1) 10Ebernhardson: cirrus: Report container log output on backfilling failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016859 [22:06:36] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:06:44] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:09:29] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:09:38] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:09:46] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P59404 and previous config saved to /var/cache/conftool/dbconfig/20240403-221103-arnaudb.json [22:19:56] (03PS2) 10Ebernhardson: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 [22:20:42] (03PS1) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) [22:22:39] (03PS3) 10Ebernhardson: cirrus: Tune resource usage of consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016858 [22:26:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T360332)', diff saved to https://phabricator.wikimedia.org/P59405 and previous config saved to /var/cache/conftool/dbconfig/20240403-222610-arnaudb.json [22:26:15] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [22:26:31] (03PS2) 10Peter Fischer: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) [22:26:46] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [22:27:38] (03Merged) 10jenkins-bot: Search update pipeline: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016861 (https://phabricator.wikimedia.org/T356933) (owner: 10Peter Fischer) [22:34:13] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:34:46] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:34:50] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:35:09] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:45:32] 06SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685#9686467 (10colewhite) [23:01:13] (03PS1) 10Scott French: Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) [23:06:02] (03CR) 10Tim Starling: [C:03+2] WMCS: Read from the new block/block_target tables [puppet] - 10https://gerrit.wikimedia.org/r/1016066 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [23:19:49] (03PS2) 10Scott French: Improve etcdmirror shutdown behavior [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) [23:21:53] (03CR) 10Scott French: "I ran into this while testing out the migration for T358636. It's a fairly simple fix and would make the procedure a bit less stressful :)" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1016862 (https://phabricator.wikimedia.org/T361762) (owner: 10Scott French) [23:22:57] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:27:57] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:21] !log on clouddb1021 ran maintain-views for enwiki [23:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016376 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1016376 (owner: 10TrainBranchBot) [23:44:59] (03PS4) 10Krinkle: codesearch: Enable network=host and set CODESEARCH_HOUND_BASE [puppet] - 10https://gerrit.wikimedia.org/r/1016480