[00:36:33] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1010250 [00:38:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1010250 (owner: 10TrainBranchBot) [01:01:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1010250 (owner: 10TrainBranchBot) [01:04:37] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9622245 (10phaultfinder) [01:19:09] RECOVERY - Host elastic2088 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [01:25:33] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:39:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:47:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0200) [02:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.22 [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010251 (https://phabricator.wikimedia.org/T354440) [02:08:08] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.22 [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010251 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [02:22:54] (03PS1) 10Tim Starling: SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 [02:23:21] (03CR) 10Tim Starling: "Done in followup." [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [02:24:28] (03CR) 10CI reject: [V: 04-1] SwiftTooManyMediaUploads: reduce severity [alerts] - 10https://gerrit.wikimedia.org/r/1010347 (owner: 10Tim Starling) [02:27:51] RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 77%, RTA = 1.77 ms [02:27:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.22 [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010251 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [02:34:15] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [02:37:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0300) [03:03:27] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.19 (duration: 03m 24s) [03:04:46] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010348 (https://phabricator.wikimedia.org/T354440) [03:04:47] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010348 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [03:05:34] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010348 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [03:06:01] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.22 refs T354440 [03:06:16] T354440: 1.42.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T354440 [03:17:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:29:35] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:37:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:37:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:39:35] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:41:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:41:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:46:01] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:47:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:47:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:50:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:35] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:58:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:02:48] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.22 refs T354440 (duration: 56m 47s) [04:02:52] T354440: 1.42.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T354440 [04:06:10] (03PS1) 10Tim Starling: migrateBlocks.php: Skip existing IDs [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010218 (https://phabricator.wikimedia.org/T355034) [04:06:20] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:06:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:10:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:10:38] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:16:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:53:05] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-11-120258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010169 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [04:54:11] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-11-120258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010169 (https://phabricator.wikimedia.org/T350773) (owner: 10KartikMistry) [05:01:23] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:01:52] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:06:01] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:07:16] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:09:46] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:10:22] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:11:34] !log Updated cxserver to 2024-03-11-120258-production (T350773) [05:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:39] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [05:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:23:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:24:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:44:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:44:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:47:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:55:55] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 53 probes of 803 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0600). [06:01:15] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 7 probes of 804 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:35:30] !log hashar@deploy2002 Started deploy [integration/docroot@a13474b]: Link to Cite docs - T358641 [06:35:36] T358641: Cite js docs should be published to doc.wikimedia.org - https://phabricator.wikimedia.org/T358641 [06:35:37] !log hashar@deploy2002 Finished deploy [integration/docroot@a13474b]: Link to Cite docs - T358641 (duration: 00m 06s) [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:05:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:05:51] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1020.eqiad.wmnet with reason: Decommissioning — T354561 [07:06:05] T354561: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561 [07:06:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1020.eqiad.wmnet with reason: Decommissioning — T354561 [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:29] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:17:31] (03PS1) 10Marostegui: data.yaml: Add tjones to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1010453 (https://phabricator.wikimedia.org/T359092) [07:18:04] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9622567 (10Marostegui) [07:19:11] (03CR) 10CI reject: [V: 04-1] data.yaml: Add tjones to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1010453 (https://phabricator.wikimedia.org/T359092) (owner: 10Marostegui) [07:20:44] (03PS2) 10Marostegui: data.yaml: Add tjones to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1010453 (https://phabricator.wikimedia.org/T359092) [07:42:13] (03PS2) 10Ilias Sarantopoulos: httpbb: add ores-legacy tests [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) [07:44:10] (03PS3) 10Ilias Sarantopoulos: httpbb: add ores-legacy tests [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) [07:58:34] (03PS1) 10Ilias Sarantopoulos: ml-services: update readability to identify cgroupsv2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010483 (https://phabricator.wikimedia.org/T353461) [08:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T0800) [08:17:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2205/6/8 [puppet] - 10https://gerrit.wikimedia.org/r/1009956 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [08:20:49] 06SRE: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9622592 (10Peachey88) [08:21:49] 06SRE, 06collaboration-services: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9622593 (10Jelto) [08:21:51] 06SRE, 06collaboration-services: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9622594 (10Peachey88) [08:22:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58717 and previous config saved to /var/cache/conftool/dbconfig/20240312-082202-arnaudb.json [08:22:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58718 and previous config saved to /var/cache/conftool/dbconfig/20240312-082203-arnaudb.json [08:22:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58719 and previous config saved to /var/cache/conftool/dbconfig/20240312-082203-arnaudb.json [08:28:29] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9622611 (10Marostegui) [08:37:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58720 and previous config saved to /var/cache/conftool/dbconfig/20240312-083709-arnaudb.json [08:37:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58721 and previous config saved to /var/cache/conftool/dbconfig/20240312-083710-arnaudb.json [08:37:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58722 and previous config saved to /var/cache/conftool/dbconfig/20240312-083710-arnaudb.json [08:38:07] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2209/10/11 [puppet] - 10https://gerrit.wikimedia.org/r/1010246 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [08:40:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58723 and previous config saved to /var/cache/conftool/dbconfig/20240312-084015-arnaudb.json [08:40:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58724 and previous config saved to /var/cache/conftool/dbconfig/20240312-084016-arnaudb.json [08:40:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 1%: Post clone', diff saved to https://phabricator.wikimedia.org/P58725 and previous config saved to /var/cache/conftool/dbconfig/20240312-084016-arnaudb.json [08:41:30] (03CR) 10Klausman: [C: 03+2] httpbb: add ores-legacy tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) (owner: 10Ilias Sarantopoulos) [08:41:42] (03CR) 10Klausman: [C: 03+1] httpbb: add ores-legacy tests [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) (owner: 10Ilias Sarantopoulos) [08:50:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:50:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:52:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 4%: Post clone', diff saved to https://phabricator.wikimedia.org/P58726 and previous config saved to /var/cache/conftool/dbconfig/20240312-085214-arnaudb.json [08:52:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 4%: Post clone', diff saved to https://phabricator.wikimedia.org/P58727 and previous config saved to /var/cache/conftool/dbconfig/20240312-085214-arnaudb.json [08:52:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 4%: Post clone', diff saved to https://phabricator.wikimedia.org/P58728 and previous config saved to /var/cache/conftool/dbconfig/20240312-085215-arnaudb.json [08:55:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58729 and previous config saved to /var/cache/conftool/dbconfig/20240312-085520-arnaudb.json [08:55:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58730 and previous config saved to /var/cache/conftool/dbconfig/20240312-085521-arnaudb.json [08:55:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 2%: Post clone', diff saved to https://phabricator.wikimedia.org/P58731 and previous config saved to /var/cache/conftool/dbconfig/20240312-085522-arnaudb.json [08:59:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:59:22] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:07:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 8%: Post clone', diff saved to https://phabricator.wikimedia.org/P58732 and previous config saved to /var/cache/conftool/dbconfig/20240312-090719-arnaudb.json [09:07:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 8%: Post clone', diff saved to https://phabricator.wikimedia.org/P58733 and previous config saved to /var/cache/conftool/dbconfig/20240312-090720-arnaudb.json [09:08:56] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:09:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:10:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 4%: Post clone', diff saved to https://phabricator.wikimedia.org/P58734 and previous config saved to /var/cache/conftool/dbconfig/20240312-091025-arnaudb.json [09:10:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 4%: Post clone', diff saved to https://phabricator.wikimedia.org/P58735 and previous config saved to /var/cache/conftool/dbconfig/20240312-091026-arnaudb.json [09:10:36] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on gitlab1004.wikimedia.org with reason: Silencing alerts for switchover prep [09:10:39] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on gitlab1004.wikimedia.org with reason: Silencing alerts for switchover prep [09:10:59] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1004.wikimedia.org with reason: Silencing alerts for switchover prep [09:11:02] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1004.wikimedia.org with reason: Silencing alerts for switchover prep [09:15:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:16:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:06] hashar: bonjour, are you around for the train? [09:20:24] 06SRE, 10ops-eqiad, 06Data-Engineering: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T359702#9622742 (10Jclark-ctr) Ticket canceled by dell SR186677718 with no reason 14 hours after creating request. Resubmitted ticket request SR186760900. [09:22:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 16%: Post clone', diff saved to https://phabricator.wikimedia.org/P58737 and previous config saved to /var/cache/conftool/dbconfig/20240312-092224-arnaudb.json [09:22:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 16%: Post clone', diff saved to https://phabricator.wikimedia.org/P58738 and previous config saved to /var/cache/conftool/dbconfig/20240312-092225-arnaudb.json [09:23:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:23:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 8%: Post clone', diff saved to https://phabricator.wikimedia.org/P58739 and previous config saved to /var/cache/conftool/dbconfig/20240312-092530-arnaudb.json [09:25:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 8%: Post clone', diff saved to https://phabricator.wikimedia.org/P58740 and previous config saved to /var/cache/conftool/dbconfig/20240312-092531-arnaudb.json [09:25:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 8%: Post clone', diff saved to https://phabricator.wikimedia.org/P58741 and previous config saved to /var/cache/conftool/dbconfig/20240312-092531-arnaudb.json [09:27:40] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:27:47] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:32:38] it looks like hashar became a victim of the daylight saving time change [09:32:48] I'm going to deploy the train to group0 before the deployment window is over [09:33:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010488 (https://phabricator.wikimedia.org/T354440) [09:33:43] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010488 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [09:34:25] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010488 (https://phabricator.wikimedia.org/T354440) (owner: 10TrainBranchBot) [09:37:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 32%: Post clone', diff saved to https://phabricator.wikimedia.org/P58742 and previous config saved to /var/cache/conftool/dbconfig/20240312-093730-arnaudb.json [09:37:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 32%: Post clone', diff saved to https://phabricator.wikimedia.org/P58743 and previous config saved to /var/cache/conftool/dbconfig/20240312-093729-arnaudb.json [09:40:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 16%: Post clone', diff saved to https://phabricator.wikimedia.org/P58744 and previous config saved to /var/cache/conftool/dbconfig/20240312-094035-arnaudb.json [09:40:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 16%: Post clone', diff saved to https://phabricator.wikimedia.org/P58745 and previous config saved to /var/cache/conftool/dbconfig/20240312-094036-arnaudb.json [09:40:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 16%: Post clone', diff saved to https://phabricator.wikimedia.org/P58746 and previous config saved to /var/cache/conftool/dbconfig/20240312-094036-arnaudb.json [09:41:02] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1041 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:42:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:42:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:47:16] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.22 refs T354440 [09:47:20] T354440: 1.42.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T354440 [09:51:21] (03PS2) 10Ilias Sarantopoulos: ml-services: update readability to identify cgroupsv2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010483 (https://phabricator.wikimedia.org/T353461) [09:51:56] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P58747 and previous config saved to /var/cache/conftool/dbconfig/20240312-095235-arnaudb.json [09:52:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P58748 and previous config saved to /var/cache/conftool/dbconfig/20240312-095235-arnaudb.json [09:52:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P58749 and previous config saved to /var/cache/conftool/dbconfig/20240312-095236-arnaudb.json [09:55:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 32%: Post clone', diff saved to https://phabricator.wikimedia.org/P58750 and previous config saved to /var/cache/conftool/dbconfig/20240312-095540-arnaudb.json [09:55:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 32%: Post clone', diff saved to https://phabricator.wikimedia.org/P58751 and previous config saved to /var/cache/conftool/dbconfig/20240312-095541-arnaudb.json [09:55:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 32%: Post clone', diff saved to https://phabricator.wikimedia.org/P58752 and previous config saved to /var/cache/conftool/dbconfig/20240312-095541-arnaudb.json [09:56:10] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:11] (03PS1) 10Arnaudb: mariadb: candidate master add for x1 [puppet] - 10https://gerrit.wikimedia.org/r/1010252 (https://phabricator.wikimedia.org/T358642) [09:59:42] (03CR) 10Marostegui: [C: 03+1] mariadb: candidate master add for x1 [puppet] - 10https://gerrit.wikimedia.org/r/1010252 (https://phabricator.wikimedia.org/T358642) (owner: 10Arnaudb) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1000) [10:00:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:00:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:00:32] (03CR) 10Arnaudb: [C: 03+2] mariadb: candidate master add for x1 [puppet] - 10https://gerrit.wikimedia.org/r/1010252 (https://phabricator.wikimedia.org/T358642) (owner: 10Arnaudb) [10:02:41] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2196 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1010253 (https://phabricator.wikimedia.org/T359919) [10:02:46] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1010254 (https://phabricator.wikimedia.org/T359919) [10:04:02] (03PS1) 10Cwhite: beta-logs: remove java11 enforcement [puppet] - 10https://gerrit.wikimedia.org/r/1010255 (https://phabricator.wikimedia.org/T352517) [10:07:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58753 and previous config saved to /var/cache/conftool/dbconfig/20240312-100740-arnaudb.json [10:07:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58754 and previous config saved to /var/cache/conftool/dbconfig/20240312-100740-arnaudb.json [10:07:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58755 and previous config saved to /var/cache/conftool/dbconfig/20240312-100741-arnaudb.json [10:09:07] (03CR) 10Cwhite: [C: 03+2] beta-logs: remove java11 enforcement [puppet] - 10https://gerrit.wikimedia.org/r/1010255 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [10:10:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:10:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:10:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P58756 and previous config saved to /var/cache/conftool/dbconfig/20240312-101045-arnaudb.json [10:10:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: Post clone', diff saved to https://phabricator.wikimedia.org/P58757 and previous config saved to /var/cache/conftool/dbconfig/20240312-101046-arnaudb.json [10:11:02] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1041 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:14:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:14:52] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:22:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P58758 and previous config saved to /var/cache/conftool/dbconfig/20240312-102245-arnaudb.json [10:22:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P58759 and previous config saved to /var/cache/conftool/dbconfig/20240312-102245-arnaudb.json [10:25:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58760 and previous config saved to /var/cache/conftool/dbconfig/20240312-102550-arnaudb.json [10:25:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58761 and previous config saved to /var/cache/conftool/dbconfig/20240312-102551-arnaudb.json [10:25:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 75%: Post clone', diff saved to https://phabricator.wikimedia.org/P58762 and previous config saved to /var/cache/conftool/dbconfig/20240312-102551-arnaudb.json [10:36:40] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 80%, RTA = 87.56 ms [10:40:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2209 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P58763 and previous config saved to /var/cache/conftool/dbconfig/20240312-104055-arnaudb.json [10:40:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P58764 and previous config saved to /var/cache/conftool/dbconfig/20240312-104056-arnaudb.json [10:40:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2211 (re)pooling @ 100%: Post clone', diff saved to https://phabricator.wikimedia.org/P58765 and previous config saved to /var/cache/conftool/dbconfig/20240312-104056-arnaudb.json [10:42:19] (03PS4) 10Ilias Sarantopoulos: httpbb: add ores-legacy tests [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) [10:42:33] (03CR) 10Ilias Sarantopoulos: httpbb: add ores-legacy tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1010245 (https://phabricator.wikimedia.org/T359871) (owner: 10Ilias Sarantopoulos) [10:43:04] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:16:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:17:29] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:32:28] (03PS1) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [11:32:29] (03PS1) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [11:32:31] (03PS1) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [11:32:36] (03PS1) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [11:33:42] (03CR) 10CI reject: [V: 04-1] P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:33:56] (03CR) 10CI reject: [V: 04-1] dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:34:19] (03CR) 10CI reject: [V: 04-1] dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [11:34:36] (03CR) 10CI reject: [V: 04-1] dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [11:37:15] (03PS2) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [11:37:16] (03PS2) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [11:37:18] (03PS2) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [11:37:24] (03PS2) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [11:38:55] (03CR) 10CI reject: [V: 04-1] dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [11:41:32] (03CR) 10CI reject: [V: 04-1] P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:41:53] (03CR) 10CI reject: [V: 04-1] dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:42:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:38] (03CR) 10CI reject: [V: 04-1] dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [11:44:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:46:33] (03PS1) 10Arturo Borrero Gonzalez: spicerack: require pyyaml > 6.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 [11:46:40] (03PS1) 10KartikMistry: Update cxserver to 2024-03-12-113634-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010508 (https://phabricator.wikimedia.org/T350773) [11:50:00] (03PS3) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [11:50:01] (03PS3) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [11:50:03] (03PS3) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [11:50:10] (03PS3) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [11:50:18] (03CR) 10CI reject: [V: 04-1] spicerack: require pyyaml > 6.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 (owner: 10Arturo Borrero Gonzalez) [11:54:48] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:54:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:55:28] (03CR) 10CI reject: [V: 04-1] dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [11:55:34] (03PS4) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [11:55:38] (03PS4) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [11:55:46] (03PS4) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [11:55:54] (03PS4) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [11:55:55] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:09] (03PS5) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [11:57:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1634/co" [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [11:58:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:59:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:59:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1200) [12:01:30] (03CR) 10CI reject: [V: 04-1] dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [12:02:50] (03PS6) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [12:02:54] (03CR) 10CI reject: [V: 04-1] dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) (owner: 10Majavah) [12:02:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:03:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:03:38] (03PS2) 10Arturo Borrero Gonzalez: spicerack: avoid pyyaml5.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 [12:03:56] (03PS3) 10Arturo Borrero Gonzalez: spicerack: avoid pyyaml 5.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 [12:07:31] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:07:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:08:39] (03CR) 10CI reject: [V: 04-1] spicerack: avoid pyyaml 5.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 (owner: 10Arturo Borrero Gonzalez) [12:11:38] (03PS5) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [12:11:39] (03PS7) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [12:11:41] (03PS1) 10Majavah: dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 [12:13:48] (03CR) 10CI reject: [V: 04-1] dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 (owner: 10Majavah) [12:14:56] (03PS2) 10Majavah: dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 [12:16:32] (03CR) 10CI reject: [V: 04-1] dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 (owner: 10Majavah) [12:16:59] (03PS3) 10Majavah: dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 [12:19:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:19:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:19:55] (03PS3) 10Majavah: O:toolforge: add role for grid-less bastions [puppet] - 10https://gerrit.wikimedia.org/r/990703 (https://phabricator.wikimedia.org/T314665) [12:23:33] (03PS4) 10Majavah: O:toolforge: add role for grid-less bastions [puppet] - 10https://gerrit.wikimedia.org/r/990703 (https://phabricator.wikimedia.org/T314665) [12:25:18] (03PS6) 10Physikerwelt: Enable natvive math rendering options by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002639 (https://phabricator.wikimedia.org/T358803) [12:26:06] (03PS5) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [12:26:06] (03PS5) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [12:26:08] (03PS6) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [12:26:11] !log eoghan@cumin1002 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [12:26:14] (03PS8) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [12:26:22] (03PS4) 10Majavah: dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 [12:30:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:30:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:33:03] (03PS6) 10Majavah: P:toolforge::proxy: drop grid engine dynamicproxy support [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) [12:33:04] (03PS6) 10Majavah: dynamicproxy: cleanup after removing toolforge support [puppet] - 10https://gerrit.wikimedia.org/r/1010504 (https://phabricator.wikimedia.org/T314664) [12:33:06] (03PS7) 10Majavah: dynamicproxy: add support for per-project zones [puppet] - 10https://gerrit.wikimedia.org/r/1010505 (https://phabricator.wikimedia.org/T342398) [12:33:11] (03PS9) 10Majavah: dynamicproxy: allow specifying different certs for each zone [puppet] - 10https://gerrit.wikimedia.org/r/1010506 (https://phabricator.wikimedia.org/T342398) [12:33:19] (03PS5) 10Majavah: dynamicproxy: add spec test for API [puppet] - 10https://gerrit.wikimedia.org/r/1010509 [12:34:46] (03PS7) 10MdsShakil: Add `suppressredirect` right to pagemover and filemover user groups in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009729 (https://phabricator.wikimedia.org/T359614) [12:36:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:36:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:37:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:38:21] (03CR) 10Dzahn: [C: 03+1] "lgtm, has approval from Tyler, user has existing shell access" [puppet] - 10https://gerrit.wikimedia.org/r/1010453 (https://phabricator.wikimedia.org/T359092) (owner: 10Marostegui) [12:42:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1300). [13:00:04] MdsShakil and physikerwelt: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:22] Hello :) [13:00:25] Hi [13:03:12] (03PS1) 10Dzahn: admin: add rkhan to group 'restricted' (mwmaint access) [puppet] - 10https://gerrit.wikimedia.org/r/1010514 (https://phabricator.wikimedia.org/T359490) [13:04:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9623330 (10Dzahn) Is this with or without shell access and with or without Kerberos? [13:08:22] dancy: thanks! [13:08:50] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add tjones to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1010453 (https://phabricator.wikimedia.org/T359092) (owner: 10Marostegui) [13:09:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:09:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:09:56] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 13Patch-For-Review: Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9623344 (10Marostegui) 05Open→03Resolved a:03Marostegui This has been deployed. Please give it 30-45 minut... [13:11:21] (03CR) 10Marostegui: [C: 03+1] "This looks good, please note that the manager approval hasn't happened yet." [puppet] - 10https://gerrit.wikimedia.org/r/1010514 (https://phabricator.wikimedia.org/T359490) (owner: 10Dzahn) [13:14:10] (03CR) 10David Caro: "I think this is the same root issue as https://phabricator.wikimedia.org/T345337" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1010507 (owner: 10Arturo Borrero Gonzalez) [13:15:30] (03CR) 10Elukey: [C: 03+1] ml-services: update readability to identify cgroupsv2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010483 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [13:16:57] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update readability to identify cgroupsv2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010483 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [13:17:52] (03Merged) 10jenkins-bot: ml-services: update readability to identify cgroupsv2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010483 (https://phabricator.wikimedia.org/T353461) (owner: 10Ilias Sarantopoulos) [13:18:16] oh right the window is earlier now because daylight confusion time [13:18:25] I can deploy, I guess [13:18:42] great [13:22:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009729 (https://phabricator.wikimedia.org/T359614) (owner: 10MdsShakil) [13:23:01] (03Merged) 10jenkins-bot: Add `suppressredirect` right to pagemover and filemover user groups in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009729 (https://phabricator.wikimedia.org/T359614) (owner: 10MdsShakil) [13:23:41] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:24:01] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1009729|Add `suppressredirect` right to pagemover and filemover user groups in azwiki (T359614)]] [13:24:16] T359614: Add `suppressredirect` right to pagemover and filemover user groups in azwiki - https://phabricator.wikimedia.org/T359614 [13:24:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9623402 (10Dzahn) a:03ANakanishi_WMF [13:26:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Backport for [[gerrit:1009729|Add `suppressredirect` right to pagemover and filemover user groups in azwiki (T359614)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:53] MdsShakil: please test :) [13:27:03] https://az.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2 looks good to me, at least [13:27:18] Lucas_WMDE LGTM [13:27:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Continuing with sync [13:30:17] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:30:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:34:40] PROBLEM - Check whether ferm is active by checking the default input chain on mw2294 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:35:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:35:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:38:14] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1009729|Add `suppressredirect` right to pagemover and filemover user groups in azwiki (T359614)]] (duration: 14m 13s) [13:38:19] T359614: Add `suppressredirect` right to pagemover and filemover user groups in azwiki - https://phabricator.wikimedia.org/T359614 [13:38:21] (03PS7) 10Lucas Werkmeister (WMDE): Enable natvive math rendering options by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002639 (https://phabricator.wikimedia.org/T358803) (owner: 10Physikerwelt) [13:38:25] (03PS8) 10Lucas Werkmeister (WMDE): Enable native math rendering options by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002639 (https://phabricator.wikimedia.org/T358803) (owner: 10Physikerwelt) [13:39:08] physikerwelt: I’m a bit confused by the task description of T358803 – you write that it’s currently enabled in Germany, but the config change looks like it was previously limited to testwiki? [13:39:08] T358803: Enable native MathML rendering mode as a rendering option everywhere - https://phabricator.wikimedia.org/T358803 [13:39:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:39:52] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:40:49] (otherwise that change looks okay to me) [13:40:50] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1002639/8/wmf-config/InitialiseSettings.php the dewiki line stands for Germany [13:41:02] ohhh [13:41:05] I missed that line somehow [13:41:19] (also, don’t let the Austrian and Swiss wikipedians hear that ;)) [13:41:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002639 (https://phabricator.wikimedia.org/T358803) (owner: 10Physikerwelt) [13:41:43] (03PS1) 10Cwhite: beta-logs: unset java home override [puppet] - 10https://gerrit.wikimedia.org/r/1010259 (https://phabricator.wikimedia.org/T352517) [13:42:13] (03Merged) 10jenkins-bot: Enable native math rendering options by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1002639 (https://phabricator.wikimedia.org/T358803) (owner: 10Physikerwelt) [13:42:25] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:38] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1002639|Enable native math rendering options by default (T358803)]] [13:45:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and physikerwelt: Backport for [[gerrit:1002639|Enable native math rendering options by default (T358803)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:45:05] T358803: Enable native MathML rendering mode as a rendering option everywhere - https://phabricator.wikimedia.org/T358803 [13:45:11] alright, please test :) [13:46:46] thank you that works [13:47:02] ok! [13:47:03] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and physikerwelt: Continuing with sync [13:48:04] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:49:34] huh, we’re only left with 4 bare-metal canaries now? [13:49:38] (if I read the scap output correctly) [13:51:56] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:52:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:54:09] PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:54:23] hu [13:55:13] here if needed [13:55:17] hmm [13:55:20] on it, don't worry [13:55:23] depooling the host [13:55:28] thanks arnaudb [13:55:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1246', diff saved to https://phabricator.wikimedia.org/P58768 and previous config saved to /var/cache/conftool/dbconfig/20240312-135537-arnaudb.json [13:56:59] RECOVERY - Host db1246 #page is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:57:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:57:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Down', diff saved to https://phabricator.wikimedia.org/P58769 and previous config saved to /var/cache/conftool/dbconfig/20240312-135715-ladsgroup.json [13:57:43] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1002639|Enable native math rendering options by default (T358803)]] (duration: 15m 05s) [13:57:47] T358803: Enable native MathML rendering mode as a rendering option everywhere - https://phabricator.wikimedia.org/T358803 [13:57:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58770 and previous config saved to /var/cache/conftool/dbconfig/20240312-135750-ladsgroup.json [13:58:07] !log UTC afternoon backport+config window done [13:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:51] thank you Lucas_WMDE [13:59:24] PROBLEM - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:00:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: db1246 depooled [14:00:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: db1246 depooled [14:02:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:04:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw2294 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:11:13] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/1010504/1634/" [puppet] - 10https://gerrit.wikimedia.org/r/1010503 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [14:12:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58772 and previous config saved to /var/cache/conftool/dbconfig/20240312-141255-ladsgroup.json [14:13:53] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010524 (https://phabricator.wikimedia.org/T356492) [14:14:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:14:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:14:33] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010524 (https://phabricator.wikimedia.org/T356492) [14:14:58] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010524 (https://phabricator.wikimedia.org/T356492) (owner: 10Kosta Harlan) [14:16:12] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1010524 (https://phabricator.wikimedia.org/T356492) (owner: 10Kosta Harlan) [14:16:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: db1246 depooled [14:16:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1246.eqiad.wmnet with reason: db1246 depooled [14:16:56] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:04] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:18:36] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:20:10] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [14:20:37] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [14:21:02] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:21:09] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:21:16] RECOVERY - Host elastic2088 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:22:32] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [14:22:55] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [14:22:58] 10ops-eqiad, 06DC-Ops, 06Data-Persistence: hw troubleshooting: Unidentified for db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T359940 (10ABran-WMF) 03NEW [14:26:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:26:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:26:58] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1010530 (https://phabricator.wikimedia.org/T359940) [14:27:40] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58773 and previous config saved to /var/cache/conftool/dbconfig/20240312-142801-ladsgroup.json [14:28:34] (03CR) 10Arnaudb: [C: 03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1010530 (https://phabricator.wikimedia.org/T359940) (owner: 10Marostegui) [14:31:23] (03CR) 10Cwhite: [C: 03+2] beta-logs: unset java home override [puppet] - 10https://gerrit.wikimedia.org/r/1010259 (https://phabricator.wikimedia.org/T352517) (owner: 10Cwhite) [14:34:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:35:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:38:04] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:57] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9623717 (10mpopov) +1 to Isaac's proposed solution of carrying `wprov` forward as `wprov` but also setting `rprov=1` in case of a redirect to simplify analy... [14:42:25] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58774 and previous config saved to /var/cache/conftool/dbconfig/20240312-144307-ladsgroup.json [14:49:24] 06SRE, 06serviceops: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9623780 (10Dzahn) [14:51:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:51:55] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab] Failover test of gitlab replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1009298 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [14:56:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:58:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2006.codfw.wmnet with OS bullseye [14:58:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbprov2005.codfw.wmnet with OS bullseye [14:59:19] (03PS1) 10EoghanGaffney: Revert "[gitlab] Failover test of gitlab replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1010224 [15:00:06] eoghan, jelto, and arnoldokoth: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1500) [15:12:50] (03PS1) 10Elukey: Add fake Docker secret config for Dragonfly on ml-serve k8s [labs/private] - 10https://gerrit.wikimedia.org/r/1010534 (https://phabricator.wikimedia.org/T359416) [15:15:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake Docker secret config for Dragonfly on ml-serve k8s [labs/private] - 10https://gerrit.wikimedia.org/r/1010534 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [15:16:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 48.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:16:16] (03PS1) 10Elukey: Add Dragonfly 2p2 cache to ml-serve k8s [puppet] - 10https://gerrit.wikimedia.org/r/1010535 (https://phabricator.wikimedia.org/T359416) [15:18:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1635/co" [puppet] - 10https://gerrit.wikimedia.org/r/1010535 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [15:19:44] (03CR) 10Elukey: [V: 03+1] "Will deploy this after some tests in staging." [puppet] - 10https://gerrit.wikimedia.org/r/1010535 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [15:20:26] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1010224 (owner: 10EoghanGaffney) [15:21:01] (03CR) 10EoghanGaffney: [C: 03+2] Revert "[gitlab] Failover test of gitlab replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1010224 (owner: 10EoghanGaffney) [15:21:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 48.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:22:14] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:58] (03CR) 10Dzahn: [C: 03+1] Revert "[gitlab] Failover test of gitlab replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1010224 (owner: 10EoghanGaffney) [15:32:40] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [15:55:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for GeorgeMikesell - https://phabricator.wikimedia.org/T358922#9624001 (10Jrbranaa) I approve [15:59:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:15:15] (03PS4) 10Krinkle: Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:15:40] (03CR) 10Krinkle: [C: 03+1] Support cookies in XWikimediaDebug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:15:56] (03CR) 10CI reject: [V: 04-1] Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:20:34] (03PS1) 10EoghanGaffney: [gitlab] Fix progress_bars parameter (should be print_progress_bars) [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) [16:28:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:33:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:40:03] 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878#9624160 (10cmooney) To confirm all the new networks in codfw are within 10.192.0.0/16, which I believe should be ok in terms of grants. As I under... [16:40:54] (03PS5) 10Krinkle: Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:40:55] (03PS1) 10Krinkle: tests: Fix PHP 8.2 warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010562 [16:40:57] (03PS1) 10Krinkle: tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 [16:43:39] (03CR) 10CI reject: [V: 04-1] Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:45:24] (03CR) 10CI reject: [V: 04-1] tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 (owner: 10Krinkle) [16:49:23] (03PS6) 10Krinkle: Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:49:24] (03PS2) 10Krinkle: tests: Fix PHP 8.2 warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010562 [16:49:30] (03PS2) 10Krinkle: tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 [16:51:08] (03CR) 10CI reject: [V: 04-1] tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 (owner: 10Krinkle) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T1700) [17:09:01] (03PS1) 10Ahmon Dancy: modules/scap/files/foreachwiki: Fix check for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1010590 (https://phabricator.wikimedia.org/T357877) [17:10:58] (03PS3) 10Krinkle: tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 [17:13:24] (03CR) 10Dzahn: [C: 03+1] "verified on https://doc.wikimedia.org/spicerack/master/api/spicerack.remote.html" [cookbooks] - 10https://gerrit.wikimedia.org/r/1010559 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [17:20:03] (03CR) 10Jforrester: [C: 03+1] "Nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010562 (owner: 10Krinkle) [17:25:03] 06SRE, 06serviceops, 07Wikimedia-production-error: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9624373 (10Reedy) [17:29:07] (03PS3) 10Ahmon Dancy: mw-xml.sh: Update maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T359643) [17:29:37] (03CR) 10Krinkle: [C: 03+1] Support cookies in XWikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000307 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:31:33] (03PS4) 10Krinkle: tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 [17:33:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:42] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 60%, RTA = 33.54 ms [17:41:06] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:55] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:11] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:27:28] (03PS2) 10Jforrester: Be able to disable MobileFrontend and drop the secondary domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010268 (https://phabricator.wikimedia.org/T349408) [18:27:29] (03PS2) 10Jforrester: [BETA CLUSTER] Disable MobileFrontend for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010269 (https://phabricator.wikimedia.org/T358329) [18:27:31] (03PS2) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) [18:29:33] (03PS1) 10Jforrester: WikiModule: Fix data structure when preloading title info [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010568 (https://phabricator.wikimedia.org/T359939) [19:22:14] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:23:06] jouncebot: nowandnext [19:23:06] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [19:23:07] In 0 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T2000) [19:23:33] (03CR) 10Hashar: [C: 03+2] WikiModule: Fix data structure when preloading title info [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010568 (https://phabricator.wikimedia.org/T359939) (owner: 10Jforrester) [19:25:35] jouncebot: refresh [19:25:36] I refreshed my knowledge about deployments. [19:25:40] jouncebot: nowandnext [19:25:40] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [19:25:40] In 0 hour(s) and 34 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T2000) [19:25:47] well hmm something like that [19:44:19] (03Merged) 10jenkins-bot: WikiModule: Fix data structure when preloading title info [core] (wmf/1.42.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1010568 (https://phabricator.wikimedia.org/T359939) (owner: 10Jforrester) [19:44:47] good [19:45:31] !log hashar@deploy2002 Started scap: Backport for [[gerrit:1010568|WikiModule: Fix data structure when preloading title info (T359939)]] [19:45:36] T359939: PHP Warning: Illegal string offset 'page_len' - https://phabricator.wikimedia.org/T359939 [19:46:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:46:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:47:57] !log hashar@deploy2002 hashar and jforrester: Backport for [[gerrit:1010568|WikiModule: Fix data structure when preloading title info (T359939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:48:03] !log hashar@deploy2002 hashar and jforrester: Continuing with sync [19:51:13] * hashar yawns at helm/k8s [19:52:38] PROBLEM - Check whether ferm is active by checking the default input chain on mw2282 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:53:02] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1010 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:53:32] PROBLEM - Check whether ferm is active by checking the default input chain on mw1431 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:54:04] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1046 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:58:46] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:1010568|WikiModule: Fix data structure when preloading title info (T359939)]] (duration: 13m 14s) [19:58:50] T359939: PHP Warning: Illegal string offset 'page_len' - https://phabricator.wikimedia.org/T359939 [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240312T2000). [20:00:04] matmarex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:02:01] * TheresNoTime is unable to deploy this evening [20:04:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:29] I have deployed the patch [20:06:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:34] (03PS1) 10Hashar: Revert "Rakefile: remove useless files from generated docs" [puppet] - 10https://gerrit.wikimedia.org/r/1010570 (https://phabricator.wikimedia.org/T358507) [20:09:08] (03CR) 10Hashar: "It is no more needed as of yard 0.9.36 ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1010570 (https://phabricator.wikimedia.org/T358507) (owner: 10Hashar) [20:15:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:20:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:22:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw2282 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:02] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1431 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:24:04] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1046 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:49:07] (03PS1) 10GergesShamon: Set ShowRollbackConfirmationDefaultUserOptions on arwiki to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010645 [20:53:03] (03PS2) 10GergesShamon: Set ShowRollbackConfirmationDefaultUserOptions on arwiki to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010645 [21:00:04] (03PS3) 10GergesShamon: Set ShowRollbackConfirmationDefaultUserOptions on arwiki to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010645 [21:04:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:06:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:04] Hi [21:39:31] (03PS1) 10Tim Starling: Fix broken vim modelines [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1010664 [21:43:00] (03PS2) 10Tim Starling: Fix broken vim modelines [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1010664 [21:52:41] (03PS1) 10Tim Starling: Add procps to base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1010690 [21:54:44] (03CR) 10Gergő Tisza: [C: 03+1] tests: Convert XWikimediaDebug cookie test to data provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010563 (owner: 10Krinkle) [22:04:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:05:10] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:06:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.320 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:12] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:24:14] 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management, 13Patch-For-Review: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9625181 (10Krinkle) Tagging MwEng group for visibility, given unowned code. [22:25:21] (03PS1) 10Jdlrobson: Disable night mode on history pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010698 (https://phabricator.wikimedia.org/T359183) [22:39:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010218 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [22:41:34] (03CR) 10BryanDavis: "https://vimdoc.sourceforge.net/htmldoc/options.html#modeline" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1010664 (owner: 10Tim Starling) [22:50:01] !log removing 2 files for legal compliance [22:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:20] (03Merged) 10jenkins-bot: migrateBlocks.php: Skip existing IDs [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1010218 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [22:57:45] !log tstarling@deploy2002 Started scap: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]] [22:57:50] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [23:00:05] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:02:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:03:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:04:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490#9625219 (10ANakanishi_WMF) I'm Riddy's manager and approve his access, which is essential for his job. Thank you! [23:05:11] !log removing 3 files for legal compliance [23:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:17] !log removing 1 file for legal compliance [23:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:29] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:43] !log tstarling@deploy2002 tstarling: Continuing with sync [23:30:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:30:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:40:22] !log tstarling@deploy2002 Finished scap: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]] (duration: 42m 36s) [23:40:27] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [23:48:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:48:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply