[00:10:20] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112422 [00:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112422 (owner: 10TrainBranchBot) [00:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:57:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1112422 (owner: 10TrainBranchBot) [00:59:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:02:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:06:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10474935 (10phaultfinder) [01:08:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112438 [01:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112438 (owner: 10TrainBranchBot) [01:09:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:20:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:22:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:30:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1112438 (owner: 10TrainBranchBot) [02:08:20] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:56] (03PS1) 10Kevin Bazira: ml-services: update article-country deployment image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112449 (https://phabricator.wikimedia.org/T382295) [04:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:56:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:59:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:01:22] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/ops for Mr ChanMP - https://phabricator.wikimedia.org/T384168 (10MrChanMP) 03NEW [05:11:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10475002 (10phaultfinder) [05:30:38] (03PS1) 10Kevin Bazira: EventStreamConfig: Add mediawiki.page_article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) [05:32:14] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:35:48] PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:40:53] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/ops for Mr ChanMP - https://phabricator.wikimedia.org/T384168#10475013 (10MrChanMP) [05:43:14] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:45:05] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/ops for Mr ChanMP - https://phabricator.wikimedia.org/T384168#10475014 (10MrChanMP) 05Open→03In progress p:05Triage→03Unbreak! [05:52:48] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:03:03] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for administrators of Indonesian projects - https://phabricator.wikimedia.org/T384135#10475018 (10SpartacksCompatriot) Thanks for taking this, Ladsgroup. Can this be wiki-id-admins@lists.wikimedia.org instead? We also plan to include non-Wikipedia Indonesian admin... [06:05:46] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:27:52] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:28:52] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:32:46] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:43:38] 10SRE-Access-Requests: Access to superset tool https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169 (10DSantamaria) 03NEW [06:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [06:58:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:27] (03PS1) 10Marostegui: pc1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112566 (https://phabricator.wikimedia.org/T383235) [07:05:47] (03CR) 10Muehlenhoff: [C:03+2] Add missing Hiera settings for new bookworm master roles [puppet] - 10https://gerrit.wikimedia.org/r/1112218 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:06:28] (03PS6) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [07:06:47] (03CR) 10Marostegui: [C:03+2] pc1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112566 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [07:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:36:19] (03PS1) 10Marostegui: pc1017,pc2017: Create pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112676 (https://phabricator.wikimedia.org/T383235) [07:36:52] (03CR) 10Marostegui: [C:03+2] pc1017,pc2017: Create pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112676 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [07:36:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: reorganizing pc7 [07:42:20] (03PS1) 10Marostegui: site.pp: Create pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112677 (https://phabricator.wikimedia.org/T383235) [07:43:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [07:43:58] (03CR) 10Marostegui: [C:03+2] site.pp: Create pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112677 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [07:51:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [07:52:43] (03PS1) 10Muehlenhoff: Remove LDAP access for jaycano [puppet] - 10https://gerrit.wikimedia.org/r/1112678 [07:54:44] (03CR) 10Slyngshede: "LGTM, wasn't sure it this was the correct person, but the username matches." [puppet] - 10https://gerrit.wikimedia.org/r/1112678 (owner: 10Muehlenhoff) [07:54:49] (03CR) 10Slyngshede: [C:03+1] Remove LDAP access for jaycano [puppet] - 10https://gerrit.wikimedia.org/r/1112678 (owner: 10Muehlenhoff) [07:59:28] (03CR) 10Muehlenhoff: "I was a little confused as well, but the nickname is also mentioned there:" [puppet] - 10https://gerrit.wikimedia.org/r/1112678 (owner: 10Muehlenhoff) [07:59:30] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for jaycano [puppet] - 10https://gerrit.wikimedia.org/r/1112678 (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T0800). [08:00:05] awight, cjming, and ihurbain: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:15] o/ I can self-deploy my patch now [08:00:33] o/ i can self-deploy mine too [08:01:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) (owner: 10Awight) [08:01:06] awight: can you ping me when you're done? [08:01:16] (03CR) 10Awight: Switch to explicit numbering for Parsoid footnote markers [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) (owner: 10Awight) [08:01:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) (owner: 10Awight) [08:01:28] cjming: will do! [08:01:34] ty! [08:01:40] o/ i can't self deploy, i could do with one of you getting mine after theirs :) [08:01:59] ihurbain: I'm happy to, after cjming is done [08:02:05] awight: thank you :) [08:03:39] i can ping you both when i'm done [08:03:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2027.codfw.wmnet to cluster codfw and group A [08:04:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2027.codfw.wmnet to cluster codfw and group A [08:05:38] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country deployment image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112449 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [08:12:00] hi, I am around as well [08:15:25] !log Deploy schema change on x1 codfw master db2196 with replication dbmaint T384176 [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:29] T384176: Make gemm_mentee_is_active a mwtinyint in the per-wiki x1 databases - https://phabricator.wikimedia.org/T384176 [08:15:43] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2024 Nov-Dec), 07Unplanned-Sprint-Work: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10475172 (10Nikerabbit) [08:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:55] marostegui: thank you! you're quick :) [08:20:09] urbanecm: Some of the tables already have that definition [08:20:11] Is that expected? [08:21:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:24] (03Merged) 10jenkins-bot: Switch to explicit numbering for Parsoid footnote markers [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) (owner: 10Awight) [08:30:43] !log installing python-aiohttp security updates [08:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:31:32] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1111215|Switch to explicit numbering for Parsoid footnote markers (T382310)]] [08:31:36] T382310: Remove built-in Cite CSS numbering for Parsoid, if possible - https://phabricator.wikimedia.org/T382310 [08:31:54] sorry, those tests were *slow* today [08:34:21] !log installing Linux 6.1.124 on Bookworm hosts [08:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:37:21] (03PS5) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) [08:38:27] okay, considering timing, i'll move my deploy to later, it's not going to fit :) [08:41:30] ihurbain: yeah it's looking like that---although config patches should be much faster than this extension patch CI, one hopes. [08:41:58] awight: faster, probably, faster than 10 minutes, probably not [08:42:59] that's been my experience -- backports with most extensions run 25 minutes on average [08:43:13] for CI to finish ^^ [08:43:38] 40 minutes and we're only just now getting to the test servers today. I chewed a limb off already. [08:43:55] chomp chomp [08:44:13] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10475251 (10Aklapper) [08:44:27] testing... [08:44:31] !log awight@deploy2002 awight: Backport for [[gerrit:1111215|Switch to explicit numbering for Parsoid footnote markers (T382310)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:44:36] T382310: Remove built-in Cite CSS numbering for Parsoid, if possible - https://phabricator.wikimedia.org/T382310 [08:45:26] !log awight@deploy2002 awight: Continuing with sync [08:49:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:52:11] 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10475285 (10MatthewVernon) p:05Triage→03High [08:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:55:11] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111215|Switch to explicit numbering for Parsoid footnote markers (T382310)]] (duration: 23m 38s) [08:55:15] T382310: Remove built-in Cite CSS numbering for Parsoid, if possible - https://phabricator.wikimedia.org/T382310 [08:55:28] cjming: all the remaining time is yours :-/ [08:55:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475302 (10MoritzMuehlenhoff) [08:55:56] thanks! [08:56:01] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: serve apache vhost on localhost too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112186 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:56:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [08:56:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [08:56:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475304 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs [08:57:12] 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10475315 (10MatthewVernon) Can confirm the codfw db for this container is missing, and there are quarantined versions. [08:58:07] 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10475316 (10MatthewVernon) Of interest/concern is that ms-be2068 has quarantined this before: ` mvernon@ms-be2068:~$ sudo ls -l /srv/swift-storage/sdb3/quarantined/containers total 0 d... [08:58:20] (03Merged) 10jenkins-bot: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [08:58:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [08:58:36] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1112105|Enable the text experiment on testwiki only (T373715)]] [08:58:41] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [09:03:20] !log cjming@deploy2002 cjming: Backport for [[gerrit:1112105|Enable the text experiment on testwiki only (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:03:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to drbd [09:04:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475327 (10ops-monitoring-bot) VM kubestagemaster2005.codfw.wmnet switching disk type to drbd [09:04:26] (03PS2) 10Filippo Giunchedi: prometheus: recording rules for mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) [09:05:01] !log cjming@deploy2002 cjming: Continuing with sync [09:05:03] (03CR) 10Filippo Giunchedi: "Doh of course, rates is what we should do here also to deal properly with underlying counters resets" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [09:08:03] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112235 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [09:09:40] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/ops for Mr ChanMP - https://phabricator.wikimedia.org/T384168#10475329 (10Aklapper) [09:11:52] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [09:13:09] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112105|Enable the text experiment on testwiki only (T373715)]] (duration: 14m 32s) [09:13:12] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [09:13:14] (03Merged) 10jenkins-bot: ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [09:14:26] jouncebot: nowandnext [09:14:27] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [09:14:27] In 1 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1100) [09:14:53] !log end of UTC morning backport window [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:05] (03CR) 10Jelto: [C:03+1] peopleweb: request timeout to allow downloading larger files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112056 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [09:15:07] I'm going to deploy now if that's all good. [09:15:34] I just finished my patch - all yours Dreamy_Jazz [09:15:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2212', diff saved to https://phabricator.wikimedia.org/P72151 and previous config saved to /var/cache/conftool/dbconfig/20250120-091545-marostegui.json [09:16:48] Thanks. [09:16:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2212.codfw.wmnet with reason: rebuilding [09:18:10] (03CR) 10Volans: [C:03+1] "Looks sane to me, thanks for the follow up" [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [09:18:11] (03CR) 10Jelto: [C:03+1] "lgtm, the `request_timeout` added in I3ca0a13ab9c35a6935ca3c9759f2ff2759874699 was probably also just noop because the parameter was remov" [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [09:18:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220 (owner: 10Dreamy Jazz) [09:19:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to drbd [09:19:27] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [09:19:40] (03Merged) 10jenkins-bot: Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112220 (owner: 10Dreamy Jazz) [09:19:55] RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 44.02 ms [09:19:57] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1112220|Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false"]] [09:19:57] (03PS2) 10Jelto: trafficserver: add dedicated mapping for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) [09:20:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [09:20:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475371 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs [09:20:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [09:21:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [09:21:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2005.codfw.wmnet to plain [09:21:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475372 (10ops-monitoring-bot) VM kubestagemaster2005.codfw.wmnet switching disk type to plain [09:22:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2005.codfw.wmnet to plain [09:22:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475376 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs [09:23:33] (03PS1) 10Muehlenhoff: Switch ganeti2023 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1112693 [09:24:29] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1112220|Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:24:36] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [09:27:36] (03PS1) 10Filippo Giunchedi: Revert^2 "prometheus: scrape otelcol metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1112695 [09:30:54] (03CR) 10Filippo Giunchedi: [C:03+2] Revert^2 "prometheus: scrape otelcol metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1112695 (owner: 10Filippo Giunchedi) [09:31:42] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112220|Revert^2 "Pin wgCheckUserEnableTempAccountsOnboardingDialog as false"]] (duration: 11m 45s) [09:32:29] Finished my deploys [09:32:39] (03PS1) 10Marostegui: parsercachepurging.pp: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112697 (https://phabricator.wikimedia.org/T383235) [09:35:30] (03CR) 10Marostegui: [C:03+2] parsercachepurging.pp: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112697 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [09:44:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10475441 (10MoritzMuehlenhoff) [09:46:06] 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10475446 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The error from the sqlite integrity check for each copy was `wrong # of entries in index ix_object_deleted_name... [09:47:39] (03PS1) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [09:48:45] (03CR) 10Giuseppe Lavagetto: [C:03+2] aptrepo: allow importing conftool from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1111945 (owner: 10Giuseppe Lavagetto) [09:51:02] !log installing intel-microcode security updates [09:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:38] (03PS1) 10Filippo Giunchedi: thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T383963) [09:53:11] (03PS2) 10Filippo Giunchedi: thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) [09:53:42] (03PS3) 10Filippo Giunchedi: thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) [09:54:23] (03PS1) 10Marostegui: sections.yaml: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112701 (https://phabricator.wikimedia.org/T383235) [09:56:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [09:56:34] (03PS1) 10Marostegui: wmnet: Add pc7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112702 (https://phabricator.wikimedia.org/T383235) [10:02:19] (03PS1) 10Filippo Giunchedi: prometheus: rename k8s rules groups [puppet] - 10https://gerrit.wikimedia.org/r/1112703 (https://phabricator.wikimedia.org/T369607) [10:03:09] RECOVERY - Disk space on prometheus1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [10:07:20] (03PS2) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [10:08:48] (03PS3) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [10:09:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4817/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:10:58] (03PS4) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [10:11:43] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4818/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:12:45] (03CR) 10Arnaudb: "Thanks for the review. Interesting! `request_timeout` is also used in other files, this might be worth highlighting? I'll merge it early i" [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [10:14:01] (03PS5) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [10:14:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4819/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:15:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4820/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:19:48] (03PS6) 10Slyngshede: P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) [10:20:52] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4821/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:22:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4822/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:27:40] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: os upgrade [10:28:45] (03PS1) 10Slyngshede: P:idp missing airflow secret [labs/private] - 10https://gerrit.wikimedia.org/r/1112706 [10:29:35] (03CR) 10Slyngshede: [V:03+2 C:03+2] P:idp missing airflow secret [labs/private] - 10https://gerrit.wikimedia.org/r/1112706 (owner: 10Slyngshede) [10:30:59] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4824/console" [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:32:24] (03PS1) 10Clare Ming: Add dedicated experimentation lab test module [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) [10:32:32] (03CR) 10Ilias Sarantopoulos: [C:03+1] EventStreamConfig: Add mediawiki.page_article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [10:34:11] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1245.eqiad.wmnet with OS bookworm [10:35:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [10:36:13] (03CR) 10Ladsgroup: [C:03+1] wmnet: Add pc7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112702 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [10:36:42] (03CR) 10Ladsgroup: [C:03+1] sections.yaml: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112701 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [10:37:24] (03CR) 10Lucas Werkmeister (WMDE): Add known-good regexes for WikibaseQualityConstraints (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [10:41:58] (03CR) 10Ladsgroup: [C:03+1] parsercachepurging.pp: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112697 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [10:49:38] (03CR) 10Audrey Penven: Add known-good regexes for WikibaseQualityConstraints (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [10:50:47] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1245.eqiad.wmnet with reason: host reimage [10:51:08] (03PS2) 10Audrey Penven: Add known-good regexes for WikibaseQualityConstraints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) [10:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [10:54:20] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: host reimage [10:59:37] (03CR) 10Jelto: [C:03+2] trafficserver: add dedicated mapping for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1100) [11:00:25] FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:30] (03PS1) 10Hnowlan: dsh: empty scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1112714 (https://phabricator.wikimedia.org/T384196) [11:05:25] RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:44] (03PS1) 10Vgutierrez: Add pki.goog CAA record on unified cert domains [dns] - 10https://gerrit.wikimedia.org/r/1112715 (https://phabricator.wikimedia.org/T376459) [11:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [11:10:41] (03CR) 10Cathal Mooney: [C:03+2] Add WMCS cloud-private eqiad ranges to private6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1112273 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [11:11:15] (03Merged) 10jenkins-bot: Add WMCS cloud-private eqiad ranges to private6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1112273 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [11:13:49] (03PS2) 10Vgutierrez: Add pki.goog CAA record on unified cert domains [dns] - 10https://gerrit.wikimedia.org/r/1112715 (https://phabricator.wikimedia.org/T376459) [11:14:05] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2023.codfw.wmnet with reason: remove from cluster for reimage [11:14:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475759 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=41026d46-bc2c-40b2-9cae-afdfd40f2459) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:16:12] (03PS1) 10Majavah: hieradata: Bump striker to 2025-01-20-105216-production [puppet] - 10https://gerrit.wikimedia.org/r/1112716 [11:16:32] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1245.eqiad.wmnet with OS bookworm [11:21:57] (03CR) 10Marostegui: [C:03+2] wmnet: Add pc7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112702 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [11:22:04] !log marostegui@dns1006 START - running authdns-update [11:22:35] (03CR) 10Marostegui: [C:03+2] sections.yaml: Add pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112701 (https://phabricator.wikimedia.org/T383235) (owner: 10Marostegui) [11:23:36] (03CR) 10Majavah: [C:03+2] hieradata: Bump striker to 2025-01-20-105216-production [puppet] - 10https://gerrit.wikimedia.org/r/1112716 (owner: 10Majavah) [11:23:51] !log marostegui@dns1006 END - running authdns-update [11:27:48] (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:28:03] (03PS13) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [11:28:30] (03PS1) 10Zabe: Fix logo issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112720 [11:29:36] (03PS2) 10Zabe: Fix logo issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112720 [11:30:13] (03CR) 10Zabe: [C:03+2] Fix logo issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112720 (owner: 10Zabe) [11:31:02] (03Merged) 10jenkins-bot: Fix logo issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112720 (owner: 10Zabe) [11:31:30] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1112720|Fix logo issues]] [11:32:08] (03CR) 10Lucas Werkmeister (WMDE): Add known-good regexes for WikibaseQualityConstraints (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [11:34:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:34:47] (03PS1) 10Cathal Mooney: Remove 'vrf_id' parameter from common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1112722 (https://phabricator.wikimedia.org/T310715) [11:36:32] !log zabe@deploy2002 zabe: Backport for [[gerrit:1112720|Fix logo issues]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:36:37] !log zabe@deploy2002 zabe: Continuing with sync [11:39:44] FIRING: [16x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 11645085 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:39:57] (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1112722 (https://phabricator.wikimedia.org/T310715) (owner: 10Cathal Mooney) [11:40:22] (03CR) 10Vgutierrez: "can we move forward with this one?" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [11:41:14] (03CR) 10Cathal Mooney: [C:03+2] Remove 'vrf_id' parameter from common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1112722 (https://phabricator.wikimedia.org/T310715) (owner: 10Cathal Mooney) [11:41:47] (03Merged) 10jenkins-bot: Remove 'vrf_id' parameter from common.yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1112722 (https://phabricator.wikimedia.org/T310715) (owner: 10Cathal Mooney) [11:41:55] !log depooling db2131 as per T384001 [11:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] T384001: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001 [11:43:00] (03PS1) 10Brouberol: airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) [11:43:01] (03PS1) 10Brouberol: airflow: refactor env injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112724 (https://phabricator.wikimedia.org/T383430) [11:43:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2131 T384001', diff saved to https://phabricator.wikimedia.org/P72153 and previous config saved to /var/cache/conftool/dbconfig/20250120-114306-fceratto.json [11:43:55] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112720|Fix logo issues]] (duration: 12m 25s) [11:44:01] (03CR) 10Btullis: [C:03+1] airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [11:44:22] (03CR) 10CI reject: [V:04-1] airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [11:45:34] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112724 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [11:45:50] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@663a2f0]: 202412 Backfill: Fix ExternalTaskSensor missing filters [11:46:26] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@663a2f0]: 202412 Backfill: Fix ExternalTaskSensor missing filters (duration: 00m 35s) [11:48:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2131.codfw.wmnet with reason: Downtime db2131 [11:50:21] zabe: ok for me to deploy something? [11:52:37] (03CR) 10Jelto: [C:03+1] "I was looking here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/tlsp" [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [11:56:07] (03PS1) 10Volans: puppet-microservice: compatibility with Cumin 5+ [puppet] - 10https://gerrit.wikimedia.org/r/1112726 [11:56:54] (03CR) 10Volans: "RElated change in cumin is:" [puppet] - 10https://gerrit.wikimedia.org/r/1112726 (owner: 10Volans) [11:59:04] urbanecm: yep [11:59:07] (03PS1) 10Federico Ceratto: instances.yaml: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112727 (https://phabricator.wikimedia.org/T384001) [11:59:14] (from my view) [12:00:52] (03PS1) 10Hnowlan: httpbb: use k8s jobrunners for healthchecking [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) [12:02:46] (03CR) 10Santiago Faci: [V:03+1 C:03+1] "Looks good and it has been tested locally. It works fine!" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [12:03:17] (03CR) 10Marostegui: [C:03+1] instances.yaml: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112727 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:05:58] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112727 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:13:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2131 from dbctl T384001', diff saved to https://phabricator.wikimedia.org/P72154 and previous config saved to /var/cache/conftool/dbconfig/20250120-121318-fceratto.json [12:13:23] T384001: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001 [12:13:43] (03PS1) 10Cathal Mooney: Update routing policy to introduce BACKUP_PATH community [homer/public] - 10https://gerrit.wikimedia.org/r/1112734 (https://phabricator.wikimedia.org/T354839) [12:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112740 [12:18:41] (03PS1) 10Federico Ceratto: db2131.yaml: remove file [puppet] - 10https://gerrit.wikimedia.org/r/1112741 (https://phabricator.wikimedia.org/T384001) [12:24:59] (03PS1) 10Federico Ceratto: site.pp: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112745 (https://phabricator.wikimedia.org/T384001) [12:25:17] (03CR) 10Federico Ceratto: [C:03+2] db2131.yaml: remove file [puppet] - 10https://gerrit.wikimedia.org/r/1112741 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:25:39] (03CR) 10Marostegui: [C:03+1] site.pp: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112745 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:25:49] (03CR) 10Marostegui: [C:03+1] db2131.yaml: remove file [puppet] - 10https://gerrit.wikimedia.org/r/1112741 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:26:17] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Remove db2131 [puppet] - 10https://gerrit.wikimedia.org/r/1112745 (https://phabricator.wikimedia.org/T384001) (owner: 10Federico Ceratto) [12:34:27] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2023 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1112693 (owner: 10Muehlenhoff) [12:34:37] !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2131.codfw.wmnet [12:34:53] (03CR) 10Cathal Mooney: [C:03+2] reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [12:35:21] federico3: I'll puppet-merge your two patches along, ok? [12:37:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Deploy pc7 in both eqiad and codfw T383235', diff saved to https://phabricator.wikimedia.org/P72155 and previous config saved to /var/cache/conftool/dbconfig/20250120-123741-marostegui.json [12:37:46] T383235: Introduce pc7 and move one spare per dc to it - https://phabricator.wikimedia.org/T383235 [12:39:12] going ahead since these seem safe [12:39:13] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [12:39:23] moritzm: go for it yes [12:39:27] moritzm: yes please [12:40:11] ack, now merged [12:40:39] (03Merged) 10jenkins-bot: reimage: change check to find physical interface if IP is on bridge [cookbooks] - 10https://gerrit.wikimedia.org/r/1112175 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [12:43:23] (03PS1) 10Slyngshede: Escape filter character [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1112750 [12:47:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2023.codfw.wmnet with OS bookworm [12:47:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10475955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bookworm [12:48:57] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2131.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:49:39] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2131.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:49:39] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:40] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2131.codfw.wmnet [12:52:09] (03PS2) 10Brouberol: airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) [12:52:09] (03PS2) 10Brouberol: airflow: refactor env injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112724 (https://phabricator.wikimedia.org/T383430) [12:52:51] (03CR) 10Hnowlan: [C:03+1] Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:54:28] (03CR) 10Brouberol: [C:03+2] airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [12:54:30] (03CR) 10Brouberol: [C:03+2] airflow: refactor env injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112724 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [12:54:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2282:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2282 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:46] !log Removing db2131 from zarcillo T384001 [12:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] T384001: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001 [12:56:26] (03Merged) 10jenkins-bot: airflow: de-indent environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112723 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [12:56:28] (03Merged) 10jenkins-bot: airflow: refactor env injection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112724 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [12:57:33] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10475975 (10cmooney) 05Open→03Resolved [12:57:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:58:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [12:58:20] (03PS1) 10Marostegui: orchestrator.conf: Add Federico [puppet] - 10https://gerrit.wikimedia.org/r/1112752 (https://phabricator.wikimedia.org/T384001) [12:59:12] jouncebot: nowandnext [12:59:12] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [12:59:12] In 1 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1400) [12:59:17] (03CR) 10Urbanecm: [C:03+2] [Growth] Remove tybanner campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112207 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [12:59:18] (03CR) 10Urbanecm: [C:03+2] [Growth] Add fundraising- as a prefix for fundraising campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112208 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [12:59:59] (03Merged) 10jenkins-bot: [Growth] Remove tybanner campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112207 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [13:00:12] (03Merged) 10jenkins-bot: [Growth] Add fundraising- as a prefix for fundraising campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112208 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [13:00:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#10475985 (10cmooney) >>! In T354839#10271470, @Vgutierrez wrote: > Gven the limitations to run pybal and liberica on t... [13:00:57] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1112207|[Growth] Remove tybanner campaigns (T380405)]], [[gerrit:1112208|[Growth] Add fundraising- as a prefix for fundraising campaign (T380405)]] [13:01:00] (03PS2) 10Federico Ceratto: orchestrator.conf: Add Federico [puppet] - 10https://gerrit.wikimedia.org/r/1112752 (https://phabricator.wikimedia.org/T384001) (owner: 10Marostegui) [13:01:01] T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405 [13:01:37] (03CR) 10Federico Ceratto: [C:03+2] orchestrator.conf: Add Federico [puppet] - 10https://gerrit.wikimedia.org/r/1112752 (https://phabricator.wikimedia.org/T384001) (owner: 10Marostegui) [13:02:40] (03PS2) 10Kamila Součková: wikikube: rename mw146[4-9] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) [13:03:37] (03CR) 10Federico Ceratto: [C:03+1] orchestrator.conf: Add Federico [puppet] - 10https://gerrit.wikimedia.org/r/1112752 (https://phabricator.wikimedia.org/T384001) (owner: 10Marostegui) [13:03:44] (03CR) 10Marostegui: [C:03+2] orchestrator.conf: Add Federico [puppet] - 10https://gerrit.wikimedia.org/r/1112752 (https://phabricator.wikimedia.org/T384001) (owner: 10Marostegui) [13:04:01] (03CR) 10Kamila Součková: wikikube: rename mw146[4-9] -> wikikube-worker* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:05:41] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1112207|[Growth] Remove tybanner campaigns (T380405)]], [[gerrit:1112208|[Growth] Add fundraising- as a prefix for fundraising campaign (T380405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:49] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:07:35] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw146[4-9] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112200 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:07:38] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1464-1469].eqiad.wmnet [13:10:45] PROBLEM - MariaDB Replica SQL: s2 #page on db2189 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:11:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1464-1469].eqiad.wmnet [13:11:09] * Emperor here [13:11:27] can you depool ? [13:11:36] here too [13:11:39] !incidents [13:11:39] 5611 (UNACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [13:11:40] I will take care of it [13:11:43] !ack 5612 [13:11:44] Attempt to ack incident 5612 failed. [13:11:47] !ack 5611 [13:11:47] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [13:11:50] yes, I type good [13:11:55] marostegui: anything we can do? [13:11:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2189', diff saved to https://phabricator.wikimedia.org/P72157 and previous config saved to /var/cache/conftool/dbconfig/20250120-131157-marostegui.json [13:12:01] marostegui: OK, thanks. Need anything from oncall? [13:12:03] effie: nah, an index corruption [13:12:06] I will take care of it [13:12:08] alright [13:12:14] effie: I type fast and good 🤦 [13:12:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: rebuilding [13:12:41] Emperor: or I lag and I am slow haha [13:13:01] :) [13:13:49] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112207|[Growth] Remove tybanner campaigns (T380405)]], [[gerrit:1112208|[Growth] Add fundraising- as a prefix for fundraising campaign (T380405)]] (duration: 12m 52s) [13:13:53] T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405 [13:14:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: rebuilding index [13:15:16] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1464 to wikikube-worker1117 [13:15:37] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:16:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:16:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:17:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [13:17:22] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [13:17:26] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:27] (03PS1) 10Marostegui: db2189: Disable notications [puppet] - 10https://gerrit.wikimedia.org/r/1112753 (https://phabricator.wikimedia.org/T384202) [13:19:16] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1464 to wikikube-worker1117 - kamila@cumin1002" [13:20:59] (03PS4) 10Brouberol: airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) [13:21:15] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1465 to wikikube-worker1118 [13:21:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1464 to wikikube-worker1117 - kamila@cumin1002" [13:21:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:18] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1117 [13:21:35] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:22:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1117 [13:23:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1464 to wikikube-worker1117 [13:25:11] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1466 to wikikube-worker1119 [13:25:58] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1465 to wikikube-worker1118 - kamila@cumin1002" [13:26:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1465 to wikikube-worker1118 - kamila@cumin1002" [13:26:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:21] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:26:21] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1118 [13:27:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1118 [13:28:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1465 to wikikube-worker1118 [13:28:18] (03CR) 10Elukey: [C:03+1] "LGTM, I'd have loved the possibility to log when the new condition is true (so when json["query"] gets overridden), but I don't see a lot " [puppet] - 10https://gerrit.wikimedia.org/r/1112726 (owner: 10Volans) [13:28:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10476056 (10MoritzMuehlenhoff) [13:28:32] (03PS1) 10Klausman: home/klausman: move tmuxp config files to right location [puppet] - 10https://gerrit.wikimedia.org/r/1112756 [13:28:34] (03CR) 10Klausman: [C:03+2] home/klausman: move tmuxp config files to right location [puppet] - 10https://gerrit.wikimedia.org/r/1112756 (owner: 10Klausman) [13:29:28] (03CR) 10Volans: "The microservice is used also by other scripts around the infra without using cumin, so that's a pretty much normal behaviour. If you feel" [puppet] - 10https://gerrit.wikimedia.org/r/1112726 (owner: 10Volans) [13:29:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on mw1467:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:30:12] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1466 to wikikube-worker1119 - kamila@cumin1002" [13:30:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1466 to wikikube-worker1119 - kamila@cumin1002" [13:30:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:30:17] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1119 [13:30:42] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1467 to wikikube-worker1120 [13:31:03] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:31:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1119 [13:32:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1466 to wikikube-worker1119 [13:32:53] (03CR) 10Elukey: [C:03+1] "Nono go ahead, I was thinking out loud, no real/big concern :)" [puppet] - 10https://gerrit.wikimedia.org/r/1112726 (owner: 10Volans) [13:33:41] (03CR) 10JMeybohm: [C:03+1] prometheus: rename k8s rules groups [puppet] - 10https://gerrit.wikimedia.org/r/1112703 (https://phabricator.wikimedia.org/T369607) (owner: 10Filippo Giunchedi) [13:33:52] (03CR) 10Volans: [C:03+2] puppet-microservice: compatibility with Cumin 5+ [puppet] - 10https://gerrit.wikimedia.org/r/1112726 (owner: 10Volans) [13:34:46] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2024 Nov-Dec), 07Unplanned-Sprint-Work: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10476088 (10elukey) >>! In T335491#10469212, @KartikMistry wrote: > Assigning this to m... [13:35:59] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1467 to wikikube-worker1120 - kamila@cumin1002" [13:36:03] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1468 to wikikube-worker1121 [13:36:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1467 to wikikube-worker1120 - kamila@cumin1002" [13:36:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:18] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1120 [13:36:24] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:36:32] (03CR) 10Federico Ceratto: [C:03+1] db2189: Disable notications [puppet] - 10https://gerrit.wikimedia.org/r/1112753 (https://phabricator.wikimedia.org/T384202) (owner: 10Marostegui) [13:36:49] (03CR) 10Marostegui: [C:03+2] db2189: Disable notications [puppet] - 10https://gerrit.wikimedia.org/r/1112753 (https://phabricator.wikimedia.org/T384202) (owner: 10Marostegui) [13:37:18] (03CR) 10JMeybohm: "Why is it beneficial to have two metrics (entry_ and wiki_) instead of just one which keeps both labels intact?" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [13:37:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1120 [13:38:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1467 to wikikube-worker1120 [13:39:10] (03CR) 10Brouberol: [C:03+2] airflow: bypass DNS resolution for the PG URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112206 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [13:39:58] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1468 to wikikube-worker1121 - kamila@cumin1002" [13:40:13] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1469 to wikikube-worker1122 [13:40:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1468 to wikikube-worker1121 - kamila@cumin1002" [13:40:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:14] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1121 [13:40:33] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:41:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1121 [13:41:50] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001#10476101 (10FCeratto-WMF) [13:42:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1468 to wikikube-worker1121 [13:42:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:43:06] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001#10476105 (10FCeratto-WMF) This is ready for DCOps. The host shutdown step in the decommissioning failed but I'm advised by @elukey that the decommissioning can proceed. [13:43:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:43:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2023.codfw.wmnet with reason: host reimage [13:44:06] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1469 to wikikube-worker1122 - kamila@cumin1002" [13:44:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1469 to wikikube-worker1122 - kamila@cumin1002" [13:44:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:23] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1122 [13:46:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2023.codfw.wmnet with reason: host reimage [13:47:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1122 [13:47:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1469 to wikikube-worker1122 [13:47:50] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1117.eqiad.wmnet wikikube-worker1118.eqiad.wmnet wikikube-worker1119.eqiad.wmnet wikikube-worker1120.eqiad.wmnet wikikube-worker1121.eqiad.wmnet wikikube-worker1122.eqiad.wmnet on all recursors [13:47:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1117.eqiad.wmnet wikikube-worker1118.eqiad.wmnet wikikube-worker1119.eqiad.wmnet wikikube-worker1120.eqiad.wmnet wikikube-worker1121.eqiad.wmnet wikikube-worker1122.eqiad.wmnet on all recursors [13:49:59] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1117.eqiad.wmnet with OS bookworm [13:50:03] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1117 [13:50:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1117 [13:50:06] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1118.eqiad.wmnet with OS bookworm [13:50:10] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1118 [13:50:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1118 [13:50:13] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1119.eqiad.wmnet with OS bookworm [13:50:17] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1119 [13:50:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1119 [13:50:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1120.eqiad.wmnet with OS bookworm [13:50:24] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1120 [13:50:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1120 [13:50:28] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1121.eqiad.wmnet with OS bookworm [13:50:31] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1121 [13:50:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1121 [13:50:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1122.eqiad.wmnet with OS bookworm [13:50:37] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1122 [13:50:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1122 [13:52:27] (03CR) 10Filippo Giunchedi: "The main goal is to have a low cardinality edit rate metric, with useful labels. Just "entry" is in the order of 10 metrics, which I'm exp" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [13:52:27] (03CR) 10Vgutierrez: [C:03+1] Update routing policy to introduce BACKUP_PATH community [homer/public] - 10https://gerrit.wikimedia.org/r/1112734 (https://phabricator.wikimedia.org/T354839) (owner: 10Cathal Mooney) [13:52:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:53:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:54:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [13:55:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [13:55:17] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:55:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [13:56:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [13:56:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:57:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:57:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:58:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:02:25] <_joe_> jouncebot: next [14:02:25] In 2 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1630) [14:02:37] <_joe_> uh was the deployment window cancelled? [14:02:41] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:02:45] (03CR) 10Muehlenhoff: [C:03+2] Don't setup database config for tilerator on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1112167 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:03:30] <_joe_> uhh nope, [14:03:35] <_joe_> jouncebot: now [14:03:35] For the next 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1400) [14:03:46] <_joe_> ^_^ but no announcement [14:04:47] Lucas_WMDE: is there deployment window right now? [14:05:08] <_joe_> sfaci: there should be, I can get out of the way with my change now [14:05:16] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:06:02] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1117.eqiad.wmnet with reason: host reimage [14:06:05] (03CR) 10AikoChou: [C:03+1] ml-services: update article-country deployment image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112449 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:06:07] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1118.eqiad.wmnet with reason: host reimage [14:06:09] _joe_ it I'm also here to deploy a change [14:06:15] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1122.eqiad.wmnet with reason: host reimage [14:06:24] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1119.eqiad.wmnet with reason: host reimage [14:06:28] _joe_ it seems Lucas_WMDE is the only deployer who is connected right now [14:06:30] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1121.eqiad.wmnet with reason: host reimage [14:06:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1120.eqiad.wmnet with reason: host reimage [14:06:59] <_joe_> yeah I'm going to deploy my change first by myself, is what I was saying [14:07:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [14:07:25] o/ [14:07:35] Ah! sorry. I didn't understand you [14:07:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2023.codfw.wmnet with OS bookworm [14:07:41] Oh, it seems Lucas is here [14:07:41] just got back to the keyboard [14:07:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10476242 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bookworm completed: - ganeti202... [14:07:46] any idea why jouncebot didn’t announce the window? [14:08:00] <_joe_> nope [14:08:03] i also have a patch for this window, if possible (and i need a deployer :) ) [14:08:11] (03CR) 10Cathal Mooney: [C:03+2] Update routing policy to introduce BACKUP_PATH community [homer/public] - 10https://gerrit.wikimedia.org/r/1112734 (https://phabricator.wikimedia.org/T354839) (owner: 10Cathal Mooney) [14:09:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1117.eqiad.wmnet with reason: host reimage [14:09:11] _joe_: I assume you’re self-serving your config change? [14:09:17] I also need deployer [14:09:18] I can deploy afterwards [14:09:18] <_joe_> Lucas_WMDE: already did, yes [14:09:37] <_joe_> yeah I started early as my change needs a slight amount of nuance to be tested [14:10:45] <_joe_> looks like zuul is malfunctioning or something, as many checks are still "pending" [14:10:55] <_joe_> so it's not even merged yet :/ [14:10:56] (03Merged) 10jenkins-bot: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [14:11:06] <_joe_> see? I scared it :P [14:11:15] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1109109|Use a bespoke database configuration for dumps (T382947)]] [14:11:19] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [14:11:23] ^^ [14:11:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1121.eqiad.wmnet with reason: host reimage [14:12:03] Lucas_WMDE: _joe_: can you ping me when done? I also have a patch or two [14:12:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [14:12:38] ihurbain, taavi: can you add your changes to the deployment calendar? [14:12:43] and then we can maybe prioritize them [14:12:55] mmh i thought i had, let me check [14:13:01] I don’t see it yet [14:13:06] ah, I think you added it to the evening window [14:13:09] urgh [14:13:18] indeed [14:13:22] well i'll do that tomorrow. [14:13:27] (03PS7) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [14:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72158 and previous config saved to /var/cache/conftool/dbconfig/20250120-141340-root.json [14:14:03] (T381237 would’ve helped in this case I guess) [14:14:04] T381237: Allow scheduling for current backport window - https://phabricator.wikimedia.org/T381237 [14:14:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1118.eqiad.wmnet with reason: host reimage [14:14:27] !rolling out policy changes to L3 switches and CRs to support new BACKUP_PATH community T354839 [14:14:28] nah i made a mess all on my own :D [14:14:30] T354839: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 [14:14:47] done [14:16:01] 06SRE, 10SRE-swift-storage: wikipedia-commons-local-thumb.4b corrupted causing 401 - https://phabricator.wikimedia.org/T384128#10476266 (10MatthewVernon) Logs: ` Jan 17 20:15:29 ms-be2081 container-server: Quarantined /srv/swift-storage/accounts1/containers/17692/ea3/451cd039c82e3456c69a99f613023ea3 to /sr... [14:16:18] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1109109|Use a bespoke database configuration for dumps (T382947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:16:22] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [14:17:01] (03PS1) 10DCausse: wdqs: bump image version to 0.3.152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112762 (https://phabricator.wikimedia.org/T374919) [14:17:01] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "[Special:ListUsers/oauthadmin](https://wikitech.wikimedia.org/wiki/Special:ListUsers/oauthadmin) shows no remaining users with that right," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112333 (https://phabricator.wikimedia.org/T384122) (owner: 10Majavah) [14:17:02] (03PS1) 10DCausse: wdqs: enable new event stream api config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112763 (https://phabricator.wikimedia.org/T374919) [14:17:06] thx [14:17:09] (03CR) 10Ssingh: [C:03+1] "Looks good! Verified CAA record and affected zones." [dns] - 10https://gerrit.wikimedia.org/r/1112715 (https://phabricator.wikimedia.org/T376459) (owner: 10Vgutierrez) [14:17:17] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country deployment image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112449 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:17:28] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "No users left in [Special:ListUsers/oathauth](https://wikitech.wikimedia.org/wiki/Special:ListUsers/oathauth), should be okay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112335 (https://phabricator.wikimedia.org/T384123) (owner: 10Majavah) [14:18:12] <_joe_> testinbg my change, I should be done in a couple minutes hopefully [14:18:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1122.eqiad.wmnet with reason: host reimage [14:19:06] (03Merged) 10jenkins-bot: ml-services: update article-country deployment image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112449 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:19:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "I think on bare metal this backport would potentially problematic (e.g. if a request sees the new `extension.json` before the other new fi" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [14:19:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:20:24] (03CR) 10AikoChou: EventStreamConfig: Add mediawiki.page_article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:20:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [14:21:46] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: rename k8s rules groups [puppet] - 10https://gerrit.wikimedia.org/r/1112703 (https://phabricator.wikimedia.org/T369607) (owner: 10Filippo Giunchedi) [14:22:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2023.codfw.wmnet to cluster codfw and group A [14:22:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1119.eqiad.wmnet with reason: host reimage [14:22:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2023.codfw.wmnet to cluster codfw and group A [14:23:00] !log oblivian@deploy2002 oblivian: Continuing with sync [14:23:20] (03PS3) 10Vgutierrez: Add pki.goog CAA record on unified cert domains [dns] - 10https://gerrit.wikimedia.org/r/1112715 (https://phabricator.wikimedia.org/T376459) [14:26:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10476356 (10MoritzMuehlenhoff) [14:27:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1120.eqiad.wmnet with reason: host reimage [14:27:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1117.eqiad.wmnet with OS bookworm [14:28:26] !log upgraded cumin to v5.0.0 on cumin2002 [14:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] (03CR) 10JMeybohm: [C:03+1] "Sounds reasonable, thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [14:28:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72159 and previous config saved to /var/cache/conftool/dbconfig/20250120-142846-root.json [14:29:44] !log volans@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1002.eqiad.wmnet with reason: testing cumin [14:30:02] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109109|Use a bespoke database configuration for dumps (T382947)]] (duration: 18m 47s) [14:30:06] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [14:30:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1121.eqiad.wmnet with OS bookworm [14:31:40] _joe_: can I continue with the other deployments? [14:32:17] (03CR) 10AikoChou: [C:03+1] EventStreamConfig: Add mediawiki.page_article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:33:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1118.eqiad.wmnet with OS bookworm [14:34:10] <_joe_> Lucas_WMDE: yes please [14:34:15] ok, thanks! [14:34:27] <_joe_> sorry, I was still testing stuff that was nonblocking [14:34:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [14:34:38] ah, ok [14:34:40] hope it went well ^^ [14:34:44] <_joe_> yes [14:36:12] (03CR) 10Lucas Werkmeister (WMDE): "I *think* if you mark this change as Depends-On: Idd45d24104a387f2788275e43bd3608eef9463c7 then `scap backport` will even warn (or error?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [14:36:16] (03Merged) 10jenkins-bot: Add dedicated experimentation lab test module [extensions/WikimediaEvents] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112707 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [14:36:19] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]] [14:36:20] (03CR) 10Vgutierrez: [C:03+2] Add pki.goog CAA record on unified cert domains [dns] - 10https://gerrit.wikimedia.org/r/1112715 (https://phabricator.wikimedia.org/T376459) (owner: 10Vgutierrez) [14:36:23] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [14:36:24] (03PS2) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) [14:36:32] !log vgutierrez@dns1004 START - running authdns-update [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:56] (03CR) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:38:31] !log vgutierrez@dns1004 END - running authdns-update [14:40:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1122.eqiad.wmnet with OS bookworm [14:40:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, cjming: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:41:01] sfaci: can you test the change on WikimediaDebug? [14:41:12] Lucas_WMDE: regarding the change you are deploying right now, I was here to test it but I'm afraid that a related piece that doesn't belong to this change is not working properly and I won't be able to test it right now. Anyway, it's safe to deploy as it is and we will have the opportunity to test it later. So, just having the change deployed is ok [14:41:22] ok [14:41:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, cjming: Continuing with sync [14:41:44] Anyway, the test is producing events and look at hive is they have been produced. I would take more than 30 minutes. We will test later when we can [14:41:48] Thank you very much Lucas_WMDE !!! [14:42:24] np :) [14:42:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1119.eqiad.wmnet with OS bookworm [14:43:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72160 and previous config saved to /var/cache/conftool/dbconfig/20250120-144351-root.json [14:45:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1120.eqiad.wmnet with OS bookworm [14:47:27] (03CR) 10AikoChou: [C:03+1] EventStreamConfig: Add mediawiki.article_country_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:48:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]] (duration: 12m 12s) [14:48:36] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [14:48:45] taavi: all yours [14:48:50] thanks [14:48:57] I also have an IRC question if you have a second ^^ [14:49:09] sure :D [14:49:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112333 (https://phabricator.wikimedia.org/T384122) (owner: 10Majavah) [14:49:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112335 (https://phabricator.wikimedia.org/T384123) (owner: 10Majavah) [14:50:08] (DM) [14:50:16] (03Merged) 10jenkins-bot: wikitech: Drop obsolete oauthadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112333 (https://phabricator.wikimedia.org/T384122) (owner: 10Majavah) [14:50:19] (03Merged) 10jenkins-bot: wikitech: Drop oathauth group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112335 (https://phabricator.wikimedia.org/T384123) (owner: 10Majavah) [14:50:34] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1112333|wikitech: Drop obsolete oauthadmin group (T384122)]], [[gerrit:1112335|wikitech: Drop oathauth group (T384123)]] [14:50:40] T384122: Drop local 'OAuth administrators' group from Wikitech - https://phabricator.wikimedia.org/T384122 [14:50:40] T384123: Remove OATH validators user group in Wikitech - https://phabricator.wikimedia.org/T384123 [14:50:46] (03CR) 10Kevin Bazira: EventStreamConfig: Add mediawiki.article_country_prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112451 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:51:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [14:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [14:51:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10476575 (10ops-monitoring-bot) Draining ganeti2024.codfw.wmnet of running VMs [14:52:59] (03PS1) 10Muehlenhoff: Switch ganeti2024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1112767 [14:53:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [14:54:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [14:55:12] !log taavi@deploy2002 taavi: Backport for [[gerrit:1112333|wikitech: Drop obsolete oauthadmin group (T384122)]], [[gerrit:1112335|wikitech: Drop oathauth group (T384123)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:55:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10476595 (10kamila) [14:56:04] !log taavi@deploy2002 taavi: Continuing with sync [14:56:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10476596 (10ops-monitoring-bot) Draining ganeti2024.codfw.wmnet of running VMs [14:57:28] there's a pile of uncommitted changes in homer, some look like traffic stuff and some like DBs... can someone please look? [14:57:33] (03CR) 10Isabelle Hurbain-Palatin: "Oh neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [14:57:57] (I was trying to commit a few host renames, those can be safely committed anytime) [14:58:46] topranks: ^^^ [14:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72161 and previous config saved to /var/cache/conftool/dbconfig/20250120-145856-root.json [14:58:58] kamila_: yes I'm pushing changes to all network devices right now [14:59:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [14:59:08] ok, thanks topranks! [14:59:16] kamila_: is there a particular router you are trying to update? [14:59:25] cr*eqiad [14:59:42] ok, I'll push to those now for you [14:59:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1112172 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [14:59:46] thank you! [15:02:05] (03PS1) 10Phuedx: testwiki: Enable Metrics Platform configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112770 (https://phabricator.wikimedia.org/T381853) [15:02:56] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112333|wikitech: Drop obsolete oauthadmin group (T384122)]], [[gerrit:1112335|wikitech: Drop oathauth group (T384123)]] (duration: 12m 21s) [15:03:01] T384122: Drop local 'OAuth administrators' group from Wikitech - https://phabricator.wikimedia.org/T384122 [15:03:02] T384123: Remove OATH validators user group in Wikitech - https://phabricator.wikimedia.org/T384123 [15:03:12] ok I'm done deploying [15:03:38] jouncebot: nowandnext [15:03:38] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [15:03:38] In 1 hour(s) and 26 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1630) [15:03:48] (03PS3) 10Brouberol: airflow: restore all airflow resources for the migrated instances except the systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) [15:04:02] !log UTC afternoon backport+config window done [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:12] (03CR) 10CI reject: [V:04-1] airflow: restore all airflow resources for the migrated instances except the systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) (owner: 10Brouberol) [15:04:44] (03PS4) 10Brouberol: airflow: restore airflow resources except the systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:01] actually, I have one backport I’d like to deploy if that’s okay with everyone else [15:07:22] (dcausse: did you want to deploy something?) [15:07:41] Lucas_WMDE: please go ahead I have something but it's not directly related to MW [15:07:44] kamila_: cr*eqiad* are updated now [15:07:51] ok sure [15:08:20] thank you topranks! [15:08:26] (03PS1) 10Lucas Werkmeister (WMDE): Make known-good regex check strict [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112771 (https://phabricator.wikimedia.org/T380751) [15:08:35] (03CR) 10Btullis: [C:03+1] "Looks good. Should we disable puppet on all airflow hosts and just try it on an-test-client1002 first?" [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) (owner: 10Brouberol) [15:08:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112771 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:09:06] I’ve started the backport, if anyone wants me to not deploy you have ~15 minutes to tell me before the gate-and-submit build is done :) [15:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:09] (03CR) 10DCausse: "related changes where the new config vars were renamed: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1078991/11/streaming-updater" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112762 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:13:51] (03CR) 10Brouberol: "Yep!" [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) (owner: 10Brouberol) [15:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72162 and previous config saved to /var/cache/conftool/dbconfig/20250120-151402-root.json [15:14:30] (03PS1) 10Fabfur: acmecerts: new param to use tmpfs storage for certificates [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) [15:15:56] (03CR) 10Brouberol: [C:03+2] airflow: restore airflow resources except the systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1112769 (https://phabricator.wikimedia.org/T377601) (owner: 10Brouberol) [15:17:22] (03CR) 10Gmodena: [C:03+1] wdqs: bump image version to 0.3.152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112762 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:17:55] (03CR) 10DCausse: [C:03+2] wdqs: bump image version to 0.3.152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112762 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:18:14] !log issues power off via mgmt UI for db2131 (failed to power off during decommissioning) [15:18:14] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2282.codfw.wmnet with reason: decommissioning host [15:18:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:53] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2282.codfw.wmnet [15:18:54] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2282.codfw.wmnet [15:19:10] (03Merged) 10jenkins-bot: wdqs: bump image version to 0.3.152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112762 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [15:21:33] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:21:55] !log jelto@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw2282.codfw.wmnet [15:22:19] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:25:07] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10476793 (10elukey) p:05Triage→03Medium [15:25:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10476794 (10elukey) p:05Triage→03Medium [15:26:31] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [15:28:52] (03Merged) 10jenkins-bot: Make known-good regex check strict [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112771 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:29:11] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112771|Make known-good regex check strict (T380751)]] [15:29:15] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:29:54] whee [15:31:56] nothing to test for this change btw [15:32:10] the code in question is unreachable until a config change (probably tomorrow) [15:32:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1112771|Make known-good regex check strict (T380751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:32:48] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:36:18] (03PS1) 10Brouberol: airflow: restore kerberos keytab in task pod containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112777 (https://phabricator.wikimedia.org/T378441) [15:36:31] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2282.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002" [15:37:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw2282.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jelto@cumin1002" [15:37:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:37:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2282.codfw.wmnet [15:39:46] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112771|Make known-good regex check strict (T380751)]] (duration: 10m 35s) [15:39:50] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:40:00] * Lucas_WMDE done deploying [15:40:05] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:42:21] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add known-good regexes for WikibaseQualityConstraints (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112261 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven) [15:43:18] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:43:20] !log brouberol@deploy2002 Started deploy [airflow-dags/analytics_test@516e8f2]: (no justification provided) [15:43:32] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:43:36] !log brouberol@deploy2002 Finished deploy [airflow-dags/analytics_test@516e8f2]: (no justification provided) (duration: 00m 21s) [15:45:05] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10decommission-hardware: decommission mw2282.codfw.wmnet - https://phabricator.wikimedia.org/T384226#10476878 (10Jelto) a:05Jelto→03None [15:45:43] !log installing python-tornado security updates [15:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:45] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/wmde AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:47:24] (03CR) 10Btullis: [C:03+1] airflow: restore kerberos keytab in task pod containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112777 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [15:47:29] (03CR) 10Brouberol: [C:03+2] airflow: restore kerberos keytab in task pod containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112777 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [15:48:12] (03PS4) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [15:48:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:50:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:50:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1112750 (owner: 10Slyngshede) [15:50:41] (03PS5) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [15:53:35] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:53:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:59] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:55:58] (03PS14) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [15:55:58] (03PS1) 10JMeybohm: Revert "Create certificates for Typha/Felix mTLS" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) [15:56:25] (03Abandoned) 10JMeybohm: calico: Add support for Typha/Felix mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112235 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [15:56:34] (03Abandoned) 10JMeybohm: Update calico to 0.2.11 in staging-codfw and enable mTLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112236 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [16:01:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:01:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:01:26] (03PS6) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [16:01:26] (03PS7) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [16:01:26] (03PS3) 10JMeybohm: admin_ng: Install VAPs instead of PSPs on k8s >= 1.24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112183 (https://phabricator.wikimedia.org/T341984) [16:01:26] (03PS8) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [16:01:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [16:02:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [16:02:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [16:02:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [16:03:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [16:03:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [16:04:48] !log brouberol@deploy2002 Started deploy [airflow-dags/search@8c96899]: (no justification provided) [16:05:14] !log brouberol@deploy2002 Finished deploy [airflow-dags/search@8c96899]: (no justification provided) (duration: 00m 31s) [16:05:17] (03PS6) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [16:12:58] (03PS1) 10Daimona Eaytoy: beta: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112795 (https://phabricator.wikimedia.org/T380817) [16:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:40] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477049 (10jcrespo) Hello, SRE on clinic duty this week. I am trying the understand well the request. Do you currently acc... [16:17:53] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477051 (10jcrespo) p:05Triage→03High [16:18:01] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239 (10isarantopoulos) 03NEW [16:19:49] * urbanecm just took test.wikipedia.org down [16:19:55] (I'll fix it) [16:20:58] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1117-1122].eqiad.wmnet [16:21:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1117-1122].eqiad.wmnet [16:21:12] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/research AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:21:55] (03PS1) 10Hnowlan: rest-gateway: add page/lint rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) [16:23:02] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [16:24:31] (03PS1) 10Hnowlan: trafficserver: route /lint to rest-gateway for testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1112800 (https://phabricator.wikimedia.org/T384216) [16:28:58] (03PS1) 10Jcrespo: dbbackups: Remove set user permissions from m1 backup user grants [puppet] - 10https://gerrit.wikimedia.org/r/1112802 (https://phabricator.wikimedia.org/T383902) [16:29:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1630). [16:30:23] (03PS1) 10Hnowlan: httpbb: use k8s jobrunners for healthchecking [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) [16:30:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477164 (10isarantopoulos) Related patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1109414 [16:31:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477168 (10isarantopoulos) Approved! [16:32:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10477177 (10phaultfinder) [16:34:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477188 (10jcrespo) A quick search tells me you don't have wmf staff rights to start with (unless I am not finding your LD... [16:36:23] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477203 (10jcrespo) [16:38:34] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to analytics-privatedata-users to access https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477205 (10jcrespo) Please confirm on the request summary up here your LDAP account: UID, cname or mail account, and we sh... [16:39:11] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477206 (10jcrespo) [16:39:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:40:36] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112805 (https://phabricator.wikimedia.org/T128546) [16:41:10] (03PS2) 10Fabfur: acmecerts: new param to use tmpfs storage for certificates [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) [16:42:05] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112805 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:42:58] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112805 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:45:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [16:50:44] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10477259 (10jcrespo) a:03DSantamaria [16:54:40] (03Abandoned) 10Phuedx: testwiki: Enable Metrics Platform configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112770 (https://phabricator.wikimedia.org/T381853) (owner: 10Phuedx) [16:58:18] (03PS1) 10Jgiannelos: kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112808 [16:59:27] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10477273 (10jcrespo) To try to speed up confirmations, 'restricted' is documented at data.yml to require @thcipriani approval. So asking in advance (although totally ok if you prefer... [16:59:45] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1112805| Bumping portals to master (T128546)]] (duration: 13m 40s) [16:59:49] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:02:28] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1112805| Bumping portals to master (T128546)]] (duration: 02m 42s) [17:03:21] (03CR) 10Clément Goubert: [C:03+1] httpbb: use k8s jobrunners for healthchecking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) (owner: 10Hnowlan) [17:06:59] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477305 (10jcrespo) @gkyziridis please fill in above your Developer account... [17:07:30] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477307 (10jcrespo) [17:07:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:08:15] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477308 (10jcrespo) p:05Triage→03High [17:08:22] !incidents [17:08:23] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [17:08:23] 5613 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [17:08:23] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [17:08:26] !ack 5611 [17:08:26] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [17:08:32] !ack 5612 [17:08:33] Attempt to ack incident 5612 failed. [17:08:39] oh weird ordering [17:08:40] !ack 5613 [17:08:41] 5613 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [17:11:01] It's the NTT transit circuit triggering the ping [17:15:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS [17:15:14] v6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:15:56] (03CR) 10Jcrespo: [C:04-1] "Blocked on analytics approvals and other checkboxes at https://phabricator.wikimedia.org/T384239" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [17:20:00] (03PS1) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [17:22:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [17:22:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:23:41] (03PS1) 10Jgiannelos: kartotherian: Fix outgoing MW api req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112813 [17:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10477341 (10phaultfinder) [17:25:18] (03CR) 10Elukey: "+1 on all the blockers, but in theory analytics-privatedata-users should not require DE's approval: https://wikitech.wikimedia.org/w/index" [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [17:26:00] (03CR) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [17:26:47] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Georgios Kyziridis - https://phabricator.wikimedia.org/T384239#10477343 (10Aklapper) The LDAP account can also be found on https://phabricat... [17:27:27] (03PS1) 10Vgutierrez: secret: Add dummy pki.goog staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/1112814 [17:27:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:27:44] !ack 5614 [17:27:45] 5614 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [17:28:03] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add dummy pki.goog staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/1112814 (owner: 10Vgutierrez) [17:28:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:29:54] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4835/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [17:30:44] (03PS1) 10Hnowlan: trafficserver: route all linting traffic via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1112815 (https://phabricator.wikimedia.org/T384216) [17:32:05] !ack 5615 [17:32:05] 5615 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [17:32:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:34:07] (03PS5) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [17:34:49] (03PS2) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [17:37:18] (03PS7) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [17:38:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:38:42] (03PS3) 10Vgutierrez: hiera: Add pki.goog staging account to acmechief-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/1112812 (https://phabricator.wikimedia.org/T384195) [17:39:15] (03CR) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [17:40:02] (03CR) 10Andrea Denisse: "While awaiting input from WMCS regarding the thresholds, I’ve updated my patch to trigger a Warning alert if it persists for 10 minutes an" [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [17:42:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:42:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:42:41] !ack 5616 [17:42:42] 5616 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [17:46:44] (03PS2) 10Jgiannelos: kartotherian: Fix outgoing MW api req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112813 [17:47:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:47:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:48:55] (03CR) 10Elukey: [C:03+1] kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112808 (owner: 10Jgiannelos) [17:49:07] (03CR) 10Elukey: [C:03+1] kartotherian: Fix outgoing MW api req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112813 (owner: 10Jgiannelos) [17:50:50] (03CR) 10Jgiannelos: [C:03+2] kartotherian: Fix outgoing MW api req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112813 (owner: 10Jgiannelos) [17:50:59] (03CR) 10Jgiannelos: [C:03+2] kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112808 (owner: 10Jgiannelos) [17:52:17] (03Merged) 10jenkins-bot: kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112808 (owner: 10Jgiannelos) [17:52:22] (03Merged) 10jenkins-bot: kartotherian: Fix outgoing MW api req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112813 (owner: 10Jgiannelos) [17:59:22] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1800) [18:00:05] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T1800). [18:00:09] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [18:09:30] FIRING: Processor usage over 85%: Alert for device cr1-eqiad.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [18:14:30] RESOLVED: Processor usage over 85%: Device cr1-eqiad.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [18:14:39] !incidents [18:14:39] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [18:14:39] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [18:14:40] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [18:14:40] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [18:14:40] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [18:14:40] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:14:41] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:20:49] PROBLEM - MariaDB Replica SQL: s2 #page on db2148 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:21:03] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253 (10cmooney) 03NEW p:05Triage→03Medium [18:21:14] hello new p.age [18:21:15] !incidents [18:21:15] 5611 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [18:21:16] 5618 (UNACKED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [18:21:16] 5617 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [18:21:16] 5616 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [18:21:16] 5615 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr2-eqord.wikimedia.org) [18:21:17] 5614 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-eqiad.wikimedia.org) [18:21:17] 5613 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:21:17] 5612 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:21:25] !ack 5618 [18:21:25] 5618 (ACKED) db2148 (paged)/MariaDB Replica SQL: s2 (paged) [18:21:42] I take care of this one [18:21:45] <3 [18:21:49] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477517 (10cmooney) [18:23:09] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477523 (10cmooney) [18:26:04] it should recover now [18:26:49] RECOVERY - MariaDB Replica SQL: s2 #page on db2148 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10477535 (10phaultfinder) [18:44:18] (03PS1) 10Kamila Součková: wikikube: rename mw147[0-5] -> wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1112828 (https://phabricator.wikimedia.org/T365571) [18:46:38] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [18:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [18:51:48] oh [18:59:35] (03PS1) 10Cathal Mooney: TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) [19:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [19:42:52] (03PS1) 10ZhaoFJx: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) [19:45:11] !log dropping blobs table where it's empty (T376627) [19:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:15] T376627: Drop ad-hoc or obsolete tables in production - https://phabricator.wikimedia.org/T376627 [19:50:47] 06SRE, 06Infrastructure-Foundations, 10netops: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258 (10cmooney) 03NEW p:05Triage→03Medium [19:51:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10477776 (10VRiley-WMF) [19:54:15] 06SRE, 06Infrastructure-Foundations, 10netops: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477780 (10cmooney) [19:54:32] 06SRE, 06Infrastructure-Foundations, 10netops: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477781 (10ssingh) Thanks for filing this task and looking into it! Just one more data point: this seems to have started Friday Jan 17 a... [19:57:36] 06SRE, 06Infrastructure-Foundations, 10netops: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477783 (10ssingh) Might be a red herring: The only thing I see that might be close is https://sal.toolforge.org/log/h5lbdZQBKFqumxvtiNp... [19:57:41] 06SRE, 06Infrastructure-Foundations, 10netops: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477784 (10cmooney) >>! In T384258#10477781, @ssingh wrote: > Thanks for filing this task and looking into it! Just one more data point:... [20:00:53] (03PS2) 10Cathal Mooney: TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) [20:01:39] (03PS3) 10Cathal Mooney: TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) [20:02:25] (03PS2) 10ZhaoFJx: cawiki: Create templateeditor & protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112838 (https://phabricator.wikimedia.org/T384145) [20:04:33] (03PS4) 10Cathal Mooney: TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) [20:08:28] (03CR) 10Cathal Mooney: [C:03+2] TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney) [20:08:33] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477785 (10cmooney) >>! In T384258#10477783, @ssingh wrote: > Might be a red herring: The only thing I see that might... [20:11:46] (03Merged) 10jenkins-bot: TE for traffic to ATT & Verizon in Eqiad + new generic policy [homer/public] - 10https://gerrit.wikimedia.org/r/1112832 (https://phabricator.wikimedia.org/T384253) (owner: 10Cathal Mooney) [20:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:35] (03CR) 10Pppery: "Adding some reviewers that it seems like reviewer-bot should have added per the rules at https://www.mediawiki.org/wiki/Git/Reviewers#oper" [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (owner: 10Pppery) [20:23:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#10477814 (10cmooney) 05Open→03Resolved Config is applied across the network now. Backup PyBal routes (where MED=100) are now gettin... [20:24:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) (owner: 10Pppery) [20:26:50] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477829 (10cmooney) [20:31:04] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477835 (10cmooney) [20:35:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10477840 (10VRiley-WMF) [20:43:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112795 (https://phabricator.wikimedia.org/T380817) (owner: 10Daimona Eaytoy) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T2100). [21:00:05] Nemoralis, Pppery, and cmelo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] here [21:00:13] o/ [21:00:36] o/ [21:20:15] no deployer? [21:20:40] I was about to ask the same [21:26:37] i can deploy today [21:26:44] thank you!!! [21:26:55] Pppery: Nemoralis: still around? [21:27:00] yes [21:27:02] (03CR) 10Urbanecm: [C:03+2] beta: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112795 (https://phabricator.wikimedia.org/T380817) (owner: 10Daimona Eaytoy) [21:27:13] (But wasn't watching IRC super closely so thanks for the ping) [21:27:44] (03Merged) 10jenkins-bot: beta: Enable $wgCampaignEventsEnableEventTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112795 (https://phabricator.wikimedia.org/T380817) (owner: 10Daimona Eaytoy) [21:27:46] (03CR) 10Urbanecm: [C:03+2] Add simplewiki to mobile-anon-talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) (owner: 10Pppery) [21:28:30] (03Merged) 10jenkins-bot: Add simplewiki to mobile-anon-talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) (owner: 10Pppery) [21:30:34] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1110866|Add simplewiki to mobile-anon-talk (T383161)]], [[gerrit:1112795|beta: Enable $wgCampaignEventsEnableEventTopics (T380817)]] [21:30:40] T383161: Enable Talk tabs at top of page on simple Wikipedia - https://phabricator.wikimedia.org/T383161 [21:30:41] T380817: Enable the event topics feature in beta - https://phabricator.wikimedia.org/T380817 [21:35:27] !log urbanecm@deploy2002 pppery, daimona, urbanecm: Backport for [[gerrit:1110866|Add simplewiki to mobile-anon-talk (T383161)]], [[gerrit:1112795|beta: Enable $wgCampaignEventsEnableEventTopics (T380817)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:35:34] Pppery: can you test your patch at mwdebug, please? [21:35:38] looking [21:35:40] cmelo: your patch is deployed to beta [21:36:06] Thanks, testing it now [21:36:31] urbanecm: yes, I am here [21:36:44] My patch looks good [21:36:51] you probably remember my patch. I want to recheck what was wrong with that [21:36:55] !log urbanecm@deploy2002 pppery, daimona, urbanecm: Continuing with sync [21:37:02] Pppery: thanks, proceeding [21:37:42] Thanks urbanecm mine is tested and ok, thank you so much!!! [21:38:16] Nemoralis: can you clarify a bit why are we deploying the same patch with no changes? I see it was previously reverted, because it "did not work", but I don't remember what that means [21:38:25] cmelo: awesome! thanks for confirming. [21:41:05] urbanecm: I couldn't figure out why it didn't work last time. Now I've checked my code again and I see that I don't need to change anything. I changed wgSitename and wgMetanamespace [21:41:31] Nemoralis: can you clarify what "didn't work" means, please? what were you observing when we last tried this? [21:41:42] The commit message is wrong - it says it's for uzwiki not uzwiktionary [21:42:00] plus, that [21:42:09] (which is completely orthogonal to Urbanecm's other concerns, and trivial to fix, though) [21:42:16] (03PS6) 10NMW03: Update uzwiktionary project namespace and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) [21:43:55] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110866|Add simplewiki to mobile-anon-talk (T383161)]], [[gerrit:1112795|beta: Enable $wgCampaignEventsEnableEventTopics (T380817)]] (duration: 13m 21s) [21:44:00] T383161: Enable Talk tabs at top of page on simple Wikipedia - https://phabricator.wikimedia.org/T383161 [21:44:01] T380817: Enable the event topics feature in beta - https://phabricator.wikimedia.org/T380817 [21:44:34] The IRC logs for the first deployment attempt give no detail - Nemoralis just says it didn't work. The patch looks fine to me too [21:44:52] thats the part I don't understand too [21:45:06] https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/20240424.txt [21:45:16] [13:39:35] [21:45:29] I think most likely some cache needed to be purged but wasn't during the first deployment attempt. [21:45:44] Pppery: my best bet is a client side cache [21:46:13] or maybe you need to purge the page in order for it to render the new title rather than get the old one from parser cache? I remember something similar happening during the cleanupTitles runs [21:46:27] I remembered that when I tested the changes, I saw that there were no update on the namespace name, and I was in a hurry at the time so I didn't have a chance to look in detail [21:46:33] gotcha [21:46:40] that makes cache a possible explanation [21:46:48] let's try again then [21:46:59] It seems unlikely, but could Mediawiki have considered these two comma symbols to be the same? [21:47:20] ‘ vs ʻ [21:47:20] No [21:47:42] The two symbols redirect to different pages on enwiki: https://en.wikipedia.org/wiki/‘ versus https://en.wikipedia.org/wiki/ʻ [21:48:12] (03CR) 10Urbanecm: "22:38 Nemoralis: can you clarify a bit why are we deploying the same patch with no changes? I see it was previously reverted, b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [21:48:18] (03CR) 10Urbanecm: [C:03+2] Update uzwiktionary project namespace and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [21:48:23] it looks like first one is apostrophe, second one is U+02BB [21:49:03] (03Merged) 10jenkins-bot: Update uzwiktionary project namespace and site name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [21:50:19] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1079619|Update uzwiktionary project namespace and site name (T362620)]] [21:50:23] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [21:50:48] can you ctrl + f on the https://uz.wiktionary.org/wiki/Maxsus:NamespaceInfo ? [21:50:59] my browser says it is same symbol lol [21:51:14] Nemoralis: browsers typically merge visually-similar characters together [21:51:21] but mediawiki treats them separately [21:51:30] at this point, it's most likely it is a cache of some sort [21:51:33] I wonder what will api response say [21:51:40] Try checking "match diacritics" [21:51:55] Pppery: chrome doesn't have that option :) [21:52:02] Pppery: ah yes! thanks! [21:53:13] https://uz.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases [21:53:19] api response treats them as a unicode [21:53:51] https://www.compart.com/en/unicode/U+2018 [21:54:03] !log urbanecm@deploy2002 nmw03, urbanecm: Backport for [[gerrit:1079619|Update uzwiktionary project namespace and site name (T362620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:10] so it should say 2bb [21:54:31] Nemoralis: can you test at mwdebug, please? [21:54:36] okay it worked lol! [21:54:43] let me check Special:NamespaceInfo to [21:54:46] *too [21:55:16] I tested it myself, and indeed a page in the namespace still showed using the old name until I purged it with X-Wikimedia-Debug on and then it uses the new name [21:55:22] LGTM [21:55:32] you can also check https://uz.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases [21:55:40] it says "Vikilug\u02bbat" [21:55:48] I know, but I wanted to integration test the most stuff possible [21:55:58] makes sense [21:55:59] thanks both [21:56:01] proceeding [21:56:03] !log urbanecm@deploy2002 nmw03, urbanecm: Continuing with sync [21:56:04] <3 [22:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250120T2200) [22:00:57] finishing a deployment [22:03:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079619|Update uzwiktionary project namespace and site name (T362620)]] (duration: 12m 45s) [22:03:08] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [22:03:12] Nemoralis: should be deployed [22:03:14] anything else? [22:03:17] thanks! [22:03:18] nope [22:03:21] let me also run namespace dupes [22:04:12] so many things to fix [22:07:10] over 3.5M of logs. [22:08:40] !log Run mwscript-k8s -f namespaceDupes.php -- --wiki=uzwiktionary --fix # T362620 # logs are at P72163 [22:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:45] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [22:10:09] thanks [22:11:16] > 41506 links to fix, 41491 were resolvable, 15 were deleted. [22:11:24] Yikes that's a lot of links given the size of the wiki [22:12:26] And I was going to ask you to rerun it again with a prefix but it looks like you already did [22:14:07] Pppery: yeah, i did it in two batches, so that i could get the important logs alone [22:14:21] !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f namespaceDupes.php -- --wiki=uzwiktionary --fix --add-prefix=BROKEN # T362620, logs posted to the task [22:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:25] T362620: Namespace changes on uzwiktionary - https://phabricator.wikimedia.org/T362620 [22:14:42] posted at https://phabricator.wikimedia.org/T362620#10477971 [22:14:55] Nemoralis: please check the pages listed in my comment and delete/rename as appropriate. [22:15:22] thanks! I know someone from uzbek community, I will let them know [22:15:25] ty [22:15:30] * urbanecm done [22:17:30] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:19:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 0.927 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:19:23] Filed https://phabricator.wikimedia.org/T384263 as fallout from this [22:20:22] good idea :) [22:30:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:31] FIRING: [6x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [23:10:30] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:22] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10478045 (10Andrew) On @CDanis's suggestion, 'static wikitech-static' can now be built in a docker container using https://gitlab.wikimedia.org/repos/sre/wikitech-static-... [23:36:05] PROBLEM - MariaDB Replica SQL: s2 #page on db2207 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:27] !log run OPTIMIZE TABLE recentchanges on db2207 [23:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:05] RECOVERY - MariaDB Replica SQL: s2 #page on db2207 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:51] thanks jynus