[00:50:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:36] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:37:46] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:48:16] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:07:16] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [02:30:54] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:37:06] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [04:23:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:42:40] (03PS1) 104nn1l2: Set 'WP' namespace alias to NS_PROJECT in mnw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742279 (https://phabricator.wikimedia.org/T296606) [04:57:25] (03PS1) 10RLazarus: Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 [04:58:35] (03CR) 10jerkins-bot: [V: 04-1] Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 (owner: 10RLazarus) [05:04:39] (03PS2) 10RLazarus: Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 [05:36:02] (03PS1) 10Marostegui: pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/742282 (https://phabricator.wikimedia.org/T295965) [05:37:09] (03CR) 10Marostegui: [C: 03+2] pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/742282 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [05:39:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2014.codfw.wmnet with OS bullseye [05:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:14] RECOVERY - snapshot of s3 in eqiad on alert1001 is OK: Last snapshot for s3 at eqiad (db1102.eqiad.wmnet:3313) taken on 2021-11-29 04:14:59 (1166 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:58:44] RECOVERY - MariaDB Replica SQL: s1 on db1139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:11:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2014.codfw.wmnet with OS bullseye [06:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:27] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-11-29 03:48:31 (1150 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:02:53] (03CR) 10MMandere: [C: 03+2] admin: Add user rosalie-wmde to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/742152 (https://phabricator.wikimedia.org/T295765) (owner: 10MMandere) [07:09:56] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, and 2 others: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10MMandere) 05Open→03Resolved @Rosalie_WMDE you now have access to `releasers-wikibase` . Please feel free to reach... [07:21:29] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for - https://phabricator.wikimedia.org/T295898 (10Aklapper) 05Resolved→03Invalid a:05CGlenn→03None Unfortunately no reply, thus closing as invalid as it seems no actions were performed. If this is still wanted, please feel free to reop... [07:28:20] (03PS1) 10Urbanecm: foundationwiki: Disable hard redirects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742413 [07:28:22] (03PS1) 10Urbanecm: foundationwiki: Clear group add/remove declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742414 [07:28:24] (03PS1) 10Urbanecm: foundationwiki: Do not enable wmgUsePageViewInfo explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742415 [07:28:40] jouncebot: nowandnext [07:28:40] For the next 0 hour(s) and 31 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211128T0800) [07:28:40] In 4 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1200) [07:31:12] !log elukey@deploy1002 Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time) [07:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:16] !log elukey@deploy1002 Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time) (duration: 00m 04s) [07:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:46] (on the node the dependency seems being deployed since yesterday, but just to be sure) [07:32:26] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) @joanna_borun: Could you please answer? Thanks in advance! :) [07:44:03] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10wiki_willy) a:03Cmjohnson [07:44:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10wiki_willy) a:03Cmjohnson [07:44:59] 10SRE, 10ops-eqiad, 10serviceops: Kubernetes1018's eth negotiated speed is 10MB/s - https://phabricator.wikimedia.org/T296369 (10wiki_willy) a:03Cmjohnson [07:48:01] (03PS3) 10Kosta Harlan: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) [07:53:35] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:57:12] (03PS1) 10Marostegui: misc_multiinstance.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742418 (https://phabricator.wikimedia.org/T287244) [07:58:29] (03PS1) 10Muehlenhoff: Fix access date for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/742419 [07:59:32] (03CR) 10Marostegui: [C: 03+2] misc_multiinstance.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742418 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [07:59:47] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [08:00:01] !log elukey@deploy1002 Started deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563 [08:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:32] !log Restart db2078 and db1117 [08:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:02] !log elukey@deploy1002 Finished deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563 (duration: 07m 01s) [08:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:32] goooood [08:08:49] (03CR) 10Muehlenhoff: [C: 03+2] Fix access date for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/742419 (owner: 10Muehlenhoff) [08:15:32] (03PS1) 10Muehlenhoff: Add library hint for libntlm [puppet] - 10https://gerrit.wikimedia.org/r/742420 [08:19:19] !log instaling libntlm security updates [08:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:55] !log installing libvpx security updates [08:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:09] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:33:04] !log installing bluez security updates [08:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:23] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [08:35:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10MoritzMuehlenhoff) >>! In T295767#7529602, @ayounsi wrote: > All 3 VMs got rebuilt with larger disks, but with the default Debian Buster. > > @MoritzMuehlenh... [08:35:58] (03PS3) 10Muehlenhoff: Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) [08:41:57] (03PS1) 10Kosta Harlan: Fix error handling in SuggestedEdits::getActionData() [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742254 (https://phabricator.wikimedia.org/T296366) [08:49:29] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:52:51] (03PS1) 10Vgutierrez: site: Reimage cp2041 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742422 (https://phabricator.wikimedia.org/T290005) [08:54:55] !log installing ICU security updates on buster [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] !log depool cp2041 to be reimaged as cache::text_haproxy - T290005 [08:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:44] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:57:57] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2041 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:59:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS buster [08:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:14] (03CR) 10David Caro: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [08:59:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2041.codfw.wmnet with OS buster [09:00:56] 10SRE, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10MoritzMuehlenhoff) For context/clarification: This group currently permits the unprivileged login to cloudnet* servers. Tagging the current members of the lab... [09:01:10] (03PS2) 10Jelto: charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 [09:07:09] (03Abandoned) 10Filippo Giunchedi: swift: let mount_filesystem fail on unmountable fs [puppet] - 10https://gerrit.wikimedia.org/r/269980 (https://phabricator.wikimedia.org/T126574) (owner: 10Filippo Giunchedi) [09:09:30] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Marostegui) a:03jcrespo Assigning to Jaime to reflect current status as I believe he's working on it... [09:13:22] (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [09:20:29] (03PS3) 10Jelto: charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 [09:22:17] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [09:23:33] (03PS7) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [09:23:35] (03PS1) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424 [09:23:41] jouncebot: nowandnext [09:23:41] No deployments scheduled for the next 2 hour(s) and 36 minute(s) [09:23:41] In 2 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1200) [09:23:45] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Disable hard redirects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742413 (owner: 10Urbanecm) [09:24:06] (03CR) 10Majavah: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [09:25:09] (03Merged) 10jenkins-bot: foundationwiki: Disable hard redirects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742413 (owner: 10Urbanecm) [09:27:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c3f47dc55b67d2b53ec27bb610978ff8165aa6ca: foundationwiki: Disable hard redirects (duration: 00m 57s) [09:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:02] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Clear group add/remove declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742414 (owner: 10Urbanecm) [09:28:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:51] (03Merged) 10jenkins-bot: foundationwiki: Clear group add/remove declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742414 (owner: 10Urbanecm) [09:29:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 786313c06188d5d63700d7e46384ef99a9297b57: foundationwiki: Clear group add/remove declarations (duration: 00m 55s) [09:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:40] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Do not enable wmgUsePageViewInfo explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742415 (owner: 10Urbanecm) [09:32:30] (03Merged) 10jenkins-bot: foundationwiki: Do not enable wmgUsePageViewInfo explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742415 (owner: 10Urbanecm) [09:32:45] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:48] !log [urbanecm@mwmaint1002 ~]$ mwscript emptyUserGroup.php --wiki=foundationwiki 'inactive' # removing nonexistent group; backup left at P17888 [09:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:16] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: NOOP: 3a892860b2e1e2ac7b60fc1c4dbdb2035d6af950: foundationwiki: Do not enable wmgUsePageViewInfo explicitly (duration: 00m 55s) [09:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:27] !log rolling restart of mediawiki canaries to pick up ICU security updates [09:34:27] * urbanecm done [09:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:55] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:36:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:33] (03CR) 10Btullis: Add initial personal dotfiles and one script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731403 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [09:44:09] (03CR) 10Btullis: "Many thanks and apologies for the confusion." [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) (owner: 10Jcrespo) [09:44:11] (03CR) 10Btullis: [C: 03+2] admin: Fix path of btullis' dotfiles and one script [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) (owner: 10Jcrespo) [09:46:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:40] !log pool cp2041 with HAProxy as TLS terminator - T290005 [09:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:48] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:52:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2041.codfw.wmnet with OS buster [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2041.codfw.wmnet with OS buster c... [09:54:04] (03CR) 10Jcrespo: "Remember to check the files are there after deploy 0:-) :-P." [puppet] - 10https://gerrit.wikimedia.org/r/742172 (https://phabricator.wikimedia.org/T285754) (owner: 10Jcrespo) [09:56:41] (03CR) 10Kormat: [C: 03+1] "LGTM. Just test with care on one host :)" [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [09:56:43] (03PS1) 10Muehlenhoff: Update a few Stretch target dates [puppet] - 10https://gerrit.wikimedia.org/r/742428 [09:58:00] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:00:34] (03PS1) 10Vgutierrez: site: Reimage cp3064 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742429 (https://phabricator.wikimedia.org/T290005) [10:00:58] (03CR) 10Elukey: "LGTM, asked some follow up questions just to double check some assumptions." [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:01:08] !log depool cp3064 to be reimaged as cache::text_haproxy - T290005 [10:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:01:53] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3064 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742429 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:02:27] (03CR) 10Muehlenhoff: [C: 03+2] Update a few Stretch target dates [puppet] - 10https://gerrit.wikimedia.org/r/742428 (owner: 10Muehlenhoff) [10:02:33] (03PS2) 10Muehlenhoff: Update a few Stretch target dates [puppet] - 10https://gerrit.wikimedia.org/r/742428 [10:02:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3064.esams.wmnet with OS buster [10:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:59] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster [10:03:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32691/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:07:43] (03PS19) 10Elukey: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:07:56] (03PS5) 10Elukey: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) [10:08:04] (03PS5) 10Elukey: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:10:07] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10hashar) The purpose was for Releng to monitor the status of OpenStack when we used #nodepool (a system that dynamic... [10:12:19] (03CR) 10Elukey: "John if possible I'd move this change before the ryslog one, just to leave the biggest use case for last (there are also other bits that I" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [10:31:33] 10SRE, 10Patch-Needs-Improvement: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574 (10fgiunchedi) 05Open→03Invalid Went another route, namely having proper permissions on the mount directory [10:39:11] (03CR) 10Arturo Borrero Gonzalez: "LGTM, but I don't know enough the codebase to be able to detect breaking changes. Please collect +1 from someone else." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:39:58] (03PS4) 10Jelto: charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 [10:43:17] (03PS1) 10Btullis: Correct typo in .profile for btullis [puppet] - 10https://gerrit.wikimedia.org/r/742434 [10:43:36] (03CR) 10jerkins-bot: [V: 04-1] Correct typo in .profile for btullis [puppet] - 10https://gerrit.wikimedia.org/r/742434 (owner: 10Btullis) [10:44:39] (03CR) 10Arturo Borrero Gonzalez: ":-) This patch is awesome. Thanks for taking the time to dig this up." [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [10:45:02] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp3064.esams.wmnet with OS buster [10:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster e... [10:45:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3064.esams.wmnet with OS buster [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster [10:47:44] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:49:25] (03PS2) 10Urbanecm: Disable Growth IP research survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742268 (https://phabricator.wikimedia.org/T294568) [10:49:33] jouncebot: now [10:49:34] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [10:49:39] (03CR) 10Urbanecm: [C: 03+2] Disable Growth IP research survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742268 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [10:50:03] (03PS1) 10Urbanecm: Fix "Mark entries as bot entries" feature [extensions/CentralAuth] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742256 (https://phabricator.wikimedia.org/T296297) [10:50:10] (03CR) 10Urbanecm: [C: 03+2] "backporting" [extensions/CentralAuth] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742256 (https://phabricator.wikimedia.org/T296297) (owner: 10Urbanecm) [10:50:26] (03Merged) 10jenkins-bot: Disable Growth IP research survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742268 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [10:51:01] (03Abandoned) 10Btullis: Correct typo in .profile for btullis [puppet] - 10https://gerrit.wikimedia.org/r/742434 (owner: 10Btullis) [10:51:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [10:52:02] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) 05Open→03Resolved All VMs have been restarted to enable the machine type. [10:52:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d01652ec22f6cb3413b419a3c9b0a7a08d79960f: Disable Growth IP research survey (T294568) (duration: 00m 56s) [10:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:18] T294568: deploy quicksurvey for editors on eswiki and arwiki (for Growth IP editors research) - https://phabricator.wikimedia.org/T294568 [10:53:14] (03Merged) 10jenkins-bot: Fix "Mark entries as bot entries" feature [extensions/CentralAuth] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742256 (https://phabricator.wikimedia.org/T296297) (owner: 10Urbanecm) [10:54:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:23] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/CentralAuth/includes/Special/SpecialMultiLock.php: 5fc6aaa73202a1bf2aa58998d2671d5f4a6255bc: Fix "Mark entries as bot entries" feature (1/2; T296297) (duration: 00m 56s) [10:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:27] T296297: Special:MultiLock 'Mark entries on Recent changes as bot entries' is no longer working - https://phabricator.wikimedia.org/T296297 [10:58:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: ceph: refresh bootstrap auth [labs/private] - 10https://gerrit.wikimedia.org/r/742133 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:58:18] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/CentralAuth/includes/CentralAuthUser.php: 5fc6aaa73202a1bf2aa58998d2671d5f4a6255bc: Fix "Mark entries as bot entries" feature(2/2; T296297) (duration: 00m 55s) [10:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:28] * urbanecm done [11:00:00] (03CR) 10Jelto: "I came across some inconsistency around affinity settings in our charts. Some charts produce invalid yaml due to wrong formatting and inde" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [11:01:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:15] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:03:32] i hope that's not me [11:04:45] it lines up pretty well though :/ [11:04:59] :/ [11:05:23] waiting a while in case it's temp, and if not, I'll rv [11:05:59] hm, seems like parsoid cluster went up around the same time, while api_appserver didn’t change at all [11:06:14] (though parsoid also looks more variable in general) [11:06:39] yeah, i touched things that are miles away from parsoid [11:06:41] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:09:21] it now went slightly down (~390ms), but nowhere near the ~200 ms it was before :/ [11:10:26] went down ~270ms [11:11:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:12:01] and recovery [11:12:20] so it looks my patch either was not related or it was, but caused a temporary issue [11:12:44] urbanecm: memcached rps metrics went up at the same time, what was the change about? Maybe it was related to a cache problem? [11:13:11] urbanecm: that was the CA logging fix, right? [11:13:20] majavah: correct [11:13:32] 95th percentile is still not great [11:14:25] elukey: it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/742256/, and it's supposed to give some entries in RC the is_bot flag [11:14:34] if you want me to, i can try to revert it anyway [11:14:50] did you deploy anything else recently? [11:15:34] majavah: at 10:52, disabled a survey, and at 09:30-09:34, few foundationwiki config changes (disable hard redirects + no-op changes) [11:16:32] I doubt foundationwiki gets enough traffic to affect global metrics this much [11:16:47] yeah [11:17:39] maybe the survey? that would at least match that it's appservers + parsoid but not api [11:18:07] i heard of enabling surveys making metrics worse, never disabling [11:18:37] (surveys are all clientside, ftr) [11:19:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:21:50] (03PS1) 10Muehlenhoff: Enable component/ganeti216 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/742438 [11:24:59] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:25:07] elukey: in the meantime, did the metrics you talked about improve? if not, should i revert the CA patch? [11:25:12] (also cc majavah in case you have any opinion) [11:26:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742438 (owner: 10Muehlenhoff) [11:27:48] I still suspect the survey patch as it invalidates RL caches [11:28:08] majavah: do you want me to RV it? [11:28:09] elukey: do we have response time stats per-wiki? [11:29:10] afaik the webrequest_sampled_128 data in turnilo is delayed, which does not work for this purpose [11:29:11] urbanecm: the stats are a little better but not great [11:29:23] majavah: not that I know [11:29:46] :/ [11:29:48] or per endpoint (index.php/load.php/etc)? [11:30:03] i _think_ kafkacat'ing the right topic should be realtime [11:30:15] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [11:30:35] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Kormat) Split the alias, have a new `db-backup-source` alias. [11:30:51] I probably can't do that [11:31:19] i'm curious if the request rate increase was on load.php only or other endpoints [11:31:33] majavah: `kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -o -1 -t webrequest_text` from a host that can access kafka [11:33:13] I don't think any of the cloud hosts can/should do that (or bast/peopleweb) [11:33:34] yeah, unfortunately [11:33:43] The only thing that matches with the p95 increase is the memcached traffic [11:33:46] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=26&from=now-3h&orgId=1&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [11:33:53] if we look hours before there are worst spikes [11:34:01] * urbanecm briefly afk, will be back soon [11:35:54] hmm, actually https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&from=now-3h&to=now does not support my theory\ [11:38:37] * urbanecm waves again [11:38:59] also it seems like the requests started raising before the deployments [11:40:05] majavah: if that'd help you, i can keep the kafkacat running for a while and send you a sample [11:40:11] (03PS1) 10Muehlenhoff: Retire labnet-users group [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) [11:40:23] (of course, it wouldn't have pre-deployment data -- turnillo is simplest for them) [11:40:45] probably not that much, https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&from=now-3h&to=now is enough to disprove the RL theory [11:40:59] ack [11:41:44] I have to go now, the situation is not burning so we can leave things as they are and verify performances later on (say one hour or a little more). If the perf regression is still there, I'd revert [11:42:22] I don't have time right now to check in dept, but it smells like a perf regression due to more traffic to memcached [11:43:34] ack, sounds good. [11:43:53] I'm happy to revert things, although I'm not sure _what_ to revert (ie. should the patch from 09:30 go too?) [11:44:18] (03PS2) 10Muehlenhoff: Retire labnet-users group [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) [11:45:04] urbanecm: I thought there was one that matched (more or less) the regression time [11:45:09] I'd start from it [11:45:31] okay [11:46:01] thanks a lot :) [11:46:04] * elukey afk [11:46:38] i should be irc-reachable for the whole day (more or less), so feel free to ping me if a rv is needed after the hour [11:59:23] hey everyone, I need to do a service deploy Today out of the regular deployment window (Tomorrow). Is that okay? Are there any questions or concerns? The service is wikifeeds for an Android Campaign that should go out this week [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1200). [12:00:05] kostajh and majavah: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] around [12:00:13] * urbanecm waves [12:00:16] i can deploy today [12:00:17] \o hi [12:00:22] 10SRE, 10Infrastructure-Foundations, 10netops: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) Ok yes it seems to be the loopback filter alright, testing the change on asw1-b13-drmrs adding a new term as advised in the KB article fixed it: ` cmooney@asw1-b13-drmrs... [12:00:30] (unless kostajh wishes to try to lead a B&C window!) [12:00:34] o/ [12:01:22] Hi [12:01:35] hey nn1l2, i don't see you on the list? :-) am i missing sth? [12:01:41] oh, very last time addition [12:01:58] starting... [12:01:59] (03CR) 10Muehlenhoff: [C: 03+2] Enable component/ganeti216 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/742438 (owner: 10Muehlenhoff) [12:02:10] (03CR) 10Urbanecm: [C: 03+2] Fix error handling in SuggestedEdits::getActionData() [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742254 (https://phabricator.wikimedia.org/T296366) (owner: 10Kosta Harlan) [12:02:25] kostajh: do we want to wait for the backport before enabling the add an image experiment? [12:03:00] urbanecm: it should be fine to enable first, but I think we might as well wait for the backport to be extra safe [12:03:21] I'm waiting then, proceeding with other config patches in the meantime [12:03:27] (03PS2) 10Urbanecm: Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [12:03:32] (03CR) 10Urbanecm: [C: 03+2] Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [12:04:16] (03Merged) 10jenkins-bot: Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [12:04:21] my patch can't be really tested on mwdebug, the point being that search.wm.o is no longer served on the main appservers and testing that it still works on mwdebug would be pointless [12:04:33] (and we already removed the apache config for it) [12:04:48] majavah: acknowledged [12:05:41] (03PS1) 10Ladsgroup: noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 [12:06:01] indeed, already says 404 not found at mwdebug [12:06:13] (curl --connect-to is helpful) [12:06:15] syncing [12:07:05] nn1l2: i can't find any downloadable material at the site you mention [12:07:13] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Majavah) 05Open→03Resolved [12:07:14] can you link me some? [12:07:26] https://planet4589.org/space/gcat/index.html [12:07:27] !log urbanecm@deploy1002 Synchronized docroot/: 4662224229cb4083b8b01de436ccd65e8c00e7dd: Remove search.wikimedia.org files (T289224) (duration: 00m 56s) [12:07:30] thanks! [12:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:32] T289224: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 [12:07:46] majavah: any time! [12:07:57] nn1l2: and what's the downloadable bit there? [12:08:09] ie. what would i put into commons' "URL bar" when uploading by URL [12:08:16] Some have uploaded to Commons [12:08:17] https://commons.wikimedia.org/wiki/File:Kosmos-1408_orbit_decay.jpg [12:08:24] https://commons.wikimedia.org/wiki/File:Kosmos-1408_vs_ISS.jpg [12:08:48] https://commons.wikimedia.org/w/index.php?search=insource%3Aplanet4589.org&title=Special:MediaSearch&go=Go&type=image [12:09:00] The Graph [12:09:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:41] (03PS3) 10Urbanecm: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136) (owner: 104nn1l2) [12:09:46] (03CR) 10Urbanecm: [C: 03+2] Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136) (owner: 104nn1l2) [12:09:48] fair enough then [12:09:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3064.esams.wmnet with OS buster [12:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:53] https://planet4589.org/space/gcat/images/models/map2a.jpg [12:10:02] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3064.esams.wmnet with OS buster c... [12:10:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:33] (03Merged) 10jenkins-bot: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136) (owner: 104nn1l2) [12:10:49] nn1l2: please test at mwdebug1001 [12:11:29] !log pool cp3064 (text) using HAProxy as TLS terminator - T290005 [12:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:33] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:11:46] LGTM https://commons.wikimedia.org/wiki/File:General_Catalog_of_Artificial_Space_Objects_Map2a.jpg [12:11:53] syncing [12:11:56] Uploaded by URL [12:12:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [12:13:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7fdea3e71e4fd9e85c30efbc17f94c0711deb252: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T296136) (duration: 00m 56s) [12:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:09] T296136: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T296136 [12:13:15] nn1l2: it's live [12:13:18] anything else? [12:13:30] no, thanks [12:13:33] great :) [12:14:27] waiting for CI... [12:16:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:17:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:30] (03Merged) 10jenkins-bot: Fix error handling in SuggestedEdits::getActionData() [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742254 (https://phabricator.wikimedia.org/T296366) (owner: 10Kosta Harlan) [12:25:02] (03PS31) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:25:15] kostajh: pulled the backport at mwdebug1001 [12:25:17] can you test? [12:25:20] urbanecm: yes, testing [12:25:23] thanks [12:25:55] urbanecm: it works :) [12:25:59] great, i can confirm that too [12:26:00] syncing [12:26:09] (03PS4) 10Urbanecm: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [12:26:19] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [12:26:21] (03PS32) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:27:19] (03Merged) 10jenkins-bot: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [12:28:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) @Dzahn On second thought, I'd rather create a separate WMF account, since I didn't realize that my wikimedia.org email would also be used for gerrit and the cloud. I need to create a new WMF a... [12:28:31] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32695/" [puppet] - 10https://gerrit.wikimedia.org/r/742132 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:28:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:40] Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 352 hosts is taking a while this time [12:29:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:52] (03PS33) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:30:21] (03PS34) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:32:21] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: 05704407395fbf227eec47cf716393dc60a36a35: Fix error handling in SuggestedEdits::getActionData (T296366) (duration: 05m 37s) [12:32:24] finally [12:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:25] T296366: GrowthExperiments: Call to undefined method StatusValue::getTotalCount() - https://phabricator.wikimedia.org/T296366 [12:32:33] !log depool cp3064 - T290005 [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:32:48] kostajh: config patch available for testing [12:32:59] urbanecm: thanks. mwdebug1001? [12:33:01] yes [12:33:37] having a look [12:33:46] (03PS35) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:34:02] thanks [12:35:42] urbanecm: looks good to me [12:35:49] urbanecm: wait [12:35:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:52] waiting [12:35:52] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10MoritzMuehlenhoff) >>! In T295993#7528528, @Daimona wrote: > As I said, if my wikimedia email needs to be in the puppet file, that's fine. I do prefer not to use my real name publicly, but I believe th... [12:36:15] (03PS2) 10Arturo Borrero Gonzalez: profile: ceph: cleanup firewall config [puppet] - 10https://gerrit.wikimedia.org/r/742174 [12:36:18] urbanecm: sorry, spotted an issue, give me a moment [12:36:24] sure, take your time [12:36:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:15] (03PS1) 10Kormat: cumin: Split out backup sources from db-store alias [puppet] - 10https://gerrit.wikimedia.org/r/742453 (https://phabricator.wikimedia.org/T296285) [12:39:34] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC: (NOOP) https://puppet-compiler.wmflabs.org/compiler1001/32697/" [puppet] - 10https://gerrit.wikimedia.org/r/742174 (owner: 10Arturo Borrero Gonzalez) [12:40:39] urbanecm: let's not deploy the config patch for now. We need a backport to address an issue with GrowthExperiments code first. [12:40:48] reverting [12:40:49] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [12:41:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) (owner: 10Muehlenhoff) [12:41:33] (03CR) 10Jakob: [C: 03+1] Update termbox to 2021-11-26-093451-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [12:41:37] (03PS1) 10Urbanecm: Revert "GrowthExperiments: Start imagerecommendation variant experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742454 (https://phabricator.wikimedia.org/T294737) [12:41:39] (03CR) 10Urbanecm: [C: 03+2] Revert "GrowthExperiments: Start imagerecommendation variant experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742454 (https://phabricator.wikimedia.org/T294737) (owner: 10Urbanecm) [12:42:37] (03Merged) 10jenkins-bot: Revert "GrowthExperiments: Start imagerecommendation variant experiment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742454 (https://phabricator.wikimedia.org/T294737) (owner: 10Urbanecm) [12:42:53] kostajh: reverted [12:42:57] not syncing, as it never got past mwdebug [12:43:18] (03CR) 10Jbond: [C: 03+1] cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [12:43:57] (03PS20) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [12:45:09] urbanecm: thx [12:45:14] any time [12:48:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:34] (03CR) 10Jbond: P:base::certificates: update support for trusted CA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [12:49:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:50] (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [12:49:52] (03PS3) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [12:51:20] !log upgrading ganeti codfw cluster to 2.16 backport T296622 [12:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [12:51:40] (03PS6) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [12:51:59] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:52:17] (03PS6) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [12:52:19] (03PS21) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [12:52:21] (03PS7) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [12:52:23] (03CR) 10Arturo Borrero Gonzalez: "PCC NOOP https://puppet-compiler.wmflabs.org/compiler1003/32698/" [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:55:21] (03PS4) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [12:55:58] (03PS8) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [12:57:08] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:58:09] (03CR) 10David Caro: cli: add --fail-fast flag and behavior (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [12:59:37] (03CR) 10Jbond: "PCC currently on error needs investigating, all other hosts have a big diff but this is just a change to the parameter structure none of t" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [13:01:07] (03CR) 10Majavah: [C: 03+1] Retire labnet-users group [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) (owner: 10Muehlenhoff) [13:01:28] (03CR) 10Kormat: [V: 03+2 C: 03+2] .gitignore: Ignore __pycache__ dirs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742140 (owner: 10Kormat) [13:03:54] (03PS22) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:03:56] (03PS7) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:03:58] (03PS9) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [13:05:00] (03PS23) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:05:02] (03PS10) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [13:05:04] (03PS8) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:05:41] (03PS3) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [13:05:43] (03PS5) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:08:28] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 53.35 ms [13:08:35] (03PS9) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:09:11] 10SRE, 10ops-codfw: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540 (10Papaul) a:03Papaul [13:09:18] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:09:33] 10SRE, 10ops-codfw: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540 (10Papaul) p:05Triage→03Medium [13:09:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32703/console" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:11:41] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libntlm [puppet] - 10https://gerrit.wikimedia.org/r/742420 (owner: 10Muehlenhoff) [13:12:23] (03PS10) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:13:01] (03PS11) 10Jbond: P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:13:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32704/console" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:13:45] (03PS4) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [13:13:47] (03PS6) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:14:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32705/console" [puppet] - 10https://gerrit.wikimedia.org/r/739917 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [13:15:55] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:16:52] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [13:17:43] (03CR) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:18:23] (03CR) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:19:33] (03PS1) 10Jelto: profile::gitlab-runner add hieradata for protected GitLab Runners [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) [13:20:46] (03PS7) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:21:41] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:26:46] (03PS1) 10Cathal Mooney: Modified loopback4 filter to allow NTP commands to run [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) [13:37:01] (03PS8) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:37:58] (03PS36) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [13:38:00] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:45:56] (03PS9) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:47:56] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:49:34] (03CR) 10Ayounsi: Modified loopback4 filter to allow NTP commands to run (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [13:50:27] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10Gehel) 05Open→03Resolved [13:51:14] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) >>! In T295993#7533601, @MoritzMuehlenhoff wrote: >>>! In T295993#7528528, @Daimona wrote: >> As I said, if my wikimedia email needs to be in the puppet file, that's fine. I do prefer not to u... [13:51:20] 10SRE, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Gehel) [13:51:41] (03PS37) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [13:52:30] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Deprecation of U2F API in Chrome / Enable web auth in CAS - https://phabricator.wikimedia.org/T296629 (10MoritzMuehlenhoff) [13:53:46] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:54:05] (03CR) 10jerkins-bot: [V: 04-1] monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [13:55:42] (03PS1) 10Cathal Mooney: Add drmrs loopbacks and interconnect range to ntp allowed config [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T295672) [13:55:46] !log repool cp3064 - T290005 [13:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:51] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:57:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) ^^ apologies ignore above used incorrect task ref. [13:57:25] hey elukey, if you're around again, did the metrics we were discussing a while ago improve? Or should I do the revert? [13:57:29] (03CR) 10Ayounsi: Add drmrs loopbacks and interconnect range to ntp allowed config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:57:31] (03PS2) 10Cathal Mooney: Add drmrs loopbacks and interconnect range to ntp allowed config [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T296623) [14:00:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) ^^ ignore above - pasted wrong task ID. and sorry for spam. [14:05:18] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10kai.nissen) [14:05:28] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [14:06:47] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10kai.nissen) [14:09:20] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10kai.nissen) [14:09:36] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10kai.nissen) [14:10:38] urbanecm: they look way much better, I don't think a rollback is needed [14:10:53] good news! thanks elukey. Letting that out of my head then :). [14:13:48] (03PS1) 10MSantos: wikifeeds: bump to 2021-11-25-114706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742468 [14:13:57] (03CR) 10Elukey: [C: 03+1] "Nice thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:14:12] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:15:29] (03PS1) 10Marostegui: phabricator_instance.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742469 (https://phabricator.wikimedia.org/T287244) [14:16:25] (03PS38) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:17:43] (03CR) 10Marostegui: [C: 03+2] phabricator_instance.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742469 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [14:19:19] (03PS39) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:19:47] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-11-25-114706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742468 (owner: 10MSantos) [14:21:08] (03PS40) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:21:48] (03PS1) 10Marostegui: phabricator.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742471 (https://phabricator.wikimedia.org/T287244) [14:22:29] (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Change innodb_checksum_algorithm [puppet] - 10https://gerrit.wikimedia.org/r/742471 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [14:23:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32710/console" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:23:52] (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-11-25-114706-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742468 (owner: 10MSantos) [14:24:39] (03CR) 10Jbond: [V: 03+1] "PCC for alerts1001 https://puppet-compiler.wmflabs.org/compiler1003/32710/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:24:45] (03PS41) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [14:25:57] (03CR) 10Jbond: "ready for review see comments for pcc. the show a lot of diffs but only the parameter names and how they are passed has really changed. " [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [14:27:31] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Deprecation of U2F API in Chrome / Enable web auth in CAS - https://phabricator.wikimedia.org/T296629 (10jbond) [14:27:36] (03CR) 10Elukey: P:base::certificates: update support for trusted CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:32:17] (03PS24) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [14:33:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32711/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:34:05] (03CR) 10Jbond: [V: 03+1] P:base::certificates: update support for trusted CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:34:14] (03PS11) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [14:34:18] (03PS12) 10Jbond: P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:36:57] (03CR) 10Lucas Werkmeister (WMDE): Update termbox to 2021-11-26-093451-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [14:37:32] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:42] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] ^ deployment out of the window in order to deliver Android and iOS fundraising campaign [14:40:15] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:16] (03PS2) 10Jforrester: Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [14:41:30] (03PS2) 10Jforrester: Set 'WP' namespace alias to NS_PROJECT in mnw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742279 (https://phabricator.wikimedia.org/T296606) (owner: 104nn1l2) [14:45:28] (03CR) 10Elukey: P:base::certificates: update support for trusted CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:47:39] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Deprecation of U2F API in Chrome / Enable web auth in CAS - https://phabricator.wikimedia.org/T296629 (10Volans) [14:49:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32712/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:51:57] (03CR) 10Andrew Bogott: cinder.conf: Tune settings for the backup agent. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742273 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [15:15:23] !log gnt-cluster renew-crypto --new-cluster-certificate for codfw Ganeti cluster T296622 [15:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:29] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [15:15:54] (03PS25) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [15:16:35] (03PS12) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [15:16:46] (03PS13) 10Jbond: P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:17:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32713/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [15:20:23] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10cmooney) Seems like a sane proposal. The use of sflow and a different pipeline will keep a clean separation between it and data fr... [15:20:24] (03CR) 10Jakob: [C: 03+1] Update termbox to 2021-11-26-093451-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [15:22:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32714/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [15:24:39] (03CR) 10Elukey: [V: 03+1 C: 03+1] "Thanks a lot John!" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [15:26:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::certificates: update support for trusted CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [15:26:38] (03CR) 10Jbond: [C: 03+2] P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [15:29:50] 10SRE, 10ops-eqiad, 10serviceops: Kubernetes1018's eth negotiated speed is 10MB/s - https://phabricator.wikimedia.org/T296369 (10Cmjohnson) 05Open→03Resolved replaced the cable. Good to go now cmjohnson@kubernetes1018:~$ sudo ethtool eno1 | grep Speed Speed: 1000Mb/s [15:30:57] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [15:31:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10Cmjohnson) a:05Cmjohnson→03wiki_willy @jcrespo db1102 is out of warranty.... [15:35:14] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Cmjohnson) a:05Cmjohnson→03wiki_willy @wiki_willy @RobH this server is out of warranty, they have a 1.6TB SSD that has failed. I recommend buying a new SSD. [15:37:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah) [15:40:06] (03PS1) 10Majavah: add phab task for role::doc stretch deprecation [puppet] - 10https://gerrit.wikimedia.org/r/742480 [15:40:32] (03PS1) 10Jbond: O:sslcert::trusted_ca: fix trying to join struct to array [puppet] - 10https://gerrit.wikimedia.org/r/742481 [15:41:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32715/console" [puppet] - 10https://gerrit.wikimedia.org/r/742481 (owner: 10Jbond) [15:41:44] (03CR) 10Majavah: [C: 03+1] O:sslcert::trusted_ca: fix trying to join struct to array [puppet] - 10https://gerrit.wikimedia.org/r/742481 (owner: 10Jbond) [15:41:56] (03CR) 10Elukey: [C: 03+1] O:sslcert::trusted_ca: fix trying to join struct to array [puppet] - 10https://gerrit.wikimedia.org/r/742481 (owner: 10Jbond) [15:42:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:sslcert::trusted_ca: fix trying to join struct to array [puppet] - 10https://gerrit.wikimedia.org/r/742481 (owner: 10Jbond) [15:42:25] (03CR) 10RhinosF1: [C: 04-1] add phab task for role::doc stretch deprecation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742480 (owner: 10Majavah) [15:42:44] majavah: spell check hates you [15:42:54] (03PS2) 10Majavah: add phab task for role::doc stretch deprecation [puppet] - 10https://gerrit.wikimedia.org/r/742480 [15:43:04] vim does not have a spell checker by default [15:43:25] (03PS3) 10Majavah: add phab task for role::doc stretch deprecation [puppet] - 10https://gerrit.wikimedia.org/r/742480 [15:43:44] Anyone using production right now? I want to deploy a Beta Cluster patch that'll need no sync, just a pull. [15:43:51] (03PS4) 10Jforrester: [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) [15:43:56] (03CR) 10Majavah: add phab task for role::doc stretch deprecation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742480 (owner: 10Majavah) [15:44:10] (03CR) 10RhinosF1: [C: 03+1] add phab task for role::doc stretch deprecation [puppet] - 10https://gerrit.wikimedia.org/r/742480 (owner: 10Majavah) [15:44:43] majavah: ty [15:45:06] (03CR) 10Jforrester: [C: 03+2] "Let's go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [15:45:13] I'll take that as no. ;-) [15:45:27] (03Abandoned) 10Jforrester: [BETA CLUSTER] Configure wikifunctionswiki in wikiversions-labs.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740790 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [15:45:48] (03Merged) 10jenkins-bot: [BETA CLUSTER] Create wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740789 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [15:47:02] (03PS1) 10Jbond: P:cache::kafka::Webrequest: always include profile::cache::kafka::certificate [puppet] - 10https://gerrit.wikimedia.org/r/742482 (https://phabricator.wikimedia.org/T291905) [15:47:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:21] !log power down logstash2028 for IDRAC reset [15:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32716/console" [puppet] - 10https://gerrit.wikimedia.org/r/742482 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [15:48:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:cache::kafka::Webrequest: always include profile::cache::kafka::certificate [puppet] - 10https://gerrit.wikimedia.org/r/742482 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [15:50:44] PROBLEM - Host logstash2028 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2069.codfw.wmnet with OS buster [15:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2069.codfw.wmnet with OS buster [15:51:26] !log Running mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=enwiki en wikimedia wikifunctionswiki wikifunctions.beta.wmflabs.org in Beta Cluster for T284162 [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:29] T284162: Create a Beta Cluster version of Wikifunctions.org - https://phabricator.wikimedia.org/T284162 [15:52:47] !log depool cp3064 - T290005 [15:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:53] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:54:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:44] James_F: invalid hostname :( [15:54:46] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) mgmt cable was loosed on elastic2069 [15:55:51] RhinosF1: Yes, debugging what went wrong now. See T296644. [15:55:52] T296644: addWiki.php failed in Beta Cluster, trying to run Flagged Revs code where none was enabled - https://phabricator.wikimedia.org/T296644 [15:56:04] Seen [16:00:52] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.24 ms [16:03:28] RECOVERY - Host logstash2028 is UP: PING WARNING - Packet loss = 90%, RTA = 33.20 ms [16:04:27] !log sudo gnt-cluster upgrade --to 2.16 for Ganeti codfw cluster [16:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:50] (03PS1) 10Filippo Giunchedi: pontoon: fix trusted_certs [puppet] - 10https://gerrit.wikimedia.org/r/742484 [16:07:14] (03PS1) 10Elukey: update-wmf-ca-certificates: add group/other read flags to cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 [16:08:48] PROBLEM - ganeti-mond running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:08:51] jbond: found an issue with the latest refactor of trusted_certs, though the fix is simple https://gerrit.wikimedia.org/r/c/operations/puppet/+/742484 [16:09:12] PROBLEM - ganeti-noded running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:09:30] PROBLEM - ganeti-wconfd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [16:09:32] ^ ganeti-mond is relate to the gnt-cluster upgrade, it's restarting daemons [16:10:42] 10SRE, 10ops-codfw: logstash2028.mgmt flapping - https://phabricator.wikimedia.org/T296540 (10Papaul) 05Open→03Resolved Reset IDRAC and upgrade IDRAC mgmt is back up. [16:11:26] godog: thanks looking [16:11:35] (03PS1) 10Elukey: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742486 [16:11:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:59] godog: do you have a pontoon server to test. i had thought the default would work ok with pontoon [16:12:17] jbond: sure, sec [16:12:20] (03CR) 10Jdlrobson: [C: 03+1] enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) (owner: 10Inductiveload) [16:13:05] godog: in theory with the last refactor the puppet code should access profile::base::certificates::trusted_ca_path, that for pontoon envs (by default) will point to the Puppet CA cert bundle path [16:13:20] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:13:21] (say any daemon/client that needs to use it) [16:13:24] PROBLEM - ganeti-confd running on ganeti2024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:13:24] PROBLEM - ganeti-noded running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:14:08] jbond elukey you can see the error at pontoon-log-01.monitoring.eqiad1.wikimedia.cloud [16:14:32] or https://phabricator.wikimedia.org/P17893 [16:14:43] thx looking [16:15:26] RECOVERY - ganeti-confd running on ganeti2024 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:15:27] RECOVERY - ganeti-noded running on ganeti2023 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:16:58] godog: looks like that should be a simple fix one sec [16:17:17] (03PS1) 10Ladsgroup: rdbms: Add DB host to TransactionProfiler logging and fix time fields [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742259 (https://phabricator.wikimedia.org/T295706) [16:18:04] (03PS1) 10Jbond: sslcert: use correct default of undef [puppet] - 10https://gerrit.wikimedia.org/r/742487 [16:18:07] godog: i think ^^ should fix things [16:18:45] (03CR) 10Filippo Giunchedi: [C: 03+1] sslcert: use correct default of undef [puppet] - 10https://gerrit.wikimedia.org/r/742487 (owner: 10Jbond) [16:19:17] jbond: yep lgtm! what gets trusted with 'undef' ? [16:19:48] I'd assume nothing but wanted to double check [16:19:56] nothing additional that is [16:20:01] godog: facts['puppet_config]['localcacert'] which should be the puppet ca [16:20:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] sslcert: use correct default of undef [puppet] - 10https://gerrit.wikimedia.org/r/742487 (owner: 10Jbond) [16:20:24] nice, I'll try right now [16:20:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2069.codfw.wmnet with OS buster [16:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2069.codfw.wmnet with OS buster comp... [16:21:00] ack it should be merged now, let me know if it dosn;t work as expected [16:21:12] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [16:22:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) 05Open→03Resolved This is complete [16:23:33] jbond: thank you, different error https://phabricator.wikimedia.org/P17894 looks like a typo for 'default' [16:23:42] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:24:37] godog: sorry just on phone will check in a sec [16:25:06] jbond: sure no rush [16:25:32] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/742147 (owner: 10Filippo Giunchedi) [16:25:38] (03CR) 10Ottomata: [C: 03+2] analytics/systemd/sqoop: Add daily sqoop [puppet] - 10https://gerrit.wikimedia.org/r/739923 (https://phabricator.wikimedia.org/T290516) (owner: 10Milimetric) [16:25:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:54] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: log additional alert labels [puppet] - 10https://gerrit.wikimedia.org/r/742147 (owner: 10Filippo Giunchedi) [16:27:20] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10RobH) This server is 5 years old in May 2022, do we want to throw new hardware in a host going away in less than 6 months or just move the refresh to Q3? [16:30:04] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1630). [16:32:19] (03Abandoned) 10Filippo Giunchedi: pontoon: fix trusted_certs [puppet] - 10https://gerrit.wikimedia.org/r/742484 (owner: 10Filippo Giunchedi) [16:33:14] RECOVERY - ganeti-mond running on ganeti2019 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:33:34] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:08] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) 05Open→03Resolved Not seeing any memory errors on the server since the swap so closing this task for now. In case we do see the issue again. We can re-open. [16:35:32] RECOVERY - ganeti-noded running on ganeti2019 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:37:16] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10Papaul) @MatthewVernon I requested for the disk replacement . I am still waiting for it. [16:37:46] RECOVERY - ganeti-wconfd running on ganeti2019 is OK: PROCS OK: 1 process with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [16:44:49] (03PS1) 10Majavah: sslcert::trusted_ca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/742490 [16:44:57] (03PS1) 10David Caro: trusted_ca: Fix typo defaut->default [puppet] - 10https://gerrit.wikimedia.org/r/742491 [16:45:53] (03CR) 10Filippo Giunchedi: [C: 03+1] trusted_ca: Fix typo defaut->default [puppet] - 10https://gerrit.wikimedia.org/r/742491 (owner: 10David Caro) [16:46:03] dcaro: thanks re: ^ ! ran into the same [16:46:18] ack [16:46:23] (03Abandoned) 10Majavah: sslcert::trusted_ca: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/742490 (owner: 10Majavah) [16:46:51] lol the timing majavah dcaro [16:47:05] just noticed it is the same fix [16:47:37] (03CR) 10David Caro: [C: 03+2] trusted_ca: Fix typo defaut->default [puppet] - 10https://gerrit.wikimedia.org/r/742491 (owner: 10David Caro) [16:47:51] yep xd, waiting for the tests to come back [16:48:37] (03PS1) 10Urbanecm: Revert "foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742492 (https://phabricator.wikimedia.org/T205347) [16:49:01] (03CR) 10Urbanecm: [C: 03+2] Revert "foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742492 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [16:49:53] (03Merged) 10jenkins-bot: Revert "foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742492 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [16:51:09] thank godog dcaro [16:51:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 567f2a9d4883c9a98a3251f153ea0ad58d7774c6: Revert "foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily" (T205347) (duration: 00m 56s) [16:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:15] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [16:52:28] the puppet the fix worked, the alerts should start clearing out by themselves [16:52:42] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [16:52:54] (03PS2) 10Urbanecm: Make foundationwiki a standard CentralAuth wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735443 (https://phabricator.wikimedia.org/T205347) [16:53:00] and of course i had a patch for this prepared [16:53:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:16] (03CR) 10Urbanecm: [C: 03+2] Make foundationwiki a standard CentralAuth wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735443 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [16:54:15] (03PS1) 10Urbanecm: foundationwiki: Remove explicit wmgUseOAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742493 [16:54:24] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Remove explicit wmgUseOAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742493 (owner: 10Urbanecm) [16:54:29] (03Merged) 10jenkins-bot: Make foundationwiki a standard CentralAuth wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735443 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [16:55:33] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) Oddly, despite having run "sudo gnt-cluster renew-crypto --new-cluster-certificate", "gnt-cluster verify still complains about the certs of two nodes: ` Mon Nov 29 16:... [16:55:56] (03Merged) 10jenkins-bot: foundationwiki: Remove explicit wmgUseOAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742493 (owner: 10Urbanecm) [16:56:10] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: bad34ed8d86b30eb4c240da0498ddfb44af30ea7: Make foundationwiki a standard CentralAuth wiki (T205347) (duration: 00m 56s) [16:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 06d8d25f6e89be0b1692d017bdbc2c9524372c0b: foundationwiki: Remove explicit wmgUseOAuth (duration: 00m 57s) [16:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:29] * urbanecm done [17:00:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:32] (03PS1) 10Ebernhardson: Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) [17:14:22] I’ll deploy an update to the termbox k8s service in a few minutes if that’s alright with everyone https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/742167 [17:15:19] Lucas_WMDE: Sure; OK if I do a prod-touching Beta Cluster deploy at the same time (just a sync of IS and CS with our new WikiLambda variable)? [17:15:29] should be fine, yeah [17:15:42] (03PS4) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) [17:15:47] (03CR) 10Jforrester: [C: 03+2] Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:16:13] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:30] ({meme, src=itshappening}) [17:16:36] (03Merged) 10jenkins-bot: Initial Beta Cluster deployment of Wikifunctions: I - IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740791 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:17:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update termbox to 2021-11-26-093451-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [17:18:07] (03PS2) 10Lucas Werkmeister (WMDE): Update termbox to 2021-11-26-093451-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) [17:18:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(forgot to rebase first)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [17:18:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] update-wmf-ca-certificates: add group/other read flags to cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [17:18:26] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Initial Beta Cluster deployment of Wikifunctions: I - IS for T289315 (duration: 00m 55s) [17:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:30] T289315: Work out how we're going to have "production-like" versions of the wikifunctions evaluator and orchestrator services in Beta Cluster - https://phabricator.wikimedia.org/T289315 [17:18:50] (03PS4) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) [17:19:25] (03CR) 10Jforrester: [C: 03+2] Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:20:08] (03Merged) 10jenkins-bot: Initial Beta Cluster deployment of Wikifunctions: II - Services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740792 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:21:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:57] (03Merged) 10jenkins-bot: Update termbox to 2021-11-26-093451-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/742167 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [17:22:22] !log jforrester@deploy1002 Synchronized wmf-config/ProductionServices.php: Initial Beta Cluster deployment of Wikifunctions: II - Services for T289315 (duration: 00m 55s) [17:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:18] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [17:25:18] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [17:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:56] (03PS1) 10Muehlenhoff: Disable cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/742499 (https://phabricator.wikimedia.org/T284811) [17:26:02] (03CR) 10Andrew Bogott: [C: 03+1] Retire labnet-users group [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) (owner: 10Muehlenhoff) [17:26:16] (03PS2) 10Muehlenhoff: Disable cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/742499 (https://phabricator.wikimedia.org/T284811) [17:27:18] oh good, my `helmfile test` failed [17:28:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] (03CR) 10Andrew Bogott: [C: 03+1] "Assuming you've already tested this on the proxy VMs I'm happy to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/742267 (https://phabricator.wikimedia.org/T129800) (owner: 10Majavah) [17:28:57] jouncebot: nowandnext [17:28:57] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [17:28:57] In 0 hour(s) and 31 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1800) [17:29:21] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Add DB host to TransactionProfiler logging and fix time fields [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742259 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [17:29:26] Amir1: ... [17:29:28] (typing) [17:29:35] I’m currently trying to figure out a termbox deployment [17:29:50] Lucas_WMDE: mine will take at least twenty minutes to merge [17:29:51] (03CR) 10Andrew Bogott: [C: 03+2] puppetmaster::gitsync: Replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/732991 (https://phabricator.wikimedia.org/T273673) (owner: 10Majavah) [17:29:53] ok [17:29:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:34] (03PS5) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) [17:34:02] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10wiki_willy) ++ @nskaggs & @Andrew - since this server is scheduled for a refresh in Q4 (line 130 on the procurement doc), are you guys ok with not fixing this, and just sticking w... [17:34:05] (03CR) 10Jforrester: [C: 03+2] Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:34:24] (03CR) 10Jbond: [C: 03+1] "lgtm, but see nit/question thanks" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [17:34:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742486 (owner: 10Elukey) [17:34:49] (03Merged) 10jenkins-bot: Initial Beta Cluster deployment of Wikifunctions: III - CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740793 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:36:24] (03CR) 10Andrew Bogott: [C: 03+2] Replace wmflabs-project with wmcs-project in various scripts [puppet] - 10https://gerrit.wikimedia.org/r/740900 (owner: 10Majavah) [17:36:38] (03PS5) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [17:36:40] (03PS10) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [17:38:21] (03PS11) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [17:38:46] !log pool cp3064 - T290005 [17:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:51] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [17:39:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10wiki_willy) Hi @jcrespo & @LSobanski - it looks like this machine is due to be r... [17:40:00] (03CR) 10Andrew Bogott: [C: 03+1] acme_chief: add -rw to ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [17:40:45] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Initial Beta Cluster deployment of Wikifunctions: III - CS for T289315 (duration: 00m 55s) [17:40:47] (03PS5) 10Jforrester: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) [17:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:49] T289315: Work out how we're going to have "production-like" versions of the wikifunctions evaluator and orchestrator services in Beta Cluster - https://phabricator.wikimedia.org/T289315 [17:40:56] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:41:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:16] anyone around who’s familiar with kubernetes deployments and helmfile stuff? [17:41:26] I think I messed up termbox a bit (not too badly at the moment, as far as I can tell) [17:42:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:42:07] (03CR) 10Jforrester: [C: 03+2] Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:58] (03PS12) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [17:43:14] (03Merged) 10jenkins-bot: Initial Beta Cluster deployment of Wikifunctions: IV - IS-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740794 (https://phabricator.wikimedia.org/T289315) (owner: 10Jforrester) [17:44:00] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:44:31] (03Abandoned) 10Arturo Borrero Gonzalez: cloud: cinder-backups: use main ceph cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/740551 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [17:45:54] (03CR) 10Elukey: update-wmf-ca-certificates: add group/other read flags to cert bundle (031 comment) [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [17:46:31] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Andrew) @wiki_willy we can probably live without it for a few months; I'll do a bit of research to figure out what would be a good replacement (it was one in a set of three handli... [17:47:15] (I'm clear on prod., BTW) [17:47:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] Disable cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/742499 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [17:47:23] jelto might know (See Lucas_WMDE's message ) [17:47:33] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Andrew) a:05wiki_willy→03Andrew [17:48:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10nskaggs) @wiki_willy Is their an advantage to getting the normal refresh order in now? Would you recommend it? I know prices and lead times have been an issue this year. [17:49:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:33] (03Merged) 10jenkins-bot: rdbms: Add DB host to TransactionProfiler logging and fix time fields [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742259 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [17:50:15] Mine can wait a bit, It makes prod dirty for a bit but nbd [17:50:29] let me know when I can proceed [17:50:44] Amir1: right now I’m just twiddling thumbs tbh [17:50:52] how long will yours take? [17:51:10] (if I don’t hear back from anyone I’ll eventually roll out the termbox deploy to the other two clusters, the change itself seems to be okay) [17:51:17] not much, I need to test it [17:51:22] (I’m just worried about leaving termbox-staging-service-checker in a bad state) [17:51:27] Amir1: I’d say go ahead [17:52:08] sure, my knowledge of k8s is not great otherwise I would have helped :( [17:52:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:54:41] Lucas_WMDE: I'm afk currently, I can take a look in ~30minutes. So helm tests during a helmfile sync of termbox in staging failed or what is the problem? [17:55:06] I ran `helmfile -e staging -l name=staging test --cleanup` and it failed [17:55:12] (i.e. manual `helmfile test`) [17:55:29] the regular sync worked fine as far as I can tell and I probably just shouldn’t have run the test [17:55:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:55:35] but I’m not sure how to clean it up myself [17:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:01] jelto: should I sync eqiad+codfw in the meantime, and we’ll look at the failed test in 30 minutes? [17:56:18] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10EBernhardson) Another round of import tests completed, nothing fell over. Calling this done for now. [17:56:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:56:29] (as far as I can tell from a manual test, the new version in staging is working fine) [17:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:40] (03PS1) 10Kosta Harlan: SuggestedEdits: Drop isActivated() check in getJsData [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742260 (https://phabricator.wikimedia.org/T296626) [17:57:00] works fine, moving forward [17:57:54] (03PS1) 10Kosta Harlan: Revert "Revert "GrowthExperiments: Start imagerecommendation variant experiment"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 [17:57:57] (03CR) 10Dduvall: [C: 04-1] mediawiki: Install yaml extension for use by SettingsBuilder (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [17:58:00] (03PS2) 10Kosta Harlan: Revert "Revert "GrowthExperiments: Start imagerecommendation variant experiment"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 [17:58:09] (03PS3) 10Kosta Harlan: Revert "Revert "GrowthExperiments: Start imagerecommendation variant experiment"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 [17:58:23] LucasWMDE: I'll check that helm tests in staging in a moment. Maybe this is also due to some helmfile changes last week. [17:58:30] (03PS4) 10Kosta Harlan: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 [17:58:33] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10wiki_willy) Hi @nskaggs - sure, we can totally order the "refresh of cloudvirt101[6-8]" now, and have it to arrive in early Q3 if you want. The lead times with vendors have been... [17:58:36] alright, thanks. standing by [17:58:58] I'm a bit concerned about proceeding because the test 2 hours ago went fine [17:58:58] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:742259|rdbms: Add DB host to TransactionProfiler logging and fix time fields (T295706)]] (duration: 00m 56s) [17:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:03] T295706: Improve TransactionProfiler as replacement for tendril's slow queries - https://phabricator.wikimedia.org/T295706 [17:59:23] yeah, the termbox-*test*-service-checker pod looks fine (but is older than the staging one) [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1800). [18:04:40] (03PS2) 10Cathal Mooney: Modified loopback4 filter to allow NTP commands to run [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) [18:06:04] Lucas_WMDE: ah okay, thanks for the hint. I would guess the --name test in you helmfile test command caused this error. I'll take a look in a moment [18:06:12] jelto: for all I know, it’s possible that the test for the staging release in the staging cluster never worked [18:06:25] IIUC, the *test* release in the staging cluster is used for Test Wikidata [18:06:32] but I don’t know if the staging release in there is properly configured [18:06:56] (03PS13) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [18:07:06] (03PS1) 10Jforrester: [Beta Cluster] Also declare wgCanonicalServer for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742503 [18:07:27] (the kubectl logs indicate a connection error due to Name or service not known')': /?spec) [18:08:53] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:08:55] (03PS1) 10Urbanecm: uzwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742504 (https://phabricator.wikimedia.org/T294245) [18:11:11] PROBLEM - exim queue on mx2001 is CRITICAL: CRITICAL: 4316 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [18:14:34] (03PS1) 10Ayounsi: Enable DHCP relay on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/742505 [18:15:40] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:16:14] yo [18:16:20] Wikis are slow [18:16:21] paged [18:16:21] Lucas_WMDE: my uneducated guess is that this logic is breaking down https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/termbox/templates/tests/test-service-checker.yaml#L12 [18:16:25] (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: s/sanitizer/saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [18:16:44] things are not loading for me [18:16:48] same here [18:16:49] * jbond here [18:16:58] (mobile, lmk if I can help still) [18:17:01] looks like ddos [18:17:05] Yeah nothing here [18:17:10] hi [18:17:45] here too [18:17:58] !log depool cp3064 - T290005 [18:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:03] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:18:09] -> discussion in _security [18:18:21] 10SRE, 10DNS, 10Traffic, 10WMF-Communications: Setup subdomain for Foundation messaging site - https://phabricator.wikimedia.org/T296570 (10Varnent) We will be sharing this site with all staff around December 1. Domain is not necessary per se as we have a temporary domain - but do we have a sense of when i... [18:18:34] Do we have / need a task for tracking [18:18:41] Also can someone set topic [18:19:17] o/ [18:19:43] Amir1: --> _security [18:19:52] PROBLEM - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:20:05] urbanecm: can you use your topic privs? [18:20:08] (03PS2) 10Ebernhardson: Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) [18:20:16] PROBLEM - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:20:16] sure [18:20:20] what should i use it for? [18:20:27] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 14.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:20:37] Phab down, or just me? [18:20:37] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenConfirm - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:20:44] MatmaRex: ddos [18:20:49] (03CR) 10jerkins-bot: [V: 04-1] SuggestedEdits: Drop isActivated() check in getJsData [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742260 (https://phabricator.wikimedia.org/T296626) (owner: 10Kosta Harlan) [18:20:50] MatmaRex: known, _security [18:20:56] i see [18:20:59] urbanecm: say we're down? [18:21:09] thanks legoktm [18:21:13] I'm not sure how wide impacting but definitely EU [18:21:30] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 33%, RTA = 298.26 ms [18:21:32] mirrored in -tech too [18:23:00] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742260 (https://phabricator.wikimedia.org/T296626) (owner: 10Kosta Harlan) [18:24:24] PROBLEM - Host ncredir-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:25:38] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:25:48] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [18:26:10] RECOVERY - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 5.227 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:26:32] RECOVERY - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 3.489 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:26:40] RECOVERY - Host ncredir-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 71%, RTA = 333.71 ms [18:26:44] PROBLEM - LVS ncredir-https esams port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:26:56] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 33%, RTA = 152.71 ms [18:27:25] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ProtocolError(Connection broken: ConnectionResetError(104, Connection reset by peer), ConnectionResetError(104, Connection reset by peer)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [18:27:53] I can get on enwp now [18:28:09] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3062.esams.wmnet are marked down but pooled: testlb_443: Servers cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:28:44] PROBLEM - LVS ncredir-https esams port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:28:45] RECOVERY - LVS ncredir-https esams port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.esams.wikimedia.org is OK: OK - Certificate wikipedia.com will expire on Tue 15 Feb 2022 08:01:25 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:28:51] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [18:28:56] Or not [18:30:44] RECOVERY - LVS ncredir-https esams port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is OK: OK - Certificate wikipedia.com will expire on Tue 15 Feb 2022 08:01:25 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:33:18] PROBLEM - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:33:43] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:35:18] RECOVERY - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 609 bytes in 2.285 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:35:26] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:35:29] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 439, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:36:36] (03PS1) 10Legoktm: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/742507 [18:37:24] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: OK - Certificate *.wikipedia.org will expire on Thu 17 Nov 2022 11:59:59 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:37:26] (03CR) 10jerkins-bot: [V: 04-1] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/742507 (owner: 10Legoktm) [18:38:47] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: ncredirlb6_80: Servers ncredir3001.esams.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir3001.esams.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir3001.esams.wmnet are mar [18:38:47] but pooled: textlb_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled https:/ [18:38:47] h.wikimedia.org/wiki/PyBal [18:39:27] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 48.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:40:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/742507 (owner: 10Legoktm) [18:40:22] (03PS2) 10Legoktm: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/742507 [18:40:45] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:40:46] (03PS1) 10Majavah: add checkdoh-map to esams-offline [dns] - 10https://gerrit.wikimedia.org/r/742508 [18:40:49] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ProtocolError(Connection aborted., ConnectionResetError(104, Connection reset by peer)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [18:40:54] PROBLEM - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:41:07] (03PS1) 10BBlack: Add checkdough to esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/742509 [18:41:44] (03CR) 10Legoktm: [C: 03+2] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/742507 (owner: 10Legoktm) [18:42:00] (03PS2) 10BBlack: Add checkdough to esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/742509 [18:42:07] !log depooling esams [18:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:29] (03CR) 10BBlack: [C: 03+2] Add checkdough to esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/742509 (owner: 10BBlack) [18:43:29] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [18:45:04] PROBLEM - fastnetmon is alerting #page on netflow3001 is CRITICAL: CRITICAL: fastnetmon is alerting for 91.198.174.192 https://bit.ly/wmf-fastnetmon https://w.wiki/8oU [18:45:06] RECOVERY - LVS ncredir esams port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 1.209 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:45:18] (03Abandoned) 10Majavah: add checkdoh-map to esams-offline [dns] - 10https://gerrit.wikimedia.org/r/742508 (owner: 10Majavah) [18:45:29] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.esams.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [18:45:45] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 55.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:46:50] (03PS1) 10BBlack: Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/742510 [18:46:52] (03PS1) 10BBlack: Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/742511 [18:47:02] PROBLEM - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:47:53] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 102.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:48:18] (03CR) 10BBlack: [C: 03+2] Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/742510 (owner: 10BBlack) [18:48:22] (03CR) 10BBlack: [C: 03+2] Depool esams via esams-offline map [dns] - 10https://gerrit.wikimedia.org/r/742511 (owner: 10BBlack) [18:48:52] !log esams: shifting depool method to esams-offline (now that its config is fixed) [18:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:58] RECOVERY - LVS text esams port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 623 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:49:13] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3062.esams.wmnet are marked down but pooled: testlb_443: Servers cp3050.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir3001.esams.wmnet are marked down but pooled: textlb_443: Servers cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: testlb6_443: Ser [18:49:13] 054.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:49:21] PROBLEM - Disk space on lvs3005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.20.0.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [18:51:17] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [18:51:32] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [homer/public] - 10https://gerrit.wikimedia.org/r/742505 (owner: 10Ayounsi) [18:53:47] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:53:48] (03PS1) 10BBlack: Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/742513 [18:53:51] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [18:53:51] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [18:54:07] enwp down for me [18:54:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/742513 (owner: 10BBlack) [18:54:14] East coast [18:54:16] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:54:25] perryprog: known-ish, thanks [18:54:26] yeah that's eqiad [18:55:16] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Depool esams via esams-offline map" [dns] - 10https://gerrit.wikimedia.org/r/742513 (owner: 10BBlack) [18:55:43] !log repooling esams [18:55:56] PROBLEM - LVS ncredir eqiad port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:56:16] bblack: Failed to log message to wiki. Somebody should check the error logs. [18:56:44] PROBLEM - LVS ncredir-https eqiad port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:57:30] PROBLEM - LVS ncredir eqiad port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:57:49] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1013 is CRITICAL: cpu={0,10,12,14,2,4,6,8} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [18:58:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 36.5 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:58:25] PROBLEM - Host phab.wmfusercontent.org is DOWN: PING CRITICAL - Packet loss = 100% [18:58:33] PROBLEM - Host commons.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:58:33] PROBLEM - Host en.wikibooks.org is DOWN: PING CRITICAL - Packet loss = 100% [18:58:49] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [18:59:02] RECOVERY - LVS ncredir-https eqiad port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is OK: OK - Certificate wikipedia.com will expire on Tue 15 Feb 2022 08:01:25 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:00:09] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 29.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:00:32] RECOVERY - LVS ncredir eqiad port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 7.166 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:00:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:39] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:00:40] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T1900). [19:00:40] kostajh and ebernhardson: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:52] RECOVERY - LVS ncredir eqiad port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:00:53] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: OK - Certificate *.wikipedia.org will expire on Thu 10 Feb 2022 08:02:21 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:01:00] i imagine we are skipping this deploy window for now :) [19:01:00] we're currently in a (partial?) outage, so B&C paused until further notice [19:01:09] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:01:19] ebernhardson: very likely. Sorry! :) [19:01:21] yes please [19:01:31] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [19:01:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:01:39] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:03:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 100 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:03:45] RECOVERY - Host phab.wmfusercontent.org is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [19:03:53] RECOVERY - Host en.wikibooks.org is UP: PING OK - Packet loss = 0%, RTA = 0.12 ms [19:03:59] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 95.11 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:03:59] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs1013 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [19:04:15] welcome back Wikipedia? [19:05:38] seems like it [19:05:44] yup i'm back online [19:06:00] I'm back [19:06:11] (Netherlands) [19:09:47] (03PS1) 10Clare Ming: Provide fallback for config variable when not present [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742517 [19:09:55] RECOVERY - Disk space on lvs3005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [19:10:03] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Andrew) I just did some digging and some thinking and I don't see a reason why we need to rush this refresh. Most likely these hosts will be replaced with thinvirts which won't me... [19:10:51] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:12:45] (03PS1) 10Gergő Tisza: AddImage: Refresh user's task feed after undecided rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742262 (https://phabricator.wikimedia.org/T296491) [19:14:11] Also now that we are back, does someone know if there is a phab project for klaxon? [19:15:11] Asartea: just the catchall #sre [19:15:30] RECOVERY - fastnetmon is alerting #page on netflow3001 is OK: OK: no fastnetmon alerts https://bit.ly/wmf-fastnetmon https://w.wiki/8oU [19:15:39] majavah thanks; someone pointed out on Discord that the FAQ link really should link to static [19:15:55] Asartea: I should make one :) [19:16:02] ah, yeah, that's an easy fix [19:16:04] I will do [19:16:12] you can patch it here: https://gerrit.wikimedia.org/r/admin/repos/operations/software/klaxon [19:16:17] there's also more coming Soon™ in that department [19:16:55] !log pool cp3064 - T290005 [19:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:00] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [19:17:12] cdanis still want the phab task or can you just easily do it? [19:17:32] perryprog you greatly overestimate my tech knowledge [19:17:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:17:52] I'll write the patch now [19:18:02] Asartea: can you link me to this discord? [19:18:06] (03PS1) 10Ahmon Dancy: wmf-beta-update-databases.py: Print error in a better way [puppet] - 10https://gerrit.wikimedia.org/r/742519 [19:18:14] (assuming it is open ofc) [19:18:21] cdanis https://en.wikipedia.org/wiki/Wikipedia:Discord [19:18:24] Asartea well I didn't expect cdanis to appear out of nowhere when I was grabbing the link :P [19:18:38] Sounds like the outage is under control. Do you think there will be a backport window soon? I'd start merging the extension backports then, they can take 30-40 min. [19:18:48] (03PS1) 10Majavah: link to wikitech-static [software/klaxon] - 10https://gerrit.wikimedia.org/r/742520 [19:19:16] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [19:19:21] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 49.16 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:20:32] (03CR) 10CDanis: [C: 03+2] "thank you!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/742520 (owner: 10Majavah) [19:20:42] Asartea: {{done}} [19:20:50] majavah thanks a lot [19:21:53] (03Merged) 10jenkins-bot: link to wikitech-static [software/klaxon] - 10https://gerrit.wikimedia.org/r/742520 (owner: 10Majavah) [19:23:48] tgr: I think that would be fine now [19:23:51] (03PS3) 10Ryan Kemper: Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [19:24:18] (03CR) 10Ryan Kemper: [C: 03+1] "We're ready to cut over whenever" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [19:24:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 101.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:28:53] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [19:29:03] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 32.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:30:08] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10wiki_willy) Thanks @Andrew, sounds good. I'll leave it on the schedule for a Q4 refresh for now, but let me know if you end up needing it earlier, and we can always adjust and pu... [19:31:08] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Also declare wgCanonicalServer for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742503 (owner: 10Jforrester) [19:31:52] (03Merged) 10jenkins-bot: [Beta Cluster] Also declare wgCanonicalServer for wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742503 (owner: 10Jforrester) [19:32:24] tgr: i take it you'll do the Growth backports? [19:32:33] (I'm clear.) [19:33:06] (03CR) 10Gergő Tisza: [C: 03+2] AddImage: Refresh user's task feed after undecided rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742262 (https://phabricator.wikimedia.org/T296491) (owner: 10Gergő Tisza) [19:33:26] (03CR) 10Gergő Tisza: [C: 03+2] SuggestedEdits: Drop isActivated() check in getJsData [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742260 (https://phabricator.wikimedia.org/T296626) (owner: 10Kosta Harlan) [19:33:36] urbanecm: yes [19:33:42] okay, great :) [19:33:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:33:44] should I do the others too? [19:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:58] ebernhardson usually self-serves, if my memory serves :)) [19:34:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:37] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:36:13] (03PS1) 10Clare Ming: Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) [19:37:09] (03CR) 10jerkins-bot: [V: 04-1] Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [19:38:41] (03PS2) 10Clare Ming: Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) [19:40:35] (03PS1) 10Jforrester: Fix wgWikiLambdaOrchestratorLocation service pointer typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742525 [19:42:37] !log uploaded php-yaml_2.2.1+2.1.0+2.0.4+1.3.2-2+wmf1~buster1_amd64.changes to apt.wm.o (T296331) [19:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:42] T296331: Install php-yaml for use by SettingsLoader - https://phabricator.wikimedia.org/T296331 [19:46:23] (03PS1) 10SBassett: Fix special page displaying unescaped user input [extensions/FileImporter] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742263 (https://phabricator.wikimedia.org/T296605) [19:50:14] legoktm: Are you doing php-yaml for php72 too? [19:50:56] totally didn't forget about that [19:51:00] will do now [19:51:00] :-D [19:51:06] Was building it for CI locally. [19:51:46] As `php7.2-yaml`? [19:51:46] ooh are we getting php-yaml for CI? [19:51:57] Nikerabbit: We're getting it everywhere in production, so yes. [19:53:09] James_F: no, it'll be just `php-yaml` [19:53:14] legoktm: Ack. [19:53:28] But php7.4-yaml per https://apt-browser.toolforge.org/buster-wikimedia/component/php74/ ? [19:53:43] yes [19:54:10] (03CR) 10Jdlrobson: [C: 03+1] Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [19:55:58] (03PS4) 10Ebernhardson: Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) [19:56:03] I should also make the PHP 8.1 images. Meh. [19:56:32] (03CR) 10Ebernhardson: [C: 03+2] Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [19:57:17] (03Merged) 10jenkins-bot: AddImage: Refresh user's task feed after undecided rejection [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742262 (https://phabricator.wikimedia.org/T296491) (owner: 10Gergő Tisza) [19:57:19] (03Merged) 10jenkins-bot: Move CirrusSearch traffic back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742497 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [19:57:28] (03Merged) 10jenkins-bot: SuggestedEdits: Drop isActivated() check in getJsData [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742260 (https://phabricator.wikimedia.org/T296626) (owner: 10Kosta Harlan) [19:57:41] * ebernhardson didn't realize that patch from earlier was still pending [20:00:00] legoktm: deployment-deploy01 is still stretch, I guess that means we need the packages for stretch too [20:00:15] wut [20:00:17] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T295705 Move CirrusSearch traffic back to eqiad (duration: 00m 56s) [20:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:21] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [20:01:06] I'll look after lunch but the stretch packages are already out of date IIRC [20:01:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:45] or switch over to deploy03, it's otherwise ready but needs someone to set it up on jenkins [20:02:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:23] (03PS5) 10Gergő Tisza: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 (owner: 10Kosta Harlan) [20:08:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:17] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 (owner: 10Kosta Harlan) [20:12:05] (03Merged) 10jenkins-bot: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742261 (owner: 10Kosta Harlan) [20:12:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) [20:13:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) [20:15:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) Hi @odimitrijevic @Ottomata could you please review/approve this request? Thanks in advance! [20:19:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10herron) [20:20:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10herron) Hi @odimitrijevic @Ottomata could you please review/approve this request? Thanks in advance! [20:21:16] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: Backport: [[gerrit:742260|SuggestedEdits: Drop isActivated() check in getJsData (T296626)]] (duration: 00m 56s) [20:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:20] T296626: TypeError: this.config.gateConfig[taskType] is undefined - https://phabricator.wikimedia.org/T296626 [20:23:21] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/AddImageSubmissionHandler.php: Backport: [[gerrit:742262|AddImage: Refresh user's task feed after undecided rejection (T296491)]] (duration: 00m 56s) [20:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] T296491: Add an image: Investigate removing tasks from a user's queue after they have rejected them - https://phabricator.wikimedia.org/T296491 [20:26:39] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742261|GrowthExperiments: Start imagerecommendation variant experiment]] (duration: 00m 55s) [20:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:30] (03PS1) 10Ebernhardson: [DNM] check puppetdb output for expanded selector [puppet] - 10https://gerrit.wikimedia.org/r/742528 [20:27:34] !log UTC evening deploys done [20:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:19] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:19] ^ checking [20:36:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:38] jouncebot: next [20:40:38] In 0 hour(s) and 19 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T2100) [20:44:11] (03CR) 10Jforrester: [C: 03+2] Fix wgWikiLambdaOrchestratorLocation service pointer typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742525 (owner: 10Jforrester) [20:44:56] (03Merged) 10jenkins-bot: Fix wgWikiLambdaOrchestratorLocation service pointer typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742525 (owner: 10Jforrester) [20:46:32] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Fix wgWikiLambdaOrchestratorLocation service pointer typo (duration: 00m 55s) [20:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:16] (03PS1) 10Ayounsi: Turn on prepending for esams and eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/742535 [20:48:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:03] (03CR) 10Ayounsi: [C: 03+2] Enable DHCP relay on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/742505 (owner: 10Ayounsi) [20:50:20] (03CR) 10Ayounsi: [C: 03+2] Turn on prepending for esams and eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/742535 (owner: 10Ayounsi) [20:51:11] (03Merged) 10jenkins-bot: Enable DHCP relay on mr1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/742505 (owner: 10Ayounsi) [20:51:15] (03Merged) 10jenkins-bot: Turn on prepending for esams and eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/742535 (owner: 10Ayounsi) [20:54:32] majavah: Other than just updating the Jenkins config, is there anything else that needs to be done to switch from deployment-deploy01 to deployment-deploy03? 'Cos I can do that in about five seconds. :-) [20:54:40] (03CR) 10Accraze: [C: 03+1] "Nice catch elukey!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741844 (owner: 10Elukey) [20:56:21] jenkins config + switching relevant hiera values should be enough for mediawiki [20:56:43] Ack. [20:56:48] Want me to just do it? [20:56:51] the local apt repo for scap is a special case, but I can deal with it afterwards [20:57:39] sure. thanks! [20:58:16] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) 05Open→03In progress a:05MMandere→03RobH stealing to update the checklists for bios updates, receiving in, and everything else. These have already been imaged and are calling into... [20:59:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10odimitrijevic) Approved [21:00:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10odimitrijevic) Approved [21:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T2100). [21:00:57] majavah: Done. [21:01:12] majavah: You doing the hiera values or should I? [21:01:25] I can do it [21:03:20] (03PS1) 10Majavah: hieradata: Swap deployment-prep deploy host [puppet] - 10https://gerrit.wikimedia.org/r/742538 (https://phabricator.wikimedia.org/T278689) [21:03:55] (03PS3) 10Cathal Mooney: Add drmrs public prefix to ntp allowed config [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T296623) [21:04:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:43] (03CR) 10Jforrester: [C: 03+1] hieradata: Swap deployment-prep deploy host [puppet] - 10https://gerrit.wikimedia.org/r/742538 (https://phabricator.wikimedia.org/T278689) (owner: 10Majavah) [21:06:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:12:34] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm: Investigate and restore foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10RLazarus) p:05Triage→03Medium [21:13:43] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate and restore foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Urbanecm_WMF) a:05Urbanecm→03Urbanecm_WMF Reassigning with my contractor hat :). [21:14:30] urbanecm: oops, scuse me :) [21:15:49] (03PS1) 10RLazarus: httpbb: Temporarily remove foundationwiki 302 httpbb test [puppet] - 10https://gerrit.wikimedia.org/r/742543 (https://phabricator.wikimedia.org/T296687) [21:17:07] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) 05In progress→03Open a:05RobH→03MMandere [21:17:09] rzl: no problem at all :). [21:17:46] (03CR) 10RLazarus: [C: 03+2] httpbb: Temporarily remove foundationwiki 302 httpbb test [puppet] - 10https://gerrit.wikimedia.org/r/742543 (https://phabricator.wikimedia.org/T296687) (owner: 10RLazarus) [21:20:00] (03CR) 10Ayounsi: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [21:20:50] PASS: 112 requests sent to mw1418.eqiad.wmnet. All assertions passed. [21:21:03] 👍 the alert will self-resolve on the next hourly run [21:27:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:27:27] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) [21:28:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:29:17] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:02] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) [21:45:31] (03PS1) 10Ebernhardson: query_service: Collect wdqs and wcqs jmx metrics separately [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) [21:45:54] (03Abandoned) 10Ebernhardson: [DNM] check puppetdb output for expanded selector [puppet] - 10https://gerrit.wikimedia.org/r/742528 (owner: 10Ebernhardson) [21:51:10] Hey all - I'd like to start a couple of sec deploys now (a few mins before the "official" window). [21:51:51] (03CR) 10Ebernhardson: "PCC seems to think this will work, but it doesn't look complete. Particularly the config output by PCC only collects from wdqs{1003,1011," [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [21:52:18] (03CR) 10Ebernhardson: "i suppose i forgot the pcc results link: https://puppet-compiler.wmflabs.org/compiler1003/32720/" [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [21:52:22] (03CR) 10SBassett: [C: 03+2] Fix special page displaying unescaped user input [extensions/FileImporter] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742263 (https://phabricator.wikimedia.org/T296605) (owner: 10SBassett) [22:00:05] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211129T2200). [22:05:39] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) [22:14:21] (03Merged) 10jenkins-bot: Fix special page displaying unescaped user input [extensions/FileImporter] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742263 (https://phabricator.wikimedia.org/T296605) (owner: 10SBassett) [22:20:13] !log sbassett@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FileImporter/src/Remote/MediaWiki/HttpApiLookup.php: Backport: [[gerrit:742263|SECURITY: Fix special page displaying unescaped user input (T296605)]] (duration: 00m 56s) [22:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:47] (03CR) 10Jdlrobson: [C: 03+1] Provide fallback for config variable when not present [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742517 (owner: 10Clare Ming) [22:25:41] (03PS1) 10Jgreen: flip fundraisingdb-read.wmnet back from frdb1004 to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/742571 [22:29:20] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) 05Open→03Resolved All hosts have been imaged already, and bios/idrac firmware updates completed. I've filled out the checkboxes, and now this task is resolved. [22:31:29] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/742571 (owner: 10Jgreen) [22:32:47] !log sbassett@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/EntitySchema/src/MediaWiki/Specials/SetEntitySchemaLabelDescriptionAliases.php: Deploy security patch for T296578 (duration: 00m 55s) [22:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:43] (03CR) 10Jgreen: [C: 03+2] flip fundraisingdb-read.wmnet back from frdb1004 to frdb1002 [dns] - 10https://gerrit.wikimedia.org/r/742571 (owner: 10Jgreen) [22:55:38] (03CR) 10Cwhite: [C: 03+2] hieradata: Swap deployment-prep deploy host [puppet] - 10https://gerrit.wikimedia.org/r/742538 (https://phabricator.wikimedia.org/T278689) (owner: 10Majavah) [23:00:53] (03CR) 10Ebernhardson: cirrussearch: fix grafana dashboard links (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740708 (owner: 10Ryan Kemper) [23:01:00] (03CR) 10Ebernhardson: [C: 03+1] cirrussearch: fix grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/740708 (owner: 10Ryan Kemper) [23:17:13] (03PS3) 10Legoktm: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) [23:33:28] (03PS2) 10RLazarus: Initial deb package [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 [23:33:30] (03PS1) 10RLazarus: Merge tag '0.0.1' into debian [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/742573 [23:41:55] (03CR) 10RLazarus: Initial deb package (033 comments) [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 (owner: 10RLazarus)