[00:00:02] dontpanic: done, thanks for keeping the configuration organized [00:00:19] :) [00:00:38] !log west coast evening deploys done [00:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:36] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Dzahn) Interestingly enough I could post: ` Prezado(a), O e-mail para solicitações de desbloqueio mudou para unblock-ptwiki@wikimedia.org Por favor, encaminhe seu pedido novamente para esse novo endereço... [00:12:44] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Dzahn) {F34698922} [00:13:23] 10SRE, 10Wikimedia-Mailing-lists: Disable "Unblock-pt-l" - https://phabricator.wikimedia.org/T293591 (10Dzahn) 05In progress→03Resolved [00:17:48] (03CR) 10RLazarus: [C: 03+1] mediawiki::appserver: fetch additional MaxMind databases on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/732099 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [00:54:30] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:15:04] PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:25] (03PS2) 10Krinkle: [Beta Cluster] mc-labs.php: Remove onHostRoutingPrefix for WAN cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731817 (https://phabricator.wikimedia.org/T264604) [01:20:34] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc-labs.php: Remove onHostRoutingPrefix for WAN cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731817 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [01:21:17] (03Merged) 10jenkins-bot: [Beta Cluster] mc-labs.php: Remove onHostRoutingPrefix for WAN cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731817 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [01:28:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:05] 10SRE, 10serviceops, 10Patch-For-Review: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Jdforrester-WMF) [03:07:46] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [03:11:54] RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:46] PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@116.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:36] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [04:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:29] (03PS5) 10Legoktm: Enable $wgLocalHTTPProxy on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731862 (https://phabricator.wikimedia.org/T288848) [04:37:11] (03CR) 10Legoktm: [C: 03+2] Enable $wgLocalHTTPProxy on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731862 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [04:37:54] (03Merged) 10jenkins-bot: Enable $wgLocalHTTPProxy on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731862 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [04:40:02] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable $wgLocalHTTPProxy on group0 wikis (T288848) (duration: 01m 05s) [04:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:09] T288848: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 [04:42:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:53] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:54:51] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:58:47] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:04:47] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [05:26:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2112.codfw.wmnet with OS buster [05:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:44] 10SRE, 10ops-codfw, 10DBA: Upgrade db2112 firmware/BIOS - https://phabricator.wikimedia.org/T293740 (10Marostegui) Unfortunately the firmware upgrades didn't fix the installer issue, the host gets stuck at: ` boot: Loading debian-installer/amd64/linux... ok Loading debian-installer/amd64/initrd.gz...ok... [05:32:29] 10SRE, 10ops-codfw, 10DBA: Upgrade db2112 firmware/BIOS - https://phabricator.wikimedia.org/T293740 (10Marostegui) Spoke too soon, it took a while but it finally got into the installer! [05:32:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 234, down: 2, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:47:52] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Marostegui) >>! In T293531#7442058, @herron wrote: > Does/could the query killer itself write an additional log to sys... [05:58:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/731988 (owner: 10Giuseppe Lavagetto) [05:59:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2112.codfw.wmnet with OS buster [05:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:39] 10SRE, 10ops-codfw, 10DBA: Upgrade db2112 firmware/BIOS - https://phabricator.wikimedia.org/T293740 (10Marostegui) Installed all fine! Thanks everyone! [06:02:31] (03Merged) 10jenkins-bot: mediawiki: fix rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/731988 (owner: 10Giuseppe Lavagetto) [06:05:32] !log put transport link between ulsfo and eqsin in service - T273308 [06:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:38] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:25] PROBLEM - Hadoop DataNode on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:11:59] (03PS1) 10Marostegui: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/732115 (https://phabricator.wikimedia.org/T290868) [06:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 (s8) for upgrade', diff saved to https://phabricator.wikimedia.org/P17549 and previous config saved to /var/cache/conftool/dbconfig/20211020-061202-marostegui.json [06:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:31] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [06:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:40] (03CR) 10Marostegui: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/732115 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [06:14:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1126.eqiad.wmnet with OS buster [06:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:40] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/732118 (https://phabricator.wikimedia.org/T290865) [06:20:24] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/732118 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [06:21:11] !log Depool clouddb1013 for upgrade [06:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:47] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/732072 [06:24:46] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/732072 (owner: 10Marostegui) [06:25:13] RECOVERY - Hadoop DataNode on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:28:34] !log reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage [06:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:00] !log restarting blazegraph on wdqs1012 [06:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:12] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Add campaign pattern for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731928 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [06:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 (s1) for upgrade', diff saved to https://phabricator.wikimedia.org/P17550 and previous config saved to /var/cache/conftool/dbconfig/20211020-063431-marostegui.json [06:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:24] !log Upgrade db1106 [06:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 (s1) after upgrade', diff saved to https://phabricator.wikimedia.org/P17551 and previous config saved to /var/cache/conftool/dbconfig/20211020-063926-marostegui.json [06:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:09] (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732073 [06:40:53] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/732246 [06:41:10] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fix comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/732246 (owner: 10Giuseppe Lavagetto) [06:41:18] RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:28] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/732246 [06:41:44] (03CR) 10Marostegui: [C: 03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/732073 (owner: 10Marostegui) [06:41:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1126.eqiad.wmnet with OS buster [06:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:18] (03PS1) 10Marostegui: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/732248 (https://phabricator.wikimedia.org/T290865) [06:45:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 (s1) for reimage T290865', diff saved to https://phabricator.wikimedia.org/P17552 and previous config saved to /var/cache/conftool/dbconfig/20211020-064529-marostegui.json [06:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:36] T290865: Upgrade s1 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290865 [06:46:08] (03CR) 10Marostegui: [C: 03+2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/732248 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [06:47:26] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka_shipper: map site->kafka cluster name & point codfw to codfw brokers [puppet] - 10https://gerrit.wikimedia.org/r/731976 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [06:47:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/732246 (owner: 10Giuseppe Lavagetto) [06:49:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1118.eqiad.wmnet with OS buster [06:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:18] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: read monitoring groups from wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/731943 (https://phabricator.wikimedia.org/T286467) (owner: 10Filippo Giunchedi) [06:52:01] (03Merged) 10jenkins-bot: mediawiki: fix comma [deployment-charts] - 10https://gerrit.wikimedia.org/r/732246 (owner: 10Giuseppe Lavagetto) [06:54:17] 10SRE, 10Infrastructure-Foundations, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10elukey) Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are getting a 410 gone, do we need to update the `ripeatlas_measuremen... [07:08:04] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/732251 (https://phabricator.wikimedia.org/T293855) [07:09:06] (03PS2) 10Marostegui: dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/732251 (https://phabricator.wikimedia.org/T293855) [07:09:40] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/732251 (https://phabricator.wikimedia.org/T293855) (owner: 10Marostegui) [07:09:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (the spec test failure seems unrelated)" [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [07:09:56] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:27] (03CR) 10Muehlenhoff: aptrepo: Add component/php74 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732096 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [07:10:58] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for IF services [puppet] - 10https://gerrit.wikimedia.org/r/731934 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [07:12:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [07:13:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 24, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/731935 (owner: 10Ema) [07:16:24] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [07:16:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1118.eqiad.wmnet with OS buster [07:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:53] 10SRE, 10Analytics, 10SRE Observability (FY2021/2022-Q2): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) Thank you for the quick followup everyone! Please note that this work isn't super urgent on our (o11y) end, although graphite/statsd are in "support mo... [07:23:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [07:27:46] RECOVERY - Device not healthy -SMART- on analytics1066 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1066&var-datasource=eqiad+prometheus/ops [07:27:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [07:36:07] (03PS1) 10Ayounsi: Update ripeatlas_measurements for codfw [puppet] - 10https://gerrit.wikimedia.org/r/732252 (https://phabricator.wikimedia.org/T267714) [07:37:16] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/732074 [07:37:23] (03PS1) 10Elukey: kubernetes: add revscoring-draftquality settings for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/732253 (https://phabricator.wikimedia.org/T293858) [07:38:00] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/732074 (owner: 10Marostegui) [07:40:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM. Feel free to +2 (and CI will merge) or merge yourself. The alerts will be deployed at the next puppet run." [alerts] - 10https://gerrit.wikimedia.org/r/731919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [07:44:10] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732254 (https://phabricator.wikimedia.org/T293687) [07:45:40] (03PS1) 10Elukey: kubernetes: add fake settings for revscoring-draftquality for ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/732255 (https://phabricator.wikimedia.org/T293858) [07:46:09] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubernetes: add fake settings for revscoring-draftquality for ml-serve [labs/private] - 10https://gerrit.wikimedia.org/r/732255 (https://phabricator.wikimedia.org/T293858) (owner: 10Elukey) [07:46:52] (03CR) 10Elukey: [C: 03+2] kubernetes: add revscoring-draftquality settings for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/732253 (https://phabricator.wikimedia.org/T293858) (owner: 10Elukey) [07:52:00] (03PS1) 10Elukey: hemfile.d: add the revscoring-draftquality namespace for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/732256 (https://phabricator.wikimedia.org/T293858) [07:54:28] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [07:59:28] (03CR) 10Elukey: [C: 03+2] hemfile.d: add the revscoring-draftquality namespace for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/732256 (https://phabricator.wikimedia.org/T293858) (owner: 10Elukey) [08:01:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:01:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:39] (03PS1) 10Muehlenhoff: Add ownership annotations for Data Engineering services [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) [08:16:22] (03CR) 10JMeybohm: "I don't think we h" [deployment-charts] - 10https://gerrit.wikimedia.org/r/731917 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [08:23:00] (03PS2) 10David Caro: start_instance_with_prefix: Group options in a dataclass [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731908 [08:23:02] (03PS2) 10David Caro: start_instance_with_prefix: allow integer-suffixed prefixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731910 [08:23:04] (03PS2) 10David Caro: start_instance_with_prefix: fix next instance counter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731911 (https://phabricator.wikimedia.org/T292465) [08:23:06] (03PS2) 10David Caro: start_instance_with_prefix: add tries parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731912 (https://phabricator.wikimedia.org/T292465) [08:23:08] (03PS2) 10David Caro: start_instance_with_prefix: work around extra stderr message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731913 [08:23:10] (03PS4) 10David Caro: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) [08:23:23] (03Abandoned) 10David Caro: InstanceCreationOpts: Add a way to genera cli args [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731909 (owner: 10David Caro) [08:23:36] (03Abandoned) 10David Caro: grid: Added a couple cookbooks to add a new webgrid generic node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/731914 (owner: 10David Caro) [08:24:52] (03PS2) 10Filippo Giunchedi: mwdebug: fix statsd network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/731917 (https://phabricator.wikimedia.org/T247963) [08:25:21] (03CR) 10Filippo Giunchedi: mwdebug: fix statsd network policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/731917 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [08:27:16] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [08:34:25] (03PS1) 10Urbanecm: emailuser ratelimit: Use user-global rather than user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732260 (https://phabricator.wikimedia.org/T293866) [08:34:40] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) 05Open→03Resolved [08:34:46] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) [08:36:16] (03CR) 10Jbond: sre: add contool aware SREBatchRunnerBase (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [08:36:19] (03PS5) 10Jbond: sre: add contool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 [08:38:52] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:30] (03CR) 10Volans: [C: 03+1] "LGTM, totally ok to merge as is as first iteration. I've a more generic question inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [08:49:44] (03PS1) 10Kosta Harlan: CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732076 (https://phabricator.wikimedia.org/T293699) [08:50:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s6 T277116 [08:50:15] (03PS1) 10Kosta Harlan: CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/732077 (https://phabricator.wikimedia.org/T293699) [08:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:19] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [08:50:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s6 T277116 [08:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:28] (03CR) 10Elukey: [C: 03+1] "LGTM, this is a partial list though (but I suppose that you are adding tags in batches so more will come)." [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:58:42] (03CR) 10Elukey: [C: 03+1] Update ripeatlas_measurements for codfw [puppet] - 10https://gerrit.wikimedia.org/r/732252 (https://phabricator.wikimedia.org/T267714) (owner: 10Ayounsi) [09:04:37] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) The tests happened with the ad hoc test cluster setup a few months ago (new hardware for a proper 3 node test cluster was ordered, but 10G NICs had a lead time of... [09:04:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s5 T277116 [09:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:04] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [09:05:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s5 T277116 [09:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:49] (03CR) 10Jbond: [C: 03+1] builder/systemtap: convert role::systemtap::devserver to profile [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [09:07:40] (03CR) 10Muehlenhoff: "This was in fact added in batches, but this is the last of them picking up the crumbs: The other clusters like Hadoop or Kafka Jumbo are a" [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:12:21] (03CR) 10Elukey: [C: 03+1] Add ownership annotations for Data Engineering services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:12:37] (03PS2) 10Seddon: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) (owner: 10Eric Gardner) [09:13:22] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Lower automatic query killing threshold to 55 seconds - https://phabricator.wikimedia.org/T293533 (10Marostegui) [09:17:19] (03CR) 10Elukey: Add the first data-engineering team alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/731919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:17:33] (03PS6) 10Jbond: sre: add contool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 [09:17:58] (03PS7) 10Jbond: sre: add contool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 [09:18:14] (03CR) 10Jbond: "update thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [09:23:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:42] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:27:42] Amir1: wouldn't it make sense to renew in advance of the Critical? [09:33:19] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) I spent some time today with this, I can successfully `snmpwalk` the device, yet librenms refuses to add it. So on the device's end things seem to be... [09:34:17] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) The php-fpm logs are output to stderr, which goes to logstash at the moment using the physical node rsyslog, but it's under a different se... [09:36:39] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [09:47:23] (03PS1) 10Muehlenhoff: Add remaining ownership annotations for ML services [puppet] - 10https://gerrit.wikimedia.org/r/732268 (https://phabricator.wikimedia.org/T216088) [09:47:32] (03CR) 10Jbond: [C: 03+2] sre: add contool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [09:48:41] (03CR) 10Elukey: [C: 03+1] "No idea about the ORES poolcounter works but it should fall on our Radar indeed :)" [puppet] - 10https://gerrit.wikimedia.org/r/732268 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:49:12] (03PS8) 10Jbond: sre: add conftool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 [09:49:54] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 91.94% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:52:01] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Ok device added (with a temporary password), apparently librenms with zero `v3` configuration specified won't work [09:52:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Schema change s7 T277116 [09:52:35] (03CR) 10Btullis: Add the first data-engineering team alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/731919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [09:52:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Schema change s7 T277116 [09:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:55] (03CR) 10Btullis: [C: 03+2] Add the first data-engineering team alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/731919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:58:29] (03CR) 10Btullis: [C: 03+1] Add ownership annotations for Data Engineering services [puppet] - 10https://gerrit.wikimedia.org/r/732257 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:58:57] (03Merged) 10jenkins-bot: Add the first data-engineering team alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/731919 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:59:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Schema change s2 T277116 [09:59:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Schema change s2 T277116 [09:59:31] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) @papaul the device is now collecting data, however AFAICS only outlets are discovered for now. Let's see when more data is accumulated if the numbers... [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:32] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [09:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: Schema change s4 T277116 [10:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:07] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [10:13:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Schema change s4 T277116 [10:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:35] (03PS1) 10Filippo Giunchedi: install_server: use standard recipe for all graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/732273 (https://phabricator.wikimedia.org/T247963) [10:18:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Private Data Users for Naray-ctr - https://phabricator.wikimedia.org/T293810 (10TAndic) Hi @Dzahn! @NaRay should need both Kerberos and shell access for her scope of work. Good call on following @KCVelaga_WMF's requests -- the... [10:51:18] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) 05Open→03Resolved [10:51:24] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [10:51:50] (03PS1) 10Giuseppe Lavagetto: httpbb: change headers test for /static/current [puppet] - 10https://gerrit.wikimedia.org/r/732280 (https://phabricator.wikimedia.org/T285298) [10:57:09] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2005.codfw.wmnet [10:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:57] (03CR) 10Volans: [C: 03+1] "submitting" [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 (owner: 10Jbond) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1100). [11:00:05] urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] i'll self-service [11:00:17] ok [11:00:21] (03CR) 10Urbanecm: [C: 03+2] CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732076 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:00:25] (03CR) 10Urbanecm: [C: 03+2] CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/732077 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:00:32] (03PS2) 10Urbanecm: GrowthExperiments: Add campaign pattern for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731928 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:00:36] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Add campaign pattern for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731928 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:00:48] Lucas_WMDE: unless you want to try out a full scap :)) [11:00:53] no thanks :D [11:00:57] okay okay :) [11:02:32] (03Merged) 10jenkins-bot: GrowthExperiments: Add campaign pattern for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731928 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:07:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:35] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. - btullis@cumin1001 [11:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:26] PROBLEM - MariaDB Replica Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 434.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:15:32] RECOVERY - MariaDB Replica Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:15:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e520fc57411bb19123766192cd636396ea6fc59d: GrowthExperiments: Add campaign pattern for enwiki (T293699) (duration: 01m 22s) [11:15:56] waiting on CI [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] T293699: Donors to newcomers: recurring donor landing page - https://phabricator.wikimedia.org/T293699 [11:16:55] (03PS1) 10Majavah: debian: drop the upstart script [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732287 [11:20:03] (03PS1) 10Majavah: debian: drop help2man [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732288 [11:20:44] (03CR) 10Majavah: [C: 03+2] debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [11:20:51] (03PS6) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [11:21:31] !log installing ffmpeg security updates [11:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:08] (03Merged) 10jenkins-bot: debian: Rename package to toolforge-webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/731220 (owner: 10Majavah) [11:23:37] (03PS1) 10Majavah: d/changelog: Prepare for 0.78 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732289 [11:24:46] (03Merged) 10jenkins-bot: CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732076 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:25:21] waiting for wmf.4 too [11:27:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [11:29:05] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) 05Resolved→03Open Sadly I found a problem with our current approach: any file under static/current that is... [11:32:25] (03Merged) 10jenkins-bot: CreateAccountCampaign: Support for recurring donors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/732077 (https://phabricator.wikimedia.org/T293699) (owner: 10Kosta Harlan) [11:32:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2005.codfw.wmnet [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2005.codfw.wmnet` - testvm2005.codfw.wmnet (**PASS**) - Downtimed host on Icing... [11:37:11] !log urbanecm@deploy1002 Started scap: 802d3b7: e4f7f85: CreateAccountCampaign: Support for recurring donors (T293699) [11:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:17] T293699: Donors to newcomers: recurring donor landing page - https://phabricator.wikimedia.org/T293699 [11:37:19] so, let's see how long this will take [11:37:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. - btullis@cumin1001 [11:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2007.codfw.wmnet [11:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:44] (03PS1) 10Majavah: P::toolforge: rename toollabs-webservice package name [puppet] - 10https://gerrit.wikimedia.org/r/732291 [11:41:03] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [11:41:10] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) p:05Triage→03High [11:46:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2007.codfw.wmnet [11:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:12] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2007.codfw.wmnet` - testvm2007.codfw.wmnet (**WARN**) - //Host not found on Ici... [11:47:35] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [11:49:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:36] (03CR) 10Jgiannelos: Configure event stream for map tiles state change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [11:52:42] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [11:55:58] (03PS5) 10Jgiannelos: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) [11:57:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1200) [12:01:08] (03CR) 10David Caro: [C: 03+1] d/changelog: Prepare for 0.78 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732289 (owner: 10Majavah) [12:02:02] (03CR) 10David Caro: [C: 03+1] P::toolforge: rename toollabs-webservice package name [puppet] - 10https://gerrit.wikimedia.org/r/732291 (owner: 10Majavah) [12:02:30] !log urbanecm@deploy1002 Finished scap: 802d3b7: e4f7f85: CreateAccountCampaign: Support for recurring donors (T293699) (duration: 25m 19s) [12:02:38] didn't take that long [12:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:41] T293699: Donors to newcomers: recurring donor landing page - https://phabricator.wikimedia.org/T293699 [12:03:22] (03CR) 10Muehlenhoff: [C: 03+1] "One more down, excellent :-)" [puppet] - 10https://gerrit.wikimedia.org/r/732273 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [12:06:37] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use standard recipe for all graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/732273 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [12:09:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:51] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10akosiaris) Wow, that's a very detailed writeup. Thanks! Couple of comments inline: > This will require a restart of all instances (via gnt-instance reboot FOO, not from within the... [12:10:58] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.78 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732289 (owner: 10Majavah) [12:11:01] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) [12:11:05] (03CR) 10Ayounsi: [C: 03+2] Update ripeatlas_measurements for codfw [puppet] - 10https://gerrit.wikimedia.org/r/732252 (https://phabricator.wikimedia.org/T267714) (owner: 10Ayounsi) [12:12:11] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.78 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/732289 (owner: 10Majavah) [12:14:08] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 92.78% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:14:39] 10SRE, 10Traffic, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) 05Open→03Resolved a:03ema All hosts upgraded. [12:17:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:16] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:23:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) >>! In T267714#7443286, @elukey wrote: > Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are ge... [12:24:34] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:26:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10cmooney) 05In progress→03Resolved Cool, thanks @ayounsi. Good insight into how those alerts are configured. I'll know for the next time to update them too :) [12:28:16] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:28:30] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) >>! In T265435#7443667, @fgiunchedi wrote: > Ok device added (with a temporary password), apparently librenms with zero `v3` configuration specified... [12:30:20] (03PS1) 10Btullis: Remove the alluxio user and group [puppet] - 10https://gerrit.wikimedia.org/r/732296 (https://phabricator.wikimedia.org/T266641) [12:31:44] (03PS1) 10Jbond: cas 6.4.2: merge in upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 [12:32:58] (03PS1) 10Filippo Giunchedi: librenms: add snmp v3 dummy section [puppet] - 10https://gerrit.wikimedia.org/r/732298 (https://phabricator.wikimedia.org/T265435) [12:33:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P::toolforge: rename toollabs-webservice package name [puppet] - 10https://gerrit.wikimedia.org/r/732291 (owner: 10Majavah) [12:36:38] (03PS1) 10Filippo Giunchedi: move data-engineering to team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/732300 [12:36:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 14 hosts with reason: Schema change s1 T277116 [12:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:05] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [12:37:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 14 hosts with reason: Schema change s1 T277116 [12:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:39] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10JMeybohm) >>! In T288851#7443633, @Joe wrote: > One possible solution to all of our problems would be: > * let php-fpm log to two files, both in... [12:38:37] (03CR) 10Ssingh: [C: 03+2] anycast_monitoring: add check for durum [puppet] - 10https://gerrit.wikimedia.org/r/731399 (owner: 10Ssingh) [12:40:36] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 80%, RTA = 6158.75 ms [12:42:36] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:42:48] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:43:22] (03PS1) 10Kormat: mariadb: Remove absented alias file [puppet] - 10https://gerrit.wikimedia.org/r/732302 (https://phabricator.wikimedia.org/T291352) [12:44:08] (03CR) 10Kormat: [C: 03+2] mariadb: Remove absented alias file [puppet] - 10https://gerrit.wikimedia.org/r/732302 (https://phabricator.wikimedia.org/T291352) (owner: 10Kormat) [12:46:26] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:46:32] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 33%, RTA = 1404.15 ms [12:48:04] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:48:16] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:49:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:49:07] (03CR) 10Btullis: [C: 03+1] "Ah yes I see, sorry about that." [alerts] - 10https://gerrit.wikimedia.org/r/732300 (owner: 10Filippo Giunchedi) [12:50:39] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) Thanks for doublechecking the steps! >>! In T284811#7443983, @akosiaris wrote: > We don't have to wait for the update. We can do `sudo gnt-cluster modify --hyper... [12:51:38] !log cp3062: bump vsl_space from 80M (default) to 512M T293879 - varnish restart needed [12:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:44] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [12:53:10] (03PS2) 10Jbond: cas 6.4.2: merge in upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 [12:54:26] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3062 is CRITICAL: connect to address 10.20.0.62 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [12:54:44] PROBLEM - Webrequests Varnishkafka log producer on cp3062 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:54:52] this is me, the service is depooled ^ [12:57:10] PROBLEM - statsv Varnishkafka log producer on cp3062 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:57:38] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3062 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:58:04] RECOVERY - Webrequests Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:58:20] RECOVERY - statsv Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:00:04] hashar and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1300). [13:00:35] (03PS1) 10Muehlenhoff: Add ownership annotations for WMCS services [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) [13:01:41] (03CR) 10Muehlenhoff: Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:04:11] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet,service=varnish-fe [13:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:20] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet,service=ats-tls [13:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:09] (03CR) 10Muehlenhoff: [C: 03+1] "Let's do it :-) One nit inline" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 (owner: 10Jbond) [13:08:05] (03PS3) 10Jbond: cas 6.4.2: merge in upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 (https://phabricator.wikimedia.org/T293186) [13:10:49] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) Trying the lowest possible hanging fruit first, namely rising vsl_space. I've first tried setting it to 512M as mentioned in the SAL... [13:10:58] (03PS1) 10Jbond: puppetboard::ng: fix apache type Location vs Directory [puppet] - 10https://gerrit.wikimedia.org/r/732311 [13:11:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 7 hosts with reason: Schema change s3 T277116 [13:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 7 hosts with reason: Schema change s3 T277116 [13:11:37] T277116: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 [13:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:54] (03CR) 10Jbond: [C: 03+2] puppetboard::ng: fix apache type Location vs Directory [puppet] - 10https://gerrit.wikimedia.org/r/732311 (owner: 10Jbond) [13:14:37] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) [13:16:24] good afternoon [13:16:30] I am going to promote group 1 [13:16:36] (03PS1) 10Jbond: pdev-puppetboard: update cas settings to use cloud idp [puppet] - 10https://gerrit.wikimedia.org/r/732315 [13:16:49] (03CR) 10Jbond: [V: 03+2 C: 03+2] pdev-puppetboard: update cas settings to use cloud idp [puppet] - 10https://gerrit.wikimedia.org/r/732315 (owner: 10Jbond) [13:17:24] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10akosiaris) >>! In T284811#7444095, @MoritzMuehlenhoff wrote: >>> Since we have a few instances running in plain mode (etcd nodes), we'll need to apply a similar scheme when reimagi... [13:18:05] (03PS1) 10Hashar: group1 wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732316 [13:18:07] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732316 (owner: 10Hashar) [13:19:18] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.5 refs T281169 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732316 (owner: 10Hashar) [13:20:51] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.5 refs T281169 [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:59] T281169: 1.38.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T281169 [13:21:53] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10akosiaris) > We also have another problem: how to treat and collect php slow logs. Right now I'm sending them to stderr but that gets us a lot o... [13:21:54] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.5 refs T281169 (duration: 01m 02s) [13:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] (03CR) 10Filippo Giunchedi: [C: 03+2] move data-engineering to team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/732300 (owner: 10Filippo Giunchedi) [13:24:27] (03PS6) 10Ottomata: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:24:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:56] (03CR) 10Ottomata: [C: 03+1] "+1, perhaps you could add a comment explaining that maps.tile_change will be removed asap?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:26:28] PROBLEM - Disk space on cp3062 is CRITICAL: DISK CRITICAL - free space: /var/lib/varnish 14 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3062&var-datasource=esams+prometheus/ops [13:27:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:38] (03PS1) 10Jbond: puppetboard-ng.erb: add uwsgi_port [puppet] - 10https://gerrit.wikimedia.org/r/732318 [13:28:02] (03CR) 10Jbond: [C: 03+2] puppetboard-ng.erb: add uwsgi_port [puppet] - 10https://gerrit.wikimedia.org/r/732318 (owner: 10Jbond) [13:30:01] ACKNOWLEDGEMENT - Disk space on cp3062 is CRITICAL: DISK CRITICAL - free space: /var/lib/varnish 14 MB (2% inode=99%): Ema Ongoing experiment, see T293879 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3062&var-datasource=esams+prometheus/ops [13:30:05] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) >>! In T288851#7444291, @akosiaris wrote: >> We also have another problem: how to treat and collect php slow logs. Right now I'm sending th... [13:31:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 (https://phabricator.wikimedia.org/T293186) (owner: 10Jbond) [13:31:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10DAbad) I am Emil's manager and this request is approved [13:31:47] not much beside some deprecation [13:32:52] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) >>! In T284811#7444271, @akosiaris wrote: >> But with the reimage, just shutting them down means we'd lose the VMs? So I think we can either briefly transition th... [13:33:08] (03PS7) 10Jgiannelos: Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) [13:36:12] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10akosiaris) >>! In T288851#7443633, @Joe wrote: > The php-fpm logs are output to stderr, which goes to logstash at the moment using the physical... [13:37:50] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) >>! In T288851#7444368, @akosiaris wrote: >>>! In T288851#7443633, @Joe wrote: >> The php-fpm logs are output to stderr, which goes to log... [13:39:22] (03Abandoned) 10Herron: kafka_shipper: map site -> brokers centrally & point codfw to site local brokers [puppet] - 10https://gerrit.wikimedia.org/r/731774 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [13:39:36] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10akosiaris) >>! In T288851#7444372, @Joe wrote: >>>! In T288851#7444368, @akosiaris wrote: >>>>! In T288851#7443633, @Joe wrote: >>> The php-fpm... [13:40:27] !log installing apache2 security updates on buster [13:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:10] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:42:47] hashar: We should just revert https://github.com/wikimedia/mediawiki/commit/ffb0dfc87bcf1b5b20e6d7f7891c47412199ba99 [13:43:26] uh, not that one [13:43:36] (03PS1) 10Jbond: puppetboard-ng.erb: add uwsgi_port (also to the template) [puppet] - 10https://gerrit.wikimedia.org/r/732319 [13:43:55] (03CR) 10Jbond: [C: 03+2] puppetboard-ng.erb: add uwsgi_port (also to the template) [puppet] - 10https://gerrit.wikimedia.org/r/732319 (owner: 10Jbond) [13:44:03] https://github.com/wikimedia/mediawiki/commit/5b515cfaf019720333292884676c6ed4ece26f59 [13:47:14] Reedy: I am poking content-transformers team about it [13:47:48] it is similar to how platform deprecates bunch of methods, they do their best to do premptive patch and we catch the rest during the train [13:47:49] I just created a revert for .5 in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/732085 [13:47:58] the patches will follow [13:49:26] (03PS1) 10Majavah: cr-cloud: add tls ports for openstack services [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) [13:54:26] (03CR) 10Hashar: [C: 04-1] "It is similar to how platform engineering team deprecates bunch of methods, they do their best to do preemptive patches and we catch the r" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732085 (owner: 10Reedy) [13:57:35] (03CR) 10Reedy: ">Hard deprecation MUST NOT be applied to code still used in Wikimedia maintained code. Such usage MUST be removed first." [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732085 (owner: 10Reedy) [13:59:32] (03PS1) 10Jbond: service::uwsgi: drop python for ge bullseye [puppet] - 10https://gerrit.wikimedia.org/r/732324 [14:00:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31773/console" [puppet] - 10https://gerrit.wikimedia.org/r/732324 (owner: 10Jbond) [14:01:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] service::uwsgi: drop python for ge bullseye [puppet] - 10https://gerrit.wikimedia.org/r/732324 (owner: 10Jbond) [14:01:59] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) IRC update from Brandon. Traffic are checking if option 2B is viable with management. > Brandon Black > topranks: question_mark is going to talk with f... [14:04:19] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, and 2 others: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi thank you for getting the monitoring part up and running . [14:05:47] (03PS1) 10Ayounsi: drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) [14:06:31] (03CR) 10jerkins-bot: [V: 04-1] drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [14:09:41] (03PS1) 10Kevin Bazira: add enwiki-draftquality inference service to LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/732347 (https://phabricator.wikimedia.org/T293858) [14:10:30] (03CR) 10jerkins-bot: [V: 04-1] Revert "Hard deprecate the renamed ParserOutput::*Property() methods" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732085 (owner: 10Reedy) [14:12:33] !log installing ruby2.3 security updates [14:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31774/console" [puppet] - 10https://gerrit.wikimedia.org/r/732296 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:18:41] is anyone around to review a patch to unbreak all of CI? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/732320/ [14:18:52] hashar: Reedy: perhaps you could? ^ [14:19:12] Lucas_WMDE: for sure! :) [14:19:26] * Lucas_WMDE looks in a second [14:19:48] thanks [14:20:09] commute time & [14:20:50] +2ed [14:20:57] (03CR) 10Elukey: [V: 03+1 C: 03+1] Remove the alluxio user and group [puppet] - 10https://gerrit.wikimedia.org/r/732296 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:21:23] (03CR) 10Elukey: [C: 03+2] add enwiki-draftquality inference service to LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/732347 (https://phabricator.wikimedia.org/T293858) (owner: 10Kevin Bazira) [14:24:08] Lucas_WMDE: thanks! [14:27:38] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:45] !log cp3062: test higher vsl_space values T293879 [14:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:50] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [14:29:28] RECOVERY - Disk space on cp3062 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3062&var-datasource=esams+prometheus/ops [14:30:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:32:04] PROBLEM - Confd vcl based reload on cp3062 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:35:11] !log installing commons-io security updates on Buster [14:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:37] 10SRE, 10Observability-Logging, 10Traffic, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10fgiunchedi) [14:35:46] (03PS1) 10Ayounsi: Add drmrs network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/732351 (https://phabricator.wikimedia.org/T283050) [14:35:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:36:06] RECOVERY - Confd vcl based reload on cp3062 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:37:55] (03PS2) 10Ayounsi: drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) [14:38:32] (03CR) 10jerkins-bot: [V: 04-1] drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [14:42:22] (03PS3) 10Ayounsi: drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) [14:42:46] 10SRE, 10SRE Observability (FY2021/2022-Q2): Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10fgiunchedi) [14:42:52] (03CR) 10jerkins-bot: [V: 04-1] drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [14:44:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:31] (03PS4) 10Ayounsi: drmrs initial prep [homer/public] - 10https://gerrit.wikimedia.org/r/732346 (https://phabricator.wikimedia.org/T283050) [14:45:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:46:31] (03CR) 10Ottomata: [C: 03+1] Configure event stream for map tiles state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730848 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:46:51] !log installing irssi security updates on Buster [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:09] !log installing xmlgraphics-commons security updates on Buster [14:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:22] (03CR) 10Ahmon Dancy: [C: 03+1] httpbb: change headers test for /static/current [puppet] - 10https://gerrit.wikimedia.org/r/732280 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [14:52:50] is there any special place where errors go that happen during job execution? (e.g. PHP warnings) [14:53:02] so far I haven’t found much in logstash or in JobExecutor.log on mwlog1002 [14:57:50] !log installing modsecurity-crs security updates on Buster [14:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:03] (03PS1) 10Jbond: P:puppetboard::ng: update uwsgi to use an actual python module [puppet] - 10https://gerrit.wikimedia.org/r/732358 [15:10:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:11:09] (03CR) 10Jbond: [C: 03+2] P:puppetboard::ng: update uwsgi to use an actual python module [puppet] - 10https://gerrit.wikimedia.org/r/732358 (owner: 10Jbond) [15:13:24] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [15:14:54] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Marostegui) [15:19:49] (03PS1) 10C. Scott Ananian: LqtDiscussionPager: Remove deprecated usage of setProperty [extensions/LiquidThreads] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732331 (https://phabricator.wikimedia.org/T293895) [15:21:04] (03PS1) 10C. Scott Ananian: Update deprecated calls in ShortDescHandler [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732332 (https://phabricator.wikimedia.org/T293860) [15:23:58] (03PS1) 10C. Scott Ananian: Replace use of deprecated ParserOutput:getProperty() [extensions/GeoCrumbs] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732333 (https://phabricator.wikimedia.org/T293894) [15:24:19] (03CR) 10C. Scott Ananian: "Should be made unnecessary by I70706415ae657f6783b7ae78e2dbe1b96ca31dbe, I78b15a6e12c462eb7f3c75b2549801e8324bc591, and I728b8aa566b23c602" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732085 (owner: 10Reedy) [15:30:50] (03PS1) 10Jbond: puppetboard::ng: enable shared JS libraries [puppet] - 10https://gerrit.wikimedia.org/r/732364 [15:32:31] (03CR) 10Jbond: [C: 03+2] puppetboard::ng: enable shared JS libraries [puppet] - 10https://gerrit.wikimedia.org/r/732364 (owner: 10Jbond) [15:32:40] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [15:38:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05aborrero→03ayounsi [15:39:00] !log volans@cumin2002 START - Cookbook sre.dns.netbox [15:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:42:06] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:43:56] (03PS1) 10Jbond: P:puppetboard:ng: pass through additional variables [puppet] - 10https://gerrit.wikimedia.org/r/732367 [15:43:58] (03PS1) 10Jbond: O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368 [15:44:34] (03CR) 10Jbond: [C: 03+2] P:puppetboard:ng: pass through additional variables [puppet] - 10https://gerrit.wikimedia.org/r/732367 (owner: 10Jbond) [15:49:12] (03PS2) 10Jbond: O:puppetboard::ng: add new role [puppet] - 10https://gerrit.wikimedia.org/r/732368 [15:52:28] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:52:54] (03Abandoned) 10Bernard Wang: Add WMEDesktopWebUIActionsTrackingOversampleLoggedInUsers config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731827 (https://phabricator.wikimedia.org/T292588) (owner: 10Bernard Wang) [15:54:18] (03PS1) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [15:54:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [15:56:39] (03PS2) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [15:58:51] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas 6.4.2: merge in upstream changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732297 (https://phabricator.wikimedia.org/T293186) (owner: 10Jbond) [16:00:15] (03PS3) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [16:01:59] (03PS1) 10Jbond: changelog: bump version [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732370 [16:02:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] changelog: bump version [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732370 (owner: 10Jbond) [16:02:49] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31778/console" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [16:04:29] (03PS1) 10Jbond: gradle.properties: remove config for building locally [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732371 [16:05:12] (03PS4) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [16:05:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle.properties: remove config for building locally [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/732371 (owner: 10Jbond) [16:05:29] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatchViaJobs-related Wikibase settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) [16:07:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM until wmf.6 is safely rolled out to all wikis (or we’ve backported the relevant Wikibase changes to all other deployed branches)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732372 (https://phabricator.wikimedia.org/T291828) (owner: 10Lucas Werkmeister (WMDE)) [16:08:15] (03PS1) 10Volans: sre.hosts.reimage: don't fail on new DC [cookbooks] - 10https://gerrit.wikimedia.org/r/732373 [16:08:46] (03PS1) 10JMeybohm: Add basic ingress support to chart scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 [16:09:47] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10colewhite) [16:09:52] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31779/console" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [16:13:13] (03CR) 10Ottomata: [V: 03+1] "No op! Whatchya think?" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [16:13:38] !log upload cas_6.4.2-1_amd64.deb [16:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:40] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) a ticket has been created with Dell, I entered a lot of explanation and troubleshooting in the ticket so hopefully, they will not push back. You have... [16:19:33] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[56] - https://phabricator.wikimedia.org/T293909 (10RobH) [16:19:48] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install ganeti102[56] - https://phabricator.wikimedia.org/T293909 (10RobH) a:03Jclark-ctr [16:19:52] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: don't fail on new DC [cookbooks] - 10https://gerrit.wikimedia.org/r/732373 (owner: 10Volans) [16:26:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) [16:28:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) 05Open→03Resolved @Jgreen the network switch has been updated, set to stage in netbox. They're all yours! [16:29:00] (03PS5) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [16:30:32] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [16:30:40] (03CR) 10Hashar: [C: 03+2] "I will hot deploy it. Thank you" [extensions/GeoCrumbs] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732333 (https://phabricator.wikimedia.org/T293894) (owner: 10C. Scott Ananian) [16:31:52] (03CR) 10Hashar: [C: 03+2] LqtDiscussionPager: Remove deprecated usage of setProperty [extensions/LiquidThreads] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732331 (https://phabricator.wikimedia.org/T293895) (owner: 10C. Scott Ananian) [16:33:49] (03Merged) 10jenkins-bot: Replace use of deprecated ParserOutput:getProperty() [extensions/GeoCrumbs] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732333 (https://phabricator.wikimedia.org/T293894) (owner: 10C. Scott Ananian) [16:34:36] (03CR) 10Hashar: [C: 03+2] "Will deploy it as well as the other related patches for LiquidThreads and GeoCrumbs. Thank you!" [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732332 (https://phabricator.wikimedia.org/T293860) (owner: 10C. Scott Ananian) [16:34:45] (03Merged) 10jenkins-bot: LqtDiscussionPager: Remove deprecated usage of setProperty [extensions/LiquidThreads] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732331 (https://phabricator.wikimedia.org/T293895) (owner: 10C. Scott Ananian) [16:34:56] I will depoy a few hotfixes for ParserOutput::getProperty that got deprecated. That spam the logs T293860 T293895 T293894 [16:34:56] T293860: PHP Deprecated: Use of ParserOutput::setProperty was deprecated in MediaWiki 1.38. [Called from Wikibase\Client\Hooks\ShortDescHandler::doHandle] - https://phabricator.wikimedia.org/T293860 [16:34:57] T293895: PHP Deprecated: Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38. [Called from LqtDiscussionPager::getPageLimit] - https://phabricator.wikimedia.org/T293895 [16:34:57] T293894: PHP Deprecated: Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38. [Called from GeoCrumbsHooks::makeTrail] - https://phabricator.wikimedia.org/T293894 [16:36:23] !log deploy refinery change for https://phabricator.wikimedia.org/T287084 [16:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:22] !log razzi@deploy1002 Started deploy [analytics/refinery@9e3295f]: Regular analytics weekly train [analytics/refinery@9e3295f] [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:22] (03CR) 10Legoktm: aptrepo: Add component/php74 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732096 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [16:42:03] (03PS1) 10Legoktm: aptrepo: Don't add component/php74 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/732378 [16:42:26] (03PS2) 10Legoktm: aptrepo: Don't add component/php74 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/732378 [16:42:48] jouncebot: now [16:42:48] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [16:42:49] (03PS3) 10Legoktm: aptrepo: Don't add component/php74 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/732378 [16:43:15] (03PS1) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) [16:44:09] (03CR) 10jerkins-bot: [V: 04-1] Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [16:44:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:29] (03CR) 10Legoktm: [C: 03+2] aptrepo: Don't add component/php74 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/732378 (owner: 10Legoktm) [16:45:14] deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoCrumbs/+/732333 [16:47:01] (03PS2) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) [16:47:04] 10Puppet, 10SRE, 10Infrastructure-Foundations: package_builder puppet tests failing - https://phabricator.wikimedia.org/T293912 (10Legoktm) [16:47:08] (03CR) 10Klausman: [C: 03+1] Add remaining ownership annotations for ML services [puppet] - 10https://gerrit.wikimedia.org/r/732268 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:47:28] (03PS6) 10Legoktm: package_builder: Add hook for building PHP 7.4 packages [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) [16:47:40] (03CR) 10Klausman: [C: 03+1] hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:48:20] (03CR) 10jerkins-bot: [V: 04-1] package_builder: Add hook for building PHP 7.4 packages [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [16:48:57] (03CR) 10Legoktm: [V: 03+2 C: 03+2] "Bypassing CI failure, filed as T293912" [puppet] - 10https://gerrit.wikimedia.org/r/732097 (https://phabricator.wikimedia.org/T293449) (owner: 10Legoktm) [16:49:18] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/GeoCrumbs: Replace use of deprecated ParserOutput:getProperty() - T293894 (duration: 01m 09s) [16:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:25] T293894: PHP Deprecated: Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38. [Called from GeoCrumbsHooks::makeTrail] - https://phabricator.wikimedia.org/T293894 [16:50:26] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10dcaro) @Cmjohnson Thanks for the update! [16:52:45] (03CR) 10Volans: "LGTM, but seems to be missing the inclusion of netbox/drmrs.wmnet and netbox/mgmt.drmrs.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [16:53:04] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/LiquidThreads/pages/LqtDiscussionPager.php: Remove deprecated usage of setProperty - T293895 (duration: 01m 03s) [16:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:10] T293895: PHP Deprecated: Use of ParserOutput::getProperty was deprecated in MediaWiki 1.38. [Called from LqtDiscussionPager::getPageLimit] - https://phabricator.wikimedia.org/T293895 [16:53:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:54] (03Merged) 10jenkins-bot: Update deprecated calls in ShortDescHandler [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732332 (https://phabricator.wikimedia.org/T293860) (owner: 10C. Scott Ananian) [16:57:56] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:59:22] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:59:46] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:59:53] (03PS1) 10Ottomata: Bump eventgate-main image version to get maps.tiles_change schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/732382 (https://phabricator.wikimedia.org/T293366) [17:00:15] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:00:47] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase/client: Update deprecated calls to ParserOutput in ShortDescHandler - T293860 (duration: 01m 03s) [17:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:54] T293860: PHP Deprecated: Use of ParserOutput::setProperty was deprecated in MediaWiki 1.38. [Called from Wikibase\Client\Hooks\ShortDescHandler::doHandle] - https://phabricator.wikimedia.org/T293860 [17:01:04] !log razzi@deploy1002 Finished deploy [analytics/refinery@9e3295f]: Regular analytics weekly train [analytics/refinery@9e3295f] (duration: 23m 42s) [17:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:02:02] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:02:21] (03Abandoned) 10Hashar: Revert "Hard deprecate the renamed ParserOutput::*Property() methods" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732085 (owner: 10Reedy) [17:03:54] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:05:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:05:27] (03PS1) 10Legoktm: package_builder: Add hook to stop rebuilding man-db [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) [17:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:07] (03CR) 10jerkins-bot: [V: 04-1] package_builder: Add hook to stop rebuilding man-db [puppet] - 10https://gerrit.wikimedia.org/r/732383 (https://phabricator.wikimedia.org/T276632) (owner: 10Legoktm) [17:09:15] 10SRE, 10serviceops, 10Patch-For-Review: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) a:03Legoktm [17:10:46] (03CR) 10Arturo Borrero Gonzalez: "Thanks for the patch!" [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [17:11:02] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [17:11:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] debian: drop the upstart script [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732287 (owner: 10Majavah) [17:12:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "let me know if you need me to merge this, or even better, if you need +2 on this repo for doing it yourself." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732287 (owner: 10Majavah) [17:12:55] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [17:14:23] (03PS2) 10Majavah: cr-cloud: add tls ports for openstack services [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) [17:14:57] (03CR) 10Majavah: cr-cloud: add tls ports for openstack services (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/732321 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [17:17:02] (03CR) 10Majavah: [C: 03+2] debian: drop the upstart script [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732287 (owner: 10Majavah) [17:18:32] (03Merged) 10jenkins-bot: debian: drop the upstart script [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/732287 (owner: 10Majavah) [17:22:51] (03CR) 10Dzahn: [C: 03+2] "Has approval and it was confirmed kerberos is needed." [puppet] - 10https://gerrit.wikimedia.org/r/732038 (https://phabricator.wikimedia.org/T293810) (owner: 10Dzahn) [17:27:57] !log [krb1001:~] $ sudo manage_principals.py create statwithlatte --email_address=naray-ctr@wikimedia.org [17:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:17] !log [krb1001:~] $ sudo manage_principals.py create statwithlatte --email_address=naray-ctr@wikimedia.org - T293810 [17:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:54] come on, phab bot [17:29:40] (03PS1) 10Legoktm: mediawiki: Remove tidy binary [puppet] - 10https://gerrit.wikimedia.org/r/732386 [17:29:42] (03PS1) 10Legoktm: mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) [17:32:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Private Data Users for Naray-ctr - https://phabricator.wikimedia.org/T293810 (10Dzahn) Hi @TAndic ! Thanks for confirming. I merged the code change and access is granted. I followed what KVVelaga got and ran the same command Raz... [17:32:44] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Naray-ctr - https://phabricator.wikimedia.org/T293810 (10Dzahn) 05In progress→03Resolved a:05NaRay→03Dzahn [17:33:13] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Naray-ctr - https://phabricator.wikimedia.org/T293810 (10Dzahn) Feel free to reopen this if you have any questions or run into problems. [17:35:09] mutante: can I pm you for a quick sanity check? [17:36:50] 10Puppet, 10SRE, 10Infrastructure-Foundations: package_builder puppet tests failing - https://phabricator.wikimedia.org/T293912 (10Dzahn) a:03Dzahn I'll take a look [17:36:53] dontpanic: yea [17:55:32] PROBLEM - Host contint2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:55:54] ^ should be firmware upgrades that finally fix the flapping :) [17:57:02] RECOVERY - Host contint2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [18:00:05] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:00:05] hashar and dancy: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1800). [18:00:21] oh bot is off [18:01:18] It's right. now is train log triage. [18:04:22] (03PS2) 10Ottomata: Bump eventgate-main image version to get maps.tiles_change schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/732382 (https://phabricator.wikimedia.org/T293366) [18:04:33] (03PS1) 10Dzahn: apt: use ensure_resource for exec[apt-get update] to avoid duplicate defs [puppet] - 10https://gerrit.wikimedia.org/r/732391 (https://phabricator.wikimedia.org/T293912) [18:05:23] (03CR) 10Dzahn: "We already use ensure_resource in various places to avoid these" [puppet] - 10https://gerrit.wikimedia.org/r/732391 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:06:43] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31780/console" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [18:06:47] (03PS2) 10Dduvall: gitlab: Allow configuration of gitlab-runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/732093 (https://phabricator.wikimedia.org/T293833) [18:06:49] (03PS1) 10Dduvall: gitlab: Refactor docker volume parameters to use cinder [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) [18:09:40] (03PS1) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) [18:09:42] (03CR) 10Ottomata: [C: 03+2] Bump eventgate-main image version to get maps.tiles_change schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/732382 (https://phabricator.wikimedia.org/T293366) (owner: 10Ottomata) [18:10:43] (03CR) 10jerkins-bot: [V: 04-1] pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:11:12] (03CR) 10Dzahn: "pretending to add PHP 8 to test edits to pbuilder_hook and the tests" [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:11:35] (03PS1) 10Majavah: Update webservice package name on stretch and newer [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/732395 [18:11:53] (03CR) 10Majavah: [C: 03+2] Update webservice package name on stretch and newer [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/732395 (owner: 10Majavah) [18:13:09] (03PS6) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [18:13:30] (03PS2) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) [18:13:39] (03CR) 10Razzi: [C: 03+2] Add analytics purge for Gobblin old files [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [18:14:31] (03CR) 10jerkins-bot: [V: 04-1] pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:14:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:20] another one where LE cert renewal is 7 days but monitoring threshold is also 7 days [18:15:22] (03PS3) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) [18:15:44] will probably resolve in a couple minutes [18:15:55] (03CR) 10jerkins-bot: [V: 04-1] pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:16:34] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/732093 (https://phabricator.wikimedia.org/T293833) (owner: 10Dduvall) [18:16:56] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:31] (03PS4) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) [18:17:32] there it is [18:19:04] 10SRE, 10Discovery-Search, 10Traffic, 10observability: cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) same thing happened today for lists.wikimedia.org, it alerted and then recovered 2 minutes later. In general we have renewal = 7 days and alerting = 7 days. we... [18:19:17] (03CR) 10jerkins-bot: [V: 04-1] pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:19:19] (03Merged) 10jenkins-bot: Update webservice package name on stretch and newer [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/732395 (owner: 10Majavah) [18:19:55] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Dzahn) [18:20:15] (03PS1) 10Andrew Bogott: openStack:haproxy add tls termination for openstack [puppet] - 10https://gerrit.wikimedia.org/r/732397 (https://phabricator.wikimedia.org/T267194) [18:24:07] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:36] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [18:24:41] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Reedy) Are we still installing `php-mongodb`? I can't see it obviously in puppet... If it is/was still in use, I'm guessing it was potentially xhgui stuff from #performance-team. `php-tidy` probably ca... [18:25:05] (03PS3) 10Ayounsi: Add drmrs to DNS (Netbox generated records) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) [18:25:22] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31782/console" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [18:26:28] (03CR) 10Ayounsi: Add drmrs to DNS (Netbox generated records) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/732380 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [18:27:08] (03CR) 10Ottomata: [V: 03+1] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31782/console" [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [18:27:13] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Dzahn) >>! In T293449#7445553, @Reedy wrote: > Are we still installing `php-mongodb`? I can't see it obviously in puppet... If it is/was still in use, I'm guessing it was potentially xhgui stuff from #p... [18:28:10] (03PS1) 10Andrew Bogott: openstack:haproxy add tls for nova metadata service [puppet] - 10https://gerrit.wikimedia.org/r/732398 (https://phabricator.wikimedia.org/T267194) [18:30:30] (03PS2) 10Andrew Bogott: openStack:haproxy add tls termination for openstack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/732397 (https://phabricator.wikimedia.org/T267194) [18:30:32] (03PS2) 10Andrew Bogott: openstack:haproxy add tls for nova metadata service [puppet] - 10https://gerrit.wikimedia.org/r/732398 (https://phabricator.wikimedia.org/T267194) [18:30:56] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:30:56] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:23] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Dzahn) >>! In T293449#7445553, @Reedy wrote: > `php-tidy` probably can be answered by the Parsing people, but I don't think we're still installing it (at least, explicitly) either based on puppet T2164... [18:32:19] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) I copied the list out of what is currently packaged in the php72 component: https://apt-browser.toolforge.org/buster-wikimedia/component/php72/ >>! In T293449#7445553, @Reedy wrote: > Are we s... [18:32:46] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [18:35:35] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:38:48] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:39:09] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) a:03Jclark-ctr [18:40:16] (03PS5) 10Dzahn: pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) [18:40:24] (03PS1) 10Ottomata: Use profile::mariadb_multiinstance for analytics multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) [18:41:06] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Majavah) [18:41:14] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [18:41:14] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [18:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:02] (03CR) 10jerkins-bot: [V: 04-1] pbuilder: test edit for T293912 [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:42:41] 10ops-eqiad, 10Analytics, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:43:58] (03PS2) 10Ottomata: Use profile::mariadb_multiinstance for analytics multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) [18:44:04] (03CR) 10Dzahn: "just some more attempts to resolve the duplicate declaration by using $name in the resource title because apt::pin is a defined type.. oh " [puppet] - 10https://gerrit.wikimedia.org/r/732393 (https://phabricator.wikimedia.org/T293912) (owner: 10Dzahn) [18:45:12] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian [18:45:53] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31784/console" [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [18:47:55] (03CR) 10Ottomata: [V: 03+1] "Almost a no-op, but I'd say the changes that it does make are appropriate." [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [18:51:54] (03CR) 10Ottomata: [V: 03+1] [WIP] Make profile::mariadb::dbstore_multiinstance more generic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [18:57:59] PROBLEM - Host contint2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:00:05] hashar and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1900). [19:00:10] (03PS1) 10Dzahn: pontoon: disable puppetmaster trying to pull geoip databases [puppet] - 10https://gerrit.wikimedia.org/r/732405 [19:01:47] (03PS7) 10Ottomata: [WIP] Make profile::mariadb::dbstore_multiinstance more generic [puppet] - 10https://gerrit.wikimedia.org/r/732369 [19:02:04] (03PS3) 10Ottomata: Use profile::mariadb_multiinstance for analytics multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) [19:03:37] (03PS4) 10Ottomata: Use profile::mariadb_multiinstance for analytics multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/732400 (https://phabricator.wikimedia.org/T284150) [19:04:19] (03CR) 10Dzahn: "Yea, we can go 2 routes here, either we skip the entire geoip update class (enable_geoip: false) or we can use it but not try to use the l" [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [19:05:18] (03CR) 10Dzahn: pontoon: disable puppetmaster trying to pull geoip databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [19:06:28] (03PS2) 10Dzahn: pontoon: disable puppetmaster trying to pull geoip databases [puppet] - 10https://gerrit.wikimedia.org/r/732405 [19:06:48] (03PS3) 10Dzahn: pontoon: disable puppetmaster trying to pull _private_ geoip databases [puppet] - 10https://gerrit.wikimedia.org/r/732405 [19:07:39] (03CR) 10jerkins-bot: [V: 04-1] pontoon: disable puppetmaster trying to pull _private_ geoip databases [puppet] - 10https://gerrit.wikimedia.org/r/732405 (owner: 10Dzahn) [19:08:32] jouncebot: next [19:08:32] In 0 hour(s) and 51 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T2000) [19:08:37] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:08:53] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:09:01] !log disabling puppet on mw* for a minute to deploy a change [19:09:04] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:32] (03CR) 10Dzahn: [C: 03+2] mediawiki::appserver: fetch additional MaxMind databases on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/732099 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [19:13:01] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:15:13] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 902.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:15:29] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:46] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10Krinkle) [19:24:00] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@985a139]: bulk_daemon: detect cross-cluste config from old and new locations [19:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:46] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@985a139]: bulk_daemon: detect cross-cluste config from old and new locations (duration: 00m 46s) [19:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:54] 10SRE, 10Discovery-Search, 10Traffic, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10herron) >>! In T293826#7445653, @Legoktm wrote: > this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I... [19:29:52] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [19:37:05] (03PS1) 10Dzahn: cumin: drop tor_relay from aliases [puppet] - 10https://gerrit.wikimedia.org/r/732410 [19:42:28] (03PS1) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 [19:51:42] (03CR) 10Dduvall: [C: 03+1] "Tested successfully on runner1002.gitlab-runners.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/732093 (https://phabricator.wikimedia.org/T293833) (owner: 10Dduvall) [19:57:35] 10SRE-tools, 10Observability-Logging, 10Spicerack: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929 (10colewhite) [20:00:05] hashar and dancy: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1900). [20:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T2000). [20:02:58] (03PS1) 10Dzahn: cumin: add an alias for new pki roles and add to misc-others [puppet] - 10https://gerrit.wikimedia.org/r/732425 [20:03:12] (03PS8) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [20:05:10] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: Refactor docker volume parameters to use cinder [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:06:46] (03PS2) 10Dzahn: cumin: add alias for gitlab and add that to misc-releng [puppet] - 10https://gerrit.wikimedia.org/r/732414 [20:09:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31786/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [20:11:16] (03PS4) 10Dzahn: builder/systemtap: convert role::systemtap::devserver to profile [puppet] - 10https://gerrit.wikimedia.org/r/730863 [20:11:32] (03CR) 10Dzahn: [C: 03+2] builder/systemtap: convert role::systemtap::devserver to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [20:13:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31787/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [20:14:19] (03CR) 10Cwhite: [C: 03+2] profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:14:41] (03CR) 10Dzahn: "on deneb: Notice: /Stage[main]/Motd/File[/etc/update-motd.d/05-role-systemtap--devserver]/ensure: removed and nothing else" [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [20:16:47] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:17:46] (03CR) 10Dzahn: [C: 03+2] "ACK, thanks for the bug link, cloud-only and tested, yea, I'll go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/732093 (https://phabricator.wikimedia.org/T293833) (owner: 10Dduvall) [20:19:30] (03CR) 10Dzahn: "Arnold, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/732093 (https://phabricator.wikimedia.org/T293833) (owner: 10Dduvall) [20:19:40] (03CR) 10Dduvall: [C: 03+1] "Tested successfully on runner-1002.gitlab-runners.eqiad1.wikimedia.cloud. Note that some manual intervention is needed for existing runner" [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:19:55] (03PS2) 10Dzahn: gitlab: Refactor docker volume parameters to use cinder [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:22:21] (03CR) 10Dzahn: [C: 03+2] gitlab: Refactor docker volume parameters to use cinder [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:22:37] (03CR) 10Dzahn: "cloud-only" [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:23:39] mutante: thank you! [20:24:34] (03CR) 10Dzahn: "DDuval: do you need someone to do the "manual intervention is needed for existing runners" on existing nodes or you got that?" [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:25:37] (03CR) 10Ottomata: "Hm, curious. In mariadb::config, server_id is set to a fixed value based on the IP address." [puppet] - 10https://gerrit.wikimedia.org/r/732369 (owner: 10Ottomata) [20:25:45] (03CR) 10Dduvall: gitlab: Refactor docker volume parameters to use cinder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732392 (https://phabricator.wikimedia.org/T293835) (owner: 10Dduvall) [20:25:47] dduvall: np, let's keep the development "rapid" in this case where it's a new service in cloud [20:25:52] !log uploaded php7.4 on buster to apt.wm.o (T293449) [20:25:57] :) [20:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:58] T293449: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 [20:31:46] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:32:41] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [20:42:05] jouncebot: now [20:42:05] For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T1900) [20:42:05] For the next 0 hour(s) and 17 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T2000) [20:45:16] (03PS1) 10Urbanecm: Promote Growth features out of darkmode on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732437 (https://phabricator.wikimedia.org/T291826) [20:45:36] (03PS6) 10Cwhite: hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) [20:46:05] (03CR) 10jerkins-bot: [V: 04-1] hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:46:26] (03PS7) 10Cwhite: hiera: add minimal logstash-beta-next hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/723619 (https://phabricator.wikimedia.org/T288618) [20:49:04] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:53:02] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [20:57:30] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10RobH) [20:57:55] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10RobH) [20:58:19] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10RobH) a:03Jclark-ctr [20:59:30] (03PS1) 10Cwhite: profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) [21:00:02] (03CR) 10jerkins-bot: [V: 04-1] profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:00:46] (03PS2) 10Cwhite: profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) [21:00:48] (03PS1) 10Dzahn: global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 [21:01:35] (03CR) 10jerkins-bot: [V: 04-1] global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 (owner: 10Dzahn) [21:02:13] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:02:18] (03PS2) 10Dzahn: global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 [21:04:17] jouncebot: now [21:04:17] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [21:04:31] (03CR) 10Urbanecm: [C: 03+2] Promote Growth features out of darkmode on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732437 (https://phabricator.wikimedia.org/T291826) (owner: 10Urbanecm) [21:05:40] (03Merged) 10jenkins-bot: Promote Growth features out of darkmode on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732437 (https://phabricator.wikimedia.org/T291826) (owner: 10Urbanecm) [21:05:54] (03PS4) 10Cwhite: logstash: duplicate MediaWiki error and exception logs to ECS test [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) [21:06:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:06:38] again... [21:07:31] yea, when that exact same pattern happened the other day on cloudelastic.. that's the point when I made a ticket [21:07:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:41] ACKNOWLEDGEMENT - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). daniel_zahn https://phabricator.wikimedia.org/T293826 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:59] originally I thought we just need to lower threshold to 1 hour under 7 days [21:08:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b9cf996a38d82fdd67e600a5a951e88423957e8d: Promote Growth features out of darkmode on several wikis (T291826, T255037, T287878) (duration: 01m 04s) [21:08:31] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:31] (03CR) 10Andrew Bogott: [C: 03+1] global: remove all "filtertags" lines [puppet] - 10https://gerrit.wikimedia.org/r/732439 (owner: 10Dzahn) [21:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:36] T287878: Deploy Growth features on Kazakh Wikipedia - https://phabricator.wikimedia.org/T287878 [21:08:36] T255037: Deploy Growth features on Italian Wikipedia - https://phabricator.wikimedia.org/T255037 [21:08:36] T291826: Deploy Growth features on Gan, Inuktitut and Tajik Wikipedia - https://phabricator.wikimedia.org/T291826 [21:09:42] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) db2078.mgmt mw2253.mgmt [21:09:52] ACKNOWLEDGEMENT - SSH on db2078.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:10:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:41] (03PS1) 10Dzahn: mediawiki::appserver: fetch additional MaxMind databases on API servers [puppet] - 10https://gerrit.wikimedia.org/r/732440 (https://phabricator.wikimedia.org/T288844) [21:14:59] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:16:08] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:18:51] (03PS1) 10Jdlrobson: Restore title to mobile skin without logo [skins/MinervaNeue] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732336 (https://phabricator.wikimedia.org/T290525) [21:19:04] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:19:25] 10ops-eqiad, 10Analytics, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) @Jclark-ctr This is a spare system we already have in netbox. It just needs to relocate from row D, as its being allocated into service as redundant to a server in D3... [21:19:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:32] (03CR) 10Dzahn: "I am doing it this way and not using role::mediawiki::common because that is also jobrunners and mwmaint etc and there is currently discus" [puppet] - 10https://gerrit.wikimedia.org/r/732440 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [21:33:28] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Dzahn) 05Open→03Resolved No replies so far. I am calling this done and will reopen it if that changes. [21:37:53] PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-10-17 21:04:56 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:39:05] (03CR) 10Dzahn: [C: 03+2] mediawiki::appserver: fetch additional MaxMind databases on API servers [puppet] - 10https://gerrit.wikimedia.org/r/732440 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [21:44:12] (03CR) 10Cwhite: [C: 03+2] logstash: duplicate MediaWiki error and exception logs to ECS test [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:46:49] legoktm: we have duplicate lists.wikimedia.org cert alert because "check archives" and "check listinfo" get the same thing when the cert triggers them. (check_https -S auto-includes it) [21:47:08] eh, check_http Icinga check command when used with -S [21:47:37] is that a bad thing? [21:48:09] there are two alerts because those are provided by two different areas of the mailman code (postorius vs hyperkitty) [21:48:10] not much but a little bit, duplicate alerts and notifications for the same thing [21:48:33] but there is much worse, like when we duplicate them on a dozen hosts using the same role, not this :) [21:48:50] just wanted to point out thats also why we saw it again [21:49:03] ah, gotcha [21:50:23] !log Testing a series of one-file scap sync-file runs [21:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:52] ACKNOWLEDGEMENT - SSH on mw2253.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:52] !log dancy@deploy1002 Synchronized README: (no justification provided) (duration: 01m 03s) [21:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:58] 10SRE: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) [21:54:29] 10SRE: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) [21:54:48] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) [21:54:57] !log dancy@deploy1002 Synchronized README: testing (2) (duration: 01m 02s) [21:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:02] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) 05Open→03Stalled [21:59:09] 10SRE, 10serviceops: rename OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) please don't upload patches, i want to use this as an example in a kind of workshop [21:59:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 29.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:00:02] !log dancy@deploy1002 Synchronized README: testing (3/4) (duration: 02m 57s) [22:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:20] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [22:10:07] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) The new database files are now rolled out to all production app and API servers (mediawiki::canary_appserv... [22:10:52] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) @phuedx I think for your purposes this should be solved now. On our side we have to discuss how to do this... [22:11:11] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) 05In progress→03Resolved [22:12:56] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) [22:13:46] !log dancy@deploy1002 Synchronized README: testing (4/4) (duration: 02m 52s) [22:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:24:33] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:26:15] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:29:50] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) > It would be great if we managed to build the packages so that php 7.2 and php 7.4 can coexist on the same application server, like debian tries to do. For PHP itself and the core extensions,... [23:00:02] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [23:00:04] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211020T2300). [23:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:03:26] o/ [23:03:47] thcipriani RoanKattouw: urbanecm are either of you available to clear out a deploy blocker? [23:06:01] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [23:11:36] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10RobH) [23:13:19] PROBLEM - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Search%23Administration [23:17:16] Jdlrobson: I can if you're still around [23:17:31] RECOVERY - WMF Cloud -Psi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 26 Dec 2021 07:00:29 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [23:17:38] thcipriani: great! [23:17:39] (03PS12) 10Tim Starling: Temporarily disable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [23:17:50] Don't want to be that guy on the train tracks holding the train :) [23:18:47] should maybe re-kick the thing this backport is cherry-picked from: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/732436 [23:19:25] (03CR) 10Thcipriani: [C: 03+2] "utc late backport" [skins/MinervaNeue] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732336 (https://phabricator.wikimedia.org/T290525) (owner: 10Jdlrobson) [23:21:21] thcipriani: there's a CI problem [23:21:26] That's a tomorrow problem :) [23:21:34] fun :) [23:21:43] It relates to the `npm run doc` job so should not effect production [23:22:07] that's good [23:22:43] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [23:23:31] (03CR) 10Tim Starling: [C: 03+2] "PS12: rebase only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [23:24:27] (03PS1) 10Jeena Huneidi: Update wikiversions-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/732454 [23:24:31] (03Merged) 10jenkins-bot: Temporarily disable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [23:27:21] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fawiki require login to edit main namespace T291018 (duration: 01m 04s) [23:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:29] T291018: Temporarily disable article editing by anonymous users on fawiki - https://phabricator.wikimedia.org/T291018 [23:27:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:59] (03CR) 10Jeena Huneidi: [C: 03+1] Update wikiversions-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/732454 (owner: 10Jeena Huneidi) [23:29:25] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: fawiki require login for creation of pages in the draft namespace T291018 (duration: 01m 02s) [23:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:33] (03CR) 10Huji: "Thanks Tim! Confirming that it works as expected." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [23:30:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:50] (03Merged) 10jenkins-bot: Restore title to mobile skin without logo [skins/MinervaNeue] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/732336 (https://phabricator.wikimedia.org/T290525) (owner: 10Jdlrobson) [23:38:06] yay [23:38:08] ok [23:38:11] finallyy [23:38:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 28 Oct 2021 09:00:44 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:21] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:07] Jdlrobson: live on mwdebug1002 (sans l10n updates), check please [23:42:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:42:15] testing [23:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:47] thcipriani: logo has returned. Although it's a bit longer than expected. [23:42:54] https://usercontent.irccloud-cdn.com/file/rggKvzu3/Screen%20Shot%202021-10-20%20at%204.42.50%20PM.png [23:43:24] This is likely a better situation to be in [23:43:34] since branding is important from a legal perspective :) [23:43:38] I was going to ask :) [23:43:41] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [23:43:42] so I'd suggest we sync. [23:43:51] I'll look into the message separately. [23:43:52] * thcipriani does [23:44:12] I don't know how long sync-world takes these days [23:44:18] Who is the person to ask about all things meta? :) [23:44:35] !log thcipriani@deploy1002 Started scap: Backport: [[gerrit:732336|Restore title to mobile skin without logo (T290525)]] [23:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:41] T290525: Generate Minerva search HTML from SkinMustache data - https://phabricator.wikimedia.org/T290525 [23:44:50] That is a quesiton I don't know the answer to [23:45:10] haha. Okay. I'm going to fix this on wiki with some mobile magic [23:45:39] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [23:45:41] oh: if it's a l10n message you're missing that's because I hadn't run the full sync-world just yet [23:46:02] that's running now, I just pulled the code over to mwdebug1002 [23:48:07] we're good to sync [23:48:38] yep, I started syncing, it's pulling to canaries now [23:50:20] the l10n build will mean this takes a bit longer than our normal 2 minute sync, but it seems faster than the last time I did this and it took 20 minutes. Already done with proxy sync, now just doing apache sync. [23:50:43] sounds good [23:55:18] 96%.... [23:55:25] * thcipriani taps foot [23:56:16] !log thcipriani@deploy1002 Finished scap: Backport: [[gerrit:732336|Restore title to mobile skin without logo (T290525)]] (duration: 11m 41s) [23:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:22] T290525: Generate Minerva search HTML from SkinMustache data - https://phabricator.wikimedia.org/T290525 [23:56:26] ^ Jdlrobson should be live everywhere now [23:57:05] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) [23:57:31] (aside -- this page just rick-rolled me: https://deploy-commands.toolforge.org/bacc/732336 ... well played.) [23:59:20] yep! Looks good [23:59:24] thanks thcipriani for getting this unblocked