[00:01:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:43] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 27007 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [00:15:55] RECOVERY - Disk space on urldownloader2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=urldownloader2002&var-datasource=codfw+prometheus/ops [00:26:01] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:32] (03PS1) 10RLazarus: icinga: Add --services flag to icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) [00:31:11] (03CR) 10RLazarus: "Consider this a strawman proposal -- I'm open to other ideas about both the interface and the implementation, but this seems like a place " [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [00:33:12] (03PS4) 10Labdajiwa: Set the project namespace and sitename for Javanese Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708206 (https://phabricator.wikimedia.org/T287437) [00:59:41] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) 05Open→03Resolved a:03Legoktm A recap blog post was published a few days ago: https://techblog.wikimedia.org/2021/07/23/june-2021-data-cent... [01:01:55] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:31] 10SRE, 10CommRel-Specialists-Support, 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Legoktm) [01:25:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:26:31] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:17] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:52] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [01:36:52] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [01:51:52] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [01:56:52] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [02:02:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:06] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Legoktm) >>! In T287362#7238970, @Krassotkin wrote: > @Joe You are the third... [02:05:27] (03CR) 10Legoktm: "I left a note at https://meta.wikimedia.org/wiki/Stewards%27_noticeboard#ruwikinews_will_lose_global_abuse_filters_tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [02:13:48] 10SRE, 10DynamicPageList (Wikimedia), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7238964, @Bawolff wrote: > ...and make the feature similar in performance to a large watchlists. This gave me t... [02:16:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:21:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:22:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:26:45] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:56] !log on mwmaint2002 fixing T286273 broken files using eval.php [02:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:05] T286273: Image size is not determined for new PNG files with (partially) corrupt metadata - https://phabricator.wikimedia.org/T286273 [03:01:53] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:40] (03PS1) 10Legoktm: Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/708224 (https://phabricator.wikimedia.org/T287380) [03:48:54] (03PS1) 10Legoktm: Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708225 (https://phabricator.wikimedia.org/T287380) [04:01:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:42] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) >>! In T287362#7239066, @Firestar464 wrote: > Can we stop whini... [04:11:31] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:23] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) Yes please [04:26:31] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) For reference, DPL queries can be grouped into the following four performance ca... [04:26:55] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:25] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) Note: extension:googlenewssitemap which powers the rss feeds on wikinews ( https... [04:46:26] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Marostegui) >>! In T287380#7241779, @Legoktm wrote: >>>! In T287380#7238964, @Bawolff wro... [05:02:15] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:28] (03CR) 10Marostegui: [C: 03+1] Stop enabling DPL on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [05:10:31] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7241917, @Marostegui wrote: >>>! In T287380#7241779, @Legoktm wro... [05:18:01] (03CR) 10Ladsgroup: [C: 03+1] "Has my virtual blessing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [05:26:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:26] (03PS1) 10Marostegui: Revert "db1129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708408 [05:36:14] (03CR) 10Marostegui: [C: 03+2] Revert "db1129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708408 (owner: 10Marostegui) [05:38:34] (03PS1) 10Marostegui: install_server: Reimage db1122 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708397 (https://phabricator.wikimedia.org/T287230) [05:39:19] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1122 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708397 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [05:48:01] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7236632, @Legoktm wrote: > I think moving their workflows over to... [06:02:39] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:53] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:11:53] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:12:00] wuut [06:12:12] Hi [06:12:22] Dallas peering port [06:12:23] https://librenms.wikimedia.org/device/device=93/tab=port/port=8200/ [06:12:36] checking netflow [06:12:56] can someone ack the page? [06:13:15] Acked [06:13:42] AS15169 [06:14:19] can someone check the logs to know which IP or UA is being nasty? [06:14:41] let's move to _security [06:23:27] (03PS1) 10Ayounsi: Attack + clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/708457 [06:25:11] 10SRE, 10Wikimedia-Mailing-lists: Admin - https://phabricator.wikimedia.org/T287554 (10Aklapper) @JOAN: Hi, see "Lost list administrator passwords" on https://meta.wikimedia.org/wiki/Mailing_lists/Administration [06:25:25] 10SRE, 10Wikimedia-Mailing-lists: Reset admin password for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Aklapper) [06:27:44] (03PS1) 10Legoktm: upload: Block UA making too many requests [puppet] - 10https://gerrit.wikimedia.org/r/708458 [06:28:30] (03CR) 10Ayounsi: [C: 03+1] upload: Block UA making too many requests [puppet] - 10https://gerrit.wikimedia.org/r/708458 (owner: 10Legoktm) [06:29:21] (03PS2) 10Legoktm: upload: Block UA making too many requests [puppet] - 10https://gerrit.wikimedia.org/r/708458 [06:29:33] (03CR) 10Legoktm: [V: 03+2 C: 03+2] upload: Block UA making too many requests [puppet] - 10https://gerrit.wikimedia.org/r/708458 (owner: 10Legoktm) [06:32:42] (03Abandoned) 10Ayounsi: Attack + clouds_shutdown [puppet] - 10https://gerrit.wikimedia.org/r/708457 (owner: 10Ayounsi) [06:36:53] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:36:53] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:38:55] wonderful [06:38:59] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10DonSimon) @Legoktm, this is a Phabricator, not Wikinews. This a not a proper... [06:42:37] !log remove obsolete user.log.manual-rotation from centrallog1001 to free disk space [06:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:45] 10SRE, 10Wikimedia-Mailing-lists: Reset admin password for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Ladsgroup) Oh actually this is another case of people missing the announcement of mailman3 upgrade. There we explicitly said there won't be "admin password" anymore. They have to create an em... [06:45:34] (03PS2) 10Ladsgroup: microsites: Add Query Builder subpage to wdqs gui [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) [06:51:11] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:37] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Krassotkin) @Legoktm How do you plan to call this bot? Are you planning to continuously... [06:52:46] (03PS3) 10Ladsgroup: microsites: Add Query Builder subpage to wdqs gui [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) [06:56:59] 10SRE, 10cloud-services-team (Kanban): node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 (10fgiunchedi) [07:01:31] (03CR) 10Elukey: "Thanks for the explanations Ben, since the profile.d setting is limited to a test node it is fine to proceed in my opinion. I just wanted " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:01:44] (03PS1) 10Filippo Giunchedi: prometheus: temp disable node-pinger [puppet] - 10https://gerrit.wikimedia.org/r/708462 (https://phabricator.wikimedia.org/T287559) [07:02:31] can I get a quick +1 on ^ ? [07:02:33] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:13] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) >>! In T275873#7216558, @MoritzMuehlenhoff wrote: >> +1 on disabling the collector (on >= bullseye, since it's been introduced in node-exporter 1.0.0) > > Given tha... [07:03:59] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: temp disable node-pinger [puppet] - 10https://gerrit.wikimedia.org/r/708462 (https://phabricator.wikimedia.org/T287559) (owner: 10Filippo Giunchedi) [07:04:21] RECOVERY - puppet last run on ml-serve2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:04:25] godog: looking [07:04:45] moritzm: thanks, done already [07:04:49] straightforward enough though [07:06:16] !log remove node_pinger.prom from node-pinger hosts [07:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:38] (03PS1) 10Ladsgroup: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) [07:07:39] !log remove cloud*/syslog.log from centrallog2001 - T287559 [07:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:46] T287559: node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 [07:09:42] dcaro: FYI ^ [07:12:47] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [07:13:30] 10SRE, 10cloud-services-team (Kanban): node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 (10fgiunchedi) AFAICT the `exec_start_pre` option of `systemd::timer::job` is never rendered either in the `.service` or (which wouldn't work afaik) in the `.timer` units [07:14:12] (03CR) 10Elukey: "The change as it is will not work in my opinion, I left a comment about the workers' truststores. John can you review it when you have a m" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [07:16:09] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:21] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [07:17:03] godog: sup? [07:17:11] (03CR) 10Elukey: [C: 03+1] netboot: make an-masters reimage without confirmation [puppet] - 10https://gerrit.wikimedia.org/r/705782 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [07:17:57] dcaro: node-exporter was spamming syslog due to node-pinger appending to its .prom file :( https://phabricator.wikimedia.org/T287559 [07:18:14] godog: I'm getting a bunch of alarms now [07:18:17] I'm looking at the puppet code and I must be doing something wrong because I can't figure it out [07:18:18] systemd is failing [07:18:35] yeah I disabled node pinger as a shitty mitigation [07:19:08] (03CR) 10Muehlenhoff: Enable kerberos ticket auto-renewal for a test client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:19:10] getting paged [07:19:25] oh wow that paged you, sorry about that [07:20:15] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 6 hosts with reason: T287559 [07:20:17] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 6 hosts with reason: T287559 [07:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:22] T287559: node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 [07:20:23] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 40 hosts with reason: T287559 [07:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:37] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 40 hosts with reason: T287559 [07:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:46] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 29 hosts with reason: T287559 [07:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:57] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 29 hosts with reason: T287559 [07:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:27] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [07:22:15] still can't figure out why exec_start_pre isn't rendered in the .service [07:22:28] looking [07:22:51] (03CR) 10Elukey: "Hi! Thanks a lot for the patch, I opened https://phabricator.wikimedia.org/T287561 to track the work to review it and in case, package + d" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/708094 (owner: 10R4q3NWnUx2CEhVyr) [07:23:28] yeah I'm sure I'm missing sth obvious, exec_start_pre in systemd::timer::job is there, and it is in the template [07:24:22] I think I found it, we have messy templates that make it difficult to see where the ifs are [07:24:41] aahh of course, yeah that explains it [07:25:36] (03PS1) 10David Caro: systemd.timer_service: fix missing exec_start_pre [puppet] - 10https://gerrit.wikimedia.org/r/708465 (https://phabricator.wikimedia.org/T287559) [07:25:55] (03PS1) 10Marostegui: db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708466 (https://phabricator.wikimedia.org/T287230) [07:26:31] (03CR) 10Filippo Giunchedi: [C: 03+1] systemd.timer_service: fix missing exec_start_pre [puppet] - 10https://gerrit.wikimedia.org/r/708465 (https://phabricator.wikimedia.org/T287559) (owner: 10David Caro) [07:27:19] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:25] (03PS1) 10Muehlenhoff: Extend access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/708467 [07:27:40] (03CR) 10David Caro: [C: 03+2] systemd.timer_service: fix missing exec_start_pre [puppet] - 10https://gerrit.wikimedia.org/r/708465 (https://phabricator.wikimedia.org/T287559) (owner: 10David Caro) [07:28:52] (03CR) 10Marostegui: [C: 03+2] db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708466 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [07:29:27] moritzm: I have an email from spatton that jbol likely doesn't need access to the data anymore, and we can remove it, I was going to do that tomorrow [07:29:38] godog: that worked, I'll enable the service again [07:29:53] dcaro: neat, thank you! sorry about the pages :( [07:31:03] (03PS1) 10David Caro: Revert "prometheus: temp disable node-pinger" [puppet] - 10https://gerrit.wikimedia.org/r/708468 (https://phabricator.wikimedia.org/T287559) [07:31:34] godog: no problem [07:31:45] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Krassotkin) @Bawolff Maybe you can try replacing database queries with CirrusSearch queri... [07:34:40] (03CR) 10David Caro: [C: 03+2] Revert "prometheus: temp disable node-pinger" [puppet] - 10https://gerrit.wikimedia.org/r/708468 (https://phabricator.wikimedia.org/T287559) (owner: 10David Caro) [07:36:14] (03PS1) 10Giuseppe Lavagetto: trafficserver: limit mw on k8s to testwikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 [07:37:31] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [07:37:41] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/global [07:40:00] _joe_: I wouldn't call mw.o a testwiki in a similar sense to test.wikipedia.org, mw.o still has actual content that actual readers want to read [07:40:27] <_joe_> majavah: yeah i just kept the commit message short [07:40:36] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 (10dcaro) Fix deployed and running: ` root@cloudcephosd1001:~# wc /var/lib/prometheus/node.d/node_pinger.prom 22 44 2046 /var/lib/prometheus/node.d/node_p... [07:40:46] <_joe_> i should've written "group0 wikis" [07:41:13] <_joe_> majavah: i don't think there is any real risk of significant cache poisoning [07:41:51] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): node-exporter syslog spam filling up centrallog - https://phabricator.wikimedia.org/T287559 (10dcaro) 05Open→03Resolved a:03dcaro [07:42:15] fair [07:43:04] can we include test2 too? it has a bunch of edge cases (flaggedrevs etc) that testwiki doesn't, even if it's on group1 [07:44:02] <_joe_> oh sure, please comment on the patch [07:44:41] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/708467 (owner: 10Muehlenhoff) [07:45:22] (03PS1) 10DCausse: rdf-streaming-updater: Disable hostname verif from the k8s client [deployment-charts] - 10https://gerrit.wikimedia.org/r/708471 (https://phabricator.wikimedia.org/T287443) [07:46:40] (03CR) 10Majavah: "Can we include test2.wikipedia.org too? Some things that have caused issues in the past (flaggedrevs for example) are configured very diff" [puppet] - 10https://gerrit.wikimedia.org/r/708469 (owner: 10Giuseppe Lavagetto) [07:49:01] (03CR) 10Filippo Giunchedi: [C: 03+2] "Going ahead with this since it doesn't require an haproxy restart (and thus no coordination)" [puppet] - 10https://gerrit.wikimedia.org/r/708106 (owner: 10Filippo Giunchedi) [07:49:08] (03PS3) 10Filippo Giunchedi: haproxy: remove sleep 10 [puppet] - 10https://gerrit.wikimedia.org/r/708106 [07:49:26] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10matmarex) @DonSimon Please re-read the previous message. [07:50:26] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Disable hostname verif from the k8s client [deployment-charts] - 10https://gerrit.wikimedia.org/r/708471 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [07:50:29] moritzm: merged your change too [07:50:33] ack, thx [07:50:55] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Legoktm) >>! In T287362#7241988, @DonSimon wrote: > @Legoktm, this is a Phab... [07:53:15] (03Merged) 10jenkins-bot: rdf-streaming-updater: Disable hostname verif from the k8s client [deployment-charts] - 10https://gerrit.wikimedia.org/r/708471 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [07:53:35] !log installing aspell security updates on stretch [07:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:33] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:57:42] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 18 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:59:36] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) >>! In T287380#7242030, @Krassotkin wrote: > @Legoktm How do you plan to call th... [08:01:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:42] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:41] (03PS2) 10Giuseppe Lavagetto: trafficserver: limit mw on k8s to group0/test wikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 [08:04:23] (03CR) 10Majavah: [C: 03+1] trafficserver: limit mw on k8s to group0/test wikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 (owner: 10Giuseppe Lavagetto) [08:05:00] (03PS1) 10Muehlenhoff: Add library hint for aspell [puppet] - 10https://gerrit.wikimedia.org/r/708472 [08:06:23] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [08:06:38] (03PS3) 10Filippo Giunchedi: haproxy: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/708105 [08:06:40] (03PS2) 10Filippo Giunchedi: haproxy: read config directory natively [puppet] - 10https://gerrit.wikimedia.org/r/708108 [08:06:42] (03PS1) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 [08:06:55] (03PS2) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 [08:07:11] (03PS3) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 [08:08:15] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for aspell [puppet] - 10https://gerrit.wikimedia.org/r/708472 (owner: 10Muehlenhoff) [08:08:36] (03CR) 10Jcrespo: "I found some dbs incorrectly classified on misc while researching for this change, please also double-check those fixes." [puppet] - 10https://gerrit.wikimedia.org/r/708473 (owner: 10Jcrespo) [08:10:15] (03PS4) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [08:13:31] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:35] (03CR) 10Hashar: "I looked at it quickly. I am quite happy John figured out the Gerrit API :]" [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [08:18:02] (03CR) 10Filippo Giunchedi: "I think setting cluster: makes more sense in the relevant roles, also you'll need to add the cluster first (unfortunately to two files ATM" [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:19:50] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10CAS-SSO, and 3 others: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) a:03jbond [08:20:42] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [08:24:26] (03CR) 10Jcrespo: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:24:50] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [08:26:47] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:52] !log running several long-running queries against pc1007 [08:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:04] (03PS1) 10Giuseppe Lavagetto: admin_ng: allow conditionally including the knative releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/708475 [08:31:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1122.eqiad.wmnet with reason: REIMAGE [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:57] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1006, ganeti2025, ganeti2026, thanos-be1003, an-test-coord1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:33:16] (03PS1) 10Filippo Giunchedi: icinga: remove services Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/708476 (https://phabricator.wikimedia.org/T281359) [08:33:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1122.eqiad.wmnet with reason: REIMAGE [08:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:44] (03Abandoned) 10Hashar: Group0 to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705832 (https://phabricator.wikimedia.org/T281156) (owner: 10Hashar) [08:39:37] (03CR) 10Filippo Giunchedi: "+Petr and Eric as they likely have context on whether we can ditch these altogether" [puppet] - 10https://gerrit.wikimedia.org/r/708476 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [08:39:52] (03PS1) 10DCausse: flink-session-cluster: use kubernetesApiEnv when available [deployment-charts] - 10https://gerrit.wikimedia.org/r/708477 (https://phabricator.wikimedia.org/T287443) [08:40:16] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: use kubernetesApiEnv when available [deployment-charts] - 10https://gerrit.wikimedia.org/r/708477 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [08:41:04] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [08:42:05] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [08:43:06] (03PS1) 10Elukey: Add a simple ORES cookbook to roll restart its daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 [08:43:54] (03PS2) 10Elukey: Add a simple ORES cookbook to roll restart its daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 [08:45:08] (03PS5) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 [08:45:36] (03PS6) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [08:45:44] (03PS7) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [08:46:19] (03Abandoned) 10Elukey: admin_ng: add knative-serving in bases list [deployment-charts] - 10https://gerrit.wikimedia.org/r/707408 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [08:46:33] (03PS2) 10DCausse: flink-session-cluster: use kubernetesApiEnv when available [deployment-charts] - 10https://gerrit.wikimedia.org/r/708477 (https://phabricator.wikimedia.org/T287443) [08:51:26] (03CR) 10JMeybohm: [C: 03+1] "This looks pretty nice! I did not render it, but it seems solid to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708475 (owner: 10Giuseppe Lavagetto) [08:51:59] (03CR) 10Lucas Werkmeister (WMDE): Disable mobile contributions simplifications on Wikidata and Commons (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [08:53:19] (03PS8) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [08:53:39] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10dcaro) @Cmjohnson ping [08:54:16] (03CR) 10David Caro: global: add a simple requires.txt (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [08:54:28] (03PS3) 10Giuseppe Lavagetto: trafficserver: limit mw on k8s to group0/test wikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 [08:57:06] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: limit mw on k8s to group0/test wikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 (owner: 10Giuseppe Lavagetto) [08:57:24] (03CR) 10Jforrester: Stop enabling DPL on new wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [08:57:49] (03Abandoned) 10David Caro: doc: Introduce a code reviewing guideline [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [08:58:12] (03Abandoned) 10David Caro: WIP step_by_step: Added cli option to ask confirmation before each command [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 (owner: 10David Caro) [08:58:51] 10SRE, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Add logout.d script for Wikitech - https://phabricator.wikimedia.org/T287566 (10Majavah) [08:59:37] (03PS1) 10Majavah: Add logoutd script for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) [08:59:41] (03PS2) 10Elukey: admin_ng: allow conditionally including the knative releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/708475 (owner: 10Giuseppe Lavagetto) [09:00:14] (03CR) 10jerkins-bot: [V: 04-1] Add logoutd script for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [09:00:26] (03Abandoned) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [09:01:06] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1006, ganeti2025, thanos-be1003, ganeti2026, an-test-coord1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [09:01:33] (03Abandoned) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [09:01:43] (03Abandoned) 10David Caro: WIP wmcs.enc: Add role of the machine [puppet] - 10https://gerrit.wikimedia.org/r/680329 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [09:04:06] (03CR) 10Elukey: [C: 03+2] "Fixed the knative-serving's path under bases (s/_/-), the rest looks really good, thanks a lot! Merging and testing :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708475 (owner: 10Giuseppe Lavagetto) [09:04:26] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] global: ran flake8 on the code [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706477 (owner: 10David Caro) [09:04:34] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] global: ran black and isort [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706476 (owner: 10David Caro) [09:04:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] global: add .gitreview file [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706478 (owner: 10David Caro) [09:05:02] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [09:05:20] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [09:06:10] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) [09:06:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: limit mw on k8s to group0/test wikis [puppet] - 10https://gerrit.wikimedia.org/r/708469 (owner: 10Giuseppe Lavagetto) [09:11:51] (03PS2) 10Majavah: Add logoutd script for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) [09:11:55] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Krassotkin) @Legoktm 2-3 articles can be added by hand. We don't need a robot for this.... [09:16:32] (03PS5) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [09:17:58] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:46] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: use kubernetesApiEnv when available [deployment-charts] - 10https://gerrit.wikimedia.org/r/708477 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [09:20:01] (03PS6) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [09:21:04] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: thanos-be1003, an-test-coord1001, ganeti2026, labstore1006, ganeti2025 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [09:21:20] (03Merged) 10jenkins-bot: flink-session-cluster: use kubernetesApiEnv when available [deployment-charts] - 10https://gerrit.wikimedia.org/r/708477 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [09:22:05] (03CR) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:24:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:24:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:24] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:23] (03PS1) 10DCausse: flink-session-cluster: Updater chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/708482 [09:31:54] (03PS2) 10DCausse: flink-session-cluster: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/708482 [09:35:31] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/708482 (owner: 10DCausse) [09:36:04] (03PS1) 10JMeybohm: Add debian directory [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 [09:37:14] (03PS1) 10Elukey: Fix usage of Release.Name in knative-serving's helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/708484 [09:37:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [09:38:01] (03Merged) 10jenkins-bot: flink-session-cluster: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/708482 (owner: 10DCausse) [09:40:12] !log Start server-side upload for 1 video file (T287482) [09:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:19] T287482: Upload 2.3 GB video to Wikimedia Commons - https://phabricator.wikimedia.org/T287482 [09:40:56] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:27] (03PS2) 10JMeybohm: Add debian directory [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) [09:47:17] (03CR) 10Elukey: [C: 03+2] Fix usage of Release.Name in knative-serving's helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/708484 (owner: 10Elukey) [09:51:39] (03CR) 10Btullis: "> Patch Set 11:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [09:52:58] (03CR) 10Muehlenhoff: "Looks good, a few nits inline" (033 comments) [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:53:42] (03PS1) 10DCausse: rdf-streaming-updater: Declare kubernetesApi for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/708486 (https://phabricator.wikimedia.org/T287443) [09:55:19] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: Declare kubernetesApi for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/708486 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [09:58:11] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:58:18] (03PS1) 10Phuedx: beta: Enable IP address copy action instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708488 (https://phabricator.wikimedia.org/T279540) [10:00:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:01:04] (03PS1) 10JMeybohm: deployment_server: Add defaults for kubernetes apiserver [puppet] - 10https://gerrit.wikimedia.org/r/708489 (https://phabricator.wikimedia.org/T287443) [10:01:06] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/30375/" [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:01:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-icinga-am.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=icinga-am site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:46] (03CR) 10DCausse: [C: 03+1] deployment_server: Add defaults for kubernetes apiserver [puppet] - 10https://gerrit.wikimedia.org/r/708489 (https://phabricator.wikimedia.org/T287443) (owner: 10JMeybohm) [10:04:33] (03Abandoned) 10DCausse: rdf-streaming-updater: Declare kubernetesApi for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/708486 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [10:05:26] PROBLEM - Alertmanager has not been receiving alerts on alert1001 is CRITICAL: 0.1875 le 1 https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:06:11] checking ^ [10:09:03] (03PS1) 10JMeybohm: common:common_templates: Add wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708492 [10:09:37] (03PS2) 10JMeybohm: common_templates: Add wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708492 [10:09:39] !log temp fix prometheus-icinga-am on alert1001 [10:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:46] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:49] (03CR) 10JMeybohm: [C: 03+2] deployment_server: Add defaults for kubernetes apiserver [puppet] - 10https://gerrit.wikimedia.org/r/708489 (https://phabricator.wikimedia.org/T287443) (owner: 10JMeybohm) [10:10:14] (03Abandoned) 10Jgiannelos: maps: Disable tilerator on maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/708264 (owner: 10Jgiannelos) [10:10:56] (03PS7) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [10:11:01] (03PS10) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [10:11:06] RECOVERY - Alertmanager has not been receiving alerts on alert1001 is OK: (C)1 le (W)2 le 2.15 https://wikitech.wikimedia.org/wiki/Alertmanager%23Alerts https://grafana.wikimedia.org/d/eea-9_sik/alertmanager [10:11:57] !log installing remaining nginx security updates on stretch [10:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:46] (03PS2) 10David Caro: wmcs.ceph: remove unused backup role [puppet] - 10https://gerrit.wikimedia.org/r/702653 [10:13:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:14:26] (03PS1) 10Jgiannelos: tegola-vector-tiles: Connect staging to read replica instead of master [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 [10:16:42] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Legoktm Please pay attention to the words > at least at first... [10:18:57] (03CR) 10Jgiannelos: "When we first deployed tegola on staging we only had a master node Postgres that was using the new maps schema available. Even though ever" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 (owner: 10Jgiannelos) [10:19:04] (03PS1) 10DCausse: flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) [10:19:38] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [10:20:27] (03PS2) 10DCausse: flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) [10:20:50] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [10:22:08] (03PS1) 10Filippo Giunchedi: am: fix team_tag_matcher vs team_tags_matcher [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/708497 [10:25:01] (03CR) 10Hashar: [C: 04-1] gerrit: disabled patchset level comments [puppet] - 10https://gerrit.wikimedia.org/r/708124 (https://phabricator.wikimedia.org/T287385) (owner: 10Hashar) [10:25:17] (03CR) 10Muehlenhoff: "Looks good, two comments inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [10:25:48] (03CR) 10Jgiannelos: "From a quick look:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 (owner: 10Jgiannelos) [10:26:07] (03PS3) 10Jbond: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) [10:26:12] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:36] (03CR) 10Jbond: P:gerrit: Add logoutd script for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:26:53] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10Wikidata-Campsite, and 3 others: 🛑 Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [10:27:55] (03PS2) 10Jgiannelos: tegola-vector-tiles: Connect staging to read replica postgres node [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 [10:28:46] (03PS1) 10Muehlenhoff: Add ganeti_test to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) [10:29:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30378/console" [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:30:19] (03CR) 10Zabe: [C: 03+1] Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [10:31:12] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10Wikidata-Campsite, and 3 others: 🛑 Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [10:31:31] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10Wikidata-Campsite, and 3 others: 🛑 Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) 05Stalled→03Open [10:32:22] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10Wikidata-Campsite, and 3 others: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Michael) [10:32:25] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) Basically, what this is all about is you accusing everyone else... [10:36:40] (03Abandoned) 10Hashar: gerrit: disabled patchset level comments [puppet] - 10https://gerrit.wikimedia.org/r/708124 (https://phabricator.wikimedia.org/T287385) (owner: 10Hashar) [10:40:32] (03CR) 10Elukey: "LGTM! If you can run pcc with the following nodes it would be great:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [10:43:45] (03CR) 10Dzahn: [C: 03+2] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:43:56] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 3 others: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Addshore) [10:43:58] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:44:05] (03PS1) 10Hashar: gerrit: explicitly set jgit receive.autogc=false [puppet] - 10https://gerrit.wikimedia.org/r/708502 (https://phabricator.wikimedia.org/T262241) [10:46:47] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Ladsgroup) a:03Ladsgroup I'm doing this already. Slow and steady. [10:48:31] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Addshore) [10:48:48] (03PS3) 10DCausse: flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) [10:49:02] (03PS8) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [10:50:02] (03PS1) 10Jbond: debian::autostart: drop this function [puppet] - 10https://gerrit.wikimedia.org/r/708503 [10:50:04] (03PS1) 10Jbond: systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 [10:50:51] (03CR) 10jerkins-bot: [V: 04-1] systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [10:51:24] (03CR) 10Dzahn: "looks good, it's just blocked by dcops finishing setup" [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [10:53:32] (03CR) 10DCausse: [C: 03+1] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708492 (owner: 10JMeybohm) [10:55:04] (03CR) 10Dzahn: [C: 03+2] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:57:13] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10matmarex) @Krassotkin I think you would also benefit from re-reading the pre... [10:57:51] (03Merged) 10jenkins-bot: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [10:59:42] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:11] o/ [11:00:18] indeed, nothing to do [11:00:57] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) As far as I know, we already generate an image for every +2 in mediawiki-config, so I'll assume that part is al... [11:01:14] (03PS4) 10Dzahn: DHCP: remove mw1285 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) [11:01:52] (03CR) 10jerkins-bot: [V: 04-1] DHCP: remove mw1285 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:02:26] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:17] (03PS5) 10Dzahn: DHCP: remove mw1285 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) [11:03:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:03:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:10] (03CR) 10Jbond: [C: 03+1] "lgtm, optional nit inline. its a shame there is no way to query the user however the main use case for now is logging a user out so i thi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [11:08:49] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [11:11:47] (03CR) 10Jbond: [C: 03+1] "LGTM but do you also need to add an entry to hieradata/common/monitoring.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:12:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [11:12:33] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) @wiki_willy You guys can remove all old mw appservers from eqiad rack A5 and rack A8 already, they are decom... [11:13:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:14:41] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) p:05Triage→03High [11:14:43] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) @wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a look at special case mw1444 which shoul... [11:14:45] (03CR) 10Jforrester: trafficserver: limit mw on k8s to group0/test wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708469 (owner: 10Giuseppe Lavagetto) [11:14:57] (03PS2) 10Muehlenhoff: Add ganeti_test to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) [11:15:18] (03PS3) 10Muehlenhoff: Add ganeti_test to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) [11:15:42] (03CR) 10Dzahn: "@jelto rebased this, there are not that many left in DHCP now. I think we can go ahead and remove those now, what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:16:37] (03CR) 10Muehlenhoff: "Good catch, thanks, amended the patch." [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:18:04] !log installing nginx security updates on sodium (mirrors.wikimedia.org) [11:18:05] (03CR) 10Jbond: [C: 03+1] Update TLS configuration for analytics-test-presto (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:20:05] (03PS2) 10Jbond: systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 [11:22:07] (03PS3) 10Majavah: Add logoutd script for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) [11:22:35] (03CR) 10Majavah: Add logoutd script for wikitech (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [11:23:03] 10SRE, 10Wikimedia-Mailing-lists: Reset admin password for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Dzahn) @JOAN Hola, desde que actualizamos a la versión 3 de mailman, ya no necesita una "contraseña de administrador". Debe crear una cuenta conectada a su dirección de correo electrónico... [11:23:29] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10ArielGlenn) May I remind everyone that this task has been closed, as the spe... [11:25:24] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) [11:26:53] !log installing nginx security updates on thumbor* [11:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:04] urbanecm: o/ Would you mind if I merged a Beta Cluster-only patch during this otherwise empty backport window? [11:28:24] phuedx: go ahead -- just don't forget to fetch it to the deployment host once it merges ;) [11:28:32] Will do :) [11:28:33] 10SRE, 10Wikimedia-Mailing-lists: Reset admin password for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Aklapper) 05Open→03Invalid Closing per last comments. [11:28:54] thanks! [11:29:07] FTR the change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708488 [11:29:40] (03CR) 10Dzahn: [C: 03+2] "confirmed per " settings are applied only if Gerrit is started as the container process through Gerrit’s 'gerrit.sh' rc.d compatible wrapp" [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:29:48] (03CR) 10Phuedx: [C: 03+2] "BACKPORT WINDOW!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708488 (https://phabricator.wikimedia.org/T279540) (owner: 10Phuedx) [11:30:07] jouncebot: now [11:30:07] For the next 0 hour(s) and 29 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1100) [11:30:20] (03CR) 10Dzahn: [C: 03+1] "submitting after deployment is over" [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:30:40] (03Merged) 10jenkins-bot: beta: Enable IP address copy action instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708488 (https://phabricator.wikimedia.org/T279540) (owner: 10Phuedx) [11:32:15] (03CR) 10Jbond: [C: 03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [11:32:16] I fetched the change onto the deployment host [11:32:20] Done [11:34:24] urbanecm: can I also complain about the two spaces in front of the topic [11:34:38] thx [11:34:42] majavah: fixed :). [11:38:59] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10whym) This seems to overlap with T124841. [11:39:22] (03CR) 10Jelto: [C: 03+1] "lgtm and I think it's fine to go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:42:24] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti_test to wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/708500 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:43:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:44:30] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Aklapper) [11:44:43] (03PS8) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [11:46:06] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Connect staging to read replica postgres node [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 (owner: 10Jgiannelos) [11:46:39] Hrrm. That config change hasn't stuck [11:46:49] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 (owner: 10Jgiannelos) [11:51:57] (03PS1) 10Jbond: P:pki::get_cert: make discovery the default CA [puppet] - 10https://gerrit.wikimedia.org/r/708510 [11:53:07] I think the key name I used in the change should've had a - in front of it [11:53:08] brb [11:53:19] (03CR) 10jerkins-bot: [V: 04-1] P:pki::get_cert: make discovery the default CA [puppet] - 10https://gerrit.wikimedia.org/r/708510 (owner: 10Jbond) [11:53:29] (03CR) 10Muehlenhoff: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1003/30380/" [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [11:53:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30381/console" [puppet] - 10https://gerrit.wikimedia.org/r/708510 (owner: 10Jbond) [11:56:36] (03PS2) 10Jbond: P:pki::get_cert: make discovery the default CA [puppet] - 10https://gerrit.wikimedia.org/r/708510 (https://phabricator.wikimedia.org/T285850) [11:58:31] (03CR) 10Jbond: [C: 03+2] P:pki::get_cert: make discovery the default CA [puppet] - 10https://gerrit.wikimedia.org/r/708510 (https://phabricator.wikimedia.org/T285850) (owner: 10Jbond) [12:05:01] jouncebot: now [12:05:02] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [12:05:10] (03CR) 10Dzahn: [C: 03+2] gerrit: remove unused settings from [container] [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:07:34] (03PS3) 10Dzahn: gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:08:08] (03CR) 10Dzahn: [C: 03+2] gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:09:56] (03CR) 10DCausse: [C: 03+2] common_templates: Add wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708492 (owner: 10JMeybohm) [12:09:58] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [12:10:10] (03CR) 10DCausse: [C: 04-2] flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [12:10:55] (03PS4) 10DCausse: flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) [12:11:56] (03PS1) 10Phuedx: beta: Correctly enable IP address copy action instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708514 [12:12:04] (03CR) 10Dzahn: [C: 03+2] "ACK per https://www.eclipse.org/lists/jgit-dev/msg03736.html" [puppet] - 10https://gerrit.wikimedia.org/r/708502 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [12:12:14] 10SRE, 10Wikimedia-Mailing-lists: Reset admin password for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10JOAN) 05Invalid→03Open Hola. La mayoría son nuevos integrantes del capítulo Wikimedia. Yo tengo el rol de "moderador" pero no sabemos quién es el "administrador". ¿Qué podemos hacer en es... [12:12:16] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:12:34] urbanecm: Would you mind checking https://gerrit.wikimedia.org/r/708514? The previously BC-only config change that I deployed didn't take [12:12:42] (03Merged) 10jenkins-bot: common_templates: Add wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708492 (owner: 10JMeybohm) [12:13:36] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mw1285 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:20:01] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [12:22:02] mutante: nice! I am back from lunch ;) [12:22:50] (03Merged) 10jenkins-bot: flink-session-cluster: Use the wmf.kubernetes.ApiEnv template [deployment-charts] - 10https://gerrit.wikimedia.org/r/708495 (https://phabricator.wikimedia.org/T287443) (owner: 10DCausse) [12:24:13] mutante: if you feel adventurous we can change gerrit to listen on all addressed and firewall out the host ips ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/706049 ) [12:27:58] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:19] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Arbnos) >>! In T287362#7242601, @Firestar464 wrote: > Basically, what this i... [12:40:41] 10SRE, 10Wikimedia-Mailing-lists: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Aklapper) [12:41:21] (03CR) 10David Caro: [C: 03+1] "🎉" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/708497 (owner: 10Filippo Giunchedi) [12:42:13] phuedx: sorry, i missed your ping. Can I still help somehow? [12:42:57] 10SRE, 10Wikimedia-Mailing-lists: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Dzahn) @Ladsgroup Can you promote a moderator to admin? [12:54:35] urbanecm: I haven't merged the second change yet. I think I know what the problem is. It's just bothersome having to write 2 patches ;0 [12:54:49] (03CR) 10Ppchelko: [C: 03+1] "I never look at this and I do not think this works." [puppet] - 10https://gerrit.wikimedia.org/r/708476 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [12:56:40] phuedx: I'm not sure I understand the symptoms of the problem. `$wgWMEIPAddressCopyActionEnabled` sounds to have the expected value in beta, see below [12:56:44] https://www.irccloud.com/pastebin/so2bTSHM/ [12:58:06] phuedx: and since the default value in `extensions/WikimediaEvents/extension.json` is false, your patch apparently did something [12:58:12] or am i missing something? [12:59:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/708503 (owner: 10Jbond) [13:00:05] twentyafterfour and hashar: Time to snap out of that daydream and deploy MediaWiki train - American+European Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1300). [13:01:19] urbanecm: You aren't missing something. Apparently, I'm just impatient :/ [13:01:43] (03Abandoned) 10Phuedx: beta: Correctly enable IP address copy action instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708514 (owner: 10Phuedx) [13:02:17] I now see the value reflected in the code being shipped to the browser. My apologies for the pings [13:02:32] phuedx: no problem at all :). Hopefully it works. [13:03:05] I would like an "Hopefully it works" sticker [13:03:23] (03CR) 10Muehlenhoff: "LGTM, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [13:03:54] (03CR) 10Ottomata: "Can we merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:05:32] (03CR) 10Ottomata: admin::user: add support for nonexistent home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:06:31] (03PS1) 10David Caro: prometheus.icinga-exporter-am: support --labels.team.config-file [puppet] - 10https://gerrit.wikimedia.org/r/708521 [13:08:25] (03PS3) 10Elukey: Add a simple ORES cookbook to roll restart its daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 [13:08:27] !log installing python3.5 security updates on stretch [13:08:27] (03CR) 10Elukey: Add a simple ORES cookbook to roll restart its daemons (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [13:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] (03PS1) 10Elukey: Use uid for the nobody user in knative-serving's Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/708523 (https://phabricator.wikimedia.org/T278194) [13:15:17] (03CR) 10Jbond: admin::user: add support for nonexistent home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:18:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:18:31] 10SRE, 10CommRel-Specialists-Support, 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Elitre) @sgrabarczuk @Trizek-WMF looking forward to seeing how you're going to use the primary and secondary owner fields here! [13:18:48] (03CR) 10Ottomata: [C: 03+1] admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:18:53] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Elitre) [13:19:19] (03CR) 10Jbond: [C: 03+2] admin::user: add support for nonexistent home directory [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:19:29] (03CR) 10Ottomata: admin::user: add support for nonexistent home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708288 (https://phabricator.wikimedia.org/T287063) (owner: 10Jbond) [13:23:11] (03PS14) 10Ottomata: Use admin module to manage system user for use by human users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [13:23:35] (03CR) 10Jbond: [C: 03+2] debian::autostart: drop this function [puppet] - 10https://gerrit.wikimedia.org/r/708503 (owner: 10Jbond) [13:24:03] (03PS15) 10Ottomata: Use admin module to manage system user for use by human users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [13:24:35] (03PS3) 10Jbond: systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 [13:24:42] (03PS1) 10Jbond: C:nagios_common: Command definition for posting to a client auth site [puppet] - 10https://gerrit.wikimedia.org/r/708524 (https://phabricator.wikimedia.org/T285762) [13:24:44] (03PS1) 10Jbond: P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) [13:24:55] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [13:26:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one final typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [13:27:19] (03PS16) 10Ottomata: Use admin module to manage system user for use by human users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [13:27:35] (03PS17) 10Ottomata: Use admin module to manage system user for use by human users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [13:27:38] (03PS4) 10Jbond: systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 [13:27:44] (03PS2) 10Jbond: P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) [13:28:02] (03CR) 10Jbond: systemd::preset: add system::preset define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [13:28:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30385/console" [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [13:29:26] !log installing python2.7 security updates on stretch [13:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:58] (03CR) 10Jbond: [C: 03+2] systemd::preset: add system::preset define [puppet] - 10https://gerrit.wikimedia.org/r/708504 (owner: 10Jbond) [13:31:17] (03PS1) 10Dzahn: site/conftool: convert mw134-mw1436 from API to app servers [puppet] - 10https://gerrit.wikimedia.org/r/708526 (https://phabricator.wikimedia.org/T279309) [13:31:29] (03CR) 10Ottomata: [C: 03+2] "PCC looks good, merging." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [13:31:38] jouncebot: now [13:31:38] For the next 1 hour(s) and 28 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1300) [13:31:47] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) |Host |Row |Host iface |switch iface| |lvs2007|**A**|ens2f0np0|xe-2/0/45| |lvs2008|A|ens2f1np1|xe-7/0/45| |lvs2009|A|ens2f1np1|xe-2/0/43| |lvs20... [13:32:56] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw143[4-6].eqiad.wmnet [13:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:11] (03PS2) 10Dzahn: site/conftool: convert mw1434-mw1436 from API to app servers [puppet] - 10https://gerrit.wikimedia.org/r/708526 (https://phabricator.wikimedia.org/T279309) [13:35:10] (03PS3) 10Dzahn: site/conftool: convert mw1434-mw1436 from API to app servers [puppet] - 10https://gerrit.wikimedia.org/r/708526 (https://phabricator.wikimedia.org/T279309) [13:37:07] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: remove unused backup role [puppet] - 10https://gerrit.wikimedia.org/r/702653 (owner: 10David Caro) [13:38:11] (03CR) 10Btullis: [C: 04-1] "Setting back to WIP following discussion with ServiceOps." [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:39:41] (03Abandoned) 10Jbond: P:tlsproxy::instance: update to use debian::autostart('nginx', false) [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [13:39:48] (03Abandoned) 10Jbond: C:trafficserver: use debian::autostart to prevent auto service start [puppet] - 10https://gerrit.wikimedia.org/r/701545 (owner: 10Jbond) [13:39:54] (03Abandoned) 10Jbond: systemd::mask: refactor systemd::mask [puppet] - 10https://gerrit.wikimedia.org/r/701546 (owner: 10Jbond) [13:40:04] (03Abandoned) 10Jbond: systemd::umask: drop systemd::umask [puppet] - 10https://gerrit.wikimedia.org/r/701547 (owner: 10Jbond) [13:41:47] (03Abandoned) 10Muehlenhoff: Explicitly document the semantics of debian::autostart for different OSes [puppet] - 10https://gerrit.wikimedia.org/r/708041 (owner: 10Muehlenhoff) [13:42:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Tested locally, all working!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/708523 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [13:43:28] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) mw1434 has an issue with IPMI ` Remote IPMI failed for mgmt 'mw1434.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'mw1434.mgmt.eqiad.wmn... [13:44:05] (03PS4) 10Elukey: Add a simple ORES cookbook to roll restart its daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 [13:44:18] (03CR) 10Dzahn: [C: 03+2] site/conftool: convert mw1434-mw1436 from API to app servers [puppet] - 10https://gerrit.wikimedia.org/r/708526 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [13:44:58] (03PS9) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [13:45:21] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet'] ` The log can... [13:45:37] (03CR) 10Elukey: "Moritz: I tried to use the confctl module (change_and_revert function) that in theory should do the trick, lemme know what you think about" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [13:46:56] (03PS1) 10DCausse: flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) [13:47:08] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [13:47:34] (03CR) 10Jbond: [C: 03+1] "> Patch Set 11: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:47:51] (03PS1) 10Elukey: Change docker images used for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/708529 (https://phabricator.wikimedia.org/T278192) [13:51:58] (03CR) 10JMeybohm: Add debian directory (031 comment) [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [13:52:00] (03PS1) 10Jelto: icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 [13:52:45] (03PS2) 10DCausse: flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) [13:53:09] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [13:54:40] (03PS3) 10DCausse: flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) [13:55:47] (03PS2) 10Elukey: Change docker images used for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/708529 (https://phabricator.wikimedia.org/T278192) [13:55:57] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30386/console" [puppet] - 10https://gerrit.wikimedia.org/r/708530 (owner: 10Jelto) [13:56:16] (03CR) 10Muehlenhoff: Add debian directory (031 comment) [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [13:56:44] (03PS3) 10Ottomata: Add system users and groups for for Airflow for Research and Platform Eng [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) [13:57:58] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30387/console" [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [13:58:28] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [13:59:51] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01109 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:00:41] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:01:01] (03PS3) 10Elukey: Change docker images used for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/708529 (https://phabricator.wikimedia.org/T278192) [14:01:49] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:01:54] seems a single spike, recovering [14:01:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1435.eqiad.wmnet with reason: REIMAGE [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:02:47] is it possible the analytics/data deploy had gone bad, ottomata ? [14:03:06] jynus: from yesterday? [14:03:27] I see a lot of puppet failures on an- servers [14:03:35] but maybe I am missinterpreting the alert [14:03:38] oh [14:03:53] i did just merge a system user thing in data.yaml that declares usesr everywhere [14:03:57] that's a bit of a refactor [14:03:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1436.eqiad.wmnet with reason: REIMAGE [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:09] (03CR) 10Elukey: [C: 03+2] Change docker images used for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/708529 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [14:04:09] i ran puppet on a few hosts and had a few issues because of running processes and homedir changes [14:04:11] but i fixed those [14:04:13] lets seee [14:04:22] ottomata, I am checking, maybe it got fixed already [14:04:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1435.eqiad.wmnet with reason: REIMAGE [14:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:32] no i didn't touch these ones [14:04:33] (03Merged) 10jenkins-bot: flink-session-cluster: fix main_app app label... [deployment-charts] - 10https://gerrit.wikimedia.org/r/708528 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:04:34] could be another issue [14:04:36] checking too [14:04:57] could be unrelated, e.g. someone updating a package or something [14:04:59] * ottomata has not looked at puppetboard before.... [14:05:01] wow [14:05:03] I cannot replicate [14:05:29] ok i think that most will fix themselves [14:05:33] except an-test-client [14:05:36] i'll fix that manually [14:05:36] ok, no problem, then [14:05:44] puppet is changing the homedir of some of these system users [14:05:48] I just saw a lot of alerts and wanted to ping/research [14:05:48] and if there is a running proc [14:05:50] it fails [14:05:54] (03CR) 10Jelto: [V: 03+1] "I would like to add basic alerting for the active GitLab instance (https and ssh) similar to Gerrit. Could you please take a look?" [puppet] - 10https://gerrit.wikimedia.org/r/708530 (owner: 10Jelto) [14:05:56] the workers run batch jobs [14:05:57] :-) [14:06:00] the jobs are shortlived [14:06:05] so puppet hopefully will succeed [14:06:10] but there is a daemon on test-client [14:06:11] fixing that now [14:06:20] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:27] (03PS2) 10Jelto: icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) [14:06:32] yeah, no issue, just making sure someone hadn't hack us or something :-D [14:06:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:06:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:06:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1436.eqiad.wmnet with reason: REIMAGE [14:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:39] thanks for the ping [14:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:31] PROBLEM - mediawiki-installation DSH group on mw1434 is CRITICAL: Host mw1434 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:16:16] (03PS1) 10JMeybohm: Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) [14:16:33] mutante: expected? ^ [14:17:45] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.17; 2021-08-02), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) >>! In T287380#7242110, @Krassotkin wrote: > @Bawolff Maybe you can try replacin... [14:18:59] (03PS3) 10JMeybohm: Add debian directory [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) [14:19:01] (03PS2) 10JMeybohm: Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) [14:19:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:00] (03Abandoned) 10Muehlenhoff: configcluster: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698984 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:20:29] (03CR) 10JMeybohm: Add debian directory (033 comments) [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:26:02] (03CR) 10Filippo Giunchedi: [C: 03+2] "Excellent, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/708476 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [14:26:54] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet'] ` and were **ALL** successful. [14:26:56] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01108 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:27:56] (03CR) 10RLazarus: [C: 03+1] "LGTM - in a perfect world[tm] we'd be able to use httpbb to verify nginx-light is sufficient, but writing tests for a conf server isn't ex" [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:32:36] (03CR) 10Cwhite: [C: 03+1] am: fix team_tag_matcher vs team_tags_matcher [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/708497 (owner: 10Filippo Giunchedi) [14:32:52] (03CR) 10Filippo Giunchedi: [C: 03+2] am: fix team_tag_matcher vs team_tags_matcher [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/708497 (owner: 10Filippo Giunchedi) [14:32:55] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: fix team_tag_matcher vs team_tags_matcher [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/708497 (owner: 10Filippo Giunchedi) [14:33:15] rzl: sorry, yes, expected. depooled for reimaging but then had an IPMI issue. fixing [14:33:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1434.eqiad.wmnet with reason: known issue [14:33:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1434.eqiad.wmnet with reason: known issue [14:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:03] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30389/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [14:38:00] mutante: cool, figured it was something along those lines [14:38:11] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus.icinga-exporter-am: support --labels.team.config-file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [14:39:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:38] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1434.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:41:50] (03CR) 10Filippo Giunchedi: [C: 03+1] C:nagios_common: Command definition for posting to a client auth site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708524 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [14:41:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:44:59] (03CR) 10JMeybohm: "recheck" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:46:16] (03CR) 10Ottomata: [C: 03+1] Remove IR schema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708188 (owner: 10Sharvaniharan) [14:47:04] (03CR) 10jerkins-bot: [V: 04-1] Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:48:35] (03CR) 10JMeybohm: "recheck" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708483 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:48:43] (03PS3) 10Jbond: P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) [14:49:44] (03CR) 10jerkins-bot: [V: 04-1] P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [14:52:06] (03PS1) 10Filippo Giunchedi: syslog: expose centrallog retention in hiera [puppet] - 10https://gerrit.wikimedia.org/r/708540 [14:54:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30390/console" [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [14:56:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1434.eqiad.wmnet with reason: REIMAGE [14:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1434.eqiad.wmnet with reason: REIMAGE [14:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:40] (03PS2) 10Jbond: C:nagios_common: Command definition for posting to a client auth site [puppet] - 10https://gerrit.wikimedia.org/r/708524 (https://phabricator.wikimedia.org/T285762) [15:00:04] (03CR) 10Jbond: [C: 03+2] "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708524 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [15:01:58] (03PS4) 10Jbond: P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) [15:02:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30391/console" [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [15:03:06] (03CR) 10Btullis: [C: 03+1] "This looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [15:03:32] (03PS5) 10Jbond: P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) [15:05:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:17] (03CR) 10Herron: syslog: expose centrallog retention in hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [15:06:39] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: Add checks for each signer [puppet] - 10https://gerrit.wikimedia.org/r/708525 (https://phabricator.wikimedia.org/T285762) (owner: 10Jbond) [15:08:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:25] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Connect staging to read replica postgres node [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 (owner: 10Jgiannelos) [15:11:28] (03PS2) 10Filippo Giunchedi: syslog: expose centrallog retention in hiera [puppet] - 10https://gerrit.wikimedia.org/r/708540 [15:11:48] (03CR) 10Filippo Giunchedi: syslog: expose centrallog retention in hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [15:12:18] (03Merged) 10jenkins-bot: tegola-vector-tiles: Connect staging to read replica postgres node [deployment-charts] - 10https://gerrit.wikimedia.org/r/708494 (owner: 10Jgiannelos) [15:12:33] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 (owner: 10Jgiannelos) [15:13:02] (03PS2) 10Jgiannelos: tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 [15:13:07] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:14] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Add system users and groups for for Airflow for Research and Platform Eng [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [15:17:17] (03CR) 10Cwhite: [C: 03+1] syslog: expose centrallog retention in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [15:18:21] (03PS3) 10JMeybohm: Create dragonfly user via systemd-sysusers [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) [15:18:43] (03PS3) 10Jgiannelos: tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 [15:19:21] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1434.eqiad.wmnet'] ` and were **ALL** successful. [15:19:27] 10SRE, 10Traffic: DNS Discovery for active/passive failover within a data centre - https://phabricator.wikimedia.org/T287584 (10Legoktm) p:05Triage→03Medium [15:19:42] 10SRE, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Add logout.d script for Wikitech - https://phabricator.wikimedia.org/T287566 (10Legoktm) p:05Triage→03Medium [15:20:17] (03CR) 10Herron: [C: 03+1] syslog: expose centrallog retention in hiera [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [15:21:47] (03CR) 10Filippo Giunchedi: [C: 03+2] syslog: expose centrallog retention in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708540 (owner: 10Filippo Giunchedi) [15:24:31] (03PS1) 10Elukey: knative-serving: override KUBERNETES_SERVICE_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) [15:28:25] (03CR) 10JMeybohm: "A wmf helper template emerged just today https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708492" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:29:28] (03PS1) 10Jbond: P:pki::multirootca: add CA name to description [puppet] - 10https://gerrit.wikimedia.org/r/708547 [15:31:37] jayme: lol [15:31:42] thanks for the link, I'll try to use it [15:32:31] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: add CA name to description [puppet] - 10https://gerrit.wikimedia.org/r/708547 (owner: 10Jbond) [15:33:25] (03CR) 10Cwhite: [C: 04-1] prometheus.icinga-exporter-am: support --labels.team.config-file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [15:34:37] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 (owner: 10Jgiannelos) [15:37:10] (03Merged) 10jenkins-bot: tegola-vector-tiles: Disable debugging on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/706405 (owner: 10Jgiannelos) [15:42:53] (03CR) 10Klausman: [C: 03+1] "I'm fine with both leaving this as is, or switching to using the templates Janis mentioned." [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:43:47] (03CR) 10RLazarus: [C: 03+2] "Thanks! Going ahead and merging since volans is out, but happy to take post-hoc comments on this and make followup changes if desired." [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [15:47:47] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:57] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.77`. Pre-deploy tests passing on canary `wdqs1003` [15:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:08] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@26273d8]: 0.3.77 [15:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:55] !log [WDQS Deploy] Tests passing following deploy of `0.3.77` on canary `wdqs1003`; proceeding to rest of fleet [15:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw143[4-6].eqiad.wmnet [15:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:04] (03PS1) 10Jbond: nagios_common: fix typo string vs strings [puppet] - 10https://gerrit.wikimedia.org/r/708554 [15:53:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw143[4-6].eqiad.wmnet [15:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:58] !log mw1434,mw1435,mw1436: scap pull, repooled, reimaged, converted from API to appserver for balancing (T279309) [15:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:05] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [15:54:22] (03CR) 10Jdlrobson: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [15:54:37] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [15:55:11] (03CR) 10Jbond: [C: 03+2] nagios_common: fix typo string vs strings [puppet] - 10https://gerrit.wikimedia.org/r/708554 (owner: 10Jbond) [15:55:34] (03PS1) 10Jbond: P:pki::multirootca::monitoring: update sudo rules to drop sudo command [puppet] - 10https://gerrit.wikimedia.org/r/708555 [15:55:37] (03CR) 10Elukey: "Adding also Ben since he has been working on cookbooks a lot recently :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [15:57:03] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@26273d8]: 0.3.77 (duration: 08m 55s) [15:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:34] (03CR) 10Dzahn: [C: 03+1] "this looks good to me! just make sure when you merge this to manually run puppet on alert* hosts and then run "icinga -v /etc/icinga/icing" [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [15:57:36] (03PS10) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [15:58:40] !log T287112 [WDQS] Re-pooled `wdqs2002` [15:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:47] T287112: hw troubleshooting: SSH failure for wdqs2002.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T287112 [15:59:54] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [15:59:58] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [15:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:00] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca::monitoring: update sudo rules to drop sudo command [puppet] - 10https://gerrit.wikimedia.org/r/708555 (owner: 10Jbond) [16:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [16:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:46] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01458 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:02:36] PROBLEM - Disk space on puppetdb2002 is CRITICAL: DISK CRITICAL - free space: /var/lib/puppetdb/stockpile/cmd/q 4 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=puppetdb2002&var-datasource=codfw+prometheus/ops [16:04:47] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30393/console" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [16:06:02] <_joe_> can someone look into puppetdb2002? [16:06:10] <_joe_> it's causing the puppet failures I guess [16:06:22] (03PS1) 10Ahmon Dancy: Generate mediawiki-multiversion-debug image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708559 (https://phabricator.wikimedia.org/T287495) [16:06:57] <_joe_> jbond: still around? [16:07:32] yes im here looking [16:07:41] (03PS2) 10Ahmon Dancy: Generate mediawiki-multiversion-debug image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708559 (https://phabricator.wikimedia.org/T287495) [16:07:59] _joe_: what made you think it was puppetdb, there is a change otto rolled out which is causing some puppet errors [16:08:21] <_joe_> jbond: just the correspondence between the two errors [16:08:27] <_joe_> but it might be the opposite [16:08:33] ack looking eitherway [16:08:34] <_joe_> the puppet failures filling up the db [16:09:05] (03CR) 10Elukey: "Ben https://puppet-compiler.wmflabs.org/compiler1002/30393/stat1008.eqiad.wmnet/index.html is not completely ok at a first pass, it says t" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [16:09:08] i'll fix that after standup here sorry yall [16:09:18] running procs keeping puppet from changing the homedir of a system user [16:09:50] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005831 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:11:39] (03CR) 10Elukey: Enable kerberos ticket auto-renewal for a test client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [16:12:03] fyi puppetdb is having issues also, fixing now [16:12:25] theremay be some other puppetfaliures for serveres that use codfw while i fix [16:12:28] <_joe_> yeah it wasn't processing the facts/resources/etc AIUI since 1 hour and some [16:12:36] let me know if there's anything I can help with [16:12:36] yeah I was about to say, the analytics nodes are just a few, there is a mapreduce job ongoing, when it finishes they will be ok [16:12:55] <_joe_> elukey: yeah that's not what caused the issues to puppetdb [16:13:18] * elukey nods [16:13:35] <_joe_> now john is trying to remove stuff from the queue and save it, then restart puppetdb on 2002 [16:13:41] yes still looking around but looks like as joe said nothing has been getting submitted and so the q dir is filled up [16:14:21] exactly (first just taking a backup of the queue) [16:15:01] <_joe_> I missed jvm stacktraces [16:15:50] ok its back up now [16:15:54] I know that you love them [16:16:19] <_joe_> so it seems it was taking a very long time to replace catalogs [16:16:58] ok, fixed my failures, thanks yall, sorry about that [16:17:18] <_joe_> jbond: it looks like it's still very slow though [16:17:24] yes seems to be this but got worse https://phabricator.wikimedia.org/T263578 [16:18:37] _joe_: unfortunatly 500ms is about average for codfw as it has to submit to eqiad [16:18:44] (03PS5) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [16:19:15] <_joe_> yeah but I saw replace catalogs taking almost half an hour [16:19:42] yes we seem to have strange spike where it takes a long time. 30minutes is longer then i have seen. [16:20:04] my working theory is its because of the facts set, specificaly k8s host which have lots of interfaces [16:20:38] <_joe_> ugh that's going to get worse when we convert mediawiki appservers to be kubernetes workers [16:20:41] i think it also gets worse when puppetdb preformes its garbage collection [16:21:03] if puppet(db) has been down for an hour, should this be considered an incident? [16:21:19] yes i think i need to just filter out theses interfaces from being submitted to puppetdb but its not so simple [16:21:50] legoktm: im not sure puppetdb was down for an huor, its failed to submit reports for an hour but was probably only down for about 5-10 mins [16:22:09] <_joe_> jbond: I'm perplexed because postgres doesn't seem to be doing much either [16:22:21] ok, is it still down? [16:22:24] <_joe_> no [16:22:49] _joe_: yes theses are just working theories, i havn;t been able to track down anything concrete and no reall error signialing [16:23:18] RECOVERY - Disk space on puppetdb2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=puppetdb2002&var-datasource=codfw+prometheus/ops [16:23:23] there is some stuff about facts exploding the db when you submit as it has to do a row merge (beyond my dba skills) [16:24:02] which is where that came from, we filtered some facts and it made a bit of difference but not enough to convince me to also filter the interface facts [16:24:32] but this is worse then i have ever seen, the q is allready at 600mb again [16:25:35] <_joe_> yeah :/ [16:25:53] <_joe_> jbond: why is it so much worse in codfw though [16:26:45] well there is the oviuos cross DC latency hit but other then that im not sure [16:26:50] <_joe_> is the eqiad-codfw network link having capacity issues? [16:27:43] eqiad dos have some peaks but its XioNoX topranks ^^? [16:27:51] * topranks looking [16:28:08] <_joe_> I mean I hardly can imagine a 30 ms latency causing replacing a catalog going from 1 s to 245 s is hardly explained that way [16:28:55] <_joe_> and it seems the effect is larger for larger catalogs, which makes me think it could potentially have to do with bandwidth limits [16:29:03] _joe_: fyi the puppetdb1002 dose have some spike but its more like going from <50ms to ~800ms. where as puppetdb2002 goes from 500ms -> 30s (withyou 30minute one being by far the largests) [16:29:12] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10Legoktm) Ack. I need to verify that we store sessions in the database, in which case https://stackoverflow.com/questions/953879/how-to-force-a-user-logout... [16:29:31] ack sonds plassible [16:29:50] could also be down to the ganete host [16:29:57] <_joe_> replace catalog right now in codfw takes between 150 and 300 s [16:30:02] <_joe_> yes [16:30:34] Transport links between eqiad and codfw both look ok and healthy ~40% usage on each. [16:30:37] yes storte report is also taking a long time [16:31:03] I'll double check a few more things to see if I can spot any other network factor that might be having an effect. [16:31:36] thanks [16:37:56] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [16:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:45] (03PS1) 10Legoktm: upload: Add explanatory comment to the hasty UA block from earlier [puppet] - 10https://gerrit.wikimedia.org/r/708564 [16:42:11] (03PS1) 10Legoktm: configmaster: Add mwdebug to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/708565 [16:43:01] jbond: I assume you'd like me to wait before merging my puppet patches [16:45:56] legoktm: you should be fine to mereg as long as its not releated to puppetdb [16:46:13] ok, thanks [16:47:20] herron: cwhite: wonder if you are around to brain storm puppetdb issues? [16:49:26] (03Abandoned) 10Legoktm: configmaster: Add mwdebug to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/708565 (owner: 10Legoktm) [16:49:41] (03PS2) 10Legoktm: configmaster: Add mwdebug to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/705877 (owner: 10Effie Mouzeli) [16:49:55] (03PS3) 10Legoktm: configmaster: Add mwdebug to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/705877 (owner: 10Effie Mouzeli) [16:51:00] hey jbond will be around for ~10m before a meeting [16:51:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [16:52:07] herron: currently puppetdb2002 is processing submissions very slowly and the queu is filling up. its writing about 7GB of queue data every hour then falling over [16:52:21] (03CR) 10Legoktm: [C: 03+2] configmaster: Add mwdebug to disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/705877 (owner: 10Effie Mouzeli) [16:52:44] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005831 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:52:57] (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for push-notifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708568 [16:53:26] (03CR) 10Ahmon Dancy: [C: 03+2] DevServices.php: Add placeholder for push-notifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708568 (owner: 10Ahmon Dancy) [16:53:57] (03PS2) 10Legoktm: upload: Add explanatory comment to the hasty UA block from earlier [puppet] - 10https://gerrit.wikimedia.org/r/708564 [16:54:13] o/ [16:54:40] (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for push-notifications [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708568 (owner: 10Ahmon Dancy) [16:54:41] cwhite: currently puppetdb2002 is processing submissions very slowly and the queu is filling up. its writing about 7GB of queue data every hour then falling over [16:55:47] (03CR) 10Legoktm: [C: 03+2] upload: Add explanatory comment to the hasty UA block from earlier [puppet] - 10https://gerrit.wikimedia.org/r/708564 (owner: 10Legoktm) [16:57:47] (03PS3) 10Legoktm: varnish: Allow wikimedia.it to use maps tiles [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) (owner: 10AntiCompositeNumber) [16:57:54] (03PS4) 10Legoktm: varnish: Allow wikimedia.it to use maps tiles [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) (owner: 10AntiCompositeNumber) [16:59:32] fwiw I'm looking at dashboard on port 8080 via ssh tunnel and yeah can see the command queue just growing (at 1.3k now), do you have any suspicious as to why it's growing yet? am also looking at the puppetdb log, but I have to head into a meeting in a min [17:00:10] (03CR) 10Legoktm: [C: 03+2] varnish: Allow wikimedia.it to use maps tiles [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) (owner: 10AntiCompositeNumber) [17:00:16] herron: its growing because the submissions are taking 30s to 30 minutes [17:01:00] 10SRE, 10Wikimedia-Mailing-lists: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10JOAN) >>! In T287554#7242901, @Dzahn wrote: > @Ladsgroup Can you promote a moderator to admin? Yes, please! Mi email is joanwikimedia@gmail.com (moderator) or comunicaciones@wikimediacolomb... [17:02:43] jbond: Postgres looks like it's not very performant [17:03:33] locking may be contributing to the slowness [17:04:18] fyi I'm running puppet on A:cp-upload via cumin to sync out the above two puppet changes [17:05:07] cwhite: could be postgress however puppetdb1001 seems to be submitting much faster [17:05:19] at least M1s [17:05:29] < 1s (which is still slow) [17:06:42] jbond: puppetdb1002: https://grafana-rw.wikimedia.org/d/000000469/postgres?viewPanel=1&orgId=1&var-dc=eqiad%20prometheus%2Fops [17:08:11] cwhite: oh wow that dose look bad [17:08:58] going to restart puppetdb on puppetdb1001 just to see if it helps [17:09:23] cwhite: anyidea what could be causing that [17:13:05] cwhite: did you just do something? [17:13:16] nope, haven't done anything yet [17:13:32] just had a lof of reports submit with 0ms (on codfw) [17:15:22] lots of command ignored afaict [17:15:28] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01108 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:15:37] logs on puppetdb1002 still showing long durations [17:15:47] ^^ that will be me restarting puppetdb, should clear [17:16:36] great it just kicked of gc that will help :( [17:17:10] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Legoktm) I merged @AntiCompositeNumber's patch and tiles now work on https://barriere.wikimedia.it/ - sorry about the dela... [17:21:20] jbond: locks just dropped off significantly [17:21:48] jbond: eh, they're back up again [17:22:45] cwhite: that seems to be inline with restarting puppetdb on codfw [17:23:19] (03PS1) 10Legoktm: admin: Remove jbol's access [puppet] - 10https://gerrit.wikimedia.org/r/708571 [17:23:25] th ebigger drop at ~16:10 is when i puppetdb2002 was down as the queue mount was out of space [17:25:35] (03PS1) 10Phuedx: Implement STV algorithm [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708413 (https://phabricator.wikimedia.org/T283728) [17:25:49] cwhite: the looking also corrosponds with https://grafana.wikimedia.org/d/000000477/puppetdb?viewPanel=7&orgId=1 [17:31:29] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Legoktm) [17:45:23] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) Thanks @Dzahn! >>! In T280203#7242737, @Dzahn wrote: > @wiki_willy You guys can remove all old mw app... [17:46:42] cwhite: im lookin at the locks and it seems there are a few autovacume jobs which im gussing is causing the issue [17:47:26] jbond: same, but the those vacuums have only been running for 1.5 hours [17:48:00] the high locking issue has been hanging around for around 4 hours [17:48:23] yes i know, i wonder if it first vacumed the cataloges, then reports and now facts [17:48:53] the other locks all seem quite short lived [17:51:49] cwhite: wat do yuo think about killing those autovacuum pids? [17:53:45] ahh the facts one just finished [17:53:47] jbond: vacuum is a fairly important process and it will likely start back up soonish [17:55:04] most the other locks we are seeing seem to now be with the edges table [17:55:48] (03PS6) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [17:55:57] (03CR) 10Jdlrobson: [C: 03+1] "Scheduled for 4pm PST backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [17:57:33] edges table is done [17:58:01] factset started again [17:58:37] <_joe_> I doubt the problem is the db [17:58:48] <_joe_> given it's slow-ish in eqiad but 15x slower in codfw [17:58:52] _joe_: https://grafana.wikimedia.org/d/000000469/postgres?viewPanel=1&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-6h&to=now [17:58:59] https://grafana.wikimedia.org/d/000000477/puppetdb?viewPanel=7&orgId=1 [17:59:08] _joe_: codfw is a read-replica [17:59:20] <_joe_> yeah I'm talking about puppetdb the app [17:59:23] <_joe_> not postgres [17:59:35] <_joe_> the two installations share the postgres instance AIUI? [17:59:59] puppetdb app was restarted not too long ago [18:00:02] <_joe_> so write operations being so slow only on one of the two seems suspicious [18:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1800). [18:00:05] legoktm: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] twentyafterfour and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1800). [18:00:34] _joe_: its definetly slower on both [18:00:37] I'm going to be doing a full scap [18:00:48] legoktm: ok [18:00:58] which is now called sync-world apparently [18:01:02] see but yes it is a lot slower in codfw [18:01:04] <_joe_> but yeah the high amount of locks is a new thing also worth investigating [18:01:09] <_joe_> so I [18:01:19] <_joe_> I'm wondering what could cause this [18:01:21] (03CR) 10Legoktm: [C: 03+2] Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708225 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [18:01:23] (03CR) 10Legoktm: [C: 03+2] Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/708224 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [18:01:32] and it aligns pretty perfectly with the proccessing time [18:01:57] <_joe_> still doesn't explain the discrepancy [18:02:15] <_joe_> did you check if the ganeti hosts are network-saturated? [18:02:22] <_joe_> the physical hosts I mean [18:02:36] yes and they are not, only puppetdb is using the network on that host [18:02:49] <_joe_> ok [18:03:07] <_joe_> I'm sorry I just can't build a mental model that justifies this additional slowness [18:03:08] its running on ganeti2023 thgh if you wanted to check i could have missed something [18:03:45] <_joe_> unless there is some timeout somewhere so that 2002 rolls back transactions when they exceed some threshold [18:04:17] <_joe_> also, from what I remember, alex and I set up puppetdb to only perform GC from eqiad [18:04:26] <_joe_> I mean vacuuming the db [18:04:45] in normal operations i have seen puppetdb2002 run at average of 500ms but randome peakes of 30seconds -> 6minutes so i think there is some strange error handleing when something fails [18:04:46] <_joe_> is that still the case, or was that option dropped in later versions maybe [18:05:16] i have not touched any of the postgress stuff and not seen a ps since my time here [18:05:25] so gussing its the same [18:05:46] (03Merged) 10jenkins-bot: Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708225 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [18:06:25] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:06:32] (03Merged) 10jenkins-bot: Add a tracking category to pages using the tag [extensions/intersection] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/708224 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [18:08:46] !log legoktm@deploy1002 Started scap: Add a tracking category to pages using the tag [18:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:23] fyi the mean processing time has drop quite a bit since the edges vacume finished [18:10:45] mean processing time the p99 is still high [18:12:48] before, public.factsets was vacuum analyze right? [18:13:01] if so, it's on the actual vacuum step now? [18:13:25] cwhite: i think so yes, i also saw the same with the edges table first analyses then the vacume took about 10 mins [18:14:14] !log manually cleared out the puppetdb2002 queue [18:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:07] OTOH, locks are continuing to fall, slowly [18:16:16] im tempted to say lets keep an eye on it and see what happens when the vacume clears. with a follow up task to look at making the vacume more preformant [18:17:03] earlier in the postgres log, there were errors about duplicate key violates unique constraint [18:18:12] that seems to happen somewhat regularly though :/ [18:19:51] it was most frequent here in the last day though [18:20:10] fyi locks and processing have gone back up [18:20:37] (03PS3) 10Razzi: netboot: make an-masters reimage without confirmation [puppet] - 10https://gerrit.wikimedia.org/r/705782 (https://phabricator.wikimedia.org/T278423) [18:22:27] (03CR) 10Razzi: [C: 03+2] netboot: make an-masters reimage without confirmation [puppet] - 10https://gerrit.wikimedia.org/r/705782 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [18:23:00] jbond: did you see the last long-running query "UPDATE factsets..."? It's huge... [18:24:17] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10razzi) [18:24:45] cwhite no what are you looking at? [18:25:07] puppetdb1002:/var/log/postgresql/postgresql-11-main.log [18:25:16] 'volatile' column [18:29:30] cwhite: yes this si one of the things that i think causes issues. on the kubernetes hosts there are lots of interfaces and mount points which really bloats the factset. the numa fact (the random array of tuple ints) is also quite big and nosisy [18:30:21] https://phabricator.wikimedia.org/T263578#6492250 [18:31:56] we allready added this but we shold look at also filtering some of the structured facts https://gerrit.wikimedia.org/r/c/operations/puppet/+/634043/5/hieradata/role/common/puppetmaster/puppetdb.yaml [18:32:55] (03PS1) 10Jdlrobson: wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) [18:34:48] (03PS1) 10Dduvall: pipeline: Make blubberfile definitions slightly more coherent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708582 [18:36:02] !log legoktm@deploy1002 Finished scap: Add a tracking category to pages using the tag (duration: 27m 16s) [18:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:08] (03CR) 10Dduvall: "Is this any more understandable? I tried to at least remove the repetition of `runs:` and `lives:` configuration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708582 (owner: 10Dduvall) [18:37:13] yay [18:41:06] sigh, I forgot a message [18:41:18] will let it roll out with next week's train [18:41:23] twentyafterfour: I'm all done [18:41:38] legoktm: thanks [18:41:56] (03PS1) 10Ottomata: Install hadoop client on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) [18:43:41] _joe_: cwhite: not sure if you are still looking but im going to step away for a bit and get some food [18:44:00] <_joe_> jbond: I'm not, it's quite too late for me too [18:44:04] * cwhite is still looking [18:44:08] jbond: go for it [18:44:25] ack thanks [18:45:09] finished with my meeting just now, need any help with puppetdb? [18:47:07] (03Abandoned) 10Legoktm: admin: Remove jbol's access [puppet] - 10https://gerrit.wikimedia.org/r/708571 (owner: 10Legoktm) [18:54:50] did anyone look into GCs on the puppetdb2002 jvm already? looks to have increased since a few hours ago https://usercontent.irccloud-cdn.com/file/WrVjU52w/Screen%20Shot%202021-07-28%20at%202.54.01%20PM.png [18:56:08] might be worth trying to increase the heap as a stopgap [18:57:38] !log mwmaint2002$ foreachwikiindblist wikimania refreshLinks.php - to start populating DPL tracking category [18:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] twentyafterfour and hashar: Time to snap out of that daydream and deploy MediaWiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1900). [19:01:23] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7234490, @Beeswaxcandle wrote: >>>! In T257066#7233536, @Legoktm wrote: >> OK, we're... [19:09:36] !log Preparing to deploy 1.37.0-wmf.16 to group1 wikis [19:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:33] (03PS1) 1020after4: group1 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708585 [19:13:35] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708585 (owner: 1020after4) [19:14:17] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708585 (owner: 1020after4) [19:15:40] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.16 refs T281157 [19:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:48] T281157: 1.37.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T281157 [19:16:47] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.16 refs T281157 (duration: 01m 06s) [19:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) @RKemper can you confirm these need to be 10g space is limited in eqiad for 10g and these would be o... [19:29:56] (03PS2) 10Ottomata: Install hadoop client on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) [19:34:41] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [19:49:50] herron: cwhi.te: noticed we had some db lockig starting at a simlar time as that https://grafana.wikimedia.org/d/000000469/postgres?viewPanel=1&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-6h&to=now [19:50:09] which corrosponds to the high processing time as well https://grafana.wikimedia.org/d/000000469/postgres?viewPanel=1&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-6h&to=now [19:50:43] seems that there are a bunch of vacuum jobs going on which may be causing contension [19:51:14] im hopping things get better when they finish, and tomorrow ill look at some more rebost fixes [19:55:56] of course any ideas defently welcome [20:00:04] twentyafterfour and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T1900). [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T2000). Please do the needful. [20:12:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) i noticed that the when we get the high store reports we get a corrosponding entry in the postgres log ` 2021-07-28 20:06:52.013 GMT [db:puppetdb,sess:6... [20:21:29] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) disk has been replaced @Marostegui [20:21:36] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) 05Open→03Resolved [20:34:24] jbond: the size of the edges table is incredibly large (7894 MB), and most of the queries that are blocking are trying to manipulate that table (DELETE, INSERT) [20:35:11] and the edges table appears to have increased by 12MB in the last 30 minutes or so [20:36:47] cwhite: do we have ay stats how its grown in th last ~12 hours [20:37:04] looks like something happend at 13:30 today [20:37:11] I'm not sure [20:42:48] can't see anything obvious myself [20:44:35] cwhite: have don any looking at the table idea of #rows, ideally #rows per certname. [20:45:05] my sql foo is not good enough to know how to do prforme a query that wont make the issue worse [20:46:47] I haven't yet tried to analyze the table structure yet. [20:47:13] the DELETE FROM statements are the ones that appear the most on waiting for locks [20:47:25] wondering if sme new hosts or puppet change has made the edges sudenly explode [20:48:38] (03PS3) 10Ottomata: Install hadoop client on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) [20:49:12] (03PS1) 10Ottomata: Add dummy keytabs for analytics-research and analytics-platform-eng airflow [labs/private] - 10https://gerrit.wikimedia.org/r/708590 [20:49:40] (03PS2) 10Ottomata: Add dummy keytabs for analytics-research and analytics-platform-eng airflow [labs/private] - 10https://gerrit.wikimedia.org/r/708590 (https://phabricator.wikimedia.org/T284225) [20:49:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RKemper) **Networking** Yes we need these to be in 10G rows; we now have enough elastic hosts with 10G NICs to ge... [20:50:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dummy keytabs for analytics-research and analytics-platform-eng airflow [labs/private] - 10https://gerrit.wikimedia.org/r/708590 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [20:50:47] cwhite: estamet of rows 1.53744e+07 [20:51:12] SELECT reltuples AS estimate FROM pg_class where relname = 'edges'; [20:51:13] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30398/console" [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [20:53:47] afaict, edges are the linked edges of the graph [20:54:03] yes exactly the relationships between resources [20:54:35] there's 1718 distinct certnames [20:55:29] yes that relates to about the amount of nodes we have live [20:55:33] and one host (logstash1023, for example), has 8318 edges [20:57:28] sretest1001 has 6856 alert1001 has 14798 (they are probably the min/max) [20:57:55] hard to tell if the edges db is bloated or not. there do not appear to be any purged nodes bloating the table (rules out PDB-3515) [20:58:58] no the nunmber of hosts seem about right [20:59:18] (03PS4) 10Ottomata: Install hadoop client on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) [20:59:20] the DELETE from query is `DELETE FROM edges WHERE certname=$1 and source=$2::bytea and target=$3::bytea and type=$4` [20:59:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30399/console" [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [21:00:11] which feels table-scanny to me. I would guess there's an index on certname, maybe one on type, but certainly not on source and target. [21:00:30] Indexes: "edges_certname_source_target_type_unique_key" UNIQUE CONSTRAINT, btree (certname, source, target, type) [21:00:34] Foreign-key constraints: "edges_certname_fkey" FOREIGN KEY (certname) REFERENCES certnames(certname) ON DELETE CASCADE [21:01:41] is it still growing btw? from how i understand it it shold not really change between puppet runs [21:01:58] unless someone commits some new code [21:02:10] or we add a host etc [21:02:21] jbond: yep, it's still growing (7908 MB now) [21:02:40] (03PS5) 10Ottomata: Set up airflow-research instance on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) [21:04:21] so thats seems to be about 24MB/hour [21:04:46] wonder if thats because more deletes are bing queued then inserts?? [21:06:53] btw i dont see any vacume process anymore [21:07:24] rowcount appears to be accumulating: `select count(*) from edges;` (15340018 -> 15350754) [21:07:44] it feels like the delete statements aren't doing anything? [21:09:51] (03PS1) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 [21:09:53] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (owner: 10Michaelcochez) [21:11:37] cwhite: the following is the query i see with locks and also in the log as taking a long time to complete are the following: the logs we see in the postgress logs also relate to the long submission times [21:11:41] UPDATE certnames SET latest_report_id = $1,latest_report_timestamp = $2 WHERE certname = $3AND ( latest_report_timestamp < $4 OR latest_report_timestamp is NULL ) | 00:02:26.768174 | 10907 [21:12:11] this is the query im running https://wikitech.wikimedia.org/wiki/User:Jbond/debuging#display_locks [21:13:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RKemper) **Current state** *Before* accounting for new 10G switches opened up by https://phabricator.wikimedia.o... [21:13:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) due to constraint with open spaces for racking i will hold till i get confirmation on proposed rackin... [21:13:57] jbond: added another query I'm using to that section [21:14:14] great thanks <3 [21:14:38] (03CR) 10Michaelcochez: "I followed the instructions from https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Deployment and mimicked the oth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (owner: 10Michaelcochez) [21:16:39] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30400/console" [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [21:18:41] cwhite: things have defently got better since the vacum stop. puppetdb2002 is managing to almost keep up. its queue is now flapping between 8-20M (before it was growing at ~3-7G per hour) [21:19:34] locks are dropping [21:20:34] i also notice that we have the ize of the edges table from about 10months ago and it was ' 8.3642e+06 | 4454 MB' (rows, size) [21:20:42] https://phabricator.wikimedia.org/T263578 [21:27:12] cwhite: ok it looks like things may have stablised a bit now as such im going to call it a night and pick this up again tomorow, will be around for an hour so so feel free to ping if ther are other issues. also please add anything elses you have found or find to the task above and thanks for all the help :) [21:28:35] jbond: sounds good, have a good night [21:29:59] cwhite: thanks, heres hoping it stays stable and have a nice day :) [21:36:17] edges has stopped growing [21:43:15] (03CR) 10Btullis: [V: 03+1] "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [21:57:27] that's not the case. it paused growing for a while [22:08:26] RECOVERY - snapshot of s2 in eqiad on alert1001 is OK: Last snapshot for s2 at eqiad (db1102.eqiad.wmnet:3312) taken on 2021-07-28 20:54:04 (1048 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:08:34] (03CR) 10Btullis: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [22:17:04] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10KFrancis) @RLazarus The agreement was just sent for signatures. I'll confirm when complete. Thanks for your patience! [22:35:45] (03CR) 10Ahmon Dancy: gitlab: Provide profile for docker based GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210728T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:01] present [23:06:50] thcipriani: around? [23:11:37] urbanecm: ? [23:14:14] it is too late to ask for a backport? We just got the tickets +2ed and it's merging now [23:16:55] Tran: I can't find anyone to backport unfortunately [23:19:20] Ah that's okay. Thank you for the heads up! 🙇‍♂️ [23:26:00] thcipriani: we really need to get that ping fixed. I have no idea who else to ping right now :) ^ [23:42:39] went afk for a bit, what's up? [23:42:44] is this the evening backport window? [23:43:07] Jdlrobson: still around? [23:43:41] i am [23:43:47] this is the evening backport window yeh [23:43:54] 15 mins left.. [23:44:07] I think I can do it in 15 if you're around for 15 [23:44:07] I have 2 config changes [23:44:11] I can be sure [23:44:17] should only take 15 mins [23:44:24] thanks! [23:44:45] (03CR) 10Thcipriani: [C: 03+2] Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [23:44:57] (03CR) 10Thcipriani: [C: 03+2] wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) (owner: 10Jdlrobson) [23:45:36] (03Merged) 10jenkins-bot: Disable mobile contributions simplifications on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [23:46:05] Jdlrobson: first one is on mwdebug2002, check please [23:46:16] checking [23:48:06] thcipriani: yep that ones good [23:48:10] going [23:49:22] (03PS2) 10Thcipriani: wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) (owner: 10Jdlrobson) [23:49:34] thcipriani: Tran and I would also like to get a couple patches in if you have time? [23:49:41] They're on the calendar. [23:49:46] * thcipriani refreshes [23:49:58] (03CR) 10Thcipriani: wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) (owner: 10Jdlrobson) [23:50:02] (03CR) 10Thcipriani: [C: 03+2] wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) (owner: 10Jdlrobson) [23:50:37] !log thcipriani@deploy1002 Synchronized wmf-config: Config: [[gerrit:708158|Disable mobile contributions simplifications on Wikidata and Commons (T283988)]] (duration: 01m 58s) [23:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:46] T283988: Raw edit summary is shown in Wikidata user contribution page in mobile view - https://phabricator.wikimedia.org/T283988 [23:50:51] ^ Jdlrobson first one is live [23:51:07] (03Merged) 10jenkins-bot: wgSkipSkins: Update defaults, hide modern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708581 (https://phabricator.wikimedia.org/T287616) (owner: 10Jdlrobson) [23:51:09] thcipriani: sweet checking again.. [23:51:14] Niharika: do these need a full scap? [23:51:22] looks like there's some l10n? [23:51:34] (03CR) 10Ottomata: "Looks mostly good! Some comments." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (owner: 10Michaelcochez) [23:51:58] thcipriani: Oh shoot, the second one does. [23:52:12] Jdlrobson: your 2nd patch is on mwdebug2002 when you get a chance [23:52:25] thcipriani: How long does full scap take lately? [23:52:32] working beautifully! Thanks thcipriani [23:53:03] Niharika: heh, about the same amount of time, I'd prefer to do that when I didn't get such a late start on the window if it's not urgent [23:53:39] thcipriani: No worries. We'll tackle it tomorrow then. Thanks. [23:53:46] <3 thank you [23:53:53] Niharika: do you still need the first one? [23:54:15] I can get that one done probably [23:54:27] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/704454 [23:55:24] Jdlrobson: sorry, were you saying that the 2nd patch looked good on mwdebug2002? [23:55:26] thcipriani: We need the second one to test the first one, I think. Better to go together. What do you think Tran? [23:56:05] thcipriani: yep please sync :) [23:56:13] * thcipriani does [23:56:16] Together otherwise we can't test the first one easily [23:57:49] :( sorry I got such a late start on the window [23:57:53] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708581|wgSkipSkins: Update defaults, hide modern (T287616)]] (duration: 01m 06s) [23:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:59] ^ Jdlrobson live now [23:58:01] T287616: Hide Modern in skin preferences - https://phabricator.wikimedia.org/T287616 [23:58:10] thcipriani: testing [23:58:30] Works ! great! [23:58:40] ack, thanks for testing [23:59:00] thcipriani: No problem! [23:59:01] thanks thcipriani for the quick turn around :)