[05:00:29] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:34:55] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:25:37] (03CR) 10Joal: "I have not looked at the login of the code - some comments about parameters setting and minor nits." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns) [08:45:14] (03PS1) 10Btullis: fix(standalone-consumers) Removes Solr from spring boot application config [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/830098 [09:09:53] (03CR) 10Btullis: [C: 03+2] fix(standalone-consumers) Removes Solr from spring boot application config [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/830098 (owner: 10Btullis) [09:13:37] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10KBach) 05Resolved→03In progress [09:13:43] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10KBach) 05In progress→03Resolved To me, this task is complete. @apaskulin, @Aklapper - please let me know if you have any comments. If not, I'll resolve this one in the coming weeks. [09:26:25] (03CR) 10Joal: [V: 03+2] Update cassandra hql loading file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/828518 (https://phabricator.wikimedia.org/T311507) (owner: 10Joal) [09:35:02] (03Merged) 10jenkins-bot: fix(standalone-consumers) Removes Solr from spring boot application config [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/830098 (owner: 10Btullis) [09:55:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:55:50] !log merged and deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/821695 [09:55:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:00:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4022 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4022%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:00:57] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) [10:01:02] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [10:01:12] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10EChetty) [10:33:48] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10gmodena) > Adding idea discussed with @Ottomata earlier on. It's probably interesting to separate strea... [10:40:52] 10Analytics, 10Analytics-Wikistats, 10Data Engineering Planning: Get visibility which pages are being heavily edited, plundered, which need patrolling - https://phabricator.wikimedia.org/T315196 (10EChetty) [10:41:04] 10Analytics, 10Analytics-Wikistats, 10Data Engineering Planning: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10EChetty) [10:42:25] 10Analytics-Radar, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (MW Expedition): Decouple EventBus and EventFactory - https://phabricator.wikimedia.org/T292121 (10EChetty) [10:42:29] 10Analytics-Radar, 10Data Engineering Planning, 10Metrics-Platform, 10CSS: Schema code samples popup appears under the JSON table - https://phabricator.wikimedia.org/T272857 (10EChetty) [10:42:49] 10Data-Engineering-Kanban, 10Data Engineering Planning: Investigate Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10EChetty) [10:42:53] 10Analytics-Kanban, 10Data Engineering Planning, 10Pageviews-Anomaly: Article on Carles Puigdemont has inflated pageviews in many projects - https://phabricator.wikimedia.org/T263908 (10EChetty) [10:42:59] 10Quarry: test tox on PR - https://phabricator.wikimedia.org/T317092 (10rook) [10:43:03] 10Analytics-Radar, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Internet-Archive, 10The-Wikipedia-Library: Store page-links-change data in a database table and make available through a Special page - https://phabricator.wikimedia.org/T221397 (10EChetty) [10:43:07] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Platform Engineering: EventStreams sending same data over and over (page links change) - https://phabricator.wikimedia.org/T290211 (10EChetty) [10:43:11] 10Analytics-Radar, 10Data Engineering Planning, 10MediaWiki-extensions-EventLogging: SearchSatisfaction has validation errors for event.query - https://phabricator.wikimedia.org/T257331 (10EChetty) [10:43:29] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream: mediawiki/page/properties-change schema should use map type for added and removed page properties - https://phabricator.wikimedia.org/T281483 (10EChetty) [10:43:33] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10EChetty) [10:43:41] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10EChetty) [10:43:49] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10EChetty) [10:43:53] 10Analytics, 10Data Engineering Planning, 10Metrics-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10EChetty) [10:44:05] 10Analytics-Radar, 10Data Engineering Planning, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10EChetty) [10:44:11] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Goal: Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10EChetty) [10:44:33] 10Analytics-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10EChetty) [10:44:39] 10Analytics-Wikistats, 10Data Engineering Planning: Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10EChetty) [10:44:59] 10Analytics, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Metrics-Platform: Client-side error logging should use Elastic Common Schema (ECS) fields when possible - https://phabricator.wikimedia.org/T267602 (10EChetty) [10:46:42] 10Analytics, 10Analytics-Wikistats, 10Data Engineering Planning, 10Data Pipelines: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10EChetty) [11:05:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [11:10:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4034 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4034%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:29:30] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10JAllemandou) > Do we already have a set of use cases for this layout? [12:36:42] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) > This sounds like a separate thread though. Maybe we can spike some work on it? +1, just wan... [12:36:56] o/ [12:37:06] good morning ottomata [12:50:06] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10ntsako) Added tables: ` SELECT country_code, metric_value growth_rate_unique_devices_column_ab, year FROM ntsako.georeadership_input_metrics WHERE year = 2021 AND metric... [12:50:48] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10ntsako) Hi @JAnstee_WMF, Please can you review this. [12:50:57] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10ntsako) a:05ntsako→03JAnstee_WMF [12:52:57] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10ntsako) Added tables: ` SELECT country_code, metric_value growth_rate_unique_devices_column_ab, year FROM ntsako.georeadership_input_metrics WHERE year = 2021 AND metric... [13:05:19] btullis: meeting? [13:07:23] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10ntsako) Added tables: ` SELECT country_code, commons commons_column_ja, mediawiki mediawiki_column_jb, wikidata wikidata_column_jc, wikipedia... [13:07:45] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10ntsako) a:05ntsako→03JAnstee_WMF [13:20:45] 10Data-Engineering-Kanban, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files - https://phabricator.wikimedia.org/T308450 (10gmodena) >>! In T308450#8161644, @Ottomata wrote: > @JAllemandou @Milimetric @phu... [13:25:57] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review, 10User-Elukey: Port architecture of irc-recentchanges to Kafka - https://phabricator.wikimedia.org/T234234 (10Ottomata) [13:28:52] milimetric: good morning! I think https://phabricator.wikimedia.org/T314578 is on track to get deployed today but please do let me know if anything else is needed from me [13:32:16] (03PS1) 10Gerrit maintenance bot: Add tl.wikiquote to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830165 (https://phabricator.wikimedia.org/T317113) [13:32:48] (03PS1) 10Gerrit maintenance bot: Add az.wikimedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830167 (https://phabricator.wikimedia.org/T317119) [13:42:24] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for today's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830165 (https://phabricator.wikimedia.org/T317113) (owner: 10Gerrit maintenance bot) [13:43:06] (03PS2) 10Joal: Add az.wikimedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830167 (https://phabricator.wikimedia.org/T317119) (owner: 10Gerrit maintenance bot) [13:43:22] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging got today's deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830167 (https://phabricator.wikimedia.org/T317119) (owner: 10Gerrit maintenance bot) [14:27:28] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10EChetty) 05Open→03Resolved [14:32:40] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Antoine_Quhen) Last resolution about this ticket: * forget about complete automatization (puppet or ci) * add doc + a sh... [14:44:33] 10Data-Engineering-Operations, 10Data Engineering Planning, 10Mail, 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10jbond) p:05Triage→03Medium [14:57:16] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10EChetty) [16:08:56] (03CR) 10Bearloga: [C: 03+2] movement_metrics: Update global market active editors query [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/826911 (https://phabricator.wikimedia.org/T316398) (owner: 10Mayakpwiki) [16:36:14] joal: joining us? [16:36:23] 10Quarry: test tox on PR - https://phabricator.wikimedia.org/T317092 (10rook) https://github.com/toolforge/quarry/pull/3 [16:36:36] 10Quarry: test irc integration - https://phabricator.wikimedia.org/T316961 (10rook) https://github.com/toolforge/quarry/pull/2 [16:36:36] andrewbogott: Heya - I'm in meeting with my team now - can I follow on IRC? :S [16:36:44] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook) https://github.com/toolforge/quarry/pull/1 [16:37:03] sure, or we can delay 30 minutes [16:37:29] I'll be in interview in 30 - so better now for me :) [16:40:11] ok :) [16:42:57] I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/828102 as soon as CI is ready [16:44:05] over here, apergos :) [16:45:18] I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/828102. Now forcing puppet runs on some toolforge nodes as spot checks... [16:45:38] ack [16:46:35] hello to the fouurth member of the cabal! [16:46:49] yaay! [16:47:02] heya :) [16:47:52] joal: can you pick one of your nfs clients and do a puppet refresh to make sure things are still OK? [16:48:10] andrewbogott: I don't have root :( [16:48:18] then just tell me the name of one :) [16:48:24] btullis: could you please help with that --^ ? [16:48:39] an-launcher1002.eqiad.wmnet would be one andrewbogott [16:48:51] thx! [16:49:35] Running puppet on an-launcher1002 now. [16:49:46] oops I beat you to it :) [16:49:54] thanks btullis :) [16:50:25] andrewbogott: I can read nfs stuff, no issue [16:50:27] my minimal cloud-vps/toolforge spot checks seem fine [16:50:38] btullis: it's still mid-move so I'll want you to check again in 5 [16:50:55] ack [16:51:46] ok, it's done. can you check again, and tell me which path you're checking? [16:52:21] andrewbogott: I cd into cd /mnt/data/xmldatadumps/public/other/pageviews [16:52:42] andrewbogott: I can cd into subfolders, view files etc [16:52:53] nice. that's using the new mount so I think we're good. [16:53:02] \o/ [16:53:27] so joal I think all that's left is for you to keep an eye on whether the dumps remain up-to-date in the next few days. And also make sure y'all aren't relying on the old mount points /mnt/nfs/dumps-labstore1007.wikimedia.org and /mnt/nfs/dumps-labstore1006.wikimedia.org [16:53:29] similarly andrewbogott, I can also access the analytics dumps data from the internet, so that seems working as well [16:53:39] I'll surely do that andrewbogott [16:53:46] apergos: I'm feeling lucky, ok if I apply that dns change too? [16:54:10] yes I think so [16:54:41] noting two places that still have labstore related names in puppet: [16:54:43] hieradata/codfw/profile/openstack/codfw1dev/networktests.yaml: DUMPSFILE: /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html [16:54:50] hieradata/eqiad/profile/openstack/eqiad1/networktests.yaml: DUMPSFILE: /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html [16:55:04] I assume neither of these are big deals but they will need to be fixed up I guess [16:55:34] apergos: oh thanks! I'm pretty much the only one who runs that script lately but I will update it. [16:55:38] dns change is rolling out now [16:55:44] crossing fingers [16:56:45] 1 hour [16:56:51] that's going to be a long wait, ugh [16:57:05] hm... how do I know if my browser is using the new cname? I guess maybe by waiting an hour :/ [16:57:23] wait an hour, try nslookup or dig from the command line, see what your isp does, heh [16:57:34] dig on my laptop already shows the new one [16:57:55] https://www.irccloud.com/pastebin/SrTMWbpQ/ [16:57:57] just tried it on an internal host [16:58:01] looks good [16:58:03] But I don't necessarily trust the browser to be using that. [16:58:11] oh, nice. [16:58:19] same on laptop [16:58:21] Someone did a very nice, thorough job of puppetizing those hosts. [16:58:28] so we know that the syntax is right and the name is beign served [16:58:56] a few different people over time worked on those manifests [16:59:01] most recent was probably Brooke [16:59:37] I mean, reproducibility is the goal with puppet but the fact that I was able to rebuild all the same functionality on Bullseye very pleasing. [17:00:06] OK folks, I think we're done for now apart from waiting for the other shoe to drop. Please ping me here immediately if you find any surprises in the next 24h or so. [17:00:20] thank you all! [17:00:29] it's 8 pm for me so I won't be great about notifications for the next [17:00:33] Hey andrewbogott, apergos and JustHannah - I'm going into interview mode - I'll be back in 1 hour :) Thanks a lot to all of you for the migration :) [17:00:34] well 12-13 hours probably [17:00:57] feel free to ping me (and Hannah) here or maybe better in the usual sre or security channel [17:01:03] if something comes up. [17:01:16] thanks for all the work! [17:02:28] So far I'm just coasting on work that other folks did :) The hard bit was the hdfs port which I suspect caused btullis to cry tears of blood. [17:02:57] I watched those package builds move along day by painful day. but it got done in the end, kudos! [17:04:31] andrewbogott: switching the active nfs host broke toolforge and paws k8s containers, because they don't have the volume defined [17:04:49] aww crap [17:04:59] taavi: tell me more? I left all the old pieces in place so I'd expect them to just coast along on that [17:05:15] I guess they get the new name from someplace? [17:06:47] andrewbogott: there are symlinks in /public/dumps which are defined by puppet to point to "/mnt/nfs/dumps-${dumps_active_server}/${stuff}" [17:06:57] you switched dumps_active_server to something that's not mounted on the containers yet [17:07:21] those are the ones you use, yeah [17:07:33] Ah, and those mounts are internal to the container rather than on the host VM [17:08:02] the containers only mount the host paths it's told to mount, and the new dumps hosts are not included [17:08:09] hmmmm [17:08:27] That's not fixable at runtime, right? only by building fresh containers and restarting everything? [17:09:14] you don't need to rebuild new containers, but yeah, adjusting some config (which is very painful on toolforge) and restarting stuff [17:09:51] ok. so time to revert, yes? [17:10:19] for toolforge the painful part comes from the fact that we hardcode those paths in the PodSecurityPolicy objects, and there's one (or two, don't remember) of those for each tool [17:10:26] dumps_dist_active_vps needs to be reverted, everything else can stay [17:11:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/830218 [17:11:24] Oh, hm... ok, let's just do that then [17:11:28] * andrewbogott updates that revert [17:11:38] yeah because otherwise we also have the dns patch [17:14:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/830218 [17:14:14] let's wait for jenkins given its whine aminute ago [17:14:34] heh, and it turns out the dns patch was 'wrong' inasmuch as dns was already pointing to 1006 even though 1007 was marked as the web server in puppet :/ [17:14:44] So my revert is trying to sort that out as well. [17:15:14] 1007 was active for web? really? huh [17:16:30] so that might mean that web logs copied over to analytics all this time from the active web server... weren't very useful [17:16:32] huh [17:16:36] ummm no I think I'm confused [17:16:48] I think it was 1006, I just made a mistake in my original patch [17:16:55] ah ok! [17:16:55] anyway after that 'revert' patch things should be consistent [17:17:07] good good [17:17:28] jenkins likes it, shall I +!? [17:17:41] sure [17:17:51] done [17:18:51] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python: Support importing a Parquet file into HDFS using wmfdata-python - https://phabricator.wikimedia.org/T273196 (10nshahquinn-wmf) p:05Medium→03Low [17:20:42] taavi: can you check https://phabricator.wikimedia.org/T317144 for accuracy? And then, when you have time, add details about how to actually change all that :/ [17:21:46] I went to subscribe to it and found I already was :-) [17:22:25] I suppose the old mounts need to go away before the old hosts can be taken out of service too [17:22:50] yeah, although maybe in k8s we can just replace rather than add [17:23:00] oh, no we can't. Dang [17:23:22] oh, correction, in the PSPs we permit all of /mnt/nfs instead of the individual hosts. this makes it much less painful! [17:24:08] so you'd need to modify volume-admission-controller config files and something in PAWS too [17:24:19] thank you for the catch, btw taavi. Are things back to working OK? [17:26:34] not yet, but I think that's just puppet not running everywhere yet [17:27:47] at least some of this ought to be !logged I guess [17:27:50] running puppet manually on a single host looks good [17:43:31] joal: sorry for the delay, I will deploy but later, do you still have time today to sync? [17:45:15] Yes milimetric, in 15 minutes [18:07:16] milimetric: heya - wanna chat now? [18:07:36] joal: yes give me one min [18:07:40] sure [18:07:47] somehow didn't see your ping [18:10:29] ok joal batcave? [18:10:37] OMW! [18:28:47] !log weekly deployment train starting [18:28:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:32:13] (03CR) 10Milimetric: Fix Array UDFs (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [18:32:36] Starting build #111 for job analytics-refinery-maven-release-docker [18:45:46] Project analytics-refinery-maven-release-docker build #111: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/111/ [18:45:53] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10codebug) [18:47:20] Starting build #70 for job analytics-refinery-update-jars-docker [18:47:54] Project analytics-refinery-update-jars-docker build #70: 09SUCCESS in 33 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/70/ [18:47:55] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.6 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830237 [18:48:27] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.6 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830237 (owner: 10Maven-release-user) [18:49:12] !log finished refinery-source 0.2.6 deploy, waiting 5 minutes and starting refinery deploy [18:49:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:53:57] (03CR) 10Ottomata: Fix Array UDFs (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [19:48:49] ottomata: some more errors scap deploying refinery, stat1006 this time: [19:48:52] https://www.irccloud.com/pastebin/VbwhBXEE/ [19:49:34] IOError: [Errno 28] No space left on device\nerror: external filter 'git-fat filter-clean' failed [19:49:37] yeah, no space [20:08:19] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:28:15] (03PS1) 10Milimetric: Delete unused jars [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830260 [20:29:10] I think I need SRE access to make space, ottomata, help? [20:29:31] an-launcher is in trouble [20:29:46] (btw, I made https://gerrit.wikimedia.org/r/c/analytics/refinery/+/830260 to just delete these old unused jars) [20:35:15] did someone clear the space? I'm confused [20:35:21] I guess I'll try deploying again... [20:36:06] maybe the rollback deleted them... but there should be space then... [20:46:19] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:48:55] milimetric: hi sorry! [20:48:56] here! [20:49:16] still problems? how can I help asap? [20:49:21] sok, I think something auto-recovered on an-launcher [20:49:34] or someone was fixing it behind the scenes or something [20:49:43] it was out of space and now it's not [20:49:53] the deploy was broken but I deployed -f again and it seems to work now [20:50:21] thanks ottomata, I'll ping if anything breaks again... and thanks to whoever freed up space on an-launcher [20:50:40] hm. [20:50:41] okay [20:50:49] and sstat1006? [20:51:07] seemed fine after deploy -f too [20:51:13] h okahy, yeah there is lots of space free there [20:51:14] but that one never threw a disk space alarm, just failed [20:51:21] oh okay [20:51:56] I deleted like 90% of the jars in the artifacts directory, not just old jars, like all versions we don't use [20:52:01] (not merged yet) [20:52:18] that would be nice to merge sometime [20:52:53] milimetric: lets do it tomorrow [20:53:17] maybe thursday, tomorrow's kinda nuts for me [21:06:56] k [21:07:04] 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10rook) [21:07:36] (03PS1) 10Milimetric: Fix groupby (hive is unfortunately like mysql here) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830263 [21:07:53] argh, I gotta redeploy, there's a bug in that query ^ [21:07:54] 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10rook) [21:08:00] 10Quarry: test tox on PR - https://phabricator.wikimedia.org/T317092 (10rook) [21:08:06] 10Quarry: test irc integration - https://phabricator.wikimedia.org/T316961 (10rook) [21:08:12] 10Quarry: build container on PR - https://phabricator.wikimedia.org/T316958 (10rook) [21:08:15] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix groupby (hive is unfortunately like mysql here) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/830263 (owner: 10Milimetric) [21:32:42] 10Data-Engineering, 10Product-Analytics: [REQUEST] Add new Fundraising dimensions to druid.pageviews_daily & druid.pageviews_hourly - https://phabricator.wikimedia.org/T304571 (10Mayakp.wiki) Per discussion in today's Board Refinement meeting moving this task to Tracking for Product Analytics as the scope of t... [21:37:47] 10Data-Engineering: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10fkaelin) [21:45:13] !log cleared logs earlier than September 1st from an-launcher1002:/srv/airflow-analytics/logs/scheduler [21:45:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:48:56] 10Data-Engineering: Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10Ottomata) Context: - https://wikitech.wikimedia.org/wiki/Analytics/Web_publication - https://github.com/wikimedia/puppet/blob/production/modules/statistics/manifests/rsync/published.pp... [21:57:29] !log finished cleaning up bad state and re-deploying refinery [21:57:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:57:58] it took 3.5 HOURS!!! :( :( [22:18:16] !log restarted webrequest load bundle [22:18:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:18:25] !log restarted referrer daily coordinator [22:18:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:18:35] !log restarted webrequest druid daily and hourly jobs [22:18:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log