[{"content":" Background I was recently asked to help, pro bono, with analysing some life science data. From the outset I expected a steady stream of feedback along the lines of can you show this like that, and that prospect pulled me back to my early career days working as a researcher and analyst for various outfits. What those years taught me is that the bottleneck is rarely the analysis itself; it is the loop of sharing a result, hearing how someone would prefer to see it, and turning that around quickly. A small, self-service tool that lets people poke at the data themselves removes most of that friction.\nSo the goal here is deliberately narrow: run a lightweight instance of Apache Superset on a Debian laptop, let a handful of users on the local network upload their own CSV files, and give them enough to build basic dashboards without my involvement. Setting this up turned out to be surprisingly straightforward.\nThis is not a production setup, and it does not pretend to be. It is a pragmatic, LAN-only system that delivers most of the value with very little overhead. The approach mirrors how I tend to structure quick technical experiments: keep the dependencies simple, validate each step before moving on, and only harden the parts that genuinely need it.\nArchitecture The setup is intentionally minimal. Every choice below errs towards the smallest thing that works:\nComponent Choice BI tool Apache Superset Metadata DB PostgreSQL Data storage PostgreSQL (same instance) Runtime Python virtual environment Access Local network only No Docker, no orchestration, no reverse proxy — just enough to get a working system that I can reason about end to end. Using a single PostgreSQL instance for both Superset\u0026rsquo;s metadata and the uploaded datasets keeps the moving parts to a minimum, which matters more than tidiness on a machine that is only ever reached from the LAN.\nSystem Setup Starting from a fresh Debian machine, the first job is to install the build toolchain and the libraries Superset\u0026rsquo;s Python dependencies compile against, alongside PostgreSQL and Redis:\n1 2 3 4 5 6 7 8 9 10 11 sudo apt update -y -V sudo apt install -y -V \\ build-essential \\ libssl-dev \\ libffi-dev \\ python3-dev \\ python3-venv \\ python3-pip \\ libpq-dev \\ postgresql \\ redis-server PostgreSQL serves double duty here, holding both Superset\u0026rsquo;s own metadata and the datasets people upload. Redis is not strictly required for a setup this small, but it is cheap to have in place for caching and async queries should the workload grow.\nDatabase Configuration With PostgreSQL running, the next step is to create a dedicated database and user for Superset:\n1 sudo -u postgres psql 1 2 3 CREATE DATABASE superset; CREATE USER superset_user WITH PASSWORD \u0026#39;***\u0026#39;; GRANT ALL PRIVILEGES ON DATABASE superset TO superset_user; There is one easily missed step. On PostgreSQL 15 and later, GRANT ALL on the database no longer implies the right to create objects in the public schema, so the schema permissions have to be granted explicitly:\n1 2 GRANT USAGE, CREATE ON SCHEMA public TO superset_user; ALTER SCHEMA public OWNER TO superset_user; Skip this and Superset fails partway through initialisation with a permission error that is far less obvious than its cause.\nSuperset Installation Superset installs cleanly into a Python virtual environment, which keeps it isolated from the system Python and makes it trivial to remove later. I also pull in psycopg2-binary so SQLAlchemy can talk to PostgreSQL:\n1 2 3 4 5 6 mkdir -p ~/superset cd ~/superset python3 -m venv venv source venv/bin/activate pip install --upgrade pip setuptools wheel pip install apache-superset psycopg2-binary Superset reads its configuration from a superset_config.py file. At a minimum it needs to know where its metadata database lives and a secret key for signing sessions:\n1 2 SQLALCHEMY_DATABASE_URI = \u0026#34;postgresql://\u0026lt;user\u0026gt;:\u0026lt;password\u0026gt;@localhost/superset\u0026#34; SECRET_KEY = \u0026#34;***\u0026#34; Generate the secret key with something like openssl rand -base64 42 rather than inventing one by hand; Superset refuses to start with the default placeholder. With the config in place, point Superset at it and run the one-time initialisation — migrating the metadata schema, creating an admin account, and loading the default roles:\n1 2 3 4 export SUPERSET_CONFIG_PATH=~/superset/superset_config.py superset db upgrade superset fab create-admin superset init Running Superset (properly) Superset\u0026rsquo;s built-in development server is fine for a first smoke test, but it prints a warning telling you not to rely on it, and rightly so — it is single-threaded and not built for sustained use. Gunicorn with gevent workers is the recommended way to run it:\n1 pip install gunicorn gevent Bind it to all interfaces so the rest of the LAN can reach it, with a handful of workers to handle concurrent users:\n1 2 3 4 5 gunicorn \\ --workers 4 \\ --worker-class gevent \\ --bind 0.0.0.0:8088 \\ \u0026#34;superset.app:create_app()\u0026#34; At this point Superset is reachable from any machine on the network at:\n1 http://\u0026lt;machine-ip\u0026gt;:8088 Running it by hand is fine for testing, but it dies the moment the terminal closes or the laptop reboots. The next step makes it survive both.\nSystemd Service Wrapping the Gunicorn command in a systemd unit makes Superset start on boot, restart if it crashes, and wait for PostgreSQL and Redis to be ready first. Adjust the User and paths to match your own account:\n1 sudo nvim /etc/systemd/system/superset.service 1 2 3 4 5 6 7 8 9 10 11 12 13 [Unit] Description=Apache Superset After=network.target postgresql.service redis-server.service [Service] User=jan WorkingDirectory=/home/jan/superset Environment=\u0026#34;SUPERSET_CONFIG_PATH=/home/jan/superset/superset_config.py\u0026#34; ExecStart=/home/jan/superset/venv/bin/gunicorn --workers 4 --worker-class gevent --bind 0.0.0.0:8088 \u0026#34;superset.app:create_app()\u0026#34; Restart=always [Install] WantedBy=multi-user.target Reload systemd so it picks up the new unit, then enable and start it:\n1 2 3 sudo systemctl daemon-reload sudo systemctl enable superset sudo systemctl start superset systemctl status superset and journalctl -u superset -f are then the two commands worth remembering whenever something misbehaves.\nNetwork Access Binding to 0.0.0.0:8088 means Superset listens on every interface, which is convenient but worth fencing in. Even on a home network I would rather the service be unreachable from anything outside the local subnet, so I let ufw enforce that explicitly:\n1 2 3 4 sudo apt install -y ufw sudo ufw default deny incoming sudo ufw allow from 192.168.50.0/24 to any port 8088 proto tcp sudo ufw enable Substitute your own subnet for 192.168.50.0/24. With this in place the dashboard is available to everyone on the LAN and to nobody beyond it, which is exactly the boundary I want for a tool holding someone else\u0026rsquo;s data.\nAdding Data Superset will not expose a database until you tell it to, and CSV uploads are off by default for good reason — they let any authorised user write tables into your warehouse. Both are enabled through the same connection settings:\nAdd the PostgreSQL instance as a database connection. Under that connection\u0026rsquo;s advanced settings, tick Allow file uploads to database and set the target schema (I use a dedicated analytics schema rather than public, so uploaded data stays separate from Superset\u0026rsquo;s own tables). Users then load their own files from Settings → Upload file to database, picking the CSV, the destination schema, and a table name. This is the part that makes the setup genuinely self-service: once it works, people stop asking me to ingest data for them.\nData Modelling CSV uploads are convenient but blunt: every column tends to arrive as text, and Superset works best when it can lean on real types and a unique key. A couple of small fixes, defined once as virtual datasets in SQL Lab, cover most of what an uploaded file needs.\nThe first is casting columns to their proper types so that, for instance, a date behaves like a date in time-series charts and filters:\n1 2 SELECT date::date AS date, * FROM analytics.sample_transactions; The second is giving each row a stable identifier. Superset\u0026rsquo;s table views and certain chart types are much happier when a dataset has a unique key, and an uploaded CSV rarely comes with one:\n1 2 3 4 SELECT row_number() OVER () AS id, * FROM analytics.sample_transactions; Note the empty OVER () — row_number() is a window function, and PostgreSQL will reject it without an OVER clause even when you want no partitioning or ordering at all.\nVisualisation Notes The one conceptual hurdle worth flagging is aggregation grain. Superset charts can behave in surprising ways if the data is aggregated too early, because some visualisations need the underlying distribution rather than a summary statistic.\nBox plots are the clearest example: they need the raw values to compute quartiles and whiskers, so feeding them an AVG(amount) collapses the very spread they are meant to show. The fix is to be deliberate about grain — keep the dataset at row level where the chart needs the distribution, and aggregate to a controlled grain such as per day only where a summary is genuinely what you want. The surrogate key from the previous section helps here, since it gives Superset a row-level handle to work with.\nUsers and Permissions Users are created through the UI under Settings → List Users. For the people sharing this instance I assign two built-in roles rather than the catch-all Admin:\nAlpha — full access to datasets, charts, and dashboards, but without the ability to manage other users or change global settings. sql_lab — access to SQL Lab so they can write the kind of casting and keying queries described above. Together these let collaborators create datasets, explore with SQL, and build dashboards — everything the workflow needs. Admin stays reserved for me: it can edit security, database connections, and other users, which is more rope than a casual user should be handed on a shared machine.\nBackups Because everything — metadata, dashboards, and uploaded data — lives in one PostgreSQL database, a single nightly pg_dump is enough to capture the whole system. A line in cron handles it:\n1 sudo crontab -e 1 0 2 * * * sudo -u postgres pg_dump superset \u0026gt; /home/jan/backups/superset_$(date +\\%F).sql The %F is escaped because cron treats a bare % as a newline. It is worth confirming the backups directory exists and, every so often, that the dumps actually restore — an untested backup is only a hopeful one.\nObservations A few things stood out by the time the setup had settled into daily use:\nSuperset is powerful but firmly UI-driven, and SQL Lab is the escape hatch that makes it workable — most of the real shaping of data happened there. PostgreSQL permissions were by far the most common failure point, and the schema-level grants in particular caused the only genuinely confusing error of the whole exercise. Charts reward an understanding of aggregation behaviour; getting the grain right up front saved more time than any other single decision. Systemd and Gunicorn together turned a fragile, terminal-bound process into something stable enough that I stopped thinking about it. What I Did Not Do It is worth being explicit about what I left out, since the omissions are choices rather than oversights:\nDocker — a single laptop with one tenant does not need containerisation, and a virtual environment is easier to inspect and tear down. A reverse proxy (nginx) — there is nothing in front of Gunicorn to terminate or route, because nothing needs to be. HTTPS — on a trusted LAN with no external exposure, certificate management would be effort spent on a threat that is not present here. External exposure — the firewall rule above is precisely so the service never leaves the local network. Each of these would matter for a production deployment. For a home setup they add complexity and maintenance for little real benefit, and the discipline is in knowing which corners are safe to cut.\nSummary What I took away from the process is how little it takes to stand up something genuinely useful. On a single Debian laptop the result gives a small group multi-user access, self-service CSV ingestion, SQL exploration, and dashboard building — the whole feedback loop that used to slow this kind of work down, now handled by the people closest to the data.\nIt is not production-grade, and it does not try to be. As a lightweight environment for experimentation, personal analytics, or small shared use, though, it does everything I needed and nothing I did not. As with most tooling setups, the goal was never completeness but sufficiency — and on that measure it has more than earned its place.\n","permalink":"https://www.thefinalartefact.xyz/post/building-apache-superset-home-setup/","summary":"\u003cp\u003e\u003ca href=\"/post/building-apache-superset-home-setup/images/apache-superset-chart-editor.png\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Apache Superset chart editor showing a stacked area chart for synthetic transaction data\" loading=\"lazy\" src=\"/post/building-apache-superset-home-setup/images/apache-superset-chart-editor.png\"\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eI was recently asked to help, pro bono, with analysing some life science data. From the outset I expected a steady stream of feedback along the lines of \u003cem\u003ecan you show this like that\u003c/em\u003e, and that prospect pulled me back to my early career days working as a researcher and analyst for various outfits. What those years taught me is that the bottleneck is rarely the analysis itself; it is the loop of sharing a result, hearing how someone would prefer to see it, and turning that around quickly. A small, self-service tool that lets people poke at the data themselves removes most of that friction.\u003c/p\u003e","title":"Building Apache Superset Home Setup"},{"content":"After a small change in my son\u0026rsquo;s schedule, I wanted to keep everything consistent while making it easy to share. That kicked off a small calendar-planning tool with a few requirements: a clean, printable calendar view, no paid app or subscription for a one-off need, and a way to subscribe or export the calendar so it stays visible across all of my devices.\nBackground I first looked at Apple Calendar, but its printing options are limited and a presentable, readable calendar was non-negotiable. Fantastical would cover everything I needed, yet I was not keen to buy a license for a one-off need. I also considered using something like InDesign with scripting to build a polished calendar, but that felt like overkill and would have required learning additional tooling and writing a fairly complex calendar builder. Google Calendar had similar printing limitations. Given all of that, building a small software package felt like the right choice: I could script the visuals and color coding, export to ICS, and control schedule changes via versioned JSON files.\nSince this was a fairly informal, one-off event, I decided to vibe-code it with Codex (see-no-evil icon).\nWhat is Codex Codex is OpenAI\u0026rsquo;s coding agent that works inside the terminal, taking natural language requests and then reading and editing files or running commands to complete tasks. In practice it feels like a pair programmer with tool access, so I can describe the outcome and let it stitch together code and assets quickly. By comparison, Crush is a terminal app from Charmbracelet that wires your tools, code, and workflows into the LLM of your choice, with multi-model support, session context, LSP-backed signals, and MCP extensibility. I\u0026rsquo;ve been using Crush a lot and I\u0026rsquo;ve been happy with how it handles small-to-medium complexity projects, mostly for macOS and iOS hobby app development. For this project, I was looking for an opportunity to give Codex a proper ride. Codex is more of a focused, guided workflow, while Crush is a configurable TUI that you can point at whichever models you prefer.\nDevelopment For testing and experimentation, I decided to make the development fully vibe-coded. I only occasionally opened files in Neovim to check the changes, but since the project was informal, I wanted to take the vibe-coding approach all the way.\nPrompt-led milestones I started with a small set of prompts to define the foundations and basic principles for the tool. A few of the early prompts (cleaned up for readability) were:\n\u0026ldquo;Create a ReportLab-based script to print a calendar to PDF.\u0026rdquo;\n\u0026ldquo;This project is uv-managed, so add dependencies with uv add. Split calendar rendering into its own module and use Click for the CLI.\u0026rdquo;\n\u0026ldquo;Default the output path to ./out/\u0026lt;calendar-YYYY-MM.pdf\u0026gt; and create out/ if it is missing.\u0026rdquo;\n\u0026ldquo;Parse JSON events with weekly recurrence and exceptions, and render them clearly in the PDF.\u0026rdquo;\nPrompt-led refinements Once the basics worked, I used follow-on prompts to push the tool toward the real-world workflow I needed:\n\u0026ldquo;Add blocks that shade calendar cells with colors and a legend defined in JSON.\u0026rdquo;\n\u0026ldquo;Fix the missing Sunday in alternating weekend blocks by anchoring recurrence to the configured week start, and make Monday the default.\u0026rdquo;\n\u0026ldquo;Support month ranges like 1-3 and render a single multi-page PDF, with page numbers for multi-month output.\u0026rdquo;\n\u0026ldquo;Add header and footer strings, bold event labels, and allow custom fonts for readability on screens.\u0026rdquo;\n\u0026ldquo;Export a consolidated ICS file alongside the PDF, with \u0026ndash;out-ics and \u0026ndash;no-ics switches.\u0026rdquo;\nProblem The schedule problem was not just \u0026ldquo;put events in a calendar.\u0026rdquo; It was an alternating-week routine with exceptions and handoffs that needed to stay consistent, and the usual calendar apps are fine for reminders but not great at producing a clean, printable, month-at-a-glance plan with blocks of time clearly marked and a legend that other people can read without the app.\nI needed a calendar that could do three things at once: show the larger weekly cadence (A/B weeks), include specific timed events, and be exportable so I could subscribe to it on my phone. Printing was the non-negotiable constraint, so I wanted a PDF that looked like something I\u0026rsquo;d actually tape on the fridge.\nSolution I built a tiny Python CLI called calendar-planning that takes a JSON file as the source of truth. The JSON holds two concepts:\nevents: timed or all-day items, optionally recurring weekly with exceptions. blocks: colored date ranges that define the week patterns (e.g., Schedule A vs Schedule B). From that single file, the tool generates a PDF calendar (single month or a multi-month PDF) and an optional .ics file for calendar subscriptions. Updating the schedule is just editing JSON and re-running the command. The code is published at github.com/konradzdeb/codex-calendar-planning.\nI ran it via uv run calendar-planning ... while the project was uv-managed. The default output is out/calendar-YYYY-MM.pdf, and the .ics file is written next to the PDF unless you pass --no-ics or override with --out-ics.\nExample run (month range, with a header, and a Monday-start week):\n1 2 3 4 5 6 uv run calendar-planning 2025 1-3 \\ --events data/schedule.json \\ --page-size A4 \\ --week-start MON \\ --header \u0026#34;2025 Schedule\u0026#34; \\ --footer \u0026#34;Generated from JSON\u0026#34; Example input structure:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 { \u0026#34;events\u0026#34;: [ { \u0026#34;title\u0026#34;: \u0026#34;Basketball practice\u0026#34;, \u0026#34;start\u0026#34;: \u0026#34;2025-01-07 18:00\u0026#34;, \u0026#34;duration\u0026#34;: { \u0026#34;hours\u0026#34;: 1, \u0026#34;minutes\u0026#34;: 30 }, \u0026#34;recurrence\u0026#34;: { \u0026#34;frequency\u0026#34;: \u0026#34;weekly\u0026#34;, \u0026#34;interval\u0026#34;: 2, \u0026#34;byweekday\u0026#34;: [\u0026#34;TUE\u0026#34;], \u0026#34;until\u0026#34;: \u0026#34;2025-04-30\u0026#34;, \u0026#34;except\u0026#34;: [\u0026#34;2025-02-11\u0026#34;] } } ], \u0026#34;blocks\u0026#34;: [ { \u0026#34;title\u0026#34;: \u0026#34;Schedule A\u0026#34;, \u0026#34;color\u0026#34;: \u0026#34;#E8F2FF\u0026#34;, \u0026#34;start_date\u0026#34;: \u0026#34;2025-01-06\u0026#34;, \u0026#34;end_date\u0026#34;: \u0026#34;2025-03-31\u0026#34;, \u0026#34;recurrence\u0026#34;: { \u0026#34;frequency\u0026#34;: \u0026#34;weekly\u0026#34;, \u0026#34;interval\u0026#34;: 2, \u0026#34;byweekday\u0026#34;: [\u0026#34;MON\u0026#34;, \u0026#34;TUE\u0026#34;, \u0026#34;WED\u0026#34;, \u0026#34;THU\u0026#34;] } } ] } Codex stitched the CLI and PDF layout together quickly, which meant I could stay focused on the data model and the printed layout instead of wiring up a GUI. That was the whole goal: a small, disposable tool that does one job well.\nImplementation notes The tool is deliberately small: Click handles the CLI, ReportLab handles PDF rendering, and a thin JSON parser expands recurrences into concrete dates for the months you ask for. Events are displayed inside each day cell; blocks are drawn as a soft background tint with a legend.\nEvent durations are optional. A timed event without a duration defaults to 15 minutes, and the same durations flow into the ICS export as explicit end times. Bi-weekly rules are just interval: 2.\nThe block logic is what makes the two-week pattern readable. It anchors each recurring block to a week start so an every-other-week pattern stays aligned even when a month begins mid-week. I hit an edge case where Sunday landed in the wrong \u0026ldquo;week bucket\u0026rdquo;; anchoring to the configured week start fixed it.\n1 2 3 4 5 6 7 8 9 10 11 def block_dates_for_month(block, year, month, week_start): # Anchor to the first valid week so bi-weekly rules stay consistent. anchor_week_start = ( block_anchor_week_start(block.start_date, rule.byweekday, week_start) if rule is not None else week_start_date(block.start_date, week_start) ) ... weeks_since_anchor = (current_date - anchor_week_start).days // 7 if weeks_since_anchor % rule.interval != 0: continue # Skip off-weeks for alternating schedules. PDF rendering stays intentionally conservative: draw the grid, shade background blocks, then place text. When the day cell gets crowded, the layout collapses the overflow into a summary line to keep the page readable.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 def build_event_lines(pdf, occurrences, max_lines, max_width, font_name, font_size): # Render as many wrapped lines as fit, then append \u0026#34;+N more\u0026#34;. lines = [] shown = 0 for occurrence in occurrences: wrapped = wrap_text(pdf, occurrence.label(), max_width, font_name, font_size) if len(lines) + len(wrapped) \u0026gt; max_lines: break lines.extend(wrapped) shown += 1 remaining = len(occurrences) - shown if remaining \u0026gt; 0: lines[-1] = f\u0026#34;+{remaining} more\u0026#34; return lines The ICS export is also simple on purpose: it expands only the months you request and writes one VEVENT per occurrence. That keeps the calendar export bounded and avoids the surprises that come with complex recurrence rules.\nResults The tool produced exactly what I wanted: a clean A4 PDF with clearly colored week blocks and legible event labels, plus an .ics file I could subscribe to on my devices. It turned the schedule into a small data file I could edit in seconds, re-render, and share.\nHere is a sample output generated from the sample_events.json fixture:\nThe main lesson was that for one-off planning tasks, a small CLI and a good data model are better than wrestling a heavyweight calendar app. I did not need perfect recurrence support or time zone handling; I needed a printable artifact and a repeatable process, and this delivered both with minimal overhead.\nRepository with the code: github.com/konradzdeb/codex-calendar-planning.\n","permalink":"https://www.thefinalartefact.xyz/post/codex-calendar-app/","summary":"\u003cp\u003eAfter a small change in my son\u0026rsquo;s schedule, I wanted to keep everything consistent while making it easy to share. That kicked off a small calendar-planning tool with a few requirements: a clean, printable calendar view, no paid app or subscription for a one-off need, and a way to subscribe or export the calendar so it stays visible across all of my devices.\u003c/p\u003e\n\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eI first looked at Apple Calendar, but its printing options are limited and a presentable, readable calendar was non-negotiable. Fantastical would cover everything I needed, yet I was not keen to buy a license for a one-off need. I also considered using something like InDesign with scripting to build a polished calendar, but that felt like overkill and would have required learning additional tooling and writing a fairly complex calendar builder. Google Calendar had similar printing limitations. Given all of that, building a small software package felt like the right choice: I could script the visuals and color coding, export to ICS, and control schedule changes via versioned JSON files.\u003c/p\u003e","title":"Codex Calendar Planning Tool"},{"content":"This article shows how to use Xcode scheme pre-/post-actions (Build and Run) to observe how your app creates and mutates defaults, by logging changes outside the app process.\nWhat are Defaults? In Swift and macOS development, defaults (via UserDefaults and the defaults CLI) are the lightweight persistence layer for user preferences, feature toggles, and other small pieces of state that need to survive app relaunches. They sit between in-memory settings and heavier storage options, letting you read and write simple values keyed by domain so the same code works in app code, Xcode schemes, and shell scripts. Because defaults are global to a domain, careful naming and clean-up are essential to avoid collisions and stale settings during development.\nFor a deeper dive, Fatbobman’s “UserDefaults and Observation in SwiftUI” is a solid blueprint: Xu Yang shows why Observation alone misses external changes, then patches the gap with a @ObservableDefaults macro that keeps SwiftUI views in sync with UserDefaults regardless of where writes originate. It’s an excellent, disciplined approach that centralizes keys, responds to external mutations, and favors lightweight persistence over ephemeral state. I use this approach for the rest of the article.\nChallenge When building a macOS application that stores defaults, I wanted to see what the app writes to the defaults store—specifically, what actually ends up on disk.\nSolution My preferred solution is to create a dedicated Xcode scheme that I use to track the defaults lifecycle for a single run. I created a basic macOS application and a set of defaults following guidance in “UserDefaults and Observation in SwiftUI”. First, I define a set of defaults that hold a few inputs from the application via an ObservableObject.\nThe App Structure 1 2 3 4 5 6 7 8 import SwiftUI import Combine class Defaults: ObservableObject { @AppStorage(\u0026#34;custom_toggle\u0026#34;, store: .group) var customToggle: Bool = false @AppStorage(\u0026#34;custom_integer\u0026#34;, store: .group) var customInteger: Int = 1 @AppStorage(\u0026#34;custom_string\u0026#34;, store: .group) var customString: String = \u0026#34;abc\u0026#34; } Adding a .group store is good practice: it enables convenient sharing of default values across targets (and extensions). I define my shared UserDefaults suite like this:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 import Foundation extension UserDefaults { static let group: UserDefaults = { let staticString: String = \u0026#34;group.xz.public.swift.examples.DefaultsExperiment\u0026#34; if let teamIdentifierPrefix = Bundle.main.object(forInfoDictionaryKey: \u0026#34;Team Identifier Prefix\u0026#34;) as? String { return UserDefaults( suiteName: teamIdentifierPrefix + staticString ) ?? UserDefaults.standard } else { return UserDefaults(suiteName: staticString) ?? UserDefaults.standard } }() } With those basics in place, I add a simple SwiftUI view so I can modify the stored preferences.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import SwiftUI struct ContentView: View { @StateObject var defaults = Defaults() var body: some View { VStack(alignment: .leading, spacing: 16) { Toggle(\u0026#34;Custom Toggle: \\(defaults.customToggle ? \u0026#34;On\u0026#34; : \u0026#34;Off\u0026#34;)\u0026#34;, isOn: defaults.$customToggle) Stepper(\u0026#34;Custom Integer: \\(defaults.customInteger)\u0026#34;, value: defaults.$customInteger) VStack(alignment: .leading, spacing: 4) { Text(\u0026#34;Custom String:\u0026#34;) TextField(\u0026#34;abc\u0026#34;, text: defaults.$customString) } } .padding() } } Observing the Observable Defaults Now the key element: how do we conveniently observe changes to defaults? In Xcode you can use standard approaches like logging (via os.Logger / OSLog) or breakpoints. However, Apple’s implementation deliberately decouples mutation from persistence. In practice, writes go through an in-memory cache and persistence to disk is deferred and opportunistic. A write to disk usually happens under one of the following conditions:\nRun loop idle: the process is idle and the cache flushes Graceful app termination (Cmd + Q) Normal quit: pending preferences are written to disk App lifecycle transitions (background/foreground) Memory pressure: forcing cache flush to reclaim memory Periodic internal timers (undocumented heuristics) Given that, I wanted to observe changes to the actual file independently, without relying on logging or breakpoints. A convenient solution is to use Xcode’s pre-run and post-run actions. These can be configured in your Xcode scheme via Product \u0026gt; Scheme \u0026gt; Edit Scheme. To achieve the desired outcome, configure one pre-run and one post-run action.\nPre-run Action Most of the work for the observation pipeline happens in the pre-run action. First, create a few variables and ensure the two tools you need are available:\n1 2 3 4 5 6 7 set -euo pipefail FSWATCH=\u0026#34;/opt/homebrew/bin/fswatch\u0026#34; PLUTIL=\u0026#34;/usr/bin/plutil\u0026#34; test -x \u0026#34;${FSWATCH}\u0026#34; || { echo \u0026#34;fswatch not found\u0026#34;; exit 1; } test -x \u0026#34;${PLUTIL}\u0026#34; || { echo \u0026#34;plutil not found\u0026#34; ; exit 1; } Note: Xcode can continue the build even if a pre-action fails. If you need failures to be hard-stops, move the checks into a build phase. Next, define a few paths so the action is easy to compose. In practice, we want the preference plist path plus some log/snapshot locations:\n1 2 3 4 5 6 7 8 9 APP_GROUP_ID=\u0026#34;group.kz.public.swift.examples.DefaultsExperiment\u0026#34; GROUP_CONTAINER=\u0026#34;$HOME/Library/Group Containers/${APP_GROUP_ID}\u0026#34; PLIST=\u0026#34;${GROUP_CONTAINER}/Library/Preferences/${APP_GROUP_ID}.plist\u0026#34; LOG_DIR=\u0026#34;${SRCROOT}/.xcode-logs\u0026#34; mkdir -p \u0026#34;$LOG_DIR\u0026#34; LOG_FILE=\u0026#34;${LOG_DIR}/defaults-watch.log\u0026#34; PID_FILE=\u0026#34;${LOG_DIR}/defaults-watch.pid\u0026#34; SNAPSHOT=\u0026#34;${LOG_DIR}/defaults-current.xml\u0026#34; PREV=\u0026#34;${LOG_DIR}/defaults-prev.xml\u0026#34; These variables have the following responsibilities:\nAPP_GROUP_ID, GROUP_CONTAINER, and PLIST build the path to the binary plist. LOG_DIR and LOG_FILE keep a timestamped record of diffs between plist revisions. PID_FILE stores the background watcher PID so the post-run action can stop it. SNAPSHOT and PREV store XML snapshots of the current and previous plist for diffing. The final part of the script spins up an fswatch loop that runs in the background and listens for filesystem events on the plist itself. fswatch --latency 0.2 coalesces writes so the loop gets one line per change; each line becomes a trigger that rotates the previous XML snapshot, re-renders the binary plist to XML via plutil, and writes both a timestamp and a unified diff to the log so you see exactly what changed. The subshell is backgrounded and its PID is written out so the post-run action can kill the watcher cleanly when the scheme finishes.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 echo \u0026#34;=== defaults watch start: $(date -Iseconds) plist=${PLIST} ===\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; # Initialise snapshot if file exists if [[ -f \u0026#34;$PLIST\u0026#34; ]]; then \u0026#34;$PLUTIL\u0026#34; -convert xml1 -o \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PLIST\u0026#34; fi ( \u0026#34;$FSWATCH\u0026#34; --latency 0.2 \u0026#34;$PLIST\u0026#34; | while read -r _; do ts=\u0026#34;$(date -Iseconds)\u0026#34; # Rotate snapshots [[ -f \u0026#34;$SNAPSHOT\u0026#34; ]] \u0026amp;\u0026amp; mv \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PREV\u0026#34; \u0026#34;$PLUTIL\u0026#34; -convert xml1 -o \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PLIST\u0026#34; echo \u0026#34;[$ts] plist changed\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; if [[ -f \u0026#34;$PREV\u0026#34; ]]; then echo \u0026#34;---- diff begin ----\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; diff -u \u0026#34;$PREV\u0026#34; \u0026#34;$SNAPSHOT\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; || true echo \u0026#34;---- diff end ----\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; else echo \u0026#34;(initial snapshot)\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; fi done ) \u0026amp; echo $! \u0026gt; \u0026#34;$PID_FILE\u0026#34; The full pre-run action setup looks as follows:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 set -euo pipefail FSWATCH=\u0026#34;/opt/homebrew/bin/fswatch\u0026#34; PLUTIL=\u0026#34;/usr/bin/plutil\u0026#34; test -x \u0026#34;${FSWATCH}\u0026#34; || { echo \u0026#34;fswatch not found\u0026#34;; exit 1; } test -x \u0026#34;${PLUTIL}\u0026#34; || { echo \u0026#34;plutil not found\u0026#34; ; exit 1; } APP_GROUP_ID=\u0026#34;group.kz.public.swift.examples.DefaultsExperiment\u0026#34; GROUP_CONTAINER=\u0026#34;$HOME/Library/Group Containers/${APP_GROUP_ID}\u0026#34; PLIST=\u0026#34;${GROUP_CONTAINER}/Library/Preferences/${APP_GROUP_ID}.plist\u0026#34; LOG_DIR=\u0026#34;${SRCROOT}/.xcode-logs\u0026#34; mkdir -p \u0026#34;$LOG_DIR\u0026#34; LOG_FILE=\u0026#34;${LOG_DIR}/defaults-watch.log\u0026#34; PID_FILE=\u0026#34;${LOG_DIR}/defaults-watch.pid\u0026#34; SNAPSHOT=\u0026#34;${LOG_DIR}/defaults-current.xml\u0026#34; PREV=\u0026#34;${LOG_DIR}/defaults-prev.xml\u0026#34; echo \u0026#34;=== defaults watch start: $(date -Iseconds) plist=${PLIST} ===\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; # Initialise snapshot if file exists if [[ -f \u0026#34;$PLIST\u0026#34; ]]; then \u0026#34;$PLUTIL\u0026#34; -convert xml1 -o \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PLIST\u0026#34; fi ( \u0026#34;$FSWATCH\u0026#34; --latency 0.2 \u0026#34;$PLIST\u0026#34; | while read -r _; do ts=\u0026#34;$(date -Iseconds)\u0026#34; # Rotate snapshots [[ -f \u0026#34;$SNAPSHOT\u0026#34; ]] \u0026amp;\u0026amp; mv \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PREV\u0026#34; \u0026#34;$PLUTIL\u0026#34; -convert xml1 -o \u0026#34;$SNAPSHOT\u0026#34; \u0026#34;$PLIST\u0026#34; echo \u0026#34;[$ts] plist changed\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; if [[ -f \u0026#34;$PREV\u0026#34; ]]; then echo \u0026#34;---- diff begin ----\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; diff -u \u0026#34;$PREV\u0026#34; \u0026#34;$SNAPSHOT\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; || true echo \u0026#34;---- diff end ----\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; else echo \u0026#34;(initial snapshot)\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; fi done ) \u0026amp; echo $! \u0026gt; \u0026#34;$PID_FILE\u0026#34; Post-run Action After the run completes, the post-run action uses the PID captured in defaults-watch.pid to shut down the background watcher started in the pre-run step. The script checks that the PID file exists, verifies the process is still alive, then sends a termination signal (and attempts to terminate any child process) so fswatch doesn’t keep running after Xcode stops the app. Finally, it removes the PID file to avoid leaking stale state into the next scheme run.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 set -euo pipefail LOG_DIR=\u0026#34;${SRCROOT}/.xcode-logs\u0026#34; LOG_FILE=\u0026#34;${LOG_DIR}/defaults-watch.log\u0026#34; PID_FILE=\u0026#34;${LOG_DIR}/defaults-watch.pid\u0026#34; if [[ -f \u0026#34;$PID_FILE\u0026#34; ]]; then pid=\u0026#34;$(cat \u0026#34;$PID_FILE\u0026#34;)\u0026#34; if kill -0 \u0026#34;$pid\u0026#34; 2\u0026gt;/dev/null; then echo \u0026#34;=== defaults watch stop: $(date -Iseconds) pid=${pid} ===\u0026#34; \u0026gt;\u0026gt; \u0026#34;$LOG_FILE\u0026#34; kill \u0026#34;$pid\u0026#34; 2\u0026gt;/dev/null || true pkill -P \u0026#34;$pid\u0026#34; 2\u0026gt;/dev/null || true fi rm -f \u0026#34;$PID_FILE\u0026#34; fi Summary By wiring a pre-run action that snapshots your app group’s preferences plist and starts an fswatch loop, and a post-run action that stops it, you can capture real on-disk UserDefaults mutations during a single Xcode run. The script converts the binary plist to XML with plutil and appends a timestamped unified diff to a log, giving a lightweight, debugger-free way to see exactly which keys your app writes and when they eventually flush to disk.\n","permalink":"https://www.thefinalartefact.xyz/post/build-pre-post-actions-observe-default/","summary":"\u003cp\u003eThis article shows how to use Xcode scheme pre-/post-actions (Build and Run) to observe how your app creates and mutates defaults, by logging changes outside the app process.\u003c/p\u003e\n\u003ch2 id=\"what-are-defaults\"\u003eWhat are Defaults?\u003c/h2\u003e\n\u003cp\u003eIn Swift and macOS development, defaults (via \u003ccode\u003eUserDefaults\u003c/code\u003e and the \u003ccode\u003edefaults\u003c/code\u003e CLI) are the lightweight persistence layer for user preferences, feature toggles, and other small pieces of state that need to survive app relaunches. They sit between in-memory settings and heavier storage options, letting you read and write simple values keyed by domain so the same code works in app code, Xcode schemes, and shell scripts. Because defaults are global to a domain, careful naming and clean-up are essential to avoid collisions and stale settings during development.\u003c/p\u003e","title":"Using Xcode Pre- and Post-actions to Observe Changes to Defaults"},{"content":"As a data science lead, I see it as my responsibility to guide junior data scientists on training and professional development. The field is broad, and I often see two common profiles: those with strong mathematical and statistical foundations but limited software engineering experience (e.g., object-oriented programming, unit testing, CI/CD), and those with solid computer science backgrounds but less exposure to the mathematical side. To make informed recommendations, I regularly complete courses and exercises myself and make a habit of daily practice. In this post, I review DataCamp’s Machine Learning Engineering course, which I recently completed.\nCourse The course is organised into theoretical and practical modules, plus independent projects. It starts with core ML in scikit-learn—classification, regression, model fine-tuning, and preprocessing with pipelines—then moves into MLOps: taking models from notebooks to production (introduction, design and development, deployment, and maintenance) and a modern lifecycle view (develop for deployment, deploy and run, monitor and maintain). The practical modules focus on specific tools and workflows—for example, experiment tracking and governance with MLflow (tracking, models, registry, projects), data versioning and pipeline reproducibility with DVC, data quality with Great Expectations, and ETL-focused data pipelines (from basics through advanced techniques and operations). Platform skills are covered through the Unix command line (files and data manipulation, combining tools, batch processing, creating simple utilities), Docker (running containers, writing images, securing images), and CI/CD for ML with GitHub Actions and DVC (YAML basics, actions, CI for training runs, and comparing runs/hyperparameter tuning). Two short projects tie the content together: a supervised learning project in agriculture and a time-series regression project forecasting London temperatures.\nThings I Liked The theoretical content is, in my view, the track’s strongest asset. Modules on model management and the ML lifecycle are grounded in established engineering practice. The Monitoring Machine Learning Concepts module explains the challenges of monitoring models in production, including data and concept drift, and outlines methods to address model degradation. I also appreciated how the lifecycle-focused MLOps modules (from “MLOps in a Nutshell” through “Monitor and Maintain”) map cleanly to real-world phases. The MLflow and DVC coverage reinforces reproducibility and governance, which are recurring pain points on production teams.\nMixed Views I have mixed views on the practical components. Some modules—such as those on using MLflow for model deployment and DVC for data version control—are well placed and genuinely useful. Others are more debatable. For example, ETL/ELT is demonstrated using Pandas. While Pandas is widely used and helpful for many data engineering tasks, relying on it alone is often insufficient for production-grade pipelines; the ETL course does a good job of introducing concepts and advanced techniques, but it would benefit from more emphasis on orchestration, dependency management, and backfills. Similarly, the CI/CD module with GitHub Actions is a solid introduction, though additional depth on environments, secrets, and release gating would make it more directly applicable to enterprise setups.\nThings I Didn\u0026rsquo;t Like Two modules could be stronger. The Introduction to Shell module progresses from simple commands (e.g., cp, grep) to basic scripts. The idea is sound, but the delivery feels uneven: it emphasises memorising commands over practical workflows such as piping, redirection, error handling, and reproducible scripting. A greater focus on real-world tasks—data inspection, batch processing, and automation—would make it more effective. The ETL/ELT module also feels underpowered for production use: centring pipelines on Pandas limits scalability and reliability; more on orchestration, data contracts, schema evolution, and recovery patterns would improve its relevance.\nVerdict Overall, the track explains the core theoretical concepts well and reinforces them with practical—if basic—examples. The end-to-end flow from scikit-learn fundamentals through MLOps, deployment, and monitoring is coherent, so learners can grasp how the pieces fit together. That said, a few modules feel light on depth: the Introduction to Shell is very basic and prioritises command recall over real workflows, and the ETL content centred on Pandas is not representative of typical production pipelines, which are usually far more complex, integrate multiple systems, and often rely on distributed engines such as Spark. Treated as an introduction to ideas and tooling, the track is a solid starting point; plan to supplement the “light” modules with more production-focused resources.\n","permalink":"https://www.thefinalartefact.xyz/post/review-mlend/","summary":"\u003cp\u003eAs a data science lead, I see it as my responsibility to guide junior data scientists on training and professional development. The field is broad, and I often see two common profiles: those with strong mathematical and statistical foundations but limited software engineering experience (e.g., object-oriented programming, unit testing, CI/CD), and those with solid computer science backgrounds but less exposure to the mathematical side. To make informed recommendations, I regularly complete courses and exercises myself and make a habit of daily practice. In this post, I review DataCamp’s Machine Learning Engineering course, which I recently completed.\u003c/p\u003e","title":"Review of the Machine Learning Engineer Datacamp Course"},{"content":" Integrating Python-based machine learning models into iOS applications can be challenging, particularly when converting models into a Swift-compatible format. This example will demonstrate a simple image classification task using the Fashion-MNIST dataset and CoreML conversion tools. The goal is to illustrate the effort required to deploy small-to-medium complexity ML models within iOS applications. The demonstration is based on a Convolutional Neural Network (CNN) built with PyTorch, but the concepts apply broadly to other Python-based models as well.\nModel Development For demonstration purposes, we\u0026rsquo;ll create a basic machine learning model in Python. To classify images, I\u0026rsquo;ll build a simple Convolutional Neural Network (CNN) using PyTorch. The model will be trained on the Fashion-MNIST dataset, comprising 70,000 grayscale images of fashion items in 10 categories. We\u0026rsquo;ll begin by sourcing a standard set of Python packages required for model development.\n1 2 3 4 5 6 7 8 9 10 11 12 13 \u0026#34;\u0026#34;\u0026#34;Train a classifier on Fashion-MNIST and print performance summary.\u0026#34;\u0026#34;\u0026#34; import torch import torch.mps import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader import coremltools as ct from sklearn.metrics import classification_report from torchvision import datasets, transforms The model represents a fairly unsophisticated approach to handle image classification task. Naturally, in a production setting you will want to utilise more sophisticated solution, handling complex data and scenarios where you could be dealing with distorted images data (low lighting, different angles, etc.). The provided CNN implementation is fairly basic but sufficient for the purpose of this demonstration. It consists of a few convolutional layers, followed by fully connected layers, and uses ReLU activation functions. The model is trained using the Adam optimizer and cross-entropy loss function.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 root=\u0026#34;./data\u0026#34;, train=True, download=True, transform=transform, ) test_set = datasets.FashionMNIST( root=\u0026#34;./data\u0026#34;, train=False, download=True, transform=transform, ) ## Use the class labels from the dataset FASHION_LABELS = train_set.classes class SimpleCNN(nn.Module): def __init__(self): super().__init__() self.conv = nn.Sequential( nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2) ) self.fc = nn.Sequential( nn.Flatten(), nn.Linear(64 * 7 * 7, 128), nn.ReLU(), nn.Linear(128, 10) ) def forward(self, x): x = self.conv(x) return self.fc(x) train_loader = DataLoader(train_set, batch_size=64, shuffle=True) val_loader = DataLoader(test_set, batch_size=64, shuffle=False) device = torch.device(\u0026#34;mps\u0026#34;) model = SimpleCNN().to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) for epoch in range(5): model.train() for images, labels in train_loader: images, labels = images.to(device), labels.to(device) optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() model.eval() all_preds = [] all_labels = [] with torch.no_grad(): for images, labels in val_loader: images = images.to(device) outputs = model(images) preds = outputs.argmax(dim=1).cpu().numpy() all_preds.extend(preds) all_labels.extend(labels.numpy()) print(classification_report(all_labels, all_preds, target_names=FASHION_LABELS)) Additional Testing In addition to evaluating model performance, we\u0026rsquo;ll also test its ability to handle images provided as flat files. The tests will run against several publicly available images.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 \u0026#34;\u0026#34;\u0026#34;Test model on a few sample images.\u0026#34;\u0026#34;\u0026#34; import os import pytest from PIL import Image, ImageOps import coremltools as ct from tabulate import tabulate @pytest.fixture(scope=\u0026#34;module\u0026#34;) def model(): model_path = os.path.join(os.path.dirname(__file__), \u0026#34;../FashionMNISTClassifier.mlpackage\u0026#34;) return ct.models.MLModel(model_path) def preprocess_image(image_path): with Image.open(image_path) as img: if img.mode != \u0026#34;L\u0026#34;: img = img.convert(\u0026#34;L\u0026#34;) img = ImageOps.invert(img) img = img.resize((28, 28)) return img results = [] ## Create parametrized test for different image files @pytest.mark.parametrize(\u0026#34;filename\u0026#34;, [\u0026#34;t-shirt.jpeg\u0026#34;, \u0026#34;pullover.jpg\u0026#34;, \u0026#34;bag.jpeg\u0026#34;]) def test_model_prediction(filename, model): fixtures_dir = os.path.join(os.path.dirname(__file__), \u0026#34;fixtures\u0026#34;) img_path = os.path.join(fixtures_dir, filename) arr = preprocess_image(img_path) input_data = {\u0026#34;image\u0026#34;: arr} expected_label = os.path.splitext(filename)[0] output = model.predict(input_data) predicted_label = str(output[\u0026#34;classLabel\u0026#34;]) match = expected_label.lower() in predicted_label.lower() results.append((filename, expected_label, predicted_label, \u0026#34;✅\u0026#34; if match else \u0026#34;❌\u0026#34;)) assert match, f\u0026#34;{filename}: expected {expected_label}, got {predicted_label}\u0026#34; def pytest_sessionfinish(session, exitstatus): if results: print(\u0026#34;\\n\\nModel Prediction Results:\\n\u0026#34;) print(tabulate(results, headers=[\u0026#34;Filename\u0026#34;, \u0026#34;Expected\u0026#34;, \u0026#34;Predicted\u0026#34;, \u0026#34;Match\u0026#34;])) Converting to Core ML A key challenge is converting and integrating the model into the Swift-based iOS application. We\u0026rsquo;ll export the model into the .mlpackage format using available conversion tools. It\u0026rsquo;s critical to ensure our model can correctly handle the required input format—in this case, images—by defining the input_features and output_features.\nProper definition of these objects is crucial when converting models (including scikit-learn) to Core ML format using coremltools. In this example, the input features are defined as input_features = [(\u0026quot;image\u0026quot;, ct.models.datatypes.Array(1, 28, 28))]. This configuration means the Core ML model expects a single-channel (grayscale) image of size 28x28 as input, matching the Fashion-MNIST images. This alignment ensures correct image processing within your iOS application.\nWhy is this important? If input features do not match the expected model shape, conversion will fail, or the resulting Core ML model may not function correctly in your app.\n1 2 3 4 5 6 7 8 9 10 11 classifier_config = ct.ClassifierConfig(class_labels=FASHION_LABELS) mlmodel = ct.convert( traced, inputs=[ct.ImageType(name=\u0026#34;image\u0026#34;, shape=(1, 1, 28, 28), scale=1/255.0, color_layout=ct.colorlayout.GRAYSCALE)], classifier_config=classifier_config ) mlmodel.save(\u0026#34;FashionMNISTClassifier.mlpackage\u0026#34;) print(\u0026#34;Exported CoreML model to FashionMNISTClassifier.mlpackage\u0026#34;) Use in Swift First, we need to import the model into our Xcode project by dragging and dropping the .mlpackage file into the Xcode project navigator. After importing, the model becomes available as a Swift class sharing its .mlpackage file name—FashionMNISTClassifier in this example. Inference is performed using the straightforward predict method, with most heavy lifting managed by the FashionMNISTClassifierInput class.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 // Created by Konrad on 30/06/2025. // import CoreML import Foundation import UIKit class ModelViewModel: ObservableObject { @Published var predictedLabel: String = \u0026#34;No prediction yet\u0026#34; private let model: FashionMNISTClassifier init?() { guard let model = try? FashionMNISTClassifier( configuration: .init()) else { return nil } self.model = model } func predict(from image: UIImage) { guard let resized = ImagePreprocessor.preprocess(image) else { predictedLabel = \u0026#34;Preprocessing failed\u0026#34; return } let input = FashionMNISTClassifierInput(image: resized) guard let result = try? model.prediction(input: input) else { predictedLabel = \u0026#34;Prediction failed\u0026#34; return } predictedLabel = result.classLabel } } Image Pre-processing The Swift ImagePreprocessor struct provides a static method to convert a UIImage into a 28×28 grayscale-formatted CVPixelBuffer. The method resizes the image, converts it to grayscale, and produces a pixel buffer ready for inference with Core ML.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 // Created by Konrad on 30/06/2025. // import UIKit import CoreImage import CoreML struct ImagePreprocessor { static func preprocess(_ image: UIImage, size: CGSize = CGSize(width: 28, height: 28)) -\u0026gt; CVPixelBuffer? { let width = Int(size.width) let height = Int(size.height) var pixelBuffer: CVPixelBuffer? let attrs = [ kCVPixelBufferCGImageCompatibilityKey: true, kCVPixelBufferCGBitmapContextCompatibilityKey: true ] as CFDictionary let status = CVPixelBufferCreate( kCFAllocatorDefault, width, height, kCVPixelFormatType_OneComponent8, attrs, \u0026amp;pixelBuffer ) guard status == kCVReturnSuccess, let buffer = pixelBuffer else { return nil } CVPixelBufferLockBaseAddress(buffer, .readOnly) guard let context = CGContext( data: CVPixelBufferGetBaseAddress(buffer), width: width, height: height, bitsPerComponent: 8, bytesPerRow: CVPixelBufferGetBytesPerRow(buffer), space: CGColorSpaceCreateDeviceGray(), bitmapInfo: CGImageAlphaInfo.none.rawValue ) else { CVPixelBufferUnlockBaseAddress(buffer, .readOnly) return nil } guard let cgImage = image.cgImage else { CVPixelBufferUnlockBaseAddress(buffer, .readOnly) return nil } context.draw(cgImage, in: CGRect(x: 0, y: 0, width: width, height: height)) CVPixelBufferUnlockBaseAddress(buffer, .readOnly) return buffer } } Final Considerations I\u0026rsquo;ve used a recent version of PyTorch to leverage Metal Performance Shaders (MPS)—Apple’s framework enabling GPU acceleration on Apple Silicon and Intel Macs. Although my chosen PyTorch version wasn\u0026rsquo;t officially tested with Core ML Tools, it functioned without issue. However, for robustness, ensure compatibility between PyTorch and Core ML library versions.\nIf your use case involves image classification, consider exploring Apple’s Vision Foundation Models. These models are optimized for on-device performance and simplify common image classification tasks significantly. Alternatively, if sticking with PyTorch is important, consider using PyTorch Mobile. PyTorch Mobile lets you run PyTorch models natively on-device, offering enhanced control with minimal translation between training and inference environments.\nThe entire project, including training scriptts, conversion logic, and Swift application code, is available through the GitHub repository: https://github.com/konradzdeb/SwiftPythonML.\n","permalink":"https://www.thefinalartefact.xyz/post/python-models-app/","summary":"\u003cp\u003e\u003ca href=\"/post/python-models-app/images/phonedemo.gif\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Phone Model Demo\" loading=\"lazy\" src=\"/post/python-models-app/images/phonedemo.gif\"\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eIntegrating Python-based machine learning models into iOS applications can be challenging, particularly when converting models into a Swift-compatible format. This example will demonstrate a simple image classification task using the Fashion-MNIST dataset and CoreML conversion tools. The goal is to illustrate the effort required to deploy small-to-medium complexity ML models within iOS applications. The demonstration is based on a Convolutional Neural Network (CNN) built with PyTorch, but the concepts apply broadly to other Python-based models as well.\u003c/p\u003e","title":"Bring your Python ML Model to iOS App in under Three Minutes"},{"content":"One of the key benefits of modern editors like NVim, Vim, or Emacs is the rich plugin ecosystem. After years with Vim, I switched to NVim and was immediately impressed by its plugin landscape. The Lazy plugin manager—available for NVim \u0026gt; 0.8—quickly became my favourite. Lazy simplifies plugin discovery and management. It offers an intuitive interface and powerful commands that make it easy to add, remove, or update plugins.\nAuto-Updating Plugins As I added more plugins, Lazy frequently prompted me to update them. I found myself running updates manually almost every session. Naturally, I wanted to automate this.\nLazy supports execution of custom commands. Its configuration allows binding actions to various events. One useful event is VeryLazy, but more generally, NVim\u0026rsquo;s autocommand system provides a robust mechanism for triggering code on specific actions. Inspired by a gist by dtr2300, I chose to trigger plugin updates on the FocusLost event. This means NVim will automatically check for updates whenever I switch to another window—an ideal, non-disruptive moment.\nHere\u0026rsquo;s the Lua code I use:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 -- File: $XDG_CONFIG_HOME/nvim/lua/config/autocmds.lua local function augroup(name) return vim.api.nvim_create_augroup(\u0026#34;lazyvim_\u0026#34; .. name, { clear = true }) end vim.api.nvim_create_autocmd(\u0026#34;FocusLost\u0026#34;, { group = augroup(\u0026#34;autoupdate\u0026#34;), callback = function() if require(\u0026#34;lazy.status\u0026#34;).has_updates() then vim.notify(\u0026#34;🔄 Updating plugins...\u0026#34;, vim.log.levels.INFO) require(\u0026#34;lazy\u0026#34;).update({ show = false, -- This callback runs after update is done concurrency = nil, lockfile = nil, wait = true, plugins = nil, pin = nil, exclude = nil, callback = function(updated) if updated and #updated \u0026gt; 0 then vim.notify(\u0026#34;✅ Lazy updated \u0026#34; .. #updated .. \u0026#34; plugin(s).\u0026#34;, vim.log.levels.INFO) else vim.notify(\u0026#34;✅ Lazy check complete. No plugins updated.\u0026#34;, vim.log.levels.INFO) end end, }) end end, }) Note that some plugins may require a manual reload after updating. Lazy.nvim now supports an experimental :Lazy reload plugin_name command, but this is not automatically triggered after updates. Reloading works best with plugins that implement a proper deactivate function or are explicitly marked as safe to reload. For others, re-running config functions can cause issues, so use this feature cautiously and check each plugin’s documentation or behavior before relying on it.\nThe configuration lives in $XDG_CONFIG_HOME/nvim/lua/config/autocmds.lua 1. Depe\nIf unset the $XDG_CONFIG_HOME points to $HOME/.config.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://www.thefinalartefact.xyz/post/nvim-autoupdate/","summary":"\u003cp\u003eOne of the key benefits of modern editors like NVim, Vim, or Emacs is the rich plugin ecosystem. After years with Vim, I switched to NVim and was immediately impressed by its plugin landscape. The \u003ca href=\"https://github.com/folke/lazy.nvim\"\u003eLazy\u003c/a\u003e plugin manager—available for NVim \u0026gt; 0.8—quickly became my favourite. \u003ca href=\"https://github.com/folke/lazy.nvim\"\u003eLazy\u003c/a\u003e simplifies plugin discovery and management. It offers an intuitive interface and powerful commands that make it easy to add, remove, or update plugins.\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"/post/nvim-autoupdate/images/lazy_image.png\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Lazy Plugin in Actions\" loading=\"lazy\" src=\"/post/nvim-autoupdate/images/lazy_image.png\"\u003e\n\u003c/a\u003e\u003c/p\u003e","title":"Automatically Refreshing NVim plugins"},{"content":" Setting up a robust data science development environment takes time, and it\u0026rsquo;s a process that’s rarely ever finished. If you’re the type who likes to get the most out of your tools, you’ll likely enjoy tweaking, optimising, and layering your workspace with productivity enhancements. That might mean refining your Python setup to easily manage multiple language versions and dependencies, or expanding your text editor with plugins for linting, code suggestions, unit test execution, and CI/CD integration.\nThe only constant is that your environment is always evolving. I recently moved from vim to nvim and rewrote much of my VimL-based configuration in Lua—something I’d never touched before. If you’re experimenting, learning, and building in parallel, having a clean way to isolate and test changes becomes invaluable.\nManaging tools without installing them In most cases, I use Homebrew to manage system components on my machine, and it works well. But there are situations where installing something locally feels excessive—especially if I only need it temporarily.\nSolution Examples provided below work on the same basis, the code and commands are executed within disposable Docker containers. The process of needing to install software on local machine is completely removed from the system\nExample: Python AST across versions Suppose you’re working with a simple Python script using the ast module, which allows you to parse and analyse Python code as an abstract syntax tree. This is commonly used in tools like linters or code formatters, but also in more advanced metaprogramming scenarios.\nHere’s a minimal script that parses the assignment x = 42 and checks whether the literal 42 is represented using ast.Num.\nIntuitively, we might expect this to return True—after all, 42 is a number, and ast.Num seems appropriate. But that\u0026rsquo;s not always the case.\nNow we run it across various Python versions using Docker.\n1 2 import ast print(type(ast.parse(\u0026#34;x = 42\u0026#34;).body[0].value) is ast.Num) I will store this script as /tmp/check_ast.py. Using the docker one liner, I will execute the script in multiple version of Python.\n1 2 3 4 for version in 2.7 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13; do echo \u0026#34;Python ${version}:\u0026#34; docker run --rm -v /tmp/check_ast.py:/check_ast.py python:$version python /check_ast.py done 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## Python 2.7: ## True ## Python 3.5: ## True ## Python 3.6: ## True ## Python 3.7: ## True ## Python 3.8: ## False ## Python 3.9: ## False ## Python 3.10: ## False ## Python 3.11: ## False ## Python 3.12: ## /check_ast.py:2: DeprecationWarning: ast.Num is deprecated and will be removed in Python 3.14; use ast.Constant instead ## print(type(ast.parse(\u0026#34;x = 42\u0026#34;).body[0].value) is ast.Num) ## False ## Python 3.13: ## False ## /check_ast.py:2: DeprecationWarning: ast.Num is deprecated and will be removed in Python 3.14; use ast.Constant instead ## print(type(ast.parse(\u0026#34;x = 42\u0026#34;).body[0].value) is ast.Num) You’ll notice the results vary. Some versions return True, others return False. This reflects the evolution of the ast module. Although ast.Constant was introduced in Python 3.8 to unify literals like numbers, strings, and constants under one node type, ast.Num, ast.Str, and related nodes were not immediately retired. In fact, ast.parse() continued to return ast.Num in Python 3.8, 3.9, and even 3.10.0, to preserve compatibility.\nAs a result, isinstance(..., ast.Num) still returned True in those versions. The complete shift to ast.Constant occurred in Python 3.11, where ast.parse() finally stopped emitting the older nodes. This is a good example of how language-level changes may be rolled out gradually, and why it’s useful to test behaviour directly—rather than rely on changelog summaries alone.\nOther interesting uses For quick evaluation it is possible to direcltly jump into ipython console. I find this partilculary useful if I want to check running some code interactively in a specific version of Python.\n1 docker run -it --rm python:3.8 bash -c \u0026#34;pip install ipython \u0026amp;\u0026amp; ipython\u0026#34; The other trick that I find useful in those scenarios is to install packages while in Python interactive session by calling subprocess, this can be easily achieved via running subprocess command pointing to pip as shown below:\n1 2 3 4 5 import subprocess import sys subprocess.run([sys.executable, \u0026#34;-m\u0026#34;, \u0026#34;pip\u0026#34;, \u0026#34;install\u0026#34;, \u0026#34;pandas\u0026#34;], check=True) import pandas as pd print(pd.__version__) Summary One-line Docker commands are a lightweight, repeatable, and isolated way to test and explore code across environments—without cluttering your system. They’re particularly useful when comparing behaviour across language versions or running quick experiments in tools you don’t use daily.\n","permalink":"https://www.thefinalartefact.xyz/post/one-line-docker/","summary":"\u003cp\u003e\u003ca href=\"/post/one-line-docker/images/ast_docker.png\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Executing Python command across multiple versions\" loading=\"lazy\" src=\"/post/one-line-docker/images/ast_docker.png\"\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eSetting up a robust data science development environment takes time, and it\u0026rsquo;s a process that’s rarely ever finished. If you’re the type who likes to get the most out of your tools, you’ll likely enjoy tweaking, optimising, and layering your workspace with productivity enhancements. That might mean refining your Python setup to easily manage multiple language versions and dependencies, or expanding your text editor with plugins for linting, code suggestions, unit test execution, and CI/CD integration.\u003c/p\u003e","title":"One Line Docker Commands"},{"content":" What are .dotfiles? Dotfiles are hidden configuration files on Unix-like systems.\nTheir filenames start with a dot (.), making them hidden by default.\nThey store preferences and settings for programs like shells, text editors, and version control systems.\nMany modern Linux applications follow the XDG Base Directory Specification.\nThis guideline recommends placing user-specific configuration files in ~/.config (or $XDG_CONFIG_HOME).\nUsing this standard reduces clutter in home directories and simplifies managing configurations across systems.\nImplementation To version control dotfiles with Git, we create a special repository setup.\nWe’ll alias this setup to a convenient config command, making it easy to run regular Git operations on our configurations.\nFor example, version controlling my Emacs configuration (~/.emacs.d/init.el) looks like this:\n1 2 3 config switch -c configs/emacs_install config add ~/.emacs.d/init.el config commit -as -m \u0026#34;Initial Emacs config\u0026#34; The config command The config command is in effect and alias for a git command with defined --work-tree and --git-dir as shown. In order for this to work we need to complete a few steps:\nCreate directory where to store the dot files in my case mkdir -v $HOME/.dotfiles\nAs a next step we will create a bare git repository. Bare repository does not contain working tree (only contains the version control data that is the .git folder)\n1 git init --bare $HOME/.dotfiles/ Finally we want to enable config command; the command is simply aliased git command with --work-tree and --git-dir parameters and looks as follows:\n1 alias config=\u0026#39;/usr/bin/git --git-dir=$HOME/.doftfiles/ --work-tree=$HOME\u0026#39; What happens here:\nThe argument --work-tree points to where the actual files reside The argument --git-dir specifies where we store the repository data There is one more snag, in the present configuration calling the config command would show a vast number of untracked files. We can solve that problem by using the following git configuration\n1 config config --local status.showUntrackedFiles no If this solution is working for you, you can add the line with the alias command to your ~/.bashrc (or other depending on the shell) to make it permanently available.\nPractical example Git-based version control for dotfiles allows easy tracking and experimentation with configurations. Git’s branches simplify testing complex setups without risking the main configuration.\nConsider my Neovim (NVim) setup. It’s structured into multiple Lua configuration files managed by the Lazy plugin. The current structure looks like this:\n1 tree ~/.config/nvim -P \u0026#39;*.lua\u0026#39; --prune 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ## /Users/konrad/.config/nvim ## ├── init.lua ## └── lua ## ├── config ## │ ├── autocmds.lua ## │ └── lazy.lua ## └── plugins ## ├── autolist.lua ## ├── autopairs.lua ## ├── autosession.lua ## ├── code-runner.lua ## ├── codecompletion.lua ## ├── conform.lua ## ├── dial.lua ## ├── gruvbox.lua ## ├── kanagawa.lua ## ├── lsp.lua ## ├── lualine.lua ## ├── mason.lua ## ├── neogen.lua ## ├── neogit.lua ## ├── neotree.lua ## ├── nvim-ts-autotag.lua ## ├── nvimlint.lua ## ├── orgmode.lua ## ├── R.lua ## ├── snippets.lua ## ├── startup.lua ## ├── telescope-undo.lua ## ├── telescope.lua ## ├── templates.lua ## ├── treesitter.lua ## └── wilder.lua ## ## 4 directories, 29 files Practical Example: Adding R support Suppose I want to add R support to Neovim using the R.nvim plugin.\nThis plugin enables running R code directly from Neovim, managing code completion, and improving workflows.\nModifying multiple files Owing to the structure of my NVim configuration, in order to enable R.nvim support I would need to edit the following files:\nR.nvim - in this file I will keep the key plugin configuration codecompletion.lua - This file stores code completion configuration treesitter.lua - Treesitter, parser generator tool, is required to enable some of the key R.nvim functionalities Updating multiple files at once To implement R.nvim, several files need updates:\nR.nvim: core plugin settings codecompletion.lua: code completion integration treesitter.lua: required parser configurations for R Additionally, adjustments might be needed in init.lua or keymap definitions.\nIf I later change my mind and prefer RStudio, reversing these changes manually could be cumbersome.\nHowever, managing dotfiles with Git makes reverting simple.\nBefore starting the new setup, I create a separate branch using my aliased Git command:\n1 config switch -c configs/nvim-r-test I would then work through the configuration changes, test NVim and play with any settings desired. Depending on what I decide to do I would commit files and merge to the master branch or abandon the work (possibly committing changes to the branch) and switch to the master branch.\nAlternatives Another approach uses symbolic links (symlinks) to a configuration-only repository.\nHowever, Git treats symlinks as regular text files unless core.symlinks is explicitly enabled, as explained in this StackOverflow discussion.\nUsing hard links is similarly problematic since Git does not store inode numbers.\nImplementing hard links within Git requires third-party tools and has significant limitations, discussed in this StackOverflow answer.\n","permalink":"https://www.thefinalartefact.xyz/post/git-dotfiles/","summary":"\u003cp\u003e\u003ca href=\"/post/git-dotfiles/images/imageGitStatus.png\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Using git to version control dotfiles\" loading=\"lazy\" src=\"/post/git-dotfiles/images/imageGitStatus.png\"\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003ch2 id=\"what-are-dotfiles\"\u003eWhat are .dotfiles?\u003c/h2\u003e\n\u003cp\u003eDotfiles are hidden configuration files on Unix-like systems.\u003cbr\u003e\nTheir filenames start with a dot (\u003ccode\u003e.\u003c/code\u003e), making them hidden by default.\u003cbr\u003e\nThey store preferences and settings for programs like shells, text editors, and version control systems.\u003c/p\u003e\n\u003cp\u003eMany modern Linux applications follow the XDG Base Directory Specification.\u003cbr\u003e\nThis guideline recommends placing user-specific configuration files in \u003ccode\u003e~/.config\u003c/code\u003e (or \u003ccode\u003e$XDG_CONFIG_HOME\u003c/code\u003e).\u003cbr\u003e\nUsing this standard reduces clutter in home directories and simplifies managing configurations across systems.\u003c/p\u003e","title":"Version Control your Dotfiles"},{"content":"Why Swift? Data science is dominated by Python and R, with some usage of Julia, Scala, Java, and C++. While Swift may not be the most popular choice, it offers several notable benefits—especially for developers already invested in the Apple ecosystem.\nKey Advantages Performance Considerations\nAs a compiled language, Swift often runs faster than languages like Python or R. This can be especially beneficial when handling large datasets or complex computations.\nSafety \u0026amp; Reliability\nSwift’s strong type system, optional handling, and memory safety features help you write more robust and secure code with fewer runtime errors.\nEcosystem \u0026amp; Tooling\nSeveral libraries and frameworks support data science in Swift, such as Swift Numerics for numerical computing and Swift for TensorFlow for machine learning.\nIntegration with Existing Codebases\nSwift integrates smoothly with existing iOS and macOS projects or C/C++ libraries. This allows teams to unify app logic and data science components under one language and codebase.\nCommercialization Potential For projects intended for the Apple ecosystem, Swift-based development can streamline the path from prototype to product. Reusing large parts of your data science pipeline directly within an iOS or macOS app reduces development overhead. This level of end-to-end integration is often more cumbersome when using non-Swift languages, making Swift an attractive option for commercial applications.\nInitial Configuration In data science workflows, the Swift REPL (Read-Eval-Print Loop) provides an interactive environment that runs Swift code line by line, making it easy to test ideas and quickly prototype. In this blog post, I will use the Swift REPL within an R Markdown document by leveraging the knitr package. This setup allows me to execute Swift code blocks directly while seamlessly incorporating the output into the rendered document, streamlining both experimentation and content creation.\nAdding Swift as engine to knitr The first step is to integrate a custom Swift engine into knitr. We have several goals for this setup:\nMaintain a cumulative namespace across chunks.\nIf a variable varA is defined in the first chunk and varB in the second, the environment for the second chunk should include both varA and varB.\nLimit scope retroactively.\nNewly defined variables remain isolated to the chunk in which they appear, so a variable varC defined in the third chunk does not retroactively affect the first or second.\nReduce redundant output.\nMinimize repeated print statements, ideally preserving only the final output for clarity.\nThis can be done using the below command 1.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ## Wrap code chunks knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 80), tidy = TRUE) ## Define Swift as engine knitr::knit_engines$set(swift = function(options) { # Get all Swift chunks swift_chunk_names \u0026lt;- knitr::all_labels(engine == \u0026#34;swift\u0026#34;) # Preceding chunks prior_chunk_names \u0026lt;- swift_chunk_names[seq_len(Position(\\(x) x == knitr::opts_current$get(\u0026#34;label\u0026#34;), swift_chunk_names))] # All Swift code collected_swift_code \u0026lt;- Reduce(\\(x, y) { c(x, knitr::knit_code$get(y)) }, prior_chunk_names, init = \u0026#34;\u0026#34;) # Filter Swift code Identify print statement lines print_lines \u0026lt;- grep(\u0026#34;^print\u0026#34;, collected_swift_code) # Keep only the last print statement if (length(print_lines) \u0026gt; 1) { filtered_swift_code \u0026lt;- collected_swift_code[-print_lines[-length(print_lines)]] # Remove all but the last print } else { filtered_swift_code \u0026lt;- collected_swift_code # Keep everything as is } # Run the collected Swift code out \u0026lt;- system2(command = \u0026#34;swift\u0026#34;, args = \u0026#34;repl\u0026#34;, input = filtered_swift_code, stdout = TRUE, stderr = TRUE) knitr::engine_output(options, options$code, out) }) What happens here:\nFunction knitr::knit_engines$set registers new engine. Engine is define as new function called swift. The call swift_chunk_names[seq_len(Position( \\(x) x == knitr::opts_current$get(\u0026quot;label\u0026quot;), swift_chunk_names ))] identifies current chunk and ensures that only the previous and the current chunk are passed into evaluation engine. Functional Position will return a number of element meting criteria. Notation \\(x) was introduced in R 4.1.0 and is a shorthand for function(x), e.g. \\(x) x + 1 is parsed as function(x) x + 1. Call Reduce( \\(x, y) { paste(x, knitr::knit_code$get(y), sep = \u0026quot;\\n\u0026quot;) }, prior_chunk_names, init = \u0026quot;\u0026quot;) combines the previous Swift code blocks in one text. Subsequent calls do a trivial vector substitution and remove all other than penultimate print statement. Testing Let\u0026rsquo;s attempt to evaluate a trivial statement\n1 2 3 import Foundation let helloText: String = \u0026#34;Hello from Swift REPL\u0026#34; print(helloText) 1 2 ## helloText: String = \u0026#34;Hello from Swift REPL\u0026#34; ## Hello from Swift REPL Let\u0026rsquo;s see if we can continue using the variables created below and reuse variable from the previous statement\n1 2 3 let punctuationMark: String = \u0026#34;!\u0026#34; let helloTwo:String = helloText + punctuationMark print(helloTwo) 1 2 3 4 ## helloText: String = \u0026#34;Hello from Swift REPL\u0026#34; ## punctuationMark: String = \u0026#34;!\u0026#34; ## helloTwo: String = \u0026#34;Hello from Swift REPL!\u0026#34; ## Hello from Swift REPL! Conclusion By setting up a custom Swift engine in knitr, you can seamlessly execute Swift REPL commands within an R Markdown document and capture the output for immediate display. This allows for rapid experimentation, straightforward debugging, and convenient sharing of code alongside explanatory text—qualities essential for any data science workflow. With just a few lines of configuration, Swift’s performance and safety become accessible in an interactive environment, letting you prototype data manipulation, statistical analysis, or even machine learning models right in your R Markdown files.\nThe original code was contributed via StackOverflow discussion; I\u0026rsquo;ve re-wrote it using R\u0026rsquo;s functional programming and reduced the code to a few lines.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://www.thefinalartefact.xyz/post/swift-data-science/","summary":"\u003ch2 id=\"why-swift\"\u003eWhy Swift?\u003c/h2\u003e\n\u003cp\u003eData science is dominated by Python and R, with some usage of Julia, Scala, Java, and C++. While Swift may not be the most popular choice, it offers several notable benefits—especially for developers already invested in the Apple ecosystem.\u003c/p\u003e\n\u003ch2 id=\"key-advantages\"\u003eKey Advantages\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003ePerformance Considerations\u003c/strong\u003e\u003cbr\u003e\nAs a compiled language, Swift often runs faster than languages like Python or R. This can be especially beneficial when handling large datasets or complex computations.\u003c/p\u003e","title":"Using Swift for Data Science Workflows"},{"content":"Vim provides a wide range of functions for file formatting, starting with basic features such as reindent.\nVimL Implementation Creating a function within Vim to process the file is likely the most straightforward approach. The primary purpose of this function is to pass the filename to an external command for formatting. Leveraging the rich ecosystem of Python formatting tools available from the command line allows the function to efficiently and consistently format files, tapping into powerful, pre-existing solutions for code aesthetics and standardization. In effect, the role of the function is to pass the filename to the call below:\n1 autopep8 --verbose --in-place --aggressive --aggressive ${our_python_file} This is the simplest implementation, where the file path is passed into the formatting function, and the formatted file is then read back into the buffer.\n1 2 3 4 5 6 7 8 \u0026#34; Aggressively and quickly format Python file function! FormatThisPythonFile() let filename = expand(\u0026#34;%\u0026#34;) let cmd = \u0026#34;autopep8 --verbose --in-place --aggressive --aggressive \u0026#34; . filename let result = system(cmd) execute(\u0026#39;:edit! \u0026#39; . filename) echo result endfunction Define Function: The function FormatThisPythonFile is defined to format the current Python file. Get Filename: let filename = expand(\u0026quot;%\u0026quot;) retrieves the name of the current file being edited in Vim. Construct Command: let cmd = \u0026quot;autopep8 --verbose --in-place --aggressive --aggressive \u0026quot; . filename constructs the command to run autopep8 with the aggressive formatting options on the file. Execute Command: let result = system(cmd) executes the constructed command using the system shell and stores the result. Reload File: execute(':edit! ' . filename) reloads the formatted file into the Vim buffer to reflect the changes made by autopep8. Output Result: echo result outputs the result of the autopep8 command to the user. This implementation has a few major drawbacks:\nTo pass the most recent content of the file into autopep8, the file must be saved. More importantly, the file\u0026rsquo;s history is lost at this stage. An alternative implementation avoids these issues by not replacing the file content directly. Instead, it updates the buffer with the formatted content. This function could look as follows:\n1 2 3 4 5 6 7 8 9 function! FormatThisPythonFile() let filename = expand(\u0026#34;%\u0026#34;) let cmd = \u0026#34;autopep8 --aggressive --aggressive \u0026#34; . filename let result = system(cmd) execute \u0026#34;%d\u0026#34; put =result exec \u0026#34;1,1d\u0026#34; endfunction command FormatThisPythonFile call FormatThisPythonFile() In this implementation:\nThe filename is retrieved using expand(\u0026quot;%\u0026quot;). The autopep8 command is constructed and executed, with its output stored in the result variable. The current buffer content is deleted with execute \u0026quot;%d\u0026quot;. The formatted content from result is inserted back into the buffer. The first line, which may be an empty line due to the put command, is deleted with exec \u0026quot;1,1d\u0026quot;. The command FormatThisPythonFile is created to call the function. Example The GIF below demonstrates the changes between the formatted and unformatted versions of the file after using the function, along with the ability to switch between the historical and formatted versions of the file.\nConclusion Vim\u0026rsquo;s extensibility makes it an excellent tool for developers who want to integrate powerful command-line utilities into their workflow. By leveraging Python formatting tools like autopep8, you can maintain clean and consistent code effortlessly. The ability to create custom Vim functions to automate these tasks further enhances productivity and ensures that your code adheres to best practices with minimal effort. This seamless integration of command-line tools within Vim highlights its flexibility and power, making it an invaluable asset for any developer\u0026rsquo;s toolkit.\nNotable mentions vim-autopep8, maintained by tell-k, offers an even more comprehensive implementation. It handles details such as maintaining cursor position and the ability to format selected parts of the file.\n","permalink":"https://www.thefinalartefact.xyz/post/aggressively-formatting-your-python-files/","summary":"\u003cp\u003eVim provides a wide range of functions for file formatting, starting with basic features such as \u003ccode\u003ereindent\u003c/code\u003e.\u003c/p\u003e\n\u003ch2 id=\"viml-implementation\"\u003eVimL Implementation\u003c/h2\u003e\n\u003cp\u003eCreating a function within Vim to process the file is likely the most straightforward approach. The primary purpose of this function is to pass the filename to an external command for formatting. Leveraging the rich ecosystem of Python formatting tools available from the command line allows the function to efficiently and consistently format files, tapping into powerful, pre-existing solutions for code aesthetics and standardization. In effect, the role of the function is to pass the filename to the call below:\u003c/p\u003e","title":"Aggressively formatting your Python files"},{"content":"Background For a computer, an image is collection of interpretable instructions that amount to a visual representation. Raster images are composed from using pixels, containing unique colours, whereas vector images keep track of points and equations that join them. In \\(\\LaTeX\\) PGF/TikZ is used to generate vector graphics from algebraic descriptions. TikZ is mostly used to conveniently draw various scientific figures. ChatGPT is capable of generating computer code in majority of popular languages. I wanted to test how far I can get with ChatGPT generating illustrations using TikZ.\nChatGPT Conversation Let\u0026rsquo;s jump right in and see how ChatGPT deals with simple request to draw a class diagram? For the request:\nCan you draw an example of a class \u0026ldquo;Person\u0026rdquo; with the properties \u0026ldquo;name\u0026rdquo; and \u0026ldquo;age\u0026rdquo; using TikZ?\nThe ChatGPT will return the following code.\n1 2 3 4 5 6 7 8 \\begin{tikzpicture} \\node[draw, class] (Person) at (0,0) {Person}; \\node[draw, attribute] (name) at (-2.5,-1.5) {name}; \\node[draw, attribute] (age) at (2.5,-1.5) {age}; \\draw[-\u0026gt;] (name) -- (-0.5,-1.5) -- (-0.5,0.1) -- (Person); \\draw[-\u0026gt;] (age) -- (0.5,-1.5) -- (0.5,0.1) -- (Person); \\end{tikzpicture} There are certain mistakes in that code. The code refers to attribute and class styles that are not defined. We can amend that by giving ChatGPT are more specific set of instructions. The code compiles but looks horribly.\n1 2 3 4 5 6 7 8 9 10 11 \\begin{tikzpicture}[ class/.style={draw, rectangle, minimum width=2cm, minimum height=1cm}, attribute/.style={draw, rectangle, minimum width=2cm, minimum height=0.5cm} ] \\node[class] (Person) at (0,0) {Person}; \\node[attribute] (name) at (-2.5,-1.5) {name}; \\node[attribute] (age) at (2.5,-1.5) {age}; \\draw[-\u0026gt;] (name) -- (-0.5,-1.5) -- (-0.5,0.1) -- (Person); \\draw[-\u0026gt;] (age) -- (0.5,-1.5) -- (0.5,0.1) -- (Person); \\end{tikzpicture} Figure 1: Sample Class\nThe graphs are horrible but by providing ChatGPT with additional instructions, likely we would be able to arrive at an acceptable solution. It would possibly make sense to test how the ChatGPT would handle a more complex request.\nCould you create a flow chart illustrating a recursive implementation for computing Fibonacci numbers?\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 \\usetikzlibrary{shapes.geometric, arrows} \\begin{tikzpicture}[node distance=2cm] \\tikzstyle{startstop} = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm,text centered, draw=black, fill=red!30] \\tikzstyle{process} = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=orange!30] \\tikzstyle{arrow} = [thick,-\u0026gt;,\u0026gt;=stealth] \\node (start) [startstop] {Start}; \\node (decide) [process, below of=start] {Is n $\\leq$ 1?}; \\node (result) [process, below of=decide, yshift=-0.5cm] {Return n}; \\node (calc1) [process, below left of=decide, xshift=-2cm] {Calculate Fibonacci(n-1)}; \\node (calc2) [process, below right of=decide, xshift=2cm] {Calculate Fibonacci(n-2)}; \\node (add) [process, below of=calc1] {Return Fibonacci(n-1) + Fibonacci(n-2)}; \\node (stop) [startstop, below of=add] {Stop}; \\draw [arrow] (start) -- (decide); \\draw [arrow] (decide) -- node[anchor=east] {Yes} (result); \\draw [arrow] (decide) -- node[anchor=south] {No} (calc1); \\draw [arrow] (decide) -- node[anchor=south] {No} (calc2); \\draw [arrow] (calc1) -- (add); \\draw [arrow] (calc2) -- (add); \\draw [arrow] (add) -- (stop); \\end{tikzpicture} Figure 2: Fibonacci Flow Chart\nClearly ChatGPT version struggles to attend to all small details of the proper flow chart representation as the produced visuals makes this nodes overlap and arrows are cutting across nodes which is also rather undesirable; nevertheless, in a broad sense the produced flow chart contains elements that we would be expecting to find in a professionally produced diagram.\nDrawing Real Life Object As demonstrated, ChatGPT can handle fairly involve instructions and deliver usable results. The produced outputs require further tweaking, which we could achieve by issuing more detailed instructions to the ChatGPT or by manually adjusting placement of node elements and arrows. The goal was to check how the ChatGPT will handle generating real-life objects using TikZ. Finally, I\u0026rsquo;ve asked ChatGPT to do the following:\nCan you draw a German Shepherd using TikZ?\nChatGPT correctly recognised the impossibility of this task:\nCreating a detailed drawing of a German Shepherd using TikZ would be a complex and time-consuming task, as TikZ is primarily a vector graphics tool for creating diagrams and shapes. It\u0026rsquo;s not ideal for highly detailed illustrations like a realistic dog.\nStill was able to produce the following results:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 \\begin{tikzpicture} % Dog body \\draw[gray!50, fill=gray!30] (0,0) ellipse (1.5 and 0.7); % Dog head \\draw[gray!50, fill=gray!30] (2.2,0) circle (0.8); % Dog ears \\fill[gray!50] (3,0.7) ellipse (0.3 and 0.4); \\fill[gray!50] (3,-0.7) ellipse (0.3 and 0.4); % Dog eyes \\fill[white] (2.5,0.2) ellipse (0.15 and 0.25); \\fill[white] (2.5,-0.2) ellipse (0.15 and 0.25); \\fill[black] (2.5,0.2) circle (0.05); \\fill[black] (2.5,-0.2) circle (0.05); % Dog nose \\fill[black] (2.7,0) circle (0.08); % Dog tail \\draw[line width=2pt] (-0.8,0) to[out=70,in=250] (-1.3,1.2); \\end{tikzpicture} Figure 3: German Shephard\n","permalink":"https://www.thefinalartefact.xyz/post/drawing-chat-gpt/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eFor a computer, an image is collection of interpretable instructions that amount to a visual representation. Raster images are composed from using pixels, containing unique colours, whereas vector images keep track of points and equations that join them.  In \u003ccode\u003e\\(\\LaTeX\\)\u003c/code\u003e PGF/TikZ is used to generate vector graphics from algebraic descriptions. TikZ is mostly used to conveniently draw various scientific figures. ChatGPT is capable of generating computer code in majority of popular languages. I wanted to test how far I can get with ChatGPT generating illustrations using TikZ.\u003c/p\u003e","title":"Showing ChatGPT How to Draw"},{"content":"Types of Projects The term Python project can be somewhat misleading. While languages like Swift are designed for specific purposes such as generating macOS/iOS apps, components, and frameworks, Python is used in a much more versatile manner. A Python project might range from an analytical solution developed across multiple Jupyter notebooks to a standalone script querying a database API and extracting results to an application or package facilitating the deployment of models. Each of these projects has its own key usability requirements. For instance, if end-users will utilize our project through a command-line interface, we will focus on argument parsing and other elements facilitating user-friendly execution.\nData Science, Model, Tool, and Hybrid In business practice, I have come across a few common types of projects. The distinction I draw between pure data science and model projects is somewhat arbitrary and frequently blurred but useful.\nTime Horizon The time horizon approach to project structure is hugely beneficial as it enables us to narrow down the gap between the complexity of the project scaffolding and the time horizon. If visualized, it might look something like this:\nSet-up File structure While in certain scenarios we may get away with a single-file setup, any wider collaboration is usually easier if the project structure spans multiple files.\nTesting When structuring Python projects, it\u0026rsquo;s essential to understand the difference between setup.py and setup.cfg, as both are used for packaging and distribution but serve slightly different purposes.\nsetup.py setup.py is a Python script traditionally used for configuring a Python package. It contains a setup() function from the setuptools module, which defines the metadata and dependencies of the project.\nPros Flexibility: Being a Python script, setup.py allows for complex logic to determine the package\u0026rsquo;s configuration. Wide Usage: It has been the standard for many years, meaning it is widely supported and understood within the Python community. Dynamic Configuration: You can dynamically set values, run additional code, or perform custom setup steps. Cons Readability: Because it can contain arbitrary code, it might be harder to read and understand compared to a static configuration file. Complexity: For simple projects, the added flexibility might be overkill and lead to unnecessary complexity. Example of setup.py 1 2 3 4 5 6 7 8 9 10 11 from setuptools import setup, find_packages setup( name=\u0026#39;example_package\u0026#39;, version=\u0026#39;0.1\u0026#39;, packages=find_packages(), install_requires=[ \u0026#39;numpy\u0026#39;, \u0026#39;pandas\u0026#39; ], ) setup.cfg setup.cfg is an INI file used to configure a Python package. It serves the same purpose as setup.py but in a declarative manner. This means you can specify the metadata and dependencies of the project in a static, easy-to-read format.\nPros for setup.cfg Readability: The INI format is more readable and easier to understand, especially for those who are not familiar with Python. Simplicity: For many projects, especially simpler ones, setup.cfg provides all the needed configuration without the complexity of a script. Separation of Concerns: Keeps configuration separate from code, adhering to the principle of separating configuration from the implementation. Cons for setup.cfg Limited Flexibility: Being a static file, it cannot perform dynamic operations or complex logic. Less Dynamic: If you need to compute values or execute code during setup, setup.cfg cannot handle these scenarios. Example of setup.cfg 1 2 3 4 5 6 7 8 9 [metadata] name = example_package version = 0.1 [options] packages = find: install_requires = numpy pandas Summary Structuring Python projects effectively requires understanding the different types of projects and their unique needs. Whether it\u0026rsquo;s a data science project, a model, a tool, or a hybrid, each has its own requirements for usability and collaboration. The choice between using setup.py and setup.cfg for project configuration also plays a significant role. While setup.py offers flexibility and dynamic configuration through Python scripting, setup.cfg provides a more readable, simpler, and declarative approach. Each option has its pros and cons, and the right choice depends on the specific needs and complexity of the project. By carefully considering these factors, you can ensure your Python projects are well-organized, maintainable, and easy to collaborate on.\n","permalink":"https://www.thefinalartefact.xyz/post/python-project-structure/","summary":"\u003ch2 id=\"types-of-projects\"\u003eTypes of Projects\u003c/h2\u003e\n\u003cp\u003eThe term \u003cem\u003ePython project\u003c/em\u003e can be somewhat misleading. While languages like Swift are designed for specific purposes such as generating macOS/iOS apps, components, and frameworks, Python is used in a much more versatile manner. A \u003cem\u003ePython project\u003c/em\u003e might range from an analytical solution developed across multiple Jupyter notebooks to a standalone script querying a database API and extracting results to an application or package facilitating the deployment of models. Each of these projects has its own key usability requirements. For instance, if end-users will utilize our project through a command-line interface, we will focus on argument parsing and other elements facilitating user-friendly execution.\u003c/p\u003e","title":"On Structuring Python Projects"},{"content":"Most frequently, users tend to undertake common R installation and management tasks from within the R session. Frequently making use of commands, like install.packages, update.packages or old.packages to obtain or update packages or update/verify the existing packages. Those common tasks can also be accomplished via the GUI offered within RStudio, which provides an effortless mechanism for undertaking basic package management tasks. This is approach is usually sufficient for the vast majority of cases; however, there are some examples when working within REPL^[REPL stands for Read Eval Print Loop and is usually delivered in a form of an interactive shell. While working in Python users would commonly access REPLY by running python or ipython, more details.] to accomplish common installation tasks is not hugely convenient.\nFor example, we may be utilising multiple library paths and our intention may be to update only one of the available libraries, which we use as a baseline for all new projects^[Projects like renv and Rocker offer sophisticated ways of managing dependencies. Users intend to use R in production can definitely benefit from those developments.]. If our .libPaths() returns a content similar to the one below, we may be interested only in updating the first repository.\n1 2 3 .libPaths() ## [1] \u0026#34;/Users/thisUserName/Library/R/4/library\u0026#34; ## [2] \u0026#34;/usr/local/Cellar/r/4.1.2/lib/R/library\u0026#34; This outcome can be achieved using update.packages(lib.loc = .libPaths()[1]) but if our intention is to pass more arguments to the call, running this command frequently becomes more tedious and unnecessarily onerous. Python users will be familiar with Conda and pip and, while working in R, may be longing for a solution offering a similar, convenient command line mechanism facilitating execution of basic installation and package management tasks. For instance in Conda we would simply do conda install package-name=2.3.4 to install a package of a specific version. Similarly we can simply run conda update --all -y to update all packages unprompted or pass further arguments to update packages within a specific environment and so forth. Pip offers a number of similar convenience features.\n\u0026ldquo;Requirements\u0026rdquo; I was interested in at a R-based solution that would offer a comparable level of convenience. However, as R does not offer package management mechanism that would be out of the box accessible via command line.\nSolution R offers two interfaces for running scriptts and commands without starting an interactive session: R CMD and RScript^[Less known littler projects offers some excellent functionalities that are worth exploring for users keen on exploiting R\u0026rsquo;s command line front end capabilities.]. R CMD is an older interface facilitating command and scriptt execution via command line. Rscriptt came later and is, in general, more flexible. Readers interested in the subject should start research from this StackOverflow discussion that provides a good primer on key differences.\nUpdating Packages In the process of updating packages we will be usually interested in achieving two goals:\nIdentifying any outdated packages that may benefit from updating Running an update for outdated packages every so often. This is relatively good habit, comparable to running brew upgrade with some frequency so we can avoid our production environment staying far behind current stable releases. Outdated packages For a start, let\u0026rsquo;s attempt to construct a data frame containing the outdated packages using Rscriptt. This is achieved in the following manner\n1 2 Rscriptt --no-save --no-restore --no-init-file \\ -e \u0026#39;as.data.frame(old.packages(repos = \u0026#34;https://cran.rstudio.com\u0026#34;))[,-c(1,6)]\u0026#39; If we intend to execute this command frequently it may be useful to wrap in a function. As we are not intending to pass any arguments to the function this is trivial:\n1 2 3 4 5 function routdated () { Rscriptt --no-save --no-restore --no-init-file \\ -e \u0026#39;as.data.frame(old.packages(repos = \u0026#34;https://cran.rstudio.com\u0026#34;))[,-c(1,6)]\u0026#39; } routdated Outdated packages explanation Let\u0026rsquo;s break this down. Rscriptt can be run with multiple switches:\n--no-init-file skips reading of the .Rprofile files. As I keep my library path stored within the Renviron file, I\u0026rsquo;m skipping processing of the .RProfile file. Traditionally, .RProfile is used to configure default repositories but I pref to specify the repository directly in the call as I want to update against RStudio\u0026rsquo;s one --no-save prevents RScript from saving data on exit -e is used to specify the call that RScript is expected to execute A point of note, if you specify your library paths using RProfile file you may need to remove the --no-init-file switch.\nInstalling packages Installing packages from console is useful if we want to undertake those tasks outside the ongoing workflow. Let\u0026rsquo;s say that we are working on a project and looking for package facilitating quick generation of tabular summaries or some relevant visuals. We could use install.packages from the REPL session we are in but if the package is requiring compilation we are get quite a lot of output that is not pertinent to the task at hand. Another common approach would be to start another session and run install command there. RStudio also facilitates running background processes but that\u0026rsquo;s a little bit of an overkill for a simple task.\nA sensible middle ground solution would be to run the package installation via RScript we could conveniently call it as a background process or just start in a new tab when working with any popular terminal client supporting this functionality.\n1 2 3 4 5 ## Function body function rinst () { declare pkgnme=$1 Rscriptt --vanilla -e \u0026#34;install.packages(\u0026#39;$pkgnme\u0026#39;, dependencies = TRUE, repos = \u0026#39;https://cloud.r-project.org/\u0026#39;, lib = \u0026#39;/Users/konrad/Library/R/4/library\u0026#39;)\u0026#34; } Explanation As I want for this function to install only to a specific library and from a specific repository I\u0026rsquo;m using the --vanilla switch to let Rscriptt know that I don\u0026rsquo;t need it to process .RProfile and other files in this particular call as I\u0026rsquo;m storing the relevant arguments within the function.\nSummary In all likelihood, for more sophisticated projects solutions like renv and Rocker are a way to go. Nevertheless combination of Rscriptt / bash can prove very efficient in quickly accomplishing routine maintenance tasks.\n","permalink":"https://www.thefinalartefact.xyz/post/rscriptt-for-r-managment/","summary":"\u003cp\u003eMost frequently, users tend to undertake common R installation and management tasks from within the R session. Frequently making use of commands, like \u003ccode\u003einstall.packages\u003c/code\u003e, \u003ccode\u003eupdate.packages\u003c/code\u003e or \u003ccode\u003eold.packages\u003c/code\u003e to obtain or update packages or update/verify the existing packages. Those common tasks can also be  accomplished via the GUI offered within RStudio, which provides an effortless mechanism for undertaking basic package management tasks. This is approach is usually sufficient for the vast majority of cases; however, there are some examples when working within REPL^[REPL stands for \u003cstrong\u003eR\u003c/strong\u003eead \u003cstrong\u003eE\u003c/strong\u003eval \u003cstrong\u003eP\u003c/strong\u003erint \u003cstrong\u003eL\u003c/strong\u003eoop and is usually delivered in a form of an interactive shell. While working in Python users would commonly access REPLY by running \u003ccode\u003epython\u003c/code\u003e or \u003ccode\u003eipython\u003c/code\u003e, \u003ca href=\"https://pythonprogramminglanguage.com/repl/\"\u003emore details\u003c/a\u003e.] to accomplish common installation tasks is not hugely convenient.\u003c/p\u003e","title":"Using RScript for R Installation Management"},{"content":" Big-O The purpose of this is not to provide yet another primer on the Big-O/$\\Omega$/$\\Theta$ notation but to share my enduring appreciation for working with R. I will introduce Big-O only briefly to provide context but I would refer all of those who are interested to the linked materials.\nWhat is Big-sth notation When analysing functions, we may be interested in knowing how fast a function grows. For instance, for function \\(T(n)=4n^2-2n+2\\), after ignoring constants, we would say that \\(T(n)\\) grows at the order of \\(n^2\\). With respect to the Big-O notation we would write \\(T(n)=O(n^2)\\)^[MIT. (2021, December 9). Big O notation. Introduction to Computers and Programming. Retrieved December 26, 2021, from https://web.mit.edu/16.070/www/lecture/big_o.pdf]. Most commonly, in computer science, we would differentiate between Big O, Big Theta \\((\\Theta)\\) and Big Omega \\((\\Omega)\\). In a nutshell, the differences between those common notations can be summarised as follows:\n| Notation | Growth Rate | Running Case | Comparison | |\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;-+\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash;+\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash;+\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;| | Big-O | Equivalent to \\(\\le\\), growth less or equal specific value | Worst case | Longest amount of time it could take to algorithm to complete | | Big-Omega \\((\\Omega)\\) | Equivalent to \\(\\ge\\), growth greater or equal specific value | Best case | Compared to Big-O, reflects smaller amount of time | | Big-Theta \\((\\Theta)\\) | Equivalent to \\(=\\), growth equal specific value | Average case | Reflects lower and upper bound of the running time |\nThe task at hand So I wanted to compare a speed of some algorithms, say \\(n*\\log(n)\\), \\(\\log(n)\\), \\(n^2\\) and so forth. I need a quick visual aid where I could plot some most common running times and quickly add more to compare how the common implementations perform against the algorithm I was working on.\nFunctions So I\u0026rsquo;ve defined a few functions that correspond to the frequently-used Big-O curves that can be found in majority of introductory literature on the subject with the intention of quickly producing visual base for the subject.\n1 2 3 4 5 6 7 O_1 \u0026lt;- function(n) {(rep(1,length(n)))} O_log_n \u0026lt;- function(n) {log(n)} O_n \u0026lt;- function(n) {n} O_n_log_n \u0026lt;- function(n) {n * log(n)} O_n_n \u0026lt;- function(n) {n^2} O_2_to_n \u0026lt;- function(n) {2^n} O_fac \u0026lt;- function(n) {factorial(n)} The \u0026ldquo;requirements\u0026rdquo; After defining the above I wanted a quick way of plotting the data. The easiest approach would be to define the table where I could execute all of the functions. Let\u0026rsquo;s say for \\(n = 10\\) we could do the following:\n1 2 3 4 5 n \u0026lt;- 10 suppressPackageStartupMessages(library(\u0026#34;tidyverse\u0026#34;)) tibble(.rows = n) %\u0026gt;% mutate(O_n = map_dbl(1:n, O_1), O_log_n = map_dbl(1:n, O_log_n)) 1 2 3 4 5 6 7 8 9 10 11 12 13 ## # A tibble: 10 × 2 ## O_n O_log_n ## \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; ## 1 1 0 ## 2 1 0.693 ## 3 1 1.10 ## 4 1 1.39 ## 5 1 1.61 ## 6 1 1.79 ## 7 1 1.95 ## 8 1 2.08 ## 9 1 2.20 ## 10 1 2.30 The problems with this approach is apparent. I didn\u0026rsquo;t want to type O_this_and_that x number of times to name columns, call functions and then be adding removing the functions I\u0026rsquo;ve already tested. I needed for R to:\nAutomatically identify all of the relevant function Call each function on a sequence 1:n Pack everything in a neat table so I can play with it by plotting, etc.. Solution The actual solution can condensed further but I have broke it down for easier readability. First I\u0026rsquo;ve started with identifying the functions.\n1 2 3 Fun_names \u0026lt;- ls(pattern = \u0026#34;O_*\u0026#34;) Funs_O \u0026lt;- mget(Fun_names) # Create a list of functions head(Funs_O, n = 2) # Preview 1 2 3 4 5 6 7 8 9 10 11 12 ## $O_1 ## function (n) ## { ## (rep(1, length(n))) ## } ## \u0026lt;bytecode: 0x7fd522b8c0f8\u0026gt; ## ## $O_2_to_n ## function (n) ## { ## 2^n ## } Now I needed to call each of those functions and pack the results into a tibble or a data.frame. This is achieved by the code below.\n1 map_df(Funs_O, ~ map_dbl(1:n, ~ .x(.x))) 1 2 3 4 5 6 7 8 9 10 11 12 13 ## # A tibble: 10 × 7 ## O_1 O_2_to_n O_fac O_log_n O_n O_n_log_n O_n_n ## \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; ## 1 1 2 1 0 1 0 1 ## 2 1 4 2 0.693 2 1.39 4 ## 3 1 8 6 1.10 3 3.30 9 ## 4 1 16 24 1.39 4 5.55 16 ## 5 1 32 120 1.61 5 8.05 25 ## 6 1 64 720 1.79 6 10.8 36 ## 7 1 128 5040 1.95 7 13.6 49 ## 8 1 256 40320 2.08 8 16.6 64 ## 9 1 512 362880 2.20 9 19.8 81 ## 10 1 1024 3628800 2.30 10 23.0 100 I reckon it requires some explanation. map_* function are purrr\u0026rsquo;s take on R\u0026rsquo;s *apply family and map function on an element passed to the function. Starting from the inside the call map_dbl(1:n, ~.x(.)) calls function .x on current element of the sequence, which is passed as . but another .x could also work!! The map_df(O_funs, ....) call is more cheeky as the list is actual collection of functions generated via mget. So instead of iterating over numbers of strings or data frames we are iterating over collection of closures. In that case our .x becomes a closure. Full solution for \\(n = 1000\\) is provided below.\n1 2 n \u0026lt;- 1e3 # This is a bad habit of mine but while in R, I\u0026#39;ve a strong preference for using 1e6 for a million, 1e2 for hundred and so forth dta_big_o \u0026lt;- map_df(mget(Fun_names), ~ map_dbl(1:n, ~ .x(.x))) Preview For \\(n = 1000\\) we obtain:\nO_1 O_2_to_n O_fac O_log_n O_n O_n_log_n O_n_n 1 2.0e+00 1 0.00 1 0.0 1 1 4.0e+00 2 0.69 2 1.4 4 1 8.0e+00 6 1.10 3 3.3 9 1 1.6e+01 24 1.39 4 5.5 16 1 3.2e+01 120 1.61 5 8.1 25 1 6.4e+01 720 1.79 6 10.8 36 1 3.3e+299 Inf 6.90 995 6868.2 990025 1 6.7e+299 Inf 6.90 996 6876.1 992016 1 1.3e+300 Inf 6.90 997 6884.0 994009 1 2.7e+300 Inf 6.91 998 6891.9 996004 1 5.4e+300 Inf 6.91 999 6899.9 998001 1 1.1e+301 Inf 6.91 1000 6907.8 1000000 Plotting Plotting functions actually becomes trivial with use of dplyr and ggplot2. As shown below, we can leverage the previous solution using mget to dynamically pull all of the refreshed functions and keep on re-generating the plot to get a view on the running time of the relevant algorithm. Here, I\u0026rsquo;m using Fun_names to avoid unnecessary verbosity but this could be replaced with the ls call.\n1 2 3 4 5 6 7 8 map_df(mget(Fun_names), ~ map_dbl(1:n, ~ .x(.x))) %\u0026gt;% rownames_to_column(\u0026#34;n_val\u0026#34;) %\u0026gt;% mutate(n_val = as.integer(n_val)) %\u0026gt;% pivot_longer(!n_val, names_to = \u0026#34;O_fun\u0026#34;, values_to = \u0026#34;n_t\u0026#34;) %\u0026gt;% ggplot(aes(x = n_val, y = n_t, group = O_fun, colour = O_fun)) + geom_line() + scale_y_continuous(trans = \u0026#34;log10\u0026#34;) + theme_minimal() Remarks In a production setting, we probably wouldn\u0026rsquo;t rely on ls(pattern = ...) and mget sourcing objects by name. Even with providing the mode argument for mget and limiting results to function there is a risk that our code could capture an undesired call that from one or another reason could appear in the environment where ls would happen to be executing search. Finally for quickly plotting trivial functions, like \\(y=log(x)^2\\) can be easily achieved with use of dedicated software. Mac users will already have Grapher installed that meets that requirement. Personally, I appreciate use cases demonstrating R\u0026rsquo;s flexibility as functionals and meta-programming are actual efficiency powerhouses.\n","permalink":"https://www.thefinalartefact.xyz/post/beauty-of-r-and-big-o/","summary":"\u003cscript src=\"index_files/kePrint/kePrint.js\"\u003e\u003c/script\u003e\n\u003clink href=\"index_files/lightable/lightable.css\" rel=\"stylesheet\" /\u003e\n\u003ch2 id=\"big-o\"\u003eBig-O\u003c/h2\u003e\n\u003cp\u003eThe purpose of this is not to provide yet another primer on the Big-O/$\\Omega$/$\\Theta$ notation but to share my enduring appreciation for working with R. I will introduce Big-O only briefly to provide context but I would refer all of those who are interested to the linked materials.\u003c/p\u003e\n\u003ch2 id=\"what-is-big-sth-notation\"\u003eWhat is Big-sth notation\u003c/h2\u003e\n\u003cp\u003eWhen analysing functions, we may be interested in knowing how fast a function grows. For instance, for function \u003ccode\u003e\\(T(n)=4n^2-2n+2\\)\u003c/code\u003e, after ignoring constants, we would say that \u003ccode\u003e\\(T(n)\\)\u003c/code\u003e grows at the order of \u003ccode\u003e\\(n^2\\)\u003c/code\u003e. With respect to the \u003cem\u003eBig-O\u003c/em\u003e notation we would write \u003ccode\u003e\\(T(n)=O(n^2)\\)\u003c/code\u003e^[MIT. (2021, December 9). Big O notation. Introduction to Computers and Programming. Retrieved December 26, 2021, from \u003ca href=\"https://web.mit.edu/16.070/www/lecture/big_o.pdf\"\u003ehttps://web.mit.edu/16.070/www/lecture/big_o.pdf\u003c/a\u003e]. Most commonly, in computer science, we would differentiate between Big O, Big Theta \u003ccode\u003e\\((\\Theta)\\)\u003c/code\u003e and Big Omega \u003ccode\u003e\\((\\Omega)\\)\u003c/code\u003e. In a nutshell, the differences between those common notations can be summarised as follows:\u003c/p\u003e","title":"Beauty of R and Big-O"},{"content":"Background Handling Hive/CSV interaction is a common reality of many analytical and data environments. The question on exporting data from Hive to CSV and other formats is frequently raised on online forums with answers frequently suggestring making use of sed that combined with nifty regular expressions pipes Hive output into a flat CSV files as an exporting solution. Import of large amounts of data is best handled by suitable tools like Apache Flume. That is fine for simpler tables but may prove problematic for tables with a large amount of unstructured text. Frequently analysts and data scientists are faced with a challenge with storing data Hive on a irregular semi-regular basis. For instance, a job may produce new forecastring scenarios that we may want to make available through a Hive tables.\nRelaying on Spark API may prove fiddly when our requirements for resulting Hive table are non-standard. We may be willing to have a detailed control over column types or dynamically store comments providing relevant table context. If our modelling/data creation efforts takes place outside Spark session kicking off a Spark connection only to store some data in Hive feels like unnecessary onerous approach. Generating lengthy HiveQL syntax may also appear unnecessary tedious if we deal with numerous tables of variable structures. Meta-programming paradigm offers a neat solution to that challenge. Using R we can quickly generate HiveQL code that we use to land the CSV data into Hive with minimal overhead and dependencies. The article provides an opinionated overview of how meta-programming could be employed to quickly generate detailed HiveQL that would enable us to generate syntax for inserting data into Hive table controlling for column types/names and automatically generating comments.\nTrial Data Before progressing with the examples and outlining difficulties associated with the export I\u0026rsquo;m going to create some sample data. I\u0026rsquo;m using the docker-hive Docker image provided by the Big Data Europe to run the example. The wakefield package available for R provides access to a number of functions useful for synthetic data generation. In this example the goal is to generate a data set that will have columns of diverse types, such as free text, dates, numbers of various formats, etc.. The wakefield packages comes with great examples and using them is generally a good idea. I have increased the size of the data to hundred thousand rows so partitioning it will be more realistic\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 library(\u0026#34;wakefield\u0026#34;) set.seed(123) test_data \u0026lt;- r_data_frame(n = 1e5, id, dob, animal, grade, death, dummy, grade_letter, gender, paragraph, sentence ) I will add a few columns to create partitions so our trial data is more \u0026ldquo;realistic\u0026rdquo;.\n1 2 3 4 5 6 7 suppressPackageStartupMessages(library(\u0026#34;tidyverse\u0026#34;)) suppressPackageStartupMessages(library(\u0026#34;magrittr\u0026#34;)) # For %\u0026lt;\u0026gt;% test_data %\u0026lt;\u0026gt;% mutate(test_data, part_year = lubridate::year(DOB), part_mnth = lubridate::month(DOB), part_day = lubridate::day(DOB)) %\u0026gt;% mutate(across(starts_with(\u0026#34;part\u0026#34;), as.integer)) The trial data looks good; we have generated a number of messy variables that will be \u0026ldquo;pleasure\u0026rdquo; to deal with during import/export operation.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## Rows: 100,000 ## Columns: 13 ## $ ID \u0026lt;chr\u0026gt; \u0026#34;00001\u0026#34;, \u0026#34;00002\u0026#34;, \u0026#34;00003\u0026#34;, \u0026#34;00004\u0026#34;, \u0026#34;00005\u0026#34;, \u0026#34;00006\u0026#34;, \u0026#34;00… ## $ DOB \u0026lt;date\u0026gt; 2008-01-31, 2008-03-19, 2007-06-09, 2008-05-21, 2007-06-… ## $ Animal \u0026lt;fct\u0026gt; Field Spaniel, Akita, Pool Frog, Akita, Akita, Newfoundla… ## $ Grade \u0026lt;dbl\u0026gt; 83.1, 87.4, 92.5, 93.7, 90.2, 86.4, 89.0, 86.1, 85.1, 84.… ## $ Death \u0026lt;lgl\u0026gt; FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FA… ## $ Dummy \u0026lt;dbl\u0026gt; 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, … ## $ Grade_Letter \u0026lt;ord\u0026gt; B+, B+, B+, B, B+, A-, B, B+, A-, B, B, B, B+, A-, A-, B+… ## $ Gender \u0026lt;fct\u0026gt; Male, Female, Female, Male, Female, Male, Female, Female,… ## $ Paragraph \u0026lt;chr\u0026gt; \u0026#34;Lorem ipsum dolor sit amet, leo pretium amet eget. Orci … ## $ Sentence \u0026lt;chr\u0026gt; \u0026#34;N.\u0026#34;, \u0026#34;As Commander in Chief, I will maintain the stronge… ## $ part_year \u0026lt;int\u0026gt; 2008, 2008, 2007, 2008, 2007, 2007, 2007, 2007, 2007, 200… ## $ part_mnth \u0026lt;int\u0026gt; 1, 3, 6, 5, 6, 4, 10, 7, 8, 12, 12, 10, 8, 8, 11, 3, 11, … ## $ part_day \u0026lt;int\u0026gt; 31, 19, 9, 21, 25, 9, 7, 29, 13, 26, 21, 7, 5, 6, 20, 13,… Metaprogramming paradigm Metaprogramming paradigm assumes using computer code to generate more computer code. In effect metaprogramming gives other programs ability to treat code like data. As stated by Levy^[L. S. Levy, \u0026ldquo;A metaprogramming method and its economic justification,\u0026rdquo; in IEEE Transactions on Software Engineering, vol. SE-12, no. 2, pp. 272-277, Feb. 1986, doi: 10.1109/TSE.1986.6312943.]:\nMetaprogramming, defined as creating application programs by writing programs that produce programs, is presented as the basis of a method for reducing software costs and improving software quality.\nThe question is how this relates to our task of importing relatively messy \u0026ldquo;real-life\u0026rdquo; data into nice, partitioned Hive table that will be a pleasure to work with. Let\u0026rsquo;s consider the simplest way to import the data. We would use CREATE_TABLE\n1 2 3 4 5 6 7 8 9 10 11 CREATE TABLE IF NOT EXISTS blog.test_data ( id int, dob date, age int COMMENT \u0026#39;This column was renamed or something else happened\u0026#39;, gender string ... ) COMMENT \u0026#39;Our sample data\u0026#39; PARTITIONED BY (txn_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \u0026#39;,\u0026#39;; We would then point to the load statement, that would look more or less like that:\n1 LOAD DATA INPATH \u0026#39;/user/hive/data/data.csv\u0026#39; INTO TABLE blog.test_data; The first observation that emerges is that generating all of the relevant code will be particular onerous and verbose. In the following scenario we may be willing to:\nComment on the columns we are souring reflecting original name, type and summarising any other transformations we have applied Table comments could reflect details on the source, such as the location of the file or any other characteristics that we can gather through file.info. In context of importing wide tables this will prove particularly painful. If we are in position where we are at a risk of undertaking that type of exercise more frequently, reflecting on a process that would enable us to automate or part-automate the data load can prove beneficial.\nGenerating code This is where R\u0026rsquo;s flexibility and data structures come into play. We can use R to generate our HiveQL code and come up with a generic function that could be easily deployed against data frames that we would like to make permanent in Hive. The most subsantial element in will be concerned with generating code responsible for provisioning column names. We will also have to generate code that will create initial opening code and final elements pertaining to providing input on table characteristics.\nImporting data Before uploading the data we devise a table structure that will be used to hold it. Following the displayed variable types this can be done in the following manner. From a development perspective, breaking functions into manageable smaller pieces will make it easier to develop.\nGenerating column names For this particular function we would like to obtain a string that would form the column part of the data generation query\n1 2 3 4 5 6 7 8 9 10 11 12 13 CREATE TABLE IF NOT EXISTS blog.test_data ( -- \u0026lt;! Function output starts\u0026gt; id int, dob date, age int COMMENT \u0026#39;This column was renamed or something else happened\u0026#39;, gender string ... -- \u0026lt;! Function output ends\u0026gt; ) COMMENT \u0026#39;Our sample data\u0026#39; PARTITIONED BY (txn_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \u0026#39;,\u0026#39;; This implies the following requirements:\nWe want to come up with syntaticaly valid column names that can be used in HiveQL. As stated in the official documentation, the Hive is fairly pemissive in acceptable column names: In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see HIVE-10120). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.\nWe can assume a more rigours approach reflecting the preference that, in all likelihood, majority of us would endorse:\nWe don\u0026rsquo;t want for the columns to start with a number We want for the function to replace spaces with _ We would like to remove non-letter characters On that for currency characters we would like for the function to repplace those with a three letter currency code With exception of the three letter currency code, the column name should be in lowercase. Hive is case-insensitive but pedentically, I would prefer for the column to be named price_GBP instead of price_gbp. As this is matter of preference we may make this operation optional We would like for the column to remove common three letter words; for instance price_item is sufficiently clear and we don\u0026rsquo;t need price_for_an_item. Pedentically, we may be even temepted to reverse common words so the order reads more natural, item_price reads better than price_item (although, there is no loss in meaning.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 #\u0026#39; Clean Column Names #\u0026#39; #\u0026#39; Arbitrary set of rules transforming string so passed results are consistent #\u0026#39; with HiveQL rules on syntactically correct column names. #\u0026#39; #\u0026#39; @details Manual replacements are useful for when we may be dealing with #\u0026#39; columns where automatic application of rules is not desired. For instance #\u0026#39; if our intention is to rename column \u0026#34;Super Important Column\u0026#34; to #\u0026#39; \u0026#34;to_delete\u0026#34; we would use the \\code{manual_replacement} argument to implement #\u0026#39; that change. #\u0026#39; #\u0026#39; @param x A character vector with column names to transform #\u0026#39; @param lowercase_currency_names A logic, defaults to \\code{FALSE}, whether to #\u0026#39; convert the currency names to lower-case. #\u0026#39; @param manual_replacement A named vector defining manual replacement for #\u0026#39; column names can be of format c(\u0026#34;Super Important Column\u0026#34; = \u0026#34;to_delete) or #\u0026#39; c(\u0026#34;3\u0026#34; = \u0026#34;to_delete\u0026#34;) for column positions. #\u0026#39; @param remove_words A character vector with words we like to remove, #\u0026#39; defaults to commonly occurring words in column name like \u0026#34;per, by, for\u0026#34; #\u0026#39; #\u0026#39; @return A character vector #\u0026#39; #\u0026#39; @export #\u0026#39; #\u0026#39; @examples #\u0026#39; dirty_col_names \u0026lt;- c(\u0026#34;Date of Birth\u0026#34;, \u0026#34;12-important-column\u0026#34;, #\u0026#39; \u0026#34;14-crucial-column\u0026#34;, \u0026#34; user inputs\u0026#34;, \u0026#34;price in (£)\u0026#34;) #\u0026#39; clean_column_names(x = dirty_col_names) clean_column_names \u0026lt;- function(x, lowercase_currency_names = FALSE, manual_replacement = NULL, remove_words = c(\u0026#34;per\u0026#34;, \u0026#34;by\u0026#34;, \u0026#34;for\u0026#34;)) { x \u0026lt;- stringi::stri_trim_both(x) x \u0026lt;- stringi::stri_trans_tolower(x) x \u0026lt;- stringi::stri_replace_all_regex(x, \u0026#34;^[^a-z]*\u0026#34;, \u0026#34;\u0026#34;) # Column name should start with a letter x \u0026lt;- stringi::stri_replace_all_charclass(x, \u0026#34;\\\\p{WHITE_SPACE}\u0026#34;, \u0026#34;_\u0026#34;) x \u0026lt;- stringi::stri_replace_all_regex(x, \u0026#34;^[^a-z]*\u0026#34;, \u0026#34;\u0026#34;) # Column name should start with a letter x \u0026lt;- x } dirty_col_names \u0026lt;- c(\u0026#34;Date of Birth\u0026#34;, \u0026#34;12-important-column\u0026#34;, \u0026#34;14-crucial-column\u0026#34;, \u0026#34; user inputs\u0026#34;, \u0026#34;price in (£)\u0026#34;) print(clean_column_names(x = dirty_col_names)) 1 2 ## [1] \u0026#34;date_of_birth\u0026#34; \u0026#34;important-column\u0026#34; \u0026#34;crucial-column\u0026#34; \u0026#34;user_inputs\u0026#34; ## [5] \u0026#34;price_in_(£)\u0026#34; Conclusion In the subsequent articles I will tackle remaining challenges such as automatically generating meaningful strings for the column comments and R to Hive type conversion. The function discussed above will be made available through a dedicated R package that I\u0026rsquo;m making available on GitHub at konradzdeb/HiveQLgenerator.\n","permalink":"https://www.thefinalartefact.xyz/post/importing-csv-to-hive/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eHandling Hive/CSV interaction is a common reality of many analytical and data environments. The question on exporting data from Hive to CSV and other formats is frequently raised on online forums with answers frequently suggestring making use of \u003ca href=\"https://en.wikipedia.org/wiki/Sed\"\u003e\u003ccode\u003esed\u003c/code\u003e\u003c/a\u003e that combined with nifty regular expressions pipes Hive output into a flat CSV files as an exporting solution. Import of large amounts of data is best handled by suitable tools like \u003ca href=\"https://flume.apache.org\"\u003eApache Flume\u003c/a\u003e. That is fine for simpler tables but may prove problematic for tables with a large amount of unstructured text. Frequently analysts and data scientists are faced with a challenge with storing data Hive on a irregular semi-regular basis. For instance, a job may produce new forecastring scenarios that we may want to make available through a Hive tables.\u003c/p\u003e","title":"R-based metaprogramming strategies for handling Hive/CSV interaction (Part I, imports)"},{"content":"Recently, I cam across an interesting discussion on StackOverflow^[SO discussion on: Fuzzy Join with Partial String Match in R] pertaining to approach to fuzzy matching tables in R. Good answer contributed by one of the most resilient and excellent contributors to whom I owe a lot of thanks for help suggested relying on regular expression, combining this with basic string removal and transformations like toupper to deterministically match the tables. The solution solved the problem and was accepted.\nSo what\u0026rsquo;s wrong With this particular problem/solution pair, there is absolutely nothing wrong. On numerous occasions I have seen people running UPPER(REGEXP_REPLACE( my_value, '[[:space:]]', '' )) only so later someone can realise, that actually now they need REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'') only for someone else to offer SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')^[The example originations from SO discussion on removing non-ASCII characters; this is actually [one of the better answerers](https://Stack Overflow.com/a/18234629/1655567) offered.]. Those are all good solutions that solve the particular challenge but they (very) seldom stand the test of time.\nThe real problem is in not finding the most robust approach but deciding how the challenge is being approached from a priority perspective. Those of us who endlessly twist the regex so they arrive at something resembling the regex below, which was actually created to match email addresses^[This is taken from a closed question validating email addresses; the SO hosts another, longer, discussion that offers detailed discussion on regex approach to validating emails], aspire to match deterministically at any price and frequently are destined to fail, especially when working with data reflecting human-input.\n1 ^(?:(?!.*?[.]{2})[a-zA-Z0-9](?:[a-zA-Z0-9.+!%-]{1,64}|)|\\\u0026#34;[a-zA-Z0-9.+!% -]{1,64}\\\u0026#34;)@[a-zA-Z0-9][a-zA-Z0-9.-]+(.[a-z]{2,}|.[0-9]{1,})$ The nature of the regex matching is source of the problem. Regex-based matching will always result a binary outcome, strings will match or not. By the very definition regex-based matching is not fuzzy matching.\nParadigm shift (just because we don\u0026rsquo;t mention this phrase often enough) When you attempt to match things fuzzily you are signing up to a few assumptions:\nYour matching reflects certain probabilistic assumptions or in other words, your match is not expected to be 100% accurate. There is no \u0026ldquo;one right\u0026rdquo; method to do the matching, ways of calculating string distances may be less or more suitable for a given problem. Computationally expensive procedures may yield excellent result but prove impractical from the implementation perspective, and so on Whereas for regex there is clear direction of improvement: it can be (almost) always tweaked more to account for one more \u0026rsquo;edge case\u0026rsquo; the improvement direction Example Generating example that show inefficiency of regex when contrasted with string matching approach. The most obvious example is the one that will show mismatches on misspelled words. Let\u0026rsquo;s consider the following example. Two data sets contain a set of strings reflecting car manufactures, as shown below.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 data_A \u0026lt;- tibble::tribble( ~ manufacturer, ~ price, \u0026#34;Ford Focus\u0026#34;, 100, \u0026#34;Ford Mondeo\u0026#34;, 120, \u0026#34; Ford Corsa\u0026#34;, 30, \u0026#34;Mercedes W205 C-Class\u0026#34;, 90, \u0026#34;Mecedes X156 GLA-Class\u0026#34;, 10 ) data_B \u0026lt;- tibble::tribble( ~ manufacturer, ~ subjective_rating, \u0026#34;Frod\u0026#34;, \u0026#34;B\u0026#34;, \u0026#34;mercedes\u0026#34; , \u0026#34;A\u0026#34; ) Let\u0026rsquo;s say that we want to bring the subjective_rating column to the data_A. The first, common sense would to match on the actual manufacturer. The one could be tempted to get the first word from the manufacturer column and use it for matching. We could attempt to match using only brand name. For that purpose the most straightforward implementation would be to delete everything after space. As shown below, the first problem we are seeing that \u0026quot; Ford Corsa\u0026quot; was deleted as it starts with space.\n1 2 3 4 suppressPackageStartupMessages(library(\u0026#34;tidyverse\u0026#34;)) data_A |\u0026gt; mutate(manufacturer = str_remove(manufacturer, \u0026#34;\\\\s.*\u0026#34;)) |\u0026gt; left_join(data_B) 1 ## Joining with `by = join_by(manufacturer)` 1 2 3 4 5 6 7 8 ## # A tibble: 5 × 3 ## manufacturer price subjective_rating ## \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; ## 1 \u0026#34;Ford\u0026#34; 100 \u0026lt;NA\u0026gt; ## 2 \u0026#34;Ford\u0026#34; 120 \u0026lt;NA\u0026gt; ## 3 \u0026#34;\u0026#34; 30 \u0026lt;NA\u0026gt; ## 4 \u0026#34;Mercedes\u0026#34; 90 \u0026lt;NA\u0026gt; ## 5 \u0026#34;Mecedes\u0026#34; 10 \u0026lt;NA\u0026gt; We could solve that problem by running trimws or str_trim first, let\u0026rsquo;s do that. For better visibility the step is inserted as a separate call but we could easily wrap that call in the existing transformation. However, the results are still unsatisfactory.\n1 2 3 4 5 suppressPackageStartupMessages(library(\u0026#34;tidyverse\u0026#34;)) data_A |\u0026gt; mutate(manufacturer = str_trim(manufacturer), manufacturer = str_remove(manufacturer, \u0026#34;\\\\s.*\u0026#34;)) |\u0026gt; left_join(data_B) 1 ## Joining with `by = join_by(manufacturer)` 1 2 3 4 5 6 7 8 ## # A tibble: 5 × 3 ## manufacturer price subjective_rating ## \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; ## 1 Ford 100 \u0026lt;NA\u0026gt; ## 2 Ford 120 \u0026lt;NA\u0026gt; ## 3 Ford 30 \u0026lt;NA\u0026gt; ## 4 Mercedes 90 \u0026lt;NA\u0026gt; ## 5 Mecedes 10 \u0026lt;NA\u0026gt; The potential next step could be addressing the upper case / lower case challenge. After doing this, the records start to match but the approach proves inefficient. We can tweak our matching further using regex. If we want to match directly on a regex column we can use regex_left_join from the fuzzyjoin package. At this point the tables start to match but the overall conclusions are quite disappointing, we are facing either endlessly tweaking our string through regex or standalone transformation to arrive at a suitable match if we want to progress in that direction. We are also exposing ourselves to a risk that added row with misspelled name will break that logic. In an actual production environment there is a risk that substantial inflow of data will break our logic for a significant number of records.\n1 2 3 4 5 6 suppressPackageStartupMessages(library(\u0026#34;tidyverse\u0026#34;)) data_A |\u0026gt; mutate(manufacturer = str_trim(manufacturer), manufacturer = str_to_lower(manufacturer), manufacturer = str_remove(manufacturer, \u0026#34;\\\\s.*\u0026#34;)) |\u0026gt; fuzzyjoin::regex_left_join(data_B, by = \u0026#34;manufacturer\u0026#34;) 1 2 3 4 5 6 7 8 ## # A tibble: 5 × 4 ## manufacturer.x price manufacturer.y subjective_rating ## \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;chr\u0026gt; ## 1 ford 100 \u0026lt;NA\u0026gt; \u0026lt;NA\u0026gt; ## 2 ford 120 \u0026lt;NA\u0026gt; \u0026lt;NA\u0026gt; ## 3 ford 30 \u0026lt;NA\u0026gt; \u0026lt;NA\u0026gt; ## 4 mercedes 90 mercedes A ## 5 mecedes 10 \u0026lt;NA\u0026gt; \u0026lt;NA\u0026gt; A \u0026ldquo;pragmatic programmer\u0026rdquo; approach At this junction it\u0026rsquo;s beneficial revise our initial assumptions. After working with this trivial sample data we can conclude that:\nThose tables have no right to match in 100%, if we code for every single opportunity there is a strong chance that new data will through this approach through the window by introducing new spelling mistakes, spaces and so forth\u0026hellip;\nThere is no consistency in mistakes, the data may contain superfluous spaces or odd capitalisation, we can\u0026rsquo;t form a assumption on what is the main problem to fix here.\nA solution to this challenge would be to approach the matching problem from a probabilistic perspective. A first option would be to generate a function assessing how dissimilar are our strings. Jaro-Winkler string distance is a metric continuously used in computer science and computational statistics assessing how distant are give strings. The comparator package offers a convenient implementation of that function. For this example, I\u0026rsquo;m leaving the default option and forcing only ignoring strings.\n1 2 3 library(\u0026#34;comparator\u0026#34;) jw \u0026lt;- JaroWinkler(ignore_case = TRUE) jw(\u0026#34;Mercedes\u0026#34;, \u0026#34;mecedes\u0026#34;) 1 ## [1] 0.9666667 The function returns distance between two strings and, expectedly, jw will return quite a high score for for common spelling mistakes. The score for dissimilar words will be much lower.\n1 jw(\u0026#34;Mercedes\u0026#34;, \u0026#34;opel corsa zxc\u0026#34;) 1 ## [1] 0.5119048 We can attempt to implement the function in our matching further leverage the functions available in a fuzzyjoin package. The fuzzy_left_join requires for the matching function to require a boolean output so we can wrap actual results in an anonymous function^[Anonymous function have simply no name and usually are used within other calls, like sapply, etc. For a discussion refer to the article on R-Bloggers.]\n1 2 3 fuzzyjoin::fuzzy_left_join(x = data_A, y = data_B, by = \u0026#34;manufacturer\u0026#34;, match_fun = function(x, y) { jw(x, y) \u0026gt;= 0.65} ) 1 2 3 4 5 6 7 8 ## # A tibble: 5 × 4 ## manufacturer.x price manufacturer.y subjective_rating ## \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;chr\u0026gt; ## 1 \u0026#34;Ford Focus\u0026#34; 100 Frod B ## 2 \u0026#34;Ford Mondeo\u0026#34; 120 Frod B ## 3 \u0026#34; Ford Corsa\u0026#34; 30 Frod B ## 4 \u0026#34;Mercedes W205 C-Class\u0026#34; 90 mercedes A ## 5 \u0026#34;Mecedes X156 GLA-Class\u0026#34; 10 mercedes A Without any string transformations we are achieving reasonable match. The key word is here is reasonable. The proposed approach only makes sense if we agree that we are not looking for a perfect match but we are willing to accept reasonably good output.\nSo what First be honest with what do you need, can you live with a few mismatched records? If you are building data to predict trends reflecting substantial populations the likely answer is yes, if you are building data set to email personalised marketing communication to existing customers the likely answer is no.\n","permalink":"https://www.thefinalartefact.xyz/post/why-regex-is-not-fuzzy-matching/","summary":"\u003cp\u003eRecently, I cam across an interesting discussion on StackOverflow^[SO discussion on: \u003ca href=\"https://stackoverflow.com/a/68182330/1655567\"\u003e\u003cem\u003eFuzzy Join with Partial String Match in R\u003c/em\u003e\u003c/a\u003e] pertaining to approach to fuzzy matching tables in R. Good answer contributed by one of the most resilient and excellent contributors to whom I owe a lot of thanks for help suggested relying on regular expression, combining this with basic string removal and transformations like \u003ccode\u003etoupper\u003c/code\u003e to deterministically match the tables. The solution solved the problem and was accepted.\u003c/p\u003e","title":"Why regex is not fuzzy matching"},{"content":"Problem Solving challenges on project Euler or HackerRank is a good past time. For folks working in the wider analaytical / data science field, places like project Euler provide an excellent opportunity to work with academic programming concepts that do not frequently appear in real-life. I was looking at common problem:\nYou are given an unordered array consisting of consecutive integers [1, 2, 3, \u0026hellip;, n] without any duplicates. You are allowed to swap any two elements. Find the minimum number of swaps required to sort the array in ascending order.\nExample Perform the following steps:\n1 2 3 4 5 6 7 i arr swap (indices) 0 [7, 1, 3, 2, 4, 5, 6] swap (0,3) 1 [2, 1, 3, 7, 4, 5, 6] swap (0,1) 2 [1, 2, 3, 7, 4, 5, 6] swap (3,4) 3 [1, 2, 3, 4, 7, 5, 6] swap (4,5) 4 [1, 2, 3, 4, 5, 7, 6] swap (5,6) 5 [1, 2, 3, 4, 5, 6, 7] It took swaps to sort the array.\nSolution First attempt After completing the problem, I like to explore literature and search or post on CodeReview for feedback. My initial solution simply involved re-starting array sort and counting each approach. As arrays are consisting of unordered consecutive integers ∈ [1, 2, 3, \u0026hellip;, n], the easiest solution\n1 2 3 4 5 6 7 8 9 10 11 12 def minimumSwaps(arr): # Add zero to avoid the need of shifting the index so all the loops # are working fine arr.insert(0, 0) num_sorts = 0 while arr != sorted(arr): for idx, val in enumerate(arr): if idx != val: arr[idx], arr[val] = arr[val], arr[idx] num_sorts += 1 break return num_sorts The solution returns the correct results but times out for 6 cases that consists of bigger arrays.\nPassed tests: 8 / 14 Second attempt My first thought was to quickly optimise the existing code by looking for quick wins. The sorting operation in while arr != sorted(arr) could be optimised by storing sorted object (as each array has only one sorted order that meets the criteria).\n1 2 3 4 5 6 7 8 9 10 11 12 13 def minimumSwaps(arr): # Add zero to avoid the need of shifting the index so all the loops # are working fine arr.insert(0, 0) num_sorts = 0 array_sorted = sorted(arr) # First optimisation while arr != array_sorted: for idx, val in enumerate(arr): if idx != val: arr[idx], arr[val] = arr[val], arr[idx] num_sorts += 1 break return num_sorts The line array_sorted = sorted(arr) makes the solution time out only once.\nPassed tests: 13/14 Third attempt Understandably, there is unnecessary computation taking place. The for loop will be restarted and always iterate over all elements until finding the element that does not match its correct place. This is computationally expensive. One approach would involve working with the loop to make the iteration less expensive. However, before attempting this I wanted to attempt another solution. For arrays that are simply sorted in reverse order arriving at incremental sorting can be achieved with a number of moves equal to half of the array lengths. In those cases, the computation and array swapping is unnecessary. This is implemented in the example below.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 def minimumSwaps(arr): # Check if dealing with simple reverse array, as sorting reverse array # will be equivalent to half of length if arr == sorted(arr, reverse=True): return int(len(arr) / 2) # Add zero to avoid the need of shifting the index so all the loops # are working fine arr.insert(0, 0) num_sorts = 0 array_sorted = sorted(arr) # First optimisation while arr != array_sorted: for idx, val in enumerate(arr): if idx != val: arr[idx], arr[val] = arr[val], arr[idx] num_sorts += 1 break return num_sorts Passed tests: 14/14 Conclusions I don\u0026rsquo;t remember where I\u0026rsquo;ve seen this statement but someone once said:\nRAM is cheap, thinking is expensive\nIt\u0026rsquo;s always easy and tempting to start working through concrete example instead of approaching the problem algebraically.\nFirst a good program computes what is has to and only that. The task is to return the numbers of steps not to sort. Sorting implies actual movement of elements in array and potentially will have a different (likely worse) memory footprint to a purely algebraic solution. The proposed solution does not adhere to this principle fully as purely algebraic solution is applied only for reverse arrays. Second, a premature optimisation is frequent source of bugs; also, algorithm is often the source of poor performance. In the proposed solution, there is a little focus on the actual performance, and optimisation is achieved mostly in the course of cheap / common sense steps, like storing sorted object or skipping computation for specific cases. Main learning points are that thinking how to achieve the outcome is frequently easier than building a detail mental image of the process that lends itself well to estimating the computational effort.\n","permalink":"https://www.thefinalartefact.xyz/post/fun-with-sortting-arrays/","summary":"\u003ch2 id=\"problem\"\u003eProblem\u003c/h2\u003e\n\u003cp\u003eSolving challenges on \u003ca href=\"https://projecteuler.net\"\u003eproject Euler\u003c/a\u003e  or \u003ca href=\"https://www.hackerrank.com/\"\u003eHackerRank\u003c/a\u003e is a good past time. For folks working in the wider analaytical / data science field, places like project Euler provide an excellent opportunity to work with academic programming concepts that do not frequently appear in real-life. I was looking at common problem:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eYou are given an unordered array consisting of consecutive integers  [1, 2, 3, \u0026hellip;, n] without any duplicates. You are allowed to swap any two elements. Find the minimum number of swaps required to sort the array in ascending order.\u003cbr\u003e\nExample\nPerform the following steps:\u003c/p\u003e","title":"On Sorting Arrays...or why it's good to read the actual assignment"},{"content":"Challenge File manipulation is a frequent task unavoidable in almost every IT business process. Traditionally, file manipulation tasks are accomplished within the ramifications of specific tools native to a given system. As such, the one may consider writing and scheduling shell scriptt to undertake frequent file operations or using more specific purpose-built tools like logrotate in order to archive logs or tools like Kafka are used to build streaming-data pipelines. R is usually though of as a statistical programming language or as an environment for a statistical analysis. The fact that R is a mature programming language able to successfully accomplish a wide array of traditional tasks is frequently ignored. What constitutes a programming language is a valid question. Wikipedia offers somehow wide definition:\nA programming language is a formal language comprising a set of instructions that produce various kinds of output. Programming languages are used in computer programming to implement algorithms.\nWikipedia\nPros and Cons of Using R as an ETL solution The article assumes that consideration on whether to utilise R within a team for pure ETL tasks is usually given in the context of R language being utilised to greater or lesser degree to facilitate data modelling, analytical or statistical work. R wasn’t designed to replace ETL1 processes; nevertheless, as every other well-developed programming language R offers a set of robust tools for accomplishing file manipulation, analysis and creation. In particular, R package ecosystem offers a layer that provides a clean, unified, interface for accomplish a variety of tasks, which traditionally, would be solutioned using system tools or bespoke applications. Pros Ability for our production process to be migrated across the systems and configurations increases. Let’s consider a simple example of generating temporary files. Utilising R base function tempfile or equivalent call from the fs package, fs::file_temp() allows for swift creation of temporary files. Achieving that on MS Windows using PowerShell could be done using specific cmdlet called New-TemporaryFile: $tmp = New-TemporaryFile. When working in command prompt we would leverage %TEMP% and %RANDOM% environment variables to come up with something like set \u0026quot;uniqueFileName=%tmp%\\bat~%RANDOM%.tmp\u0026quot;, as discussed at lengths in the related Stack Overflow answer, which suggests this answer: 1 2 3 4 5 6 7 @echo off setlocal EnableExtensions rem get unique file name :uniqLoop set \u0026#34;uniqueFileName=%tmp%\\bat~%RANDOM%.tmp\u0026#34; if exist \u0026#34;%uniqueFileName%\u0026#34; goto :uniqLoop The point is that exercises like that incur additional maintenance cost. Whereas tempfile is straightforward, well documented and understood by every R user; more importantly we achieve uniform cross-system implementation, which will make our project easier to maintain.\nThanks to the Rocker Project R plays exceptionally well with Docker. This is conducive to reducing future maintainability costs and enhancing portability. Decent article covering technicalities of using Docker with R was provided by Colin Fay.2\nR has rich ecosystem offering API connectivity. Thanks to Plumber R users can, with relatively little effort, generate own APIs. Packages like wbstats and Eurostat provide convenient wrappers for sourcing data from publicly available repositories (article World Bank and Eurostat). Those can be immensely time server if we consider building processes that rely on background external demographic/macroeconomic data being incorporated in our processes on regular basis. The subject of R / API integration merits a separate article.\nFile format conversion. Through packages like haven R offers rich interface to access data in different formats (Stata, SPSS, etc.). When working with bloated legacy processes that may required handling input originating from heterogeneous sources (think SAS binary, in-house Oracle database and some periodical figures produced by World Bank) the R flexibility may prove actual life saver. Still we should be careful not to use R as an excuse not to refactor and clean up old production processes.\nCons Using R for file manipulation is not suitable for processes that are mostly in vast majority ETL tasks. If our intention is to monitor a location for change and then conditionally copy files building R-based process will incur an unnecessary dependency, whereas more tools would\nAchieving fine-grained control over common operations may be more difficult in R. Let’s consider rsync. The Internet is full of examples on how to achieve properly defined goals using rsync, it is also possible to find Python implementation of the algorithm3. At the time of writing this article, the attempts to bring rsync functionality relied on command line tool.4\nMulti-threading, the thinking around multi-threading and asynchronous computation in R oscillates around finding more efficient ways to accomplish computational effort. Whereas packages like r-lib/async or r-lib/processx offer approaches to run processes in the background or perform asynchronous I/O operations; those solutions are applicable to more complex process. Achieving parallel rsync execution can be easily achieved using xargs, as found on Stack Overflow5:\n1 ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/ Packages worth looking at When discussing efficiencies derivable from R in the context of file manipulation it’s worth to mention a suitable packages as available\nPackage Description fs Cross-platform, uniform interface for file manipulation processx Executing and controlling system processes zip Cross-platform zip compression in R rsync R wrapper around rsync sftp SFTP for R eurostat, wbstats For accessing publicly available data Conclusion The R-based file manipulation process are only efficient, from a production perspective, when applied to processes containing a substantial analytical/statistical component. R also lends itself exceptionally to handling processes with heterogeneous inputs and outputs. By leveraging R we can bring order to processes that, before, had to utilise multiple tools to access disparate databases and ingest binary outputs. If our production process utilise common publicly available data repository, there is a strong chance that R package ecosystem can provide a convenient wrapper that makes regular querying and refreshing a breeze. R-based ETL exercises may prove unnecessary onerous if our intention is to develop multi-thread, asynchronous solutions. If our sole purpose is to run multi-thread rsync, using R packages to arrive at comparable end-results would introduce a lot of unnecessary complexity. In effect, whether to use R as an ETL backend depends on the nature of the project. In majority of the cases, building ETL pipelines using R will make sense if our project already uses R to undertake some analytical/statistical effort or there is a merit in bringing R to do some statistical work. Likely, it is most efficient then to extend our R project by coding additional ETL using R than relaying on external tools to deliver data into R and outwith R.\nReferences TechTerms definition of programming language\nExport Transform Load\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nColin Fay; An Introduction to Docker for R Users.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAn interesting implementation was offered by Tyler Cipriani in his blog.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nR package providing a convenient wrapper around rsync is available through the GitHub repo: INWTlab/rsync.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe answer to the question on Speed up rsync with Simultaneous/Concurrent File Transfers? is available in this Stack Overflow post: https://stackoverflow.com/a/25532027/1655567.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://www.thefinalartefact.xyz/post/file-manipulation-in-r/","summary":"\u003ch2 id=\"challenge\"\u003eChallenge\u003c/h2\u003e\n\u003cp\u003eFile manipulation is a frequent task unavoidable in almost every IT business process. Traditionally, file manipulation tasks are accomplished within the ramifications of specific tools native to a given system. As such, the one may consider writing and scheduling shell scriptt to undertake frequent file operations or using more specific purpose-built tools like \u003ca href=\"https://linux.die.net/man/8/logrotate\"\u003e\u003ccode\u003elogrotate\u003c/code\u003e\u003c/a\u003e in order to archive logs or tools like \u003ca href=\"https://kafka.apache.org/20/documentation.html\"\u003eKafka\u003c/a\u003e are used to build streaming-data pipelines. R is usually though of as a statistical programming language or as an environment for a statistical analysis. The fact that R is a mature programming language able to successfully accomplish a wide array of traditional tasks is frequently ignored. What constitutes a \u003cem\u003eprogramming language\u003c/em\u003e is a valid question. Wikipedia offers somehow wide definition:\u003c/p\u003e","title":"Using R for File Manipulation"},{"content":"Rationale Maintaining partitioned Hive tables is a frequent practice in a business. Properly structured tables are conducive to achieving robust performance through speeding up query execution (see Costa, Costa, and Santos 2019). Frequent use cases pertain to creating tables with hierarchical partition structure. In context of a data that is refreshed daily, the frequently utilised partition structure reflects years, months and dates.\nCreating partitioned table In HiveQL we would create the table with the following structure using the syntax below. In order to keep the development tidy, I’m creating a separate database on Hive which I will use for the purpose of creating tables for this article.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 -- Initially test database is created to keep the development tidy CREATE DATABASE blog COMMENT \u0026#39;Blog article samples, can be deleted.\u0026#39;; -- Example table is created CREATE TABLE blog.sample_partitioned_table ( value_column_a FLOAT COMMENT \u0026#39;Column will hold 4-byte number\u0026#39;, value_column_b DOUBLE COMMENT \u0026#39;8-byte double precision\u0026#39;, value_column_c CHAR(1) COMMENT \u0026#39;Fixed length varchar\u0026#39;) COMMENT \u0026#39;Sample partitioned table stored as text file\u0026#39; PARTITIONED BY ( part_year SMALLINT COMMENT \u0026#39;Data load year, partition\u0026#39;, part_month TINYINT COMMENT \u0026#39;Data load month, partition\u0026#39;, part_day TINYINT\tCOMMENT \u0026#39;Data load day, partition\u0026#39;) ROW FORMAT DELIMITED FIELDS TERMINATED BY \u0026#39;\\t\u0026#39; LINES TERMINATED BY \u0026#39;\\n\u0026#39; STORED AS TEXTFILE; The code snippet above achieves the following:\nTable sample_partitioned_table is created within newly created database blog. Three value columns are defined of FLOAT, DOUBLE and CHAR(1) types. Hive offers fairly rich set of data types and it’s worth to study the official documentation in order to ensure that selection of types is optimal considering the data we want to feed into the table. If we don’t have this clarity the wise solution may be to use more common types likes INT. The blog.sample_partitioned_table is stored as a text file with lines separated by tabs and rows separated with end line. The table defines theree columns used to partition the data, tinyint type is suitable to hold values from -127 to 127 so it can be used to store day and month values, smallint type holds values from -32,768 to 32,768 so it’s suitable for storing annual data. For more substantial tables with frequent usage further consideration should be given to the Hive file formats as well as wider storage strategy aspects.\nInserting data from R Inserting data using packages like glue in R is trivial, and enables us to deliver highly readable code that will be easy to maintain.\nSample data In an actual production setting, we would expect that our run will generate a data consistent with the table structure that should be saved as one of partitions. A common scenario could reflect summary events data generated for specific day, in business that structure would be frequently used to develop views on periodical business activity. For the purpose of example, I’m generating some sample data in R.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(lubridate)) ## Generating two months of data dates \u0026lt;- seq.Date( from = as.Date(\u0026#34;01-01-2010\u0026#34;, format = \u0026#34;%d-%m-%Y\u0026#34;), to = as.Date(\u0026#34;28-02-2010\u0026#34;, format = \u0026#34;%d-%m-%Y\u0026#34;), by = \u0026#34;day\u0026#34; ) ## Each data will contain two rows of values corresponding to the column types ## that were previously defined in Hive sample_data \u0026lt;- map_dfr(.x = dates, ~ tibble(val_a = runif(2), val_b = runif(2), val_c = sample(letters, 2), update_year = year(.x), update_month = month(.x), update_day = day(.x))) The created data looks as follows:\n1 head(sample_data) 1 2 3 4 5 6 7 8 9 ## # A tibble: 6 × 6 ## val_a val_b val_c update_year update_month update_day ## \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;int\u0026gt; ## 1 0.566 0.0228 c 2010 1 1 ## 2 0.0319 0.0220 k 2010 1 1 ## 3 0.338 0.993 a 2010 1 2 ## 4 0.794 0.517 o 2010 1 2 ## 5 0.934 0.727 y 2010 1 3 ## 6 0.0883 0.385 j 2010 1 3 Following the successful creation of the dummy data we are in position to easily leverage the desired data structure. Using the sparklyr package I’m creating a local connection.\n1 2 suppressPackageStartupMessages(library(sparklyr)) sc \u0026lt;- spark_connect(master = \u0026#34;local\u0026#34;) For the purpose of the article I’ve also executed the provided-above HiveQL via Spark to ensure accessibility to data structures that would be structurally equivalent, ensuring smooth execution of the example code. Naturally, in a production setting, we would seldom look to create new Hive schema from an R scriptt layer. Similarly, core tables storing results would be usually established outside regular production processes.\n1 2 3 4 ## \u0026lt;DBISparkResult\u0026gt; ## SQL DROP TABLE IF EXISTS blog.sample_partitioned_table ## ROWS Fetched: 0 [complete] ## Changed: 0 1 2 3 4 ## \u0026lt;DBISparkResult\u0026gt; ## SQL DROP DATABASE IF EXISTS blog ## ROWS Fetched: 0 [complete] ## Changed: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 res_DBI_data \u0026lt;- DBI::dbSendQuery(sc, \u0026#34;CREATE DATABASE blog COMMENT \u0026#39;Blog article samples, can be deleted.\u0026#39;\u0026#34;) res_DBI_tble \u0026lt;- DBI::dbSendQuery(sc, \u0026#34;CREATE TABLE blog.sample_partitioned_table ( value_column_a FLOAT COMMENT \u0026#39;Column will hold 4-byte number\u0026#39;, value_column_b DOUBLE COMMENT \u0026#39;8-byte double precision\u0026#39;, value_column_c CHAR(1) COMMENT \u0026#39;Fixed length varchar\u0026#39;) COMMENT \u0026#39;Sample partitioned table stored as text file\u0026#39; PARTITIONED BY ( part_year SMALLINT COMMENT \u0026#39;Data load year, partition\u0026#39;, part_month TINYINT COMMENT \u0026#39;Data load month, partition\u0026#39;, part_day TINYINT\tCOMMENT \u0026#39;Data load day, partition\u0026#39;) ROW FORMAT DELIMITED FIELDS TERMINATED BY \u0026#39;\\t\u0026#39; LINES TERMINATED BY \u0026#39;\\n\u0026#39; STORED AS TEXTFILE\u0026#34;) Insert mechanism Where the DBI package comes extremely handy is in inserting data into partitions. In context of our sample data we will want to populate every single partition with the respective modelling results. Courtesy of map_dfc function the “results” are available in one table but the proposed method can be easily modified and applied across other object structures, like lists. As a first step we will be looking to copy the existing sample data into Spark.\n1 tbl_sprk \u0026lt;- copy_to(sc, sample_data, \u0026#34;spark_sample_data\u0026#34;) In Spark our RDD is visible as spark_sample_data we will be looking to use that table in order to insert our partition elements into permanent storage.\n1 2 3 4 5 6 7 8 9 10 11 12 13 suppressPackageStartupMessages(library(DBI)) suppressPackageStartupMessages(library(glue)) res_pmap \u0026lt;- pmap( .l = select(sample_data, update_year, update_month, update_day), .f = ~ DBI::dbSendQuery(sc, glue(\u0026#34;INSERT INTO TABLE blog.sample_partitioned_table PARTITION (part_year={..1}, part_month={..2}, part_day={..3}) SELECT val_a, val_b, val_c FROM spark_sample_data WHERE update_year={..1} AND update_month={..2} AND update_day={..3}\u0026#34;))) Let’s unpack the code below. Our key goals are:\nOur aim is to populate partitions in our permanent Hive table blog.sample_partitioned_table, hence the statement INSERT INTO TABLE blog.sample_partitioned_table We are working with some modelling/analytical data that currently sits in our Spark session as spark_sample_data and we want for the relevant results in the data to land in the prescribed partitions on Spark What happens in the process is as follows:\nWe are generating a list of vectors with partitions identifiers to iterate over. As I’ve created this sample data in the current session in memory I can just refer to those items using select I would do that in the following manner select(sample_data, update_year, update_month, update_day) I’m interested in iterating over each column simultaneously and pmap function is excellent for that. Using ~ notation offered in pmap I will be looking to refer to first object as ..1 to the second as ..2 and so on. Glue package is used to insert strings with partitions identifier into the respective partition names. SELECT runs on spark RDD and also uses partition identifiers to get only subset of the data we are interested in and insert that subset into the desired partition. Results Following the operation above we can now explore the populated storage table. Sparklyr’s sdf_num_partitions can be used to get a number of existing partitions. Tibble’s glimpse can be used against the Spark data to get the preview of the created table.\n1 2 tbl_perm \u0026lt;- tbl(sc, \u0026#34;blog.sample_partitioned_table\u0026#34;) sdf_num_partitions(tbl_perm) 1 ## [1] 118 1 glimpse(tbl_perm) 1 2 3 4 5 6 7 8 9 ## Rows: ?? ## Columns: 6 ## Database: spark_connection ## $ value_column_a \u0026lt;dbl\u0026gt; 0.93434298, 0.08828931, 0.93434298, 0.08828931, 0.78982… ## $ value_column_b \u0026lt;dbl\u0026gt; 0.72664445, 0.38482719, 0.72664445, 0.38482719, 0.56651… ## $ value_column_c \u0026lt;chr\u0026gt; \u0026#34;y\u0026#34;, \u0026#34;j\u0026#34;, \u0026#34;y\u0026#34;, \u0026#34;j\u0026#34;, \u0026#34;f\u0026#34;, \u0026#34;e\u0026#34;, \u0026#34;f\u0026#34;, \u0026#34;e\u0026#34;, \u0026#34;y\u0026#34;, \u0026#34;j\u0026#34;, \u0026#34;y\u0026#34;, … ## $ part_year \u0026lt;int\u0026gt; 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2… ## $ part_month \u0026lt;int\u0026gt; 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1… ## $ part_day \u0026lt;int\u0026gt; 3, 3, 3, 3, 27, 27, 27, 27, 14, 14, 14, 14, 24, 24, 24,… Summary Convenient and flexible functions facilitating string manipulations available in R make metaprogramming1 in R easy. Generating and manipulating Hive statements as strings may not be the most efficient strategy in the light of the API’s offered via sparklyr or dbplyr. Nevertheless is possible to spot instances where R code makes those coding challenges particularly easy to solution and also to maintain.\nReferences Costa, Eduarda, Carlos Costa, and Maribel Yasmina Santos. 2019. “Evaluating Partitioning and Bucketing Strategies for Hive‐based Big Data Warehousing Systems.” Journal of Big Data 6 (34): 1–38. https://doi.org/10.1186/s40537-019-0196-1.\nMetaprogramming is a programming paradigm that treats other programming programs as data. In business, a BI setting metaprogramming is frequently used to generate efficiencies in routine data handling tasks, such as automating generation of SQL statements for importing data.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://www.thefinalartefact.xyz/post/inserting-data-into-partitioned-table/","summary":"\u003ch2 id=\"rationale\"\u003eRationale\u003c/h2\u003e\n\u003cp\u003eMaintaining partitioned Hive tables is a frequent practice in a business. Properly structured tables are conducive to achieving robust performance through speeding up query execution (see Costa, Costa, and Santos 2019). Frequent use cases pertain to creating tables with hierarchical partition structure. In context of a data that is refreshed daily, the frequently utilised partition structure reflects years, months and dates.\u003c/p\u003e\n\u003ch2 id=\"creating-partitioned-table\"\u003eCreating partitioned table\u003c/h2\u003e\n\u003cp\u003eIn HiveQL we would create the table with the following structure using the syntax below. In order to keep the development tidy, I’m creating a separate database on Hive which I will use for the purpose of creating tables for this article.\u003c/p\u003e","title":"Inserting Data into Partitioned Table"},{"content":"Introduction This article draws on the past post concerned with utilisation of golem for robust deployment of analytical and reporting solutions. For this article, we will assume that we are working with defined working requirements that utilise some of the Labour Market Statistics disseminated through the nomis portal.\nChange Plan What we have Reporting requirements Past scriptts we used to create reports with accompanying instructions What we want Stronger business continuity - we want to be able to give some access to this project and don\u0026rsquo;t be concerned with missing files, outdated unavailable documentation and questions on how to produce updated reports. We want self-encompassing entity that takes of care of its technical requirements and user-interaction^[Good parallel can be drawn between this approach and manuals available with life-saving equipment. Equipment delivers technical capacity and manual ensures operational capacity. In case of an inexperienced user one is not useful without the other. We want to ensure that user with minimum required capacity can use the tools correctly.] Better reproducibility - Easier way to re-run reports on custom parameters Improved efficiency - We want to have a possibility of quickly creating updated and re-running past reports using the app. Better development: We want to ensure that any change requests to our reporting/analytical stack won\u0026rsquo;t break crucial functionalities. We want to modularise development so new corporate branding or visualisation requirements can be applied with no (or minimal) integration in analytical function Framework Package Future robust development owes a lot to solid foundations. As the aim is to capitalise on the robust R package architecture, we will look to leverage available supporting packages. As a first step, we will construct a new Shiny/R package infrastructure using golem.\n1 golem::create_golem(path = \u0026#34;nomisReports\u0026#34;) Running package with default options gave us the following folder structure:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 . ├── DESCRIPTION ├── dev │ ├── 01_start.R │ ├── 02_dev.R │ ├── 03_deploy.R │ └── run_dev.R ├── inst │ ├── app │ │ └── www │ │ └── favicon.ico │ └── golem-config.yml ├── man │ └── run_app.Rd ├── NAMESPACE ├── nomisReports.Rproj ├── R │ ├── app_config.R │ ├── app_server.R │ ├── app_ui.R │ └── run_app.R └── sessionInfoLog 7 directories, 15 files The structure corresponds to what we would expect to find in a traditional package development. Thee are number of files specific to golem, which are explained later.\nUsing usethis Package structure is a scaffolding that can be used to develop package functionalities and integrate remaining, subsequent articles focus on leveraging robust and tested methods for assembling our package development scaffolding, such as usethis package that offers a variety of commands facilitating adding package elements.\n","permalink":"https://www.thefinalartefact.xyz/post/sample-analytical-app-with-shiny/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThis article draws on the past post concerned with utilisation of \u003ca href=\"https://github.com/ThinkR-open/golem\"\u003e\u003ccode\u003egolem\u003c/code\u003e\u003c/a\u003e for robust deployment of analytical and reporting solutions. For this article, we will assume that we are working with defined working requirements that utilise some of the Labour Market Statistics disseminated through the \u003ca href=\"https://www.nomisweb.co.uk\"\u003e\u003cem\u003enomis\u003c/em\u003e\u003c/a\u003e portal.\u003c/p\u003e\n\u003ch2 id=\"change-plan\"\u003eChange Plan\u003c/h2\u003e\n\u003ch3 id=\"what-we-have\"\u003eWhat we have\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eReporting requirements\u003c/li\u003e\n\u003cli\u003ePast scriptts we used to create reports with accompanying instructions\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"what-we-want\"\u003eWhat we want\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cem\u003eStronger business continuity\u003c/em\u003e - we want to be able to give some access to this project and don\u0026rsquo;t be concerned with missing files, outdated unavailable documentation and questions on how to produce updated reports. We want self-encompassing entity that takes of care of its technical requirements and user-interaction^[Good parallel can be drawn between this approach and manuals available with life-saving equipment. Equipment delivers technical capacity and manual ensures operational capacity. In case of an inexperienced user one is not useful without the other. We want to ensure that user with minimum required capacity can use the tools correctly.]\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eBetter reproducibility\u003c/em\u003e - Easier way to re-run reports on custom parameters\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eImproved efficiency\u003c/em\u003e - We want to have a possibility of quickly creating updated and re-running past reports using the app.\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eBetter development:\u003c/em\u003e\n\u003cul\u003e\n\u003cli\u003eWe want to ensure that any change requests to our reporting/analytical stack won\u0026rsquo;t break crucial functionalities.\u003c/li\u003e\n\u003cli\u003eWe want to modularise development so new corporate branding or visualisation requirements can be applied with no (or minimal) integration in analytical function\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"framework\"\u003eFramework\u003c/h2\u003e\n\u003ch2 id=\"package\"\u003ePackage\u003c/h2\u003e\n\u003cp\u003eFuture robust development owes a lot to solid foundations. As the aim is to capitalise on the robust R package architecture, we will look to leverage available supporting packages. As a first step, we will construct a new Shiny/R package infrastructure using \u003ccode\u003egolem\u003c/code\u003e.\u003c/p\u003e","title":"Poor Man's Robust Shiny App Deployment (Part II)"},{"content":"Not so uncommon problem RStudio Connect and more modest Shiny Proxy come to mind as most obvious solutions for deploying Shiny applications in production. Application servers are ideal for deploying applications that are to be consumed on a regular basis by larger audiences. In addition to serving the application, managing dependencies and user access or logging user activity are common tasks we would expect for a publishing platform to address. Frequently, however, deployment of Shiny application is directed at smaller audiences and less frequent usage. In such a situation, are availability, accessibility and user access management requirements will be often more modest. Commonly,in business a modelling or analytical solution can be packaged in Shiny application facilitating periodical re-run of models with different parameters and updated data sets. Such solutions can be conveniently utilised to facilitated development of monthly or quarterly reports. If the app is used once per month/quarter by a narrow user group the need to deploy it on the server is not well articulated. In that particular case we are mostly interested in ensuring that we can:\nProtect dependencies of our Shiny application Ensure that the relevant documentation is easily accessible Embed testing mechanism so we can be confident that we can safely re-run our model on a new data or with changed parameters It is worth remaking that likely we can live without some of the things that proper deployment solutions offer. If the app is being utilised on an infrequent basis, very likely, we can accommodate a little bit of occasional effort in putting the application for production. In majority of cases, additional 5 - 15 minutes of work will seldom cause a significant impediment in case of a analytical product that is used on monthly or quarterly basis. Within a small user group it is also likely that we don\u0026rsquo;t have a strong need to manage complex user access and permissions. In effect, we are mostly interested in packaging our modelling and reporting solution for future, infrequent, usage.\nWhat R community has to offer If we remove the Shiny application words from the opening paragraph, majority of criteria and our key deliverables correspond to what we would usually expect to find in an R package. R packages have a robust way of storing documentation, facilitate comprehensive unit testing and dependency management. On the basic level the dependency management is well addressed by the DESCRIPTION file that constitutes an integral part of each R package.\nThe job of the DESCRIPTION file is to store important metadata about your package. When you first start writing packages, you’ll mostly use these metadata to record what packages are needed to run your package. However, as time goes by and you start sharing your package with others, the metadata file becomes increasingly important because it specifies who can use it (the license) and whom to contact (you!) if there are any problems.\n--- Hadley Wickham R ecosystem offers a number of sophisticated alternatives to managing dependencies that may be more suitable for a collaborative project progressed in an organisational and/or corporate environment^[Useful list is mentioned in this discussion at Stack Overflow.] As we are looking at a solution that would be usable from a perspective of an advanced analyst it is safe to make an assumption that wise use of DESCRIPTION file will suffice. Same goes for testing and documentation development/storage. R package architecture facilitates storing and sharing of documentations. Whereas tools, like roxygen2 facilitate seamless documentation development. R community offers a number of approaches to unit testing with testthat package being one of the most popular ones.\nWhat about Shiny? The tools briefly describe above will be familiar to anyone who had some exposure to developing R packages. The key question, which still remains unanswered, pertains to any potential way we could leverage this architecture in order to facilitate deployment and sharing of an R package. Fortunately, R community offers an excellent solution facilitating distribution of Shiny applications using R package architecture. Developed by ThinkR R package golem facilitates development of R packages that focus on embedding Shiny applications.\nUnderbelly Authors of golem did an excellent job in creating robust in package documentation and providing vignettes on development, deployment and configuration as well as a comprehensive book^[Fay, C, Rochette, S, Guyader, V, Girard C 2020, Engineering Production-Grade Shiny Apps https://engineering-shiny.org/]. In order to understand how golem works it\u0026rsquo;s necessary to recognise that Shiny Application is in effect a collection of ordinary R functions, some of which will be concerned with translating R code into user interface. An excellent outline of the Shiny Application structure is provided in Mastering Shiny book by Hadley Wickham.\nDevelopment strategy The article makes an assumption that our Shiny Application is to serve as a wrapper for an existing modelling or analytical solution. In that case we would consider that we are working with a set of scriptts akin to Drake pipeline^[Drake is a R package facilitating development of reproducible research projects, more information can be found on the official GitHub page.] In practice, this would imply that we have a set of functions that we are ready to package into a production ready solution. Usually we would have a set of functions that we may use to query corporate databases via DBI or similar solution, some pre-processing and modelling work and artefact generating functions, such as wrapper for ggplot2 graphics and tables. Our final product usually would be in a form of a report and/or presentation. In practice, our quarterly report updating cycle likely will look to the one illustrated below.\n{width=400px height=200px}\nIn practical terms, yellow boxes usually imply communication with consumers of the report and seeking clarification on usefulness of the provided analysis. It may be that availability of new data induced us to ask questions that weren\u0026rsquo;t previously observable. For sake of efficiency, we would strive to automate of the remaining processes as much as possible. If we were sufficiently foreseeing we have embedded most of our modelling and reporting work in manageable functions. Embedding Shiny application on that layer gives us another possibility of further automating the process of report delivery and re-running analysis.\n","permalink":"https://www.thefinalartefact.xyz/post/poor-man-s-robust-shiny-app-deployment/","summary":"\u003ch2 id=\"not-so-uncommon-problem\"\u003eNot so uncommon problem\u003c/h2\u003e\n\u003cp\u003e\u003ca href=\"https://rstudio.com/products/connect/\"\u003eRStudio Connect\u003c/a\u003e and more modest \u003ca href=\"https://www.shinyproxy.io\"\u003eShiny Proxy\u003c/a\u003e come to mind as most obvious solutions for deploying Shiny applications in production. Application servers are ideal for deploying applications that are to be consumed on a regular basis by larger audiences. In addition to serving the application, managing dependencies and user access or logging user activity are common tasks we would expect for a publishing platform to address. Frequently, however, deployment of Shiny application is directed at smaller audiences and less frequent usage. In such a situation, are availability, accessibility and user access management requirements will be often more modest. Commonly,in business a modelling or analytical solution can be packaged in Shiny application facilitating periodical re-run of models with different parameters and updated data sets. Such solutions can be conveniently utilised to facilitated development of monthly or quarterly reports. If the app is used once per month/quarter by a narrow user group the need to deploy it on the server is not well articulated. In that particular case we are mostly interested in ensuring that we can:\u003c/p\u003e","title":"Poor Man's Robust Shiny App Deployment"},{"content":"Is there a merit for a three-way operator in R?\nBackground In C++20 revision added \u0026ldquo;spaceship operator\u0026rdquo;, which is defined as follows:\n1 2 3 (a \u0026lt;=\u0026gt; b) \u0026lt; 0 # if lhs \u0026lt; rhs (a \u0026lt;=\u0026gt; b) \u0026gt; 0 # if lhs \u0026gt; rhs (a \u0026lt;=\u0026gt; b) == 0 # if lhs and rhs are equal/equivalent. R implementation The behaviour can be achieved in R in multiple ways. A one straightforward approach would involve making use of the ifelse statement\nifelse implementation Basic approach would involve comparing the two figures and respectively returning -1 or 1 consistently with the definition above.\n1 2 3 a \u0026lt;- 1 b \u0026lt;- 2 ifelse(a \u0026lt; b, -1, 1) 1 ## [1] -1 The shortcoming of this approach is that adhering to (a \u0026lt;=\u0026gt; b) == 0 condition would require extra ifelse statement.\n1 2 3 a \u0026lt;- 1 b \u0026lt;- 1 ifelse(a \u0026lt; b, -1, 1) 1 ## [1] 1 The result above is wrong as consistently with the definition the operator should return zero. This can be achieved with the following solution.\n1 2 3 a \u0026lt;- 1 b \u0026lt;- 1 ifelse(a == b, 0, ifelse(a \u0026lt; b, -1, 1)) 1 ## [1] 0 1 2 3 a \u0026lt;- 1 b \u0026lt;- 2 ifelse(a == b, 0, ifelse(a \u0026lt; b, -1, 1)) 1 ## [1] -1 1 2 3 a \u0026lt;- 2 b \u0026lt;- 1 ifelse(a == b, 0, ifelse(a \u0026lt; b, -1, 1)) 1 ## [1] 1 Challenges There are few challenges pertaining to the implementation above. The one that is particularly stark pertains to type conversion. For a being \u0026quot;a\u0026quot; R returns TRUE.\n1 2 3 a \u0026lt;- \u0026#34;a\u0026#34; b \u0026lt;- 1 ifelse(a \u0026gt; b, TRUE, FALSE) 1 ## [1] TRUE This is due to the implementation of comparison on atomic vectors. As stated in ?Comparison documentation:\nIf the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.\nThis behaviour may be confusing, especially to those expecting to comparison operators to act as strict equality.^[Relevant StackOverflow discussion on the subject.] Developers au fait with JavaScript may not be surprised by R\u0026rsquo;s behaviours as they will be familiar with == and === where former compares variables undertaking type conversation and latter checks type of the variable. In R === can be obtained with use of ?identical.\nThe other interesting challenge is concerned with vectorisation. For vectors of unequal sizes we get a warning but also an answer.\n1 2 3 a \u0026lt;- c(2,2,2) b \u0026lt;- c(1,1) ifelse(a == b, 0, ifelse(a \u0026lt; b, -1, 1)) 1 2 ## Warning in a == b: longer object length is not a multiple of shorter object ## length 1 2 ## Warning in a \u0026lt; b: longer object length is not a multiple of shorter object ## length 1 ## [1] 1 1 1 Finally, the syntax of our nested ifelse statement is not useful. We could wrap the whole thing in a simple function\n1 2 3 three_way \u0026lt;- function(a, b) { ifelse(a == b, 0, ifelse(a \u0026lt; b, -1, 1)) } Still this is not as useful as calling this operator in-line lhs \u0026lt;=\u0026gt; rhs in C++ fashion.\nInfix implementation Fairly neat solution can be obtained with use of infix operator. Infix operators are common and frequently used in R. For instance\n1 5 + 2 1 ## [1] 7 statement is equivalent to\n1 `+`(5,2) 1 ## [1] 7 User-defined infix functions can be created by creating functions that start and end with %. Let\u0026rsquo;s assume that we want to achieve the following objectives:\nControl for types of compared vectors Control for size of compared vectors Have control over the outcome: Staying strict with the provided C++ implementation Modifying this behaviour, by for instance, returning the bigger object This can be quickly achieved using the following implementation.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 `%\u0026lt;=\u0026gt;%` \u0026lt;- function(lhs, rhs) { if (typeof(lhs) != typeof(rhs)) { warning(\u0026#34;Left and right hand operators are not of identical types.\u0026#34;) } # Single comparison function f_check \u0026lt;- function(lhs, rhs) { if (lhs \u0026gt; rhs) { lhs } else if (lhs \u0026lt; rhs) { rhs } else if (lhs == rhs) { 0 } } # Run on each element of vector purrr::modify2(.x = lhs, .y = rhs, .f = f_check) } In effect, the results obtained through the first set of ifelse statements can be easily achieved using max. The only - albeit very minor - advantage of the implementation is that it would be easily to modify it to behave in a manner consistent with the original implementation. Instead of returning lhs or rhs we would look to return -1 and 1 as in the example below.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 `%\u0026lt;=\u0026gt;%` \u0026lt;- function(lhs, rhs) { if (typeof(lhs) != typeof(rhs)) { warning(\u0026#34;Left and right hand operators are not of identical types.\u0026#34;) } # Single comparison function f_check \u0026lt;- function(lhs, rhs) { if (lhs \u0026gt; rhs) { 1 } else if (lhs \u0026lt; rhs) { -1 } else if (lhs == rhs) { 0 } } # Run on each element of vector purrr::modify2(.x = lhs, .y = rhs, .f = f_check) } Vectorisation and type checking I like purrr::modify2 due to consistent error messages it gives. Let\u0026rsquo;s say that we mistakenly compare vectors that are of different lengths. This would result in a following error message.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 `%\u0026lt;=\u0026gt;%` \u0026lt;- function(lhs, rhs) { if (typeof(lhs) != typeof(rhs)) { warning(\u0026#34;Left and right hand operators are not of identical types.\u0026#34;) } # Single comparison function f_check \u0026lt;- function(lhs, rhs) { if (lhs \u0026gt; rhs) { 1 } else if (lhs \u0026lt; rhs) { -1 } else if (lhs == rhs) { 0 } } # Run on each element of a vector purrr::modify2(.x = lhs, .y = rhs, .f = f_check) } a \u0026lt;- c(1,2,3) b \u0026lt;- c(1,2) a %\u0026lt;=\u0026gt;% b 1 2 ## Error in `map2()`: ## ! Can\u0026#39;t recycle `.x` (size 3) to match `.y` (size 2). Summary The need for the actual %\u0026lt;=\u0026gt;% is scant as ifelse and max efficiently fulfil that role. Nevertheless, creating spaceship operator in R is trivial and demonstrates flexibility of the language very well. I have a similar implementation in a package that I use to store KEmisc package that I use to store, trivial, handy functions.\n","permalink":"https://www.thefinalartefact.xyz/post/three-way-operator-in-r/","summary":"\u003cp\u003eIs there a merit for a three-way operator in R?\u003c/p\u003e\n\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eIn C++20 revision added \u0026ldquo;spaceship operator\u0026rdquo;, \u003ca href=\"https://en.cppreference.com/w/cpp/language/operator_comparison#Three-way_comparison\"\u003ewhich is defined as follows\u003c/a\u003e:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cdiv class=\"chroma\"\u003e\n\u003ctable class=\"lntable\"\u003e\u003ctr\u003e\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode\u003e\u003cspan class=\"lnt\"\u003e1\n\u003c/span\u003e\u003cspan class=\"lnt\"\u003e2\n\u003c/span\u003e\u003cspan class=\"lnt\"\u003e3\n\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\n\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-r\" data-lang=\"r\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003ea\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;=\u0026gt;\u003c/span\u003e \u003cspan class=\"n\"\u003eb\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e  \u003cspan class=\"c1\"\u003e# if lhs \u0026lt; rhs\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003ea\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;=\u0026gt;\u003c/span\u003e \u003cspan class=\"n\"\u003eb\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026gt;\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e  \u003cspan class=\"c1\"\u003e# if lhs \u0026gt; rhs\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003ea\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;=\u0026gt;\u003c/span\u003e \u003cspan class=\"n\"\u003eb\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e \u003cspan class=\"o\"\u003e==\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"c1\"\u003e# if lhs and rhs are equal/equivalent.\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\u003ch2 id=\"r-implementation\"\u003eR implementation\u003c/h2\u003e\n\u003cp\u003eThe behaviour can be achieved in R in multiple ways. A one straightforward approach would involve making use of the \u003ccode\u003eifelse\u003c/code\u003e statement\u003c/p\u003e\n\u003ch2 id=\"ifelse-implementation\"\u003e\u003ccode\u003eifelse\u003c/code\u003e implementation\u003c/h2\u003e\n\u003cp\u003eBasic approach would involve comparing the two figures and respectively returning \u003ccode\u003e-1\u003c/code\u003e or \u003ccode\u003e1\u003c/code\u003e consistently with the definition above.\u003c/p\u003e","title":"Three-Way Operator in R"},{"content":"Background The post covers installation of Hortonworks Sandbox (HD) on Mac using Docker. In software development, sandbox describes a testing environment that can be used to isolate untested code changes from a production code. Hortonworks Sandbox provides such an environment with the Hortonworks Data Platform installed. Hortonworks Data Platform is an open source framework facilitating distributed storage and processing large volumes of data.\nDeploying system for distributed processing within a single computer may seem like a counter-intuitive idea but it\u0026rsquo;s actually a very common practice. Most frequent use cases involve various learning / professional development activities where one may be interested in learning new technology or simply exploring available interfaces. Other frequent use case pertains to various demos, where there may be a need to demonstrate product capabilities and accessing proper, production environment could be cumbersome.\nInstallation Docker Docker, which is now used by millions of developers, is and open platform for developing, shipping and running applications. More technically, Docker uses OS-level virtualisation to deliver software packages called containers. Before installing docker let\u0026rsquo;s check for the existing installation.\n1 2 3 4 which -a docker ## /usr/local/bin/docker docker -v ## Docker version 20.10.2, build 2291f61 Assuming that the line above did not produce results we can install docker with use of Homebrew. To install Homebrew:\n1 /usr/bin/ruby -e \u0026#34;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)\u0026#34; We can the progress with installing docker. The command will install \u0026lsquo;cask\u0026rsquo; version of the Docker. In Homebrew lingo, \u0026lsquo;casks\u0026rsquo; are applications with Graphical User Interfaces. It is possible to install Docker without GUI by running brew install docker.\n1 brew install --cask docker HDP The HDP can be installed with the provided script. The scriptt can be downloaded from\n","permalink":"https://www.thefinalartefact.xyz/post/installing-hortonworks-sanbox-deployment-hdp-on-docker-mac/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eThe post covers installation of Hortonworks Sandbox (HD) on Mac using Docker. In software development, \u003cem\u003esandbox\u003c/em\u003e describes a testing environment that can be used to isolate untested code changes from a production code. Hortonworks Sandbox provides such an environment with the Hortonworks Data Platform installed. Hortonworks Data Platform is an open source framework facilitating distributed storage and processing large volumes of data.\u003c/p\u003e\n\u003cp\u003eDeploying system for distributed processing \u003cem\u003ewithin\u003c/em\u003e a single computer may seem like a counter-intuitive idea but it\u0026rsquo;s actually a very common practice. Most frequent use cases involve various learning / professional development activities where one may be interested in learning new technology or simply exploring available interfaces. Other frequent use case pertains to various demos, where there may be a need to demonstrate product capabilities and accessing proper, production environment could be cumbersome.\u003c/p\u003e","title":"Installing Hortonworks Sanbox on Mac with Docker"},{"content":"TL;DR If you want to see the implemented solution, please refer to: GitHub repo.\nContext Shiny is a widely popular web application framework for a R. In simple tearms it enables any R programmer to develop and deploy web application. This application could be simple - an interactive document consisting of a few charts and tables or a c complex \u0026ldquo;behemoth\u0026rdquo; with multiple functionalities enabling end-users to run models, query external data, generate exportable reports and sophisticated visuals.\nIn business, it is frequently efficient to capitalise on existing solution and develop / upgrade existing products. In a business intelligence web application context this may mean adding modules and functionalities to an existing application so a wider audience can be served and more user needs can be met.\nThis has certain advantages, mostly:\nDeployment cycle is faster, we are working on an existing product and adding new functionalities We are leveraging existing dissemination mechanism, authentication, server, data connections. Where possible we can use tested and working solutions. This is conducive to stability. Testing will be easier as we have an established user group that is familiar with the product Challenges Challenges are more interesting, as this is what I\u0026rsquo;m going to discuss. The challenges can be grouped in two main themes:\nDevelopment User exeperience Development Let\u0026rsquo;s address development challenges first. Traditionally Shiny application would consist of server.R and ui.R. Each file containing function definitions for server and user interface. Recently both of those are saved as in one file as in examples provided with Shiny package. To see available examples run:\n1 2 3 4 5 6 list.files( path = system.file(\u0026#34;examples\u0026#34;, package = \u0026#34;shiny\u0026#34;), pattern = glob2rx(\u0026#34;*.R\u0026#34;), recursive = TRUE, full.names = TRUE ) Now this won\u0026rsquo;t fly. Best case scenario we will end-up with sphagetti code mostly likely we will end up with nothhing.\nModules Modueles solve that problem by breakig down application architecture into\n","permalink":"https://www.thefinalartefact.xyz/post/interactive-module/","summary":"\u003ch2 id=\"tldr\"\u003eTL;DR\u003c/h2\u003e\n\u003cp\u003eIf you want to see the implemented solution, please refer to:\nGitHub repo.\u003c/p\u003e\n\u003ch2 id=\"context\"\u003eContext\u003c/h2\u003e\n\u003cp\u003eShiny is a widely popular web application framework for a R. In simple tearms it enables any R programmer to develop and deploy web application. This application could be simple - an interactive document consisting of a few charts and tables or a c complex \u0026ldquo;behemoth\u0026rdquo; with multiple functionalities enabling end-users to run models, query external data, generate exportable reports and sophisticated visuals.\u003c/p\u003e","title":"Interactively Loading Shiny Modules"},{"content":"In Stata it is possible to use function plot in order to get a simple scatter plot in Stata console. As of Stata eight, plot is no longer supported but remains a useful tool for quickly exploring relationships between variables. Using plot on the auto data provides the following results:\nNow the question is: can we achieve the same level of convenience in R? Of course. The txtplot package authored by Bjoern Bornkamp provides similar functionality. Executing the code below will generate nice text plot straight in the R console:\n","permalink":"https://www.thefinalartefact.xyz/post/ascii-charts-in-r/","summary":"\u003cp\u003eIn Stata it is possible to use function plot in order to get a simple scatter plot in Stata console. As of Stata eight, plot is no longer supported but remains a useful tool for quickly exploring relationships between variables. Using plot on the auto data provides the following results:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"/post/ascii-charts-in-r/images/stata_text_plot.png\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n  \u003cimg alt=\"Stata Textual Plot\" loading=\"lazy\" src=\"/post/ascii-charts-in-r/images/stata_text_plot.png\"\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eNow the question is: can we achieve the same level of convenience in R? Of course. The txtplot package authored by Bjoern Bornkamp provides similar functionality. Executing the code below will generate nice text plot straight in the R console:\u003c/p\u003e","title":"ASCII charts in R"},{"content":"After developing the Shiny App sourcing live labour market data from NOMIS. I wanted to accommodate a convenient way of managing rows in the legend. In particular, I wanted to account for the situation where end-user may select a number of geographies that will only conveniently fit into two or more rows. After transposing the data to long format, guessing the number of elements in the legend is relatively simple as it will correspond to the number of unique geographies passed via the subset command.\n1 2 3 4 g \u0026lt;- guide_legend(title = \u0026#34;Geography\u0026#34;, title.position = \u0026#39;top\u0026#39;, nrow = (if(length(unique(dta.chrt$GEOGRAPHY_NAME)) \u0026gt; 5) 2 else 1), title.theme = element_text(size = 14, face = \u0026#39;bold\u0026#39;, angle = 360)) For a small number of values we can conveniently get a small legend with one row:\nwhereas legend with a vast number of geographies is conveniently resized:\n","permalink":"https://www.thefinalartefact.xyz/post/managing-rows-in-the-ggplot-legend/","summary":"\u003cp\u003eAfter developing the Shiny App sourcing live labour market data from NOMIS. I wanted to accommodate a convenient way of managing rows in the legend. In particular, I wanted to account for the situation where end-user may select a number of geographies that will only conveniently fit into two or more rows. After transposing the data to long format, guessing the number of elements in the legend is relatively simple as it will correspond to the number of unique geographies passed via the subset command.\u003c/p\u003e","title":"Managing rows in the ggplot legend"},{"content":"In an unlikely scenario that beautiful Shiny apps do not meet your analytical requirements and developing a full-blown user interface. in RGtk2 may seem to be a little too much, there is a third, often overlooked solution, - package svDialogs by Philippe Grosjean. The package in a convenient way enables user to create various interface gadgets. For example the code:\n1 2 3 4 require(svDialogs) ## Let\u0026#39;s keep some data in one place user_figure \u0026lt;- svDialogs::dlg_input() would result in the following window being presented to the user:\nIn this case the code will return the following object:\n1 2 3 4 5 6 7 8 9 10 str(user_figure) ## Classes \u0026#39;nativeGUI\u0026#39;, \u0026#39;textCLI\u0026#39;, \u0026#39;gui\u0026#39;, \u0026#39;environment\u0026#39; \u0026lt;environment: 0x11872b648\u0026gt; user_figure ## The default SciViews GUI (.GUI) ## using widgets from: nativeGUI, textCLI ## * Last call: dlg_input(gui = .GUI) ## * Last widgets used: nativeGUI ## * Last status: ok ## * Last result: ## [1] \u0026#34;100\u0026#34; The post was inspired by an amusing discussion on SO.\n","permalink":"https://www.thefinalartefact.xyz/post/amusing-way-to-get-user-input-windows-in-r/","summary":"\u003cp\u003eIn an unlikely scenario that beautiful \u003ca href=\"http://shiny.rstudio.com/\" target=\"_blank\" rel=\"noopener\"\u003eShiny \u003c/a\u003eapps do not meet your analytical requirements and developing a full-blown user interface. in \u003ca href=\"http://www.ggobi.org/rgtk2/\"\u003eRGtk2\u003c/a\u003e may seem to be a little too much, there is a third, often overlooked solution, - package \u003ca href=\"https://cran.r-project.org/web/packages/svDialogs/index.html\"\u003esvDialogs\u003c/a\u003e by Philippe Grosjean. The package in a convenient way enables user to create various interface gadgets. For example the code:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cdiv class=\"chroma\"\u003e\n\u003ctable class=\"lntable\"\u003e\u003ctr\u003e\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode\u003e\u003cspan class=\"lnt\"\u003e1\n\u003c/span\u003e\u003cspan class=\"lnt\"\u003e2\n\u003c/span\u003e\u003cspan class=\"lnt\"\u003e3\n\u003c/span\u003e\u003cspan class=\"lnt\"\u003e4\n\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\n\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-r\" data-lang=\"r\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nf\"\u003erequire\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003esvDialogs\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e## Let\u0026#39;s keep some data in one place\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003euser_figure\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;-\u003c/span\u003e \u003cspan class=\"n\"\u003esvDialogs\u003c/span\u003e\u003cspan class=\"o\"\u003e::\u003c/span\u003e\u003cspan class=\"nf\"\u003edlg_input\u003c/span\u003e\u003cspan class=\"p\"\u003e()\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\u003cp\u003ewould result in the following window being presented to the user:\u003c/p\u003e","title":"Amusing way to get user input windows in R"}]