Bash on The Final Artefact

Using RScript for R Installation Management

Mon, 03 Jan 2022 00:00:00 +0000

Most frequently, users tend to undertake common R installation and management tasks from within the R session. Frequently making use of commands, like install.packages, update.packages or old.packages to obtain or update packages or update/verify the existing packages. Those common tasks can also be accomplished via the GUI offered within RStudio, which provides an effortless mechanism for undertaking basic package management tasks. This is approach is usually sufficient for the vast majority of cases; however, there are some examples when working within REPL^[REPL stands for Read Eval Print Loop and is usually delivered in a form of an interactive shell. While working in Python users would commonly access REPLY by running python or ipython, more details.] to accomplish common installation tasks is not hugely convenient.

R-based metaprogramming strategies for handling Hive/CSV interaction (Part I, imports)

Fri, 13 Aug 2021 00:00:00 +0000

Background

Handling Hive/CSV interaction is a common reality of many analytical and data environments. The question on exporting data from Hive to CSV and other formats is frequently raised on online forums with answers frequently suggestring making use of sed that combined with nifty regular expressions pipes Hive output into a flat CSV files as an exporting solution. Import of large amounts of data is best handled by suitable tools like Apache Flume. That is fine for simpler tables but may prove problematic for tables with a large amount of unstructured text. Frequently analysts and data scientists are faced with a challenge with storing data Hive on a irregular semi-regular basis. For instance, a job may produce new forecastring scenarios that we may want to make available through a Hive tables.

Installing Hortonworks Sanbox on Mac with Docker

Sat, 23 Feb 2019 00:00:00 +0000

Background

The post covers installation of Hortonworks Sandbox (HD) on Mac using Docker. In software development, sandbox describes a testing environment that can be used to isolate untested code changes from a production code. Hortonworks Sandbox provides such an environment with the Hortonworks Data Platform installed. Hortonworks Data Platform is an open source framework facilitating distributed storage and processing large volumes of data.

Deploying system for distributed processing within a single computer may seem like a counter-intuitive idea but it’s actually a very common practice. Most frequent use cases involve various learning / professional development activities where one may be interested in learning new technology or simply exploring available interfaces. Other frequent use case pertains to various demos, where there may be a need to demonstrate product capabilities and accessing proper, production environment could be cumbersome.