Brief data-science HOWTO

Last modified by Jan Rhebergen on 2022/01/24 15:56

Abstract

This short instructional will only address the case where one is working stand-alone i.e. on laptop or desktop and not together with others on a server. The aim however, is to shape the working environment in such a way that cooperation and transfer to/between a centralised server based environment is easy (in principle). Important boundary conditions to be satisfied are environment and package management to prevent potential dependency clash issues.

Operating system wars

The preferred operating system is any Linux distribution. For ease of use any Debian/Ubuntu related edition is advised. Special mention deserves Pop!OS by System76, which is not merely a re-flagged Ubuntu Linux but has several crucial optimisations which are especially useful for data-science applications. The default version comes with Nvidia optimised kernel and excellent battery and/or performance optimisation(s) out of the box.

MacOS is a (very) good second because of it similarity to Linux, it's FreeBSD heritage and POSIX compliance.

My (dated) limited (bad) experience with MS Windows prevents me from listing this as a serious option. In my (very biased) opinion this operating system is fit for managers and secretaries only (and perhaps gamers?).

Weapons of math-destruction

Currently the go-to choice for data-science application is Python. There are other good choices as, well most notably R, and Julia. Just a couple of years ago R would have been top choice but in recent years Python has gained the upper hand. One major advantage is the general applicability and usefulness of Python coding knowledge.

Almost anything you can think of in the data-science arena can be used along with Python. For general purpose programming Python is the engineer/scientists proverbial Swiss-army knife. So this is a good all-round choice whether for data-science or some other application.

Environmentalism

When developing in Python or Perl or many other languages one mostly needs to install modules and libraries to get the job done. These can either conflict with similar modules and libraries, as installed by the operating system, or even with each-other. It is therefor good practice to make use of so called virtual environments. This is not much more that a way to separate the modules and libraries used for development work from the rest of the system. Usually this is accomplished by setting environment variables and paths accordingly and using a package manager that takes care of dependencies and potential conflicts.

There are several options to accomplish the above mentioned isolation between the OS and various development environments. For Python the pip installer is often used. To manage dependencies and conflicts pipenv, pip-tools or poetry can be employed. However these are specific to the Python eco-system exclusively. Dependencies on other non-Python entities for instance cannot be handled by pip and friends. If one were to code exclusively in Python and not need much else, this is fine and probably sufficient. For applications which require interaction with specifically optimised libraries for data-science applications (cuda anyone?) it is much better to employ a package manager that can handle these. Anaconda by Enthought Systems, is such a package manager. In fact anaconda grew and evolved into a full blown data-science platform. It is available as an open-source product. The full anaconda distribution is rather large weighing in at around 6-7GB and includes a GUI which is not really needed. The command line edition, mini-conda or conda for short, is more than sufficient to manage your Python packages and related materials. It is a mere 1-2GB in size.

Coding IDE or Web interface

After installing conda and activating some environment\footnote{described on the website} (besides base) it is time to create the coding environment. One can choose to install an IDE like for instance Spyder or choose something else like Eclipse, VStudio or PyCharm. Personally I like to use GNU Emacs or Spyder when coding exclusively in Python.

However when practising data-science it is important to be able to communicate (and present) your work as well as make it easy to transfer. For this purpose Jupyter Lab is excellent!

Jupyter Lab is a web browser based environment which combines, mark-down, maths (it understands \LaTeX{}), code (not just Python) cells and output visualisation, with a dozen possible graphing libraries. The cell oriented Jupyter Lab notebooks can be designed to be run by relatively untrained users. Jupyter Lab makes it quite easy to present notebooks as slides for conferences or symposia. Extensions offer added functionality by an interface to git and make it very powerful indeed. Binder can be used to create a git repo with interactive notebooks.

References

Relevant information gathered for future reference:

conda activate qr-ballot
conda install ipykernel
python -m ipykernel install --user --name=qr-ballot

conda create -n yourenv pip
conda activate yourenv
pip install -r requirements.txt

Tags:

Applications

More applications

Need help?

If you need help with XWiki you can contact:

XWiki 12.10.6