1. THIS IS IN PROGRESS, INCOMPLETE (as of Dec 20, 2017)

2. Introduction

The vast majority of program installations onto a Linux system will be for your own PC and you’ll almost certainly use the system software installer to do so. All of the >100 Linux distributions have graphical software managers that make it trivial to install repository software in binary form onto your PC. This usually includes a large amount of popular scientific software as well: R, Python/SciPy, Jupyter, gnuplot, Octave, Scilab, etc.

However, if you’re reading this, you’re probably involved in research, and as such, the software you want to use has almost certainly not reached your Linux distribution’s repositories yet. (It’s RESEARCH, fergadsake). Also, for reasons of speed, RAM, and disk, you may very well be analyzing your data on a shared computing platform like a cluster where you do not have root permissions and the sysadmins may very well be straight out of a BOFH episode. ie: they may not be responsive, polite, caring, sober, or even awake. If this is the case, there may be a long time between a request to install a piece of software and fulfillment. The rest of this document will briefly review the mechanisms to installing software from repositories if you have root permissions, but then spend most of the content describing how to install software without being root. Sometimes the approaches apply to both.

3. Font Conventions

  • italic fonts → reserved for text emphasis (Arghhhh)

  • bold fonts → programs or utilities (ldconfig, nm)

  • bold & italic → Environment variables (LD_LIBRARY_PATH)

  • underline font → something meaningful ( like this example)

  • Red text → a notable user (root, postgres)

  • Green text → files or paths in the text body (/usr/include, /data/users/hmangala)

  • and sometimes I’ll ignore all the above to inject emphasis on non-Linux terms.

4. How programs are distributed

Standard utilities and applications are made available to the various Linux Distributions (distros) in different forms but they all have similar functionality. All popular distros have graphical software installation managers, sometimes multiple ones, often related to the Desktop system that you have selected (GNOME, KDE, Cinnamon, MATE, etc). Using these so-called Graphical Software Managers (GSMs), you can search the repository databases (repo’s) for regular expression patterns of files, applications, utilities, etc. The search utility will return the package name and then you can use the same GSM to download, unpack, install, and often configure the package for use. All the top-level package managers also resolve dependencies so that they will also download and install all the requirements of the end-target application.

These GSMs are usually wrappers around the core Commandline Package Managers (CPMs) that handle the actual manipulation of the constituent binary packages. This includes yum for RedHat-derived distros and apt for Debian ones. These CPMs usually have a separate and specific Commandline Package Installer (CPI) (rpm for RedHat, dpkg for Debian) to install the specific packages. Generally users should never have to use those lower-level installers, but they can be quite powerful if you need to do specific gymnastics with an installation or fix a bungled installation.

So the model is: Graphical Software Managers (GSMs) → Commandline Package Managers (CPMs) → Commandline Package Installer (CPI) (for those with OCD).

Also, the applications that these package managers install are binary installations that are pre-compiled (much like a Windows package or Macintosh DMG). You will never have to compile a package downloaded in this way, altho sometimes the installation will need to do a small compilation to integrate it into your specific platform (unless you insist on using Gentoo Linux that takes the masochistic POV that everyone should experience the sweet torture of compiling every single package that is installed on your system.)

4.1. Distribution-Specific

There are at least 100 different Linux Distributions, but most of them fall into variants of about 4 major delineations:

  • Debian: → Ubuntu, Mint, Kali, Elementary, etc

  • RedHat: → CentOS, Fedora, Scientific Linux

  • Arch: → Manjaro, Antergos, etc

  • Open SUSE → GeckoLinux, etc

I’ll provide some information about the first 2 distributions since we currently use CentOS on our HPC cluster at UCI and Debian variants are the most popular for personal use, especially Ubuntu and the derived Mint.

The native mechanisms are preferred for personal installation since they can be used to install scripts, programs, libraries, configuration files. However, they often cannot be used on large multi-user systems since they require root permissions that a normal user won’t have on a large system.

4.1.1. Debian-derived

(Debian, Ubuntu, Mint)

  • the different distros often use different GSMs (Ubuntu’s Software Center), Mint’s mintinstall Software Manager, but they all manipulate the underlying package managers to so the same thing (described above).

  • apt (similar to but different from apt-get) - is the CPM used to search for, download, and install applications, by default resolving dependencies along the way, such that a single command can install very complex applications such as R/rstudio (a statistical and data science-oriented language and its GUI) and various add-ons such as Bioconductor (a large R package for bioinformatics and biostatistics).

  • dpkg is part of the apt ecosystem that actually installs/removes the individual deb packages as Debian packages are called. If you are a normal user, you would probably never use dpkg, but if you’re installing individual, non-repository packages, you might.

  • GUI variants of the above.

  • synaptic - full GTK GUI that wraps apt (requires an X11 graphics screen)

  • aptitude - a curses/text-based UI that wraps apt (works in a terminal)

How you would search for and install xemacs, a powerfully weird text editor on a Ubuntu/Mint system. First, find the package names that include the functionality you want.

 $ apt search xemacs
p   xemacs21                      - highly customizable text editor
v   xemacs21:i386                 -
p   xemacs21-basesupport          - Editor and kitchen sink -- compiled elisp support files
p   xemacs21-basesupport-el       - Editor and kitchen sink -- source elisp support files
p   xemacs21-bin                  - highly customizable text editor -- support binaries
p   xemacs21-bin:i386             - highly customizable text editor -- support binaries
p   xemacs21-mule                 - highly customizable text editor -- Mule binary
p   xemacs21-mule:i386            - highly customizable text editor -- Mule binary
p   xemacs21-mule-canna-wnn       - highly customizable text editor -- Mule binary compiled with Canna and Wn
p   xemacs21-mule-canna-wnn:i386  - highly customizable text editor -- Mule binary compiled with Canna and Wn
p   xemacs21-mulesupport          - Editor and kitchen sink -- Mule elisp support files
p   xemacs21-mulesupport-el       - Editor and kitchen sink -- source elisp support files
p   xemacs21-nomule               - highly customizable text editor -- Non-mule binary
p   xemacs21-nomule:i386          - highly customizable text editor -- Non-mule binary
p   xemacs21-support              - highly customizable text editor -- architecture independent support files
p   xemacs21-supportel            - highly customizable text editor -- non-required library files

Then once you’ve identified the package (xemacs21 in this case), install it using the apt command.

$ sudo apt install xemacs21
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libcompfaceg1 xemacs21-basesupport xemacs21-bin xemacs21-mule xemacs21-mulesupport xemacs21-support
Suggested packages:
  xemacs21-supportel
The following NEW packages will be installed:
  libcompfaceg1 xemacs21 xemacs21-basesupport xemacs21-bin xemacs21-mule xemacs21-mulesupport xemacs21-support

  ... <much informational text deleted>

  Install pydb for xemacs21
Install systemtap-common for xemacs21
install/systemtap-common: Ignoring unsupported flavor xemacs21
Install dictionaries-common for xemacs21
install/dictionaries-common: Byte-compiling for emacsen flavour xemacs21
Compiling /usr/share/xemacs21/site-lisp/dictionaries-common/debian-ispell.el...
Wrote /usr/share/xemacs21/site-lisp/dictionaries-common/debian-ispell.elc
Compiling /usr/share/xemacs21/site-lisp/dictionaries-common/ispell.el...
Wrote /usr/share/xemacs21/site-lisp/dictionaries-common/ispell.elc
Compiling /usr/share/xemacs21/site-lisp/dictionaries-common/flyspell.el...
Wrote /usr/share/xemacs21/site-lisp/dictionaries-common/flyspell.elc
Done
Setting up xemacs21 (21.4.22-14ubuntu1) ...
update-alternatives: using /usr/bin/xemacs21 to provide /usr/bin/xemacs (xemacs) in auto mode

# xemacs is entirely installed and configured.

4.1.2. RedHat-derived

(RHEL, CentOS, Fedora)

  • yum (from 'Yellowdog Updater, Modified) functions similarly to the apt package above, but typically a bit slower.

  • rpm is the RedHat rough equivalent of dpkg

NB: the Debian repos are larger by a factor of 2-10 than the RedHat repos (depending on which repos you query). This is significant especially for researcher who want to have access to more packages and also to developers who need access to more, and more recent, libraries and development tools.

How you would search for and install xemacs on a CentOS system. First use *yum' to search for the appropriate package (and note that the packages are usually named differently on RedHat and Debian-derived systems (xemacs21 on the Debian system above and xemacs on the CentOS system below).

$ yum search xemacs
==================== N/S Matched: xemacs ===========================
flim-xemacs.noarch : Basic library for handling email messages for XEmacs
xemacs-common.x86_64 : Byte-compiled lisp files and other common files for XEmacs
xemacs-devel.i686 : Development files for XEmacs
xemacs-devel.x86_64 : Development files for XEmacs
xemacs-el.x86_64 : Emacs lisp source files for XEmacs
xemacs-erlang.noarch : Compiled elisp files for erlang-mode under XEmacs
xemacs-erlang-el.noarch : Elisp source files for erlang-mode under XEmacs
xemacs-filesystem.x86_64 : XEmacs filesystem layout
xemacs-info.x86_64 : XEmacs documentation in GNU texinfo format
xemacs-packages-base.noarch : Base lisp packages for XEmacs
xemacs-packages-base-el.noarch : Emacs lisp source files for the base lisp packages for XEmacs
xemacs-packages-extra.noarch : Collection of XEmacs lisp packages
xemacs-packages-extra-el.noarch : Emacs lisp source files for XEmacs packages collection
xemacs-packages-extra-info.noarch : XEmacs packages documentation in GNU texinfo format
xemacs-w3m.noarch : Compiled elisp files to run Emacs-w3m Under XEmacs
xemacs-w3m-el.noarch : Elisp source files for Emacs-w3m under XEmacs
xemacs.i686 : Different version of Emacs
xemacs.x86_64 : Different version of Emacs
xemacs-nox.x86_64 : Different version of Emacs built without X Windows support
xemacs-xft.x86_64 : Different version of Emacs built with Xft/fontconfig support

  Name and summary matches only, use "search all" for everything.

And now use yum to install it

 $ yum install xemacs
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package xemacs.x86_64 0:21.5.31-5.el6 will be installed

... <much informational text deleted>

  Verifying  : xemacs-filesystem-21.5.31-5.el6.x86_64                            2/7
  Verifying  : xemacs-packages-base-20100727-1.el6.noarch                        3/7
  Verifying  : Canna-libs-3.7p3-28.el6.x86_64                                    4/7
  Verifying  : xemacs-21.5.31-5.el6.x86_64                                       5/7
  Verifying  : xemacs-common-21.5.31-5.el6.x86_64                                6/7
  Verifying  : compface-1.5.2-11.el6.x86_64                                      7/7

Installed:
  xemacs.x86_64 0:21.5.31-5.el6

Dependency Installed:
  Canna-libs.x86_64 0:3.7p3-28.el6           compface.x86_64 0:1.5.2-11.el6
  neXtaw.x86_64 0:0.15.1-14.el6              xemacs-common.x86_64 0:21.5.31-5.el6
  xemacs-filesystem.x86_64 0:21.5.31-5.el6   xemacs-packages-base.noarch 0:20100727-1.el6

Complete!

The above utilities will only work with root access and only work to install applications from configured repositories.

4.1.3. Cross-Distro Utilities

  • alien is a utility that can interconvert a number of alien packages to Debian-friendly debs. It currently supports Red Hat rpm, Debian deb, Stampede slp, Slackware tgz, and Solaris pkg formats, interconverting in both directions, altho it’s not recommended to use it on base system packages.

4.2. Distribution-Independent

There are a number of application sources that are not linked to a particular distribution, but have been invented strictly as a porting mechanism.

4.2.1. Linuxbrew

Linuxbrew is a silly name for a very impressive package manager that was derived from the MacOSX Homebrew and in fact supports a surprising number of scientific softwares, some much more recent than even the Debian apt system, which is very impressive. brew is fairly easy to install and manage, altho it dives fairly deep into the guts of how Linux applications are executed.

4.2.2. pkgsrc

pkgsrc was originally part of the BSD Unix platform and also supports the BSD-derived IllumOS (a Solaris spinoff) and MacOSX, but has been ported to Linux as well. It emphasizes a build-from-source approach but there are binary distributions as well. It claims to support about 17K packages. However it is not a native application for any of the Linux distros (ie. you can’t yet apt install pkgsrc) and needs to be manually bootstrapped. It also requires builds to stay within the pkgsrc source tree for reliability, so it’s a fairly separate application tree. It is starting to become stable enough to use for large projects, but it’s not for beginners. Jason Bacon maintains a very good pkgsrc document at UW Madison.

4.2.3. easybuild

easybuild is another build system specifically for scientific and research software. Writ in Python, it is similar to pkgsrc but more tilted to research applications. It tries to break the software build process into formal chunks (easyblocks) and use such formalisms to make it easier to compile otherwise very complex software systems.

4.2.4. spack

Spack is package manager for predominantly scientific software and can compile software trees for both Linux and MacOSX. spack can be installed with a simple git clone & configure operation so it’s relatively simple for such a system. spack is a relatively new system and while it builds most mainline packages well, we are still seeing numerous failures in day-to-day operations.

4.2.5. Bitnami

Bitnami packages large application stacks in an easy-to-install superpackage and makes the configuration easier than a manual installation. The superpackages are typically web stacks such as Content Management Systems like WordPress & Joomla, eCommerce stacks such as PrestaShop and Magento, not so much for scientific applications, although it also does provide infrastructure for Linux Containers and some popular databases.

4.3. Linux Containers

Containers generically can be considered very lightweight Virtual Machines (VMs) that share the kernel of the host OS but little else and in shedding that overhead, are usually considerably smaller than a true VM. In that manner, they can run variants of that OS. For example, you could run a Debian container on a CentOS base OS, as long as the Debian OS used the same kernel, but you COULD NOT run a Windows container on a Linux system of any kind. However you COULD run a complete Windows VM on a Linux base OS (and vice versa) since the VM brings with it the entire kernel as well as utilities, etc.

I'll be referring only to Linux containers going forward.

This sounds like (and is) an odd feature until you realize that for HPC, there is one OS that rules them all (Linux) but zillions of variants (some of which are referenced above). In reality, many complex scientific codes (NeuroDebian, BioLinux, TVB, mriqc are developed and released for one of these distributions and if your cluster runs the wrong distribution (or even the wrong version of that distribution), it can require a huge amount of work to host the new software. Containers allow you use a released package much more easily, and especially keep a defined version of that software easily accessible and re-usable.

Like Linux itself, there are various flavors of containers and container orchestration tools, but the main container systems I’ll touch on are Docker, LXC (Linux Containers), and Singularity.

Docker is the overwhelming leader in Container technology and has a number of things going for it. It has a git-like organization and it has a lot of infrastructure for use, search, and utility. It’s also free enough that it has been integrated into the software repositories for most Linux Distributions.

However, in the HPC environment, it has some problems related to root escalation and other security problems. Singularity is an alternative container which mostly addresses those problems and can even consume Docker images to provide a more secure environment for running Docker images in HPC or multiuser environments.

In the UCI HPC environment, we are just starting to enable Singularity images to be run by users. If you have a Containerized software system that you want to run, by all means talk to us and we’ll try to arrange for it to be made into a Singularity image or we can also probably install it natively. One advantage of running a large, multi-user system is that we have many dependencies already available.

5. Installing Software

As noted above, a lot of the basic software you’ll need is already available to you via the distro’s native installation mechanism. However, when you need to install software that is outside of those repositories, you’ll be responsible for doing the installation yourself. Below, I describe some of the more popular (or more accurately frequently available) mechanisms for installing software.

5.1. By tarball

In the past and still quite often, software was made available in what we refer to as tarballs or tarchives. These terms describe collections of files, often with a single root directory, which contain all the code necessary to install or compile a piece of software. They are made available in the tar format (derived from the ancient terms tape archive) and are essentially a concatenation of all the files into one file. That tar file is usually compressed as well by a variety of formats (gzip → .gz, bzip2 → .bz, xz → xz, etc). Sometimes, especially if the software is multi-platform, it may also be packaged as a zip archive, which came from the PC/Windows world.

Such tarballs (as well as the zip format) can contain a few different formats for distributing the software.

5.1.1. Precompiled executables

Binary executable packages usually contain only the executable application plus some documentation. The executables are generally statically compiled to run on the widest possible distributions of Linux. When these tarballs are unpacked (see below), all that has to be done is to:

  • chown the executable to the right ownership (if there are overlaps in User IDs from the packaging system to the unpacking system, and your umask is incorrectly set, the package may become owned by another user.)

  • chmod it to be executable.

  • and then mv it to a dir on your PATH.

5.1.2. Scripts

Software based on interpreted scripts such as Python or Perl generally need to be treated as the Binary Executables directly above. A difference that you may run into is that poorly packaged scripts may have a hard-coded shebang line that points to either the developer’s preferred interpreter or an optional one you don’t have. As described here, change the shebang line to a universal one.

5.1.3. Source Code installs

A source code installation is often much more involved than a binary or interpreted script installation and usually needs the expertise described in the document How Programs Work on Linux. I’ll describe an idealized installation that goes over most of the steps needed to install a source code document, but certainly not all. I’ll demonstrate using a package called tacg.

Get the tarball.

The archive will usually be available via the Internet. Use wget to obtain the tarball:

export CT=compiletest
mkdir $CT
cd $CT
wget http://moo.nac.uci.edu/~hjm/tacg-4.6.0-src.tar.bz2
ls -l
total 960
-rw-rw-r-- 1 hjm hjm 631039 Nov  3 13:50 tacg-4.6.0-src.tar.bz2

Don’t immediately unpack the tarball. The correct/polite way to create a tarball is to have it rooted in a top-level directory so that when you unpack it, the files don’t explode in the current dir. However, perhaps 5% of such tarballs are created so that everything is unpacked in the current dir, creating a huge mess that has to be cleaned up. So check beforehand by using the -t (mnemonic tell) option to tar to show what’s in the tarball BEFORE you untar it to disk.

# we're going to use the following options:
#  t = tell
#  j = use the bzip2 decompression ('j' calls the same compression when 'creating')
#  v = be verbose when operating
#  f = use the following file as input (tacg-4.6.0-src.tar.bz2 in this case)
#  the suffix (| head) only shows us the 1st 10 lines of output

$ tar -tjvf tacg-4.6.0-src.tar.bz2 | head
tacg-4.6.0-src/
tacg-4.6.0-src/Data/
tacg-4.6.0-src/Data/codon.data
tacg-4.6.0-src/Data/matrix.data
tacg-4.6.0-src/Data/rebase.dam+dcm.data
tacg-4.6.0-src/Data/rebase.dam.data
tacg-4.6.0-src/Data/rebase.data
tacg-4.6.0-src/Data/rebase.dcm.data
tacg-4.6.0-src/Data/regex.data
tacg-4.6.0-src/Data/rules.data
...

The above shows us that the tarball is rooted in a separate dir (tacg-4.6.0-src) so we can go ahead and unpack it using a similar command:

# the only change is t -> x (for extract)
$ tar -xjvf tacg-4.6.0-src.tar.bz2
tacg-4.6.0-src/
tacg-4.6.0-src/Data/
tacg-4.6.0-src/Data/codon.data
tacg-4.6.0-src/Data/matrix.data
...
# now look at what we have: the original tarball and the extracted dir.
$ ls
tacg-4.6.0-src/  tacg-4.6.0-src.tar.bz2

Now to compile it.

$ cd tacg-4.6.0-src

$ ls
AUTHORS             INSTALL        ReadEnzFile.c  config.guess*   seqio.c
COPYING             Makefile.am    ReadMatrix.c   config.sub*     seqio.h
COPYRIGHT           Makefile.in    ReadRegex.c    configure*      tacg.c
ChangeLog           MatrixMatch.c  RecentFuncs.c  configure.in    tacg.h
Cutting.c           NEWS           SeqFuncs.c     control*        tacgi4/
Data/               ORF.c          Seqs/          install-sh*     test/
Docs/               Proximity.c    SetFlags.c     missing
GelLadSumFrgSits.c  README         SlidWin.c      mkinstalldirs*

The above is a fairly simple project - some configuration files (config, Makefile, install.sh), some Identifying files (AUTHORS, COPYING, COPYRIGHT) , some info files (README, Docs dir), the C source code (*.c, *.h), and some other dirs that contain auxiliary information.

In any such layout, the 1st thing to do is to actually READ the README file. It’s usually useful, in this case simply telling what tacg is good for. After you absorb the README and if you still want to compile the code, read the INSTALL file which should tell you how to do just this.. In this case, it’s pretty precise, but the main thing it says is to use the included configure script to generate a Makefile out of the Makefile.am template. Unless the author has subverted the ./configure script, it will also have a useful --help function. Be sure to try that option first to see if there are any gotchas or special options you should be aware of.

Also, there is one ./configure option that is critical to installing software on a shared system where you don’t have root permissions, the --prefix option which tells the Makefile where to install the program once it’s compiled. In the example below, I’m going to choose to install it in my HOME dir, so executables will go into hjm/bin, header files will go into hjm/include, libs will go intohjm/lib, manuals into hjm/man, and so on.

$ ./configure --help  | less   # how to compile & where to install the program

# then if nothing alarming or surprising to address, start it off with
# the appropriate options

$ ./configure --prefix=/home/hjm  # install it in my home dir
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
... etc
# checks for all the libs, utilities, compilers, etc it will need to compile tacg
# if it finds something wrong, it will stop with a fatal error and tell you how to fix it.

Now everything is ready to compile - if you look at the files again, you’ll notice 3 new ones.

$  ls -lt | head
total 2084
-rw-rw-r-- 1 hjm hjm  25883 Nov  3 14:22 config.log
-rw-rw-r-- 1 hjm hjm  14494 Nov  3 14:22 Makefile
-rwxrwxr-x 1 hjm hjm  29839 Nov  3 14:22 config.status*
...
  • config.log is the log from the configure script. If something went wrong, it will be described somewhere in that log.

  • Makefile is the input to Gnu make which will actually drive the compilation.

  • config.status is a shell script that will regenerate the current configuration; rarely needed.

Now we can initiate the actual make

$ make -j4  # '-j4 builds in 4 separate parallel processes

gcc -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"tacg\" -DVERSION=\"4.6.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRUCT_STAT_ST_BLKSIZE=1 -DHAVE_ST_BLKSIZE=1 -DHAVE_LIBM=1 -DHAVE_LIBPCRE=1 -DHAVE_DIRENT_H=1 -DSTDC_HEADERS=1 -DHAVE_FCNTL_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_SYS_PARAM_H=1 -DHAVE_GETPAGESIZE=1 -DHAVE_MMAP=1 -DHAVE_STRFTIME=1 -DHAVE_VPRINTF=1 -DHAVE_ALLOCA_H=1 -DHAVE_ALLOCA=1 -DHAVE_PUTENV=1 -DHAVE_STRSEP=1 -DHAVE_STRDUP=1 -DHAVE_STRSPN=1 -DHAVE_STRSTR=1 -DHAVE_UNAME=1 -I. -I. -DBUILD_DATE=\"Fri\ Nov\ \ 3\ 14:45:50\ PDT\ 2017\" -DUNAME=\"Linux\ stunted\ 4.4.0-21-generic\ #37-Ubuntu\ SMP\ Mon\ Apr\ 18\ 18:33:37\ UTC\ 2016\ x86_64\ x86_64\ x86_64\ GNU/Linux\" -DGCC_VER=\"gcc\ Ubuntu\ 5.4.0-6ubuntu1~16.04.5\ 5.4.0\ 20160609\"     -g -O2 -Wall -c seqio.c
... etc

# depending on how the Makefile is written, it may show or hide much of the process shown above
...
gcc  -g -O2 -Wall  -o tacg  seqio.o Cutting.o Proximity.o GelLadSumFrgSits.o ReadEnzFile.o SeqFuncs.o MatrixMatch.o ReadMatrix.o SetFlags.o ORF.o ReadRegex.o SlidWin.o tacg.o RecentFuncs.o  -lpcre -lm

The last line above shows the link step where all the newly compiled object files are linked together, with the required math (-lm) and regular expression (-lpcre) libraries, finally creating an application called tacg.

$  ls -lt | head
total 5980
-rwxrwxr-x 1 hjm hjm 1484792 Nov  3 14:45 tacg*
-rw-rw-r-- 1 hjm hjm 1189296 Nov  3 14:45 seqio.o
-rw-rw-r-- 1 hjm hjm  144752 Nov  3 14:45 tacg.o
-rw-rw-r-- 1 hjm hjm   94304 Nov  3 14:45 RecentFuncs.o
... <etc>

You’ll notice that there are also object files ending in .o that correspond to each C source code file. Don’t delete these yet. If you need to recompile the source files, only the source code files that are newer than the object files will be re-compiled. This can be quite significant in large projects.

If we run ldd on tacg, we’ll see what libraries it needs to run:

$ ldd tacg
        linux-vdso.so.1 =>  (0x00007ffd69500000)
        libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fa06ddd7000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa06dace000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa06d703000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa06d4e6000)
        /lib64/ld-linux-x86-64.so.2 (0x0000556f8b73f000)

So in addition to the explicit libraries that it linked to (libm & libpcre), it also uses:

  • linux-vdso.so.1 → vdso="virtual dynamic shared object", which links the different shared objects together

  • libc.so.6 → the GNU C library which is the core of the Linux OS

  • libpthread.so.0 → the GNU threading library (odd, bc tacg doesn’t use explicit threading)

  • ld-linux-x86-64.so.2 → the kernel handler that launches compiled applications

So now, let’s see what it does. A well-designed Linux application will tell you what it does if there’s no input or output specified

$ ./tacg
type 'tacg -h' or 'man tacg' for more help on program use or type:
tacg -n6 -slLc -S -F2 < [your.input.file]
for an example of what tacg can do.

Hmmm, let’s try man tacg to see if we can get more documentation:

$ man tacg
No manual entry for tacg
See 'man 7 undocumented' for help when manual pages are not available.

OK, that’s because we haven’t installed tacg yet, so the man pages are in the document tree but not available, since they’re not on the MANPATH.

We can either point man directly to the manpage or modify MANPATH to include the dir where the man page is. The former is easier.

$ man Docs/tacg.1

# shows the following:
   ..............................................................................
tacg(1)                    General Commands Manual                    tacg(1)

NAME
       tacg  -  finds short patterns and specific combinations of patterns in
       nucleic acids, translates DNA <-> protein.

SYNOPSIS
       tacg -flag [option] -flag [option] ... <input.file  >output.file  tacg
       takes input from a file (--infile) or via stdin (| or <); spits output
       to screen (default), >file, | next command

       etc.
   ..............................................................................

Now we want to test tacg to make sure it’s built correctly. The make test (sometimes make check) functionality is not often built into distributions, but sometimes it is. Try both, if it’s not made explicit in the README or INSTALL files.

The last step, once you’ve checked the compiled application, is to install it. If you used the --prefix option during the ./configure stage, all you have to do it to make install.

If the compilation does not include an installation step, you will have to copy the critical files into place yourself.

This usually involves copying any executables and scripts into your $HOME/bin (if on a multi-user system) or into /usr/local/bin if you’re using your own laptop and have root access for the install.

5.2. by git

git is the worlds most popular Version Control or Source Code Management (SCM) system and github is a large, low-cost internet site that uses git as well as other web technologies to provide a kind of universal software nexus. There are many other SCMs (CVS, Subversion, Bitkeeper, mercurial), but git and github are among the most popular ones.

Extracting the source code from Github or other remote git-based site is slightly different from the source tarball mechanism described above. git is especially notable because when you "download" a git-based project, you’re not only obtaining the source code and docs, but you’re also establishing a local mirror repository of the original site. The command used to do this is highly descriptive and accurate: git clone.

We’ll use a different project called fpart to demonstrate the git approach.

When you view a github project page like this one for fpart, one of the distinctive features is a green button with Clone or Download text on it. Clicking it allows you to copy the git repository URL to your computer’s clipboard. So now we’re ready.

# prep the directory
$ cd
$ export GD=git_test
$ mkdir $GD
$ cd $GD
$ pwd
/home/hjm/git_test

# now clone the git repository, where the last string (starting with https://)
# was copied from the github site.
$ git clone https://github.com/martymac/fpart.git
Cloning into 'fpart'...
remote: Counting objects: 1391, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 1391 (delta 4), reused 8 (delta 3), pack-reused 1370
Receiving objects: 100% (1391/1391), 302.58 KiB | 0 bytes/s, done.
Resolving deltas: 100% (917/917), done.
Checking connectivity... done.

$ ls
fpart/

$ cd fpart

$ ls -lat
total 84
drwxrwxr-x 7 hjm hjm     0 Nov  3 16:00 ./
drwxrwxr-x 8 hjm hjm     0 Nov  3 16:00 .git/     < .git dir tracks checkouts, changes, etc
drwxrwxr-x 2 hjm hjm     0 Nov  3 16:00 src/
drwxrwxr-x 2 hjm hjm     0 Nov  3 16:00 tools/
-rw-rw-r-- 1 hjm hjm  4223 Nov  3 16:00 Changelog
-rw-rw-r-- 1 hjm hjm   110 Nov  3 16:00 Makefile.am
-rw-rw-r-- 1 hjm hjm 11739 Nov  3 16:00 README
-rw-rw-r-- 1 hjm hjm  1453 Nov  3 16:00 TODO
-rw-rw-r-- 1 hjm hjm  2451 Nov  3 16:00 configure.ac
drwxrwxr-x 3 hjm hjm     0 Nov  3 16:00 contribs/
drwxrwxr-x 2 hjm hjm     0 Nov  3 16:00 man/
-rw-rw-r-- 1 hjm hjm  1322 Nov  3 16:00 COPYING
drwxrwxr-x 3 hjm hjm     0 Nov  3 16:00 ../

In a git source checkout, there’s one special dir called .git which contains a number of files and dirs that helps to track the files, their versions, monitored files, file differences with the original repository, and a number of other variables. We’re not going to worry about it except where it helps us sync with the parent repository.

In the listing of files above, you can see a few differences and similarities between this git repo and the tarball installation described above.

  • the source code is in tucked into its own src dir.

  • there is no configure script, altho there is a configure.ac file

  • there is a README file which provides the same kind of information as does the README and INSTALL files in the tarball installation, altho this layout is much more flexible than the more formal GNU toolchain format that tacg uses.

  • there is another dir called tools (specific to fpart) which has some additional utilities in it.

The consolidation of source code files into its own src dir is a personal choice, altho it does help segregate the source code from the supporting files. Similarly, adding additional dirs to hold configuration files, extra utilities, files for testing build integrity, etc is completely normal.

It’s only the first part of the git build that is slightly different. Because of the missing configure script, we have to build it anew using another GNU autoconf tool called autoreconf

# now we're in the fpart top-level dir
$ pwd
/home/hjm/git_test/fpart

# now launch autoreconf

$ autoreconf -i
configure.ac:7: installing './compile'
configure.ac:30: installing './config.guess'
configure.ac:30: installing './config.sub'
configure.ac:4: installing './install-sh'
configure.ac:4: installing './missing'
src/Makefile.am: installing './depcomp'

# now we're set to go since we now have a 'configure' script..

From this point on, it’s just like the compile process described above from this point on, substituting project and file names as needed.

5.2.1. Updating a git repo

There is one aspect of git that is different than a tarball-based install and that it when the parent git is updated, you can sync your system with the parent with a simple git pull executed in the main git dir (the one that contains the .git dir.

$ pwd
/home/hjm/git_test/fpart

$ git pull                          # after a few days
remote: Counting objects: 65, done.
remote: Compressing objects: 100% (36/36), done.
remote: Total 65 (delta 35), reused 56 (delta 26), pack-reused 0
Unpacking objects: 100% (65/65), done.
From https://github.com/martymac/fpart
   9dd3784..71ccfab  master     -> origin/master
 * [new tag]         fpart-1.0.0 -> fpart-1.0.0
Updating 9dd3784..71ccfab
Fast-forward
 Changelog                                 |   3 +-
 README                                    | 211 +++++++++++++++++++++++++++++++++---------------
 TODO                                      |   3 +
 configure.ac                              |   2 +-
 contribs/package/rpm/fpart.spec           |   6 +-
 docs/Solving_the_final_pass_challenge.txt | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/fpart.1                               |   2 +-
 man/fpsync.1                              |  28 ++++++-
 src/fpart.c                               |   2 +-
 src/fpart.h                               |   2 +-
 tools/fpsync                              |  14 ++--
 11 files changed, 450 insertions(+), 84 deletions(-)
 create mode 100644 docs/Solving_the_final_pass_challenge.txt

# and immediately afterwards,
$ git pull
 Already up-to-date.  # no more changes to sync

6. Installing personal software

As noted in the introduction, there are times when you need to install software on shared computing platforms where you can’t act as root. This section describes how to do that, first in general, and then a brief description for several popular programming platforms.

6.1. In general

Since the only place you can write on most shared platforms is $HOME, that’s the place where you should root your installation. This means if you use the command:

$ ./configure --prefix=$HOME/sw <etc>

in the configuration, after building the project, make install will copy the pieces relative to your $HOME. ie:

  • executables in /sw/bin (and then add that directory to your PATH

  • include (*.h) files in ~/sw/include

  • libraries in ~/sw/lib

  • documents in ~/sw/share

  • man pages in ~/sw/man/manX (X depending on what kind of package it it)

Since you can allow or prevent others from accessing the different part of your $HOME, you can share or shield these component via the chmod command.

6.2. R

The core R packages will usually be installed in a central location, often accessed by the use of the module command (ie: module load R/3.4.2). If that’s the case and you need to install a library to compete an analysis, first load the R version you need, to set up all the environment variables.

After that, there are 2 ways to install the software:

  • outside of R, using R CMD INSTALL, after downloading an R package (in this case Rmpi):
    Note the format for passing in the configure options.

NB: note especially the line --configure-args="--prefix=$HOME" which directs installation into a dir that YOU own and can write into.

R CMD INSTALL --configure-vars='LIBS=-L/apps/openmpi/1.4.2/lib' \
--configure-args="--prefix=$HOME" \
--configure-args="--with-Rmpi-type=OPENMPI \
--with-Rmpi-libpath=/apps/openmpi/1.4.2/lib \
--with-mpi=/apps/openmpi/1.4.2 \
--with-Rmpi-include=/apps/openmpi/1.4.2/include" \
Rmpi_0.5-9.tar.gz
  • inside of R, using install.packages

# as root if installing for whole platform
$ R
...
> install.packages("<R package name>", dependencies = TRUE, repos="http://cran.cnr.Berkeley.edu")
# eg:
> install.packages("ggplot2", dependencies=TRUE)
> install.packages("BiodiversityR", dependencies=TRUE)

However, if you’re not root, then the installation will detect your inability to install into the system areas and offer an alternative:

# as a non-root user
$ R
...
> install.packages("ggplot2", dependencies=TRUE, repos="http://cran.cnr.Berkeley.edu")
Warning in install.packages("ggplot2", dependencies = TRUE, repos = "http://cran.cnr.Berkeley.edu") :
  'lib = "/data/apps/R/3.1.2/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/3.1
to install packages into?  (y/n)
also installing the dependencies 'openssl', 'backports', 'Rcpp', 'viridisLite', 'rlang', 'rex', 'httr', 'crayon', 'sp', 'praise', 'knitr', 'yaml', 'htmltools', 'evaluate', 'rprojroot', 'stringr', 'gdtools', 'scales', 'tibble', 'covr', 'ggplot2movies', 'hexbin', 'mapproj', 'maps', 'maptools', 'testthat', 'rmarkdown', 'svglite'

trying URL 'http://cran.cnr.Berkeley.edu/src/contrib/openssl_0.9.9.tar.gz'
Content type 'application/x-gzip' length 1112927 bytes (1.1 Mb)
opened URL
==================================================
downloaded 1.1 Mb

   ...
<lots of output deleted>
>  library("ggplot2")

# and there you are...

In the above scenario, the other libraries are also installed in your local R/VERSION lib dir and unless you explicitly make them readable by other users (and they modify their environment variables to point to your installation), no one else will be able to make use of your R libraries.

If you’re using a shared resource like a cluster, it’s worthwhile to ask your sysadmins to install popular packages rather than everyone installing them by themselves.

6.3. Python

Python (both the generic & specific distributions (see below) use (or can use) 2 main installation mechanisms:

6.3.1. Problems with packages with binary shared libs

In both cases, if the package includes pre-compiled shared libs (ie: tensorflow), your laptop kernel is probably recent enough to support the GLIBC version that the compiled libs require. However, machines running much older kernels such as cluster nodes may not be recent enough, so that even if you can install the package without error, when you try to run it, you’ll hit the GLIBC error:

$ python
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Traceback (most recent call last):
  File "/data/apps/anaconda/3.6-4.3.1/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/data/apps/anaconda/3.6-4.3.1/lib/python3.6/site-packages/tensorflow/python/

  <much deleted>


ImportError: /lib64/libc.so.6: version `GLIBC_2.16' not found (required by /data/apps/anaconda/3.6-4.3.1/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

6.3.2. Freemium Pythons

Like many Open Source tools, Python has been Freemiumed into multiple distributions. The 2 most popular ones are Anaconda and Enthought Canopy.

In addition, specific Python distributions have their own tools which query against their own repositories. The 2 most popular ones are the Enthought distribution and the Anaconda distribution. Opinions vary on which is better for what. They provide:

Anaconda and conda

Anaconda Pythonis a science-targeted distribution and as such has a lot of performance-improved libs. It comes with its own installer which should be used preferentially over the generic installation tools (altho they can be used as well).

The conda utility is used very much like pip but has a slightly different syntax and searches the Anaconda repositories rather than the PyPi repositories.

In

6.3.3. Enthought Python / Canopy

6.4. MATLAB

6.5. Perl

Like the rest of these systems, Perl is usually installed via the distro-specific tools mentioned above (yum, apt). You can search for the appropriate set of modules and install them easily into your own system Perl installation. However Perl also has its own installation tools, most based on CPAN, the Comprehensive Perl Archive Network.

6.5.1. by CPAN

The commandline 'cpan utility has an internal shell that allows you to search for and install all the component Perl modules, including dependencies like any other well-written tool.

The 3 most useful cpan commands are h (help), i (info) and install commands.

$ cpan

cpan shell -- CPAN exploration and modules installation (v2.18)
Enter 'h' for help.

                                                                                                                                                         cpan[1]> h

Display Information                                                  (ver 2.18)
 command  argument          description
 a,b,d,m  WORD or /REGEXP/  about authors, bundles, distributions, modules
 <deletia>

Now search for what you want:

 cpan[2]> i /regex/  # where 'regex' is any regular expression
 cpan[2]> i /samtool/
Reading '/root/.local/share/.cpan/Metadata'
  Database was generated on Tue, 28 Nov 2017 02:17:03 GMT
Distribution    HARTZELL/Alien-SamTools-0.002.tar.gz
Distribution    LDS/Bio-SamTools-1.39.tar.gz
Distribution    LDS/Bio-SamTools-1.43.tar.gz
Module  < Alien::SamTools        (HARTZELL/Alien-SamTools-0.002.tar.gz)
Module  < Bio::Tools::Run::Samtools (CJFIELDS/BioPerl-Run-1.007002.tar.gz)
Module  < Bio::Tools::Run::Samtools::Config (CJFIELDS/BioPerl-Run-1.007002.tar.gz)
Module  < Bio::Tradis::Samtools  (AJPAGE/Bio-Tradis-1.3.3.tar.gz)
7 items found

And now install it:

cpan[4]> install Bio::Tools::Run::Samtools
<lots of text, info, warnings deleted>

6.5.2. cpanminus

[cpanminus] (aka cpanm a very handy utility that you can install (via cpan) which allows you to install modules easily form the commandline without entering the cpan shell ==== by tarball