1. Font Conventions

  • italic fonts → reserved for text emphasis (Arghhhh)

  • bold fonts → programs or utilities (ldconfig, nm)

  • bold & italic → Environment variables (LD_LIBRARY_PATH)

  • underline font → something meaningful ( like this example)

  • Red text → a notable user (root, postgres)

  • Green text → files or paths in the text body (/usr/include, /data/users/hmangala)

  • and sometimes I’ll ignore all the above to inject emphasis on non-Linux terms.

2. Introduction

This document is for (non-Computer Science) graduate students who are using Linux to do large scale data analysis on Linux systems, but who have little-to-no experience in computer programming. However, these students often need to know how programs work in order to install programs writ by others, to fix common installation or run-time problems, to bypass root requirements on large systems, to accelerate their own analysis without compromising the security of the larger system.

You would not be reading this document as a first-time Linux user, but as a Linux user who had some experience using programs and is now trying to fix or install a recalcitrant program, whether it be a Perl or C++ program.

As such, you need to have a crude understanding of a Linux system, the permission system, the difference between a normal user and root, and what programs are expected to do.

Hence this is a meta-document about programming, not about How to Program (there’s almost nothing about programming in it) but about how programs are built and work on Linux systems and how to fix common problems associated with running them.

3. Why install your own programs?

Why indeed? Well, altho you should always try to be as lazy as possible, sometimes the utilities and programs that are available on your system don’t do the thing that you want. Altho the longer you program, the more you realize that there probably is an included utility that does pretty much what you want; you just have to find out about it - see Google.

That said, this is research, and quite often you do need a program that does something different and for that you can either beg someone else to do it, or do it yourself. "By yourself" can mean anything from copying and modifying a friend’s program to a gigantic hairball of an installation that requires a number of other dependencies, libraries, versions, etc - aka dependency hell.

So you need to install a program. Well …

4. What is a program?

A program is a set of instructions (the source code) that tells the computer what to do. It can be a human-readable set of instructions, like the simple Perl program below:

#!/usr/bin/env perl   # defines the interpreter to use;
while (<>) {  # consume the input file on STDIN, line by line, until it ends
  # split the incoming line the default variable on which to operate, on space / / tokens
  # into elements of the array @A and store the token count in $N
  my $N = my @A = split (/ /);
  # loop thru the words, printing them each on their own line
  for (my $i=0; $i<$N;$i++){ print "Element [$i] = $A[$i]\n";}
  sleep 1;  # pause for 1s between lines.
}

A Perl (or Python) program has to be dynamically translated (or interpreted) into machine code by an interpreter (like Perl, Python, R, often defined by the #!/usr/bin/env perl line in the example above) or with different source code, compiled into a set of machine instructions with a compiler (as for C, C++, Ada, & Fortran).

If you try to view compiled C++/C/Fortran code in a text editor or pager like less, you’ll see something like this:

^?ELF^A^A^A^@^@^@^@^@^@^@^@^@^B^@^C^@^A^@^@^@P95^D^H4^@^@^@d<F2>^L^@^@^@^@^@4^@ ^@^H^@(^@&^@#^@^F^@
^H^D<AF>
^H<FC>^E^@^@^H6^@^@^F^@^@^@^@^P^@^@^B^@^@^@^X^_^F^@^X<AF>
^H^X<AF>
^H<D8>^@^@^@<D8>^@^@^@^F^@^@^@^D^@^@^@^D^@^@^@H^A^@^@H81^D^HH81^D^HD^@^@^@D^@^@^@^D^@^@^@^D^@^@^@
^H^D<AF>
^H<FC>^@^@^@<FC>^@^@^@^D^@^@^@^A^@^@^@/lib/ld-linux.so.2^@^@^D^@^@^@^P^@^@^@^A^@^@^@GNU^@^@^@^@^@^B^@
^H^@^@^@^@^P^@<F1><FF><A8>^A^@^@ <B5>
^H^D^@^@^@^Q^@^X^@N^@^@^@<C0><8F>^D^H^@^@^@^@^R^@^K^@1^@^@^@l82      ^@^@^@^@^R^@^N^@^V^B^@^@@<B5>
^H^D^@^@^@^Q^@^X^@^@libpcre.so.3^@__gmon_start__^@_Jv_RegisterClasses^@_fini^@pcre_exec^@pcre_compile
<etc>

ie, unlike the source code, the compiled code gives little indication of the action of the program.

4.1. What makes a program executable?

For Linux, one thing that makes a program is that the file has its execute bit set. Without the execute permission bit set, even a compiled program will not be executed by the Operating System. ie:

$ ls -l tacg
-rwxr-xr-x 2 hjm hjm 1495148 Oct 27 20:20 tacg*
   ^  ^  ^
the '^' above indicates the execute bits for the owner, group, and other.

a permission line like this:
--rwxr--r-- 2 hjm hjm 1495148 Oct 27 20:20 tacg*
    ^  allows only the owner to execute it ('group' and 'other' can't)

In an interpreted script, what makes a program executable is the presence of the shebang line pointing to the interpreter that you want to process the program. In the Perl example above, the shebang line was the 1st line, which is prefixed by a hash and an exclamation sign (or bang in geek). Hence shebang.

Often, files will be chmod'ed to be executable, even when they aren’t valid programs. Libraries are a frequent example of this contagious execution bit spread.

A file can’t be executed simply because the execute bit IS SET, but NOT having the execute bit set will prevent a program from running by calling its name. However, you CAN still execute a program that does not have the execute bit set by prefixing the name with the appropriate interpreter.

If irvinepines.pl is an otherwise functional perl program that emits some text when run.

#!/usr/bin/env perl

print "The Irvine pines are sublime..\n";
sleep 1;
print "Until they burst into flames.\n";

Note that whether it is executable depends on the execution bits as well as whether it’s a valid program.

# note no execute bits are set below
$ ls -l irvinepines.pl
-rw-rw-r-- 1 hjm hjm 116 Sep 19 15:54 irvinepines.pl

# so when we try to execute it, we can't
$ ./irvinepines.pl
bash: ./irvinepines.pl: Permission denied

# even tho 'file' identifies it as:
$ file ./irvinepines.pl
./irvinepines.pl: a /usr/bin/env perl script, ASCII text executable

# however, if we prefix a non-executable perl script with the interpreter name
$ perl irvinepines.pl
The Irvine pines are sublime..
Until they burst into flames.

# or we can make it executable
$ chmod +x irvinepines.pl
$ ls -l irvinepines.pl
-rwxrwxr-x 1 hjm hjm 118 Sep 19 15:56 irvinepines.pl*
#  ^  ^  ^  now everyone can execute it

# now we can execute it simply by calling its name.
$ ./irvinepines.pl
The Irvine pines are sublime..
Until they burst into flames.

The above example holds true for most interpreted scripts, but not for compiled programs. You can’t cause a compiled program called tustintrees to execute simply by prefixing the name with a compiler. ie: gcc tustintrees will not work.

5. Difference between scripts and programs

In this document, we’re going to call programs that require a separate interpreter a script (see Interpreters below) and those programs that are compiled (see Compilers below) into independent executable code a program.

The difference is apparent if you run file on them:

# this is a 'program'
$ file /bin/ls
/bin/ls: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked
(uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=
bd39c07194a778ccc066fc963ca152bdfaa3f971, stripped

# this is a 'script'
$ file ~/bin/clusterfork_1.75.pl
/home/hjm/bin/clusterfork_1.75.pl: Perl script, ASCII text executable

6. Interpreters

Interpreters are used by languages like Perl, Python, R, Java, Julia, etc. An interpreter is a large program that ingests human-readable source code, translates it into executable code, and then executes it directly, as opposed to a compiler (see below) which compiles source code into a file of executable code only once. Because of this repeated, real-time translation, interpreted code almost always runs slower than compiled code, but often the difference is trivial, especially since many interpreted languages tend to have shortcuts or toolboxes of functionality available that enables the user to generate useful code faster than with compiled code (especially if the compiled language requires you to manage your own memory. Especially with one-off programs or those which deal with small datasets, the speed to utility overwhelms the speed of execution.

Java (and Microsoft’s C#) have advantages over more traditional interpreted languages; instead of progressing thru the traditional convert to machine code each time and then execute they both can compile (with javac in the case of java) their source code to an intermediate, platform-independent byte-code which is then executed by a platform-specific interpreter. This can speed up java to the point where it is only a little slower than compiled programs for many things. Here’s a longer, but clear explanation.

All modern interpreted languages come with large libraries of extended functionality and often those libraries include compiled code such that any execution that uses them runs at compiled speeds. This is why Python, a relatively slow-to-execute language, is being used in many scientific areas, since the use of libraries like SciPy and NumPy enable compiled-speed execution times because they ARE compiled code, wrapped in a thin sheet of Python. One of Python’s attractions is that it is very easy to wrap and call compiled libraries from many languages from Python. Here’s an example of how to Pythonize Fortran.

6.1. Scripts

The Perl (or Python or R) script is a file containing human-readable ASCII characters that the interpreter can translate dynamically into executable code.

"Hello World" in Perl

This program simply prints "Hello World" and ends.

#!/usr/bin/env perl
print "Hello World\n";
Conditional line printer in Perl

This program reads lines from STDIN, splits the line into variables, tokenizing on whitespace, counts the number of variables and stuffs them into the A array, and then prints them if the number of variables is greater than 4.

#!/usr/bin/env perl
while (<>){
  my $N = my @A = split;
  if ($N > 4) {
    print $_;
  }
}
Note
Use /usr/bin/env

Often, the shebang line will be explicit, like #!/usr/bin/perl or #!/usr/bin/python. To make your program more portable, use the more flexible #!/usr/bin/env interpeter format. ie #!/usr/bin/env perl. This causes the interpreter to be chosen by the environment variables set when you try to execute the program. This allows the program to be run both by the system default perl interpreter as well as another perl, perhaps loaded by a module command.

NB: Typed into the bash shell, #!/usr/bin/env perl has almost no apparent effect since perl by itself is waiting for a script to act on, while #!/usr/bin/env python drops you into the python shell if there’s no code to interpret.

In the same way, you can define the interpreters for Python, R, Julia, Java, Octave, etc

  • #!/usr/bin/env python

  • #!/usr/bin/env R

  • etc

7. Compilers

Compilers are used by languages like C, C++, Ada, Go, Fortran, etc. There are many compilers (Intel’s icc/icpc/ifort and PGI’s pgc/pgc++/pcf are very good (and expensive) compilers, and there are also many free compilers (LLVM which Julia uses, for example), but I’m going to stick to the free GNU compilers; the approach is similar for all of them.

gcc stands for the Gnu Compiler Collection and it is an astonishingly sophisticated toolkit that allows many languages to be converted to inputs to the same compiler engine to be turned into chunks of machine object code, then linked to required libraries of functions to produce an executable file. (This can be confusing since gcc can act as both compiler and linker, carrying out the various functionalities depending on how it is called.)

While they are not formally part of the gcc application, there are a number of tools that are associated with the process of writing and compiling code.

7.1. Source code to compiled programs

Creating a compiled program starts with writing the source code which looks similar to a script, except that it doesn’t begin with a shebang line. Here’s a very short C program

Hello World in C
#include <stdio.h>
#define STRING "Hello World"
int main(void)
{
  /* Using the macro defined above to print 'Hello World'*/
  printf(STRING);
  return 0;
}

There are more lines than in the Perl script above and there’s much more to the creation of the executable code, but rather than duplicating Google, let me reference a page that describes it well. altho you probably don’t need to know this now.

7.2. 32bit vs 64bit programs

Linux started out running on 32bit processors. The 32bits denoted both the CPU address registers and therefore the amount of RAM that the CPU could address: (232 bits = 4294967296 bits) or 4 Billion addresses, often used as an alias for the ability to reference 4 billion bytes and therefore an address space of 4GB. This seems like a large amount of RAM, but with BigData and bloating programs, it often is not. Hence the jump to 64bit CPUs, processors that can address 264 bits = 1.84467440737e+19 bits or … more RAM than we’ll ever see in a machine. Since we will never see that much RAM in a physical machine, most 64bit CPUs use a restricted address space of only 48bits, which can reference 2.81474976711e+14 addresses, still a very large number. The 64bit architecture also refers to the width of the registers and data paths that feed the computational machinery, which also increases the overall bandwidth of data through the CPU. You can simply think of it as the CPU being able to consume, process, and move data in much larger chunks.

Most modern CPUs, even cell phone CPUs are 64bit, but there are a lot of CPUs that are still 32bit and many of the programs that have been developed and distributed were compiled on 32bit machines.

It turns out that the 64bit CPUs in modern computers can run 32bit programs just fine, IF they have the 32bit compatibility libraries installed. IF they don’t, they can’t. However, even if a 64bit CPU can run the 32bit programs, it can’t magically give them the capability of addressing more than 4GB of RAM.

However, if you try to run a 64bit program on a 32bit CPU, you’re out of luck. It just won’t work.

Obviously, CPU evolution is a lot more than RAM addressing. Modern 64bit CPUs have much larger registers, multi-level caches, specialized instructions and millions of transistors dedicated to optimizing them. You won’t have to worry about this level of detail until you start writing your own (fairly sophisticated) code, but be aware that the family of CPU that you both compile your code on AND run your code on may be significant. ie: if you compiled your code on a new CPU with all the optimizations turned on, and you’re running it on a grotacious old cluster, you may get an illegal instruction error since the CPU may not understand the compiler instructions. There are ways to make your code more transportable, but typically at the cost of making the code run slower.

Note that you should never have this kind of problem with interpreted code since it is converted to object code at each invocation by the Interpreter, itself a compiled program that should be well-supported by the OS.

8. Utilities to build programs

There are typically several sets of tools used to build programs. Most of the ones described below are for compiling programs but some are often also used with interpreted programs, especially when they need a compiled library to provide functionality. Some languages, notably Java are hybrids, composed of portable byte-code and OS- and machine-specific interpreters.

8.1. Configuration tools

8.1.1. autogen & autoconf

These are the tools that take simple templates and generate full code to help generate the files that drive the actual compilation, usually Makefile input templates and configure files described below. They are part of the GNU toolchain and are very useful (if initially confusing) in providing a formal process for keeping nontrivial code projects under control. Here is a good description of how they work.

8.1.2. the configure script

The configure script is often supplied as part of a software distribution (either complete as configure or in template form as configure.ac - see below). It is used to query the system for locations of libraries, testing the libraries to see if they provide the functions it needs, and based on those results and the directives provided by the user, create a Makefile that directs the compiler(s) to create the necessary libraries and applications. It often has many options which can be viewed by supplying the --help option to it:

./configure --help

When running the configure script, you almost always have to start it as ./configure (note the leading "./" that notifies the OS that the file to run is in the current directory, not in the dirs described by PATH.

If the application is fairly simple, running ./configure by itself may often be enough to generate a workable Makefile. However, a complex configure script might look like this:

./configure --prefix=/data/apps/octave/4.2.1  --with-openssl=auto \
 --with-java-homedir=/data/apps/java/jdk1.8.0_111 \
 --with-java-includedir=/data/apps/java/jdk1.8.0_111/include \
 --with-java-libdir=/data/apps/java/jdk1.8.0_111/jre/lib/amd64/server \
 --enable-jit \
 --with-lapack \
 --with-blas \
 --with-x --with-qt=5 \
 --with-OSMesa-includedir=/usr/include/GL/ \
 --with-OSMesa-libdir=/usr/lib64/ \
 --with-blas --with-lapack \
 --with-hdf5-includedir=/data/apps/hdf5/1.8.13/include \
 --with-hdf5-libdir=/data/apps/hdf5/1.8.13/lib \
 --with-fftw3-includedir=-I/data/apps/fftw/3.3.4-no-mpi/include \
 --with-fftw3-libdir=/data/apps/fftw/3.3.4-no-mpi/lib \
 --with-curl-includedir=/data/apps/curl/7.52.1/include \
 --with-curl-libdir=/data/apps/curl/7.52.1/lib \
 --with-magick=/data/apps/curl/7.52.1/lib \
 --with-openssl=no

8.1.3. PKGCONFIG

Many well-written programs write a pkgconfig file into a particular system dir on the filesystem to enable other configuration programs to figure out where their components live and how to use them. The location of this/these dirs is contained in the environment variable PKG_CONFIG_PATH and if it is deleted or modified by another program, what should be a straightforward configuration may devolve into madness. There are a number of standard locations that the pkgconfig program should check, but a mistaken modification of it can make it miss even standard system utilities. You can check the current value via the usual:

printenv PKG_CONFIG_PATH
/usr/lib64/pkgconfig:/data/apps/gcc/5.3.0/lib/pkgconfig

The dirs in the line above include a standard one (/usr/lib64/pkgconfig), but sometimes the pkgconfig utility can become confused, especially if it’s called in the depths of a complex config script. If you know that a particular pkgconfig file exists, you can add it explicitly to the PKG_CONFIG_PATH environment in the usual way, making sure to also use the export prefix to make sure the variable survives shell transitions.

export PKG_CONFIG_PATH=/usr/lib64/pkgconfig

8.1.4. Generating the Makefile and configure scripts

Often, especially in projects git cloneed from a github repository, the end-user configure and Makefile scripts don’t exist yet. Instead the github repository provides the template files configure.ac and Makefile.am that have to be converted into the usable scripts. If those precursor files do exist, the usual approach is to run autoreconf -i in that dir.

(The following is a long, briefly annotated history of using a git distributed source tree to build a compiled program.)

# Using 'fpart' utility as an example

$ git clone https://github.com/martymac/fpart.git
Cloning into 'fpart'...
remote: Counting objects: 1370, done.
remote: Total 1370 (delta 0), reused 0 (delta 0), pack-reused 1370
Receiving objects: 100% (1370/1370), 263.69 KiB | 0 bytes/s, done.
Resolving deltas: 100% (913/913), done.
Checking connectivity... done.

$ cd fpart
$ ls
COPYING  Changelog  Makefile.am  README  TODO  configure.ac  contribs/  man/  src/  tools/

$ autoreconf -i
configure.ac:7: installing './compile'
configure.ac:30: installing './config.guess'
configure.ac:30: installing './config.sub'
configure.ac:4: installing './install-sh'
configure.ac:4: installing './missing'
src/Makefile.am: installing './depcomp'

# note that the templates have been converted to 'Makefile.in' & 'configure' for the next stage
$ ls
COPYING      README           compile*       configure.ac  man/
Changelog    TODO             config.guess*  contribs/     missing*
Makefile.am  aclocal.m4       config.sub*    depcomp*      src/
Makefile.in  autom4te.cache/  configure*     install-sh*   tools/


$ ./configure  # now run the configure script
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
...
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating tools/Makefile
config.status: creating man/Makefile
config.status: executing depfiles commands

# ..which generates the Makefile
$ ls
COPYING      Makefile.in  autom4te.cache/  config.status*  contribs/    missing*
Changelog    README       compile*         config.sub*     depcomp*     src/
Makefile     TODO         config.guess*    configure*      install-sh*  tools/
Makefile.am  aclocal.m4   config.log       configure.ac    man/

$ make
Making all in src
make[1]: Entering directory '/home/hjm/Downloads/fpart/fpart/src'
gcc -DPACKAGE_NAME=\"fpart\" -DPACKAGE_TARNAME=\"fpart\" -DPACKAGE_VERSION=\"0.9.4\" -DPACKAGE_STRING=\"fpart\ 0.9.4\" -DPACKAGE_BUGREPORT=\"ganael.laplanche@martymac.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"fpart\" -DVERSION=\"0.9.4\" -DHAVE_LIBM=1 -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 ...
...

make[1]: Leaving directory '/home/hjm/Downloads/fpart/fpart/src'
Making all in tools
make[1]: Entering directory '/home/hjm/Downloads/fpart/fpart/tools'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/hjm/Downloads/fpart/fpart/tools'
Making all in man
make[1]: Entering directory '/home/hjm/Downloads/fpart/fpart/man'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/hjm/Downloads/fpart/fpart/man'
make[1]: Entering directory '/home/hjm/Downloads/fpart/fpart'
make[1]: Nothing to be done for 'all-am'.
make[1]: Leaving directory '/home/hjm/Downloads/fpart/fpart'

# 'src' is often where the source code is located and where the executable is left if successful
$ ls src
Makefile     file_entry.c        fpart-fpart.o      fpart.h    partition.c
Makefile.am  file_entry.h        fpart-options.o    fts.c      partition.h
Makefile.in  fpart*              fpart-partition.o  fts.h      types.h
dispatch.c   fpart-dispatch.o    fpart-utils.o      options.c  utils.c
dispatch.h   fpart-file_entry.o  fpart.c            options.h  utils.h

# note the 'fpart*' executable

$ file src/fpart  # it's a 'real', compiled application
src/fpart: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32,
BuildID[sha1]=39c88d115a6aa623694c0bf72ff08687d161aeb0, not stripped

$ ldd src/fpart   # it uses some std Linux shared libs
        linux-vdso.so.1 =>  (0x00007ffdb39fd000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7790f77000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7790bad000)
        /lib64/ld-linux-x86-64.so.2 (0x0000564e0a400000)

# the fact that it's 'not stripped' means that you can see all the function information with 'nm'
$ nm src/fpart
0000000000608e18 d _DYNAMIC
0000000000609000 d _GLOBAL_OFFSET_TABLE_
0000000000405d20 R _IO_stdin_used
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 w _Jv_RegisterClasses
0000000000407f40 r __FRAME_END__
00000000004074f4 r __GNU_EH_FRAME_HDR
0000000000608e10 d __JCR_END__
...
<much more info omitted>

8.2. make

There are a number of utilities that aim to make building programs easier. The most widely used are based on the make utility, a system that calculates dependencies and allows you to re-run only those part of the dependencies which have failed. In addition, it can run these builds in parallel which can tremendously speed up large compiles, especially when you’re in the debugging stages. See the section above for more info on generating usable Makefiles.

Note
Alternative uses of Make

make and Makefiles can also be used to similarly automate and calculate the dependencies for any large dependency-bound system, such as an analytical tree for RNASeq or other arbitrary analysis, which allows for exact replication of analyses. See this search result for other approaches to this approach.

There are several systems that use Makefiles. The 2 most widespread are GNU make and cmake, tho they use Makefiles in different ways; GNU make consumes Makefiles (typically produced by the autoconf toolchain) and cmake produces Makefiles to be processed by GNU make in the same way that it consumes Makefiles produced by other mechanisms.

8.2.1. GNU make

GNU make is a core component of every Linux system and is a build system - it takes Makefiles and directs the compilers to generate, test, and install code. If a Makefile was built correctly by configure or make or was supplied already with an application, you could type make in the Makefile directory and make would direct the appropriate compiler (defined in the Makefile or systemwide) to build the application or library.

8.2.2. CMake

cmake is a system for building multiplatform build systems (a higher level than GNU make), so if you were creating code that you wanted to run on Windows, Macs, and Linux, cmake would be a better choice than GNU make.

cmake does generate Makefiles that GNU make can use, so from a Linux POV, you could think of cmake as roughly equivalent to the configure script described above.

You would typically use cmake to generate the Makefile and then have GNU make read that Makefile to generate the code, altho often cmake will do all that for you.

8.2.3. Java-specific "Makes"

Because Java is a hybrid system (usually partly compiled, partly interpreted), it uses tools from both the Interpreter world and the Compiler world. Ant, Maven, Gradle, and Gant are systems that work like make/cmake, but are specific for Java, most often because of the multi-file aspect of Java and the startup time of the Java Compiler javac. C/C++/Fortran tends to have a flatter structure with fewer files and therefore work well with make/cmake. However Java projects tend to have lots of small files and calling javac on hundreds of small files has a substantial overhead. Ant/Maven and friends discover all the dependencies and requirements and call javac on all the files at the same time, sparing the file-by-file startup times.

8.2.4. Rake for Ruby

Rake is a make system for Ruby, another interpreted language.

8.3. the Preprocessor (cpp)

The preprocessor examines the preprocessor directives in various languages (the statements in C/C++ prefixed by "#" such as #define (see the small C program above), #include, #undef, etc) and resolves them, notably the #include statements that point to the header files (*.h) that define interfaces for all external functions.

8.4. the Lexer (lex/flex)

A lexer is a program that generates tokens (scans a series of characters and extract and assigns values to them) as part of determining how a program should operate. For example, the process of determining the options fed to a program would require such lexing. See Wikipedia for more examples.

8.5. the Parser generator (yacc/bison)

The parser often operates with the lexer to automatically generate the code to process the lexer tokens. Because of what it does, it’s often referred to a compiler compiler - it generates the structures and formalism to process an arbitrary language into computer code. The lexer and parser are often used in option processing and especially to generate the language parsing that an internal language or commands that a complex program might use. gnuplot, R, and all such programs that have internal command languages would require such functionality. ie, in gnuplot, the lexer/parser would enable you to write routines that differentiate between plot and plod and respond correctly (plot is defined in the lexer; plod is not).

8.6. the Linker / Loader (ld)

The linker ld (often called as part of the compiler actions) is the program that resolves all the function calls to either code you wrote or to functions in the system libraries and imports them either entirely (in a static linkage) or a reference or symbol (in a shared or dynamic library linkage). Read more at Wikipedia. The linker also supplies the compiled program with its loader, which becomes part of the compiled program and is responsible for starting the program by loading it into RAM, notifying the kernel that a new process is running, resolving symbols & functions, requesting memory to run in, and making sure that the process obeys the rules for execution.

8.7. List Dynamic Dependencies (ldd)

ldd helps to debug program failures due to missing shared libraries. If a program cannot find a necessary library, it will fail; ldd will identify the missing library and possibly provide hints to where it should be. (see also the RPATH environment variable)

9. Static vs Dynamic Linking

There are 2 types of compiled programs: statically linked and dynamically linked (aka shared lib or simply shared). A statically linked program has all the libraries and functions included in the binary package that you execute, resulting in a much larger file on disk. However, this dramatically increases the probability that a statically linked program will run on an OS. A dynamically linked program only contains the calls to the shared libraries, relying on the OS to provide the libraries and the user to provide the appropriate environment PATHs to them (via LD_LIBRARY_PATH or RPATH)

So (assuming you cared) how do you tell whether a program is shared or static? You point the ldd tool at the program and it will tell you:

  • whether the program is static or dynamic

  • whether the immediate library requirements of a dynamic program are met by the current environment (if a shared library requires a further library, ldd will also identify it.

A dynamic linking is like going on vacation to Brazil and packing only your personal clothes, assuming that everything else will be provided for you. A static linking assumes that you need your clothes of course, but also the towels, sheets, kitchen utensils, furniture, and your car. Instead of traveling with a suitcase, you travel with a shipping container.

10. Shell wrappers

Sometimes, especially with complex programs, the program you think you’re executing is not the actual program. For a number of reasons, the actual executable is CALLED BY a wrapper script, often written in bash (the de facto command language of Linux). This allows the wrapper script to set a number of parameters based on how the program is being called, determine the exit status of the program and do various supporting jobs based on how the job exited. If you’ve ever seen a popup window that says something to the effect "Sorry. The program salmonspots has crashed. Would you like to send a crash report back to the developers?", then you’re probably dealing with a program that has been called with a wrapper script (or is communicating on an application bus (like dbus).

So should you run ldd on such a wrapper, you’ll get:

ldd /data/apps/R/3.4.1/bin/R
        not a dynamic executable

In this case, the only way to determine where the actual executable lives is to page thru the wrapper script. In the above case, we see:

#!/bin/sh
# Shell wrapper for R executable.

R_HOME_DIR=/data/apps/R/3.4.1/lib64/R
if test "${R_HOME_DIR}" = "/data/apps/R/3.4.1/lib64/R"; then
   case "linux-gnu" in
   linux*)
     run_arch=`uname -m`
     case "$run_arch" in
        x86_64|mips64|ppc64|powerpc64|sparc64|s390x)
          libnn=lib64
          libnn_fallback=lib
<etc>
  1. And far, far below, we find that the actual R executable is buried in:

  R_binary="${R_HOME}/bin/exec${R_ARCH}/R"

# which in this case means:
 /data/apps/R/3.4.1/lib64/R/bin/exec/R

If we run file and then ldd on this terminal R, we find:

$ cd /data/apps/R/3.4.1/lib64/R/bin/exec
$ file R
R: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped


$ ldd R
        linux-vdso.so.1 =>  (0x00007ffd5ed90000)
        libgfortran.so.3 => /data/apps/gcc/5.3.0/lib64/libgfortran.so.3 (0x00007fe1274b0000)
        libgomp.so.1 => /data/apps/gcc/5.3.0/lib64/libgomp.so.1 (0x00007fe127288000)
        libR.so => /data/apps/R/3.2.3/lib64/R/lib/libR.so (0x00007fe126c30000)
        libmpi.so.1 => /data/apps/mpi/openmpi-1.8.8/gcc/5.3.0/lib/libmpi.so.1 (0x00007fe126940000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe126720000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fe126388000)
        libquadmath.so.0 => /data/apps/gcc/5.3.0/lib/../lib64/libquadmath.so.0 (0x00007fe126148000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fe125ec0000)
        libgcc_s.so.1 => /data/apps/gcc/5.3.0/lib/../lib64/libgcc_s.so.1 (0x00007fe125ca8000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fe125aa0000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe125898000)
        libblas.so.3 => /usr/lib64/libblas.so.3 (0x00007fe125640000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x00007fe1253f8000)
        libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x00007fe1250a0000)
        libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x00007fe124d08000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe1277d0000)
        libopen-rte.so.7 => /data/apps/mpi/openmpi-1.8.8/gcc/5.3.0/lib/libopen-rte.so.7 (0x00007fe124a80000)
        libopen-pal.so.6 => /data/apps/mpi/openmpi-1.8.8/gcc/5.3.0/lib/libopen-pal.so.6 (0x00007fe124798000)
        libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fe124588000)
        libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007fe124378000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fe124170000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007fe123f48000)
        libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x00007fe122df8000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fe122af0000)

The above file command shows that it is dynamically linked (uses shared libs) and the ldd command below shows that all the required shared libs have been resolved during the build process. If they had not been resolvable, you would see an entry like this:

        ..
        libicudata.so.42 => not found
        ..

which would send you off on a quest to find out how the damn thing had been built in the first place and where the missing library has gone to. (In many cases, the reason is because the module supplying the missing library wasn’t loaded by the responsible module file.)

This is a case which will make you wish that you (or the author) had built it as a static executable

11. Environment Variables

Environment Variables (envars) are shell variables set at login or during an interactive session that define or change the behavior of your shell and the execution of programs that are started from it. You can set envars to be available to subshells (using the prefix export) or to be restricted to the local context by omitting export. In both cases, the value of GOOBERDIR and VEGGIEDIR will vanish when you logout & login again unless:

  • you have made them permanent by editing them into your ~/.bashrc or other startup file.

  • you used byobu, screen, tmux, or x2go that maintains the login when you quit.

export GOOBERDIR=/home/hjm/nuts/goober
# GOOBERDIR is now available to subshell and programs started from those subshells


VEGGIDIR=/home/hjm/veg/carrots
# VEGGIEDIR is only available to THE CURRENT shell, not to subshells

Here are some critical envars that will change the behavior of programs that you try to execute.

11.1. PATH

PATH defines where the OS looks for executables. A default PATH is set when you log in, typically something like:

  /home/hjm/bin:/usr/local/bin:/bin:/usr/bin:/usr/sbin:/usr/X11R6/bin

which prepends my (hjm) private bin dir in front of anything else. PATH can be expanded and modified arbitrarily. Here’s what my laptop PATH looks like.

/home/hjm/bin:/home/hjm/eclipse:/usr/NX/bin:/usr/local/sbin:\
/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:\
/home/hjm/intel/bin:/linux86/bin

It can also be changed programmatically to point to alternative applications, especially in large systems that might use something like the environment modules or lmod systems, which also manipulate the following variables to the same end.

11.2. LD_LIBRARY_PATH

LD_LIBRARY_PATH defines the directories thru which the linker will search to find shared libraries. The following demonstrates the changes a module load can have on the LD_LIBRARY_PATH.

$ printenv LD_LIBRARY_PATH

# nothing shown above

$ module load R/3.4.1
# ...
R is a language optimized for statistics and mathematics, and with the
BioConductor package (installed), Bioinformatics and genomics.
# ...

$ printenv LD_LIBRARY_PATH
/data/apps/R/3.4.1/lib64/R/lib:/data/apps/cern_root/5.34.36/lib:\
/data/apps/hdf5/1.8.11/lib:/data/apps/curl/7.52.1/lib:/data/apps/pcre/8.40/lib:\
/data/apps/xz/5.2.3/lib:/data/apps/bzip2/1.0.6/lib:/data/apps/zlib/1.2.8/lib:\
/data/apps/fftw/3.3.4-no-mpi/lib:/data/apps/tcl-tk/8.6.4/lib:/data/apps/gcc/5.3.0/lib64

Once the the R module is loaded, all the paths noted above will be searched for required libraries. And if you’re using the module system, once you unload or purge the module, those envars will be cleared.

11.3. RPATH

RPATH is the search path hard-coded into a library when it’s built, to point to required libraries needed to resolve missing symbols. Because it’s hard-coded, it’s a fairly fragile mechanism for resolving such symbols and it is generally better to use the LD_LIBRARY_PATH to point to library locations. If you have to use RPATH, you can define the envar LD_RUN_PATH to be read by the linker ld or supply it to the compiler explicitly with -rpath=/path/to/libraries.

It’s rare to have to use this because the LD_LIBRARY_PATH is generally a better approach. Usually, the only reason to use it is to explicitly link in a lib from the native compiler tree to satisfy a requirement. ie the app doesn’t need the compiler anymore, but only one of its libs. This approach allows apps compiled by different compilers to be more compatible without having to do module load/unload/ordering gymnastics. Note that it’s difficult to tell if this is the case before the fact unless you’re watching the compile output carefully, so this is generally revealed when you’re trying to compile something that requires a lib built with a different compiler and THAT lib has a requirement for the native compiler lib. If you find an instance where this is the case, it’s worthwhile to go back to the 1st lib and recompile with the RPATH set to point to the native compiler lib so it doesn’t happen again.

Clear as mud? Good. I’ve done my job.

11.4. LDFLAGS,LIBS

LDFLAGS is an envar that often contains both the -lname and -Llocation of the libraries that need to be found in order to satisfy the compilation (and so can partly replace the 'LD_LIBRARY_PATH' in the compilation phase, but NOT in the execution phase).

The name is constructed like: -lname where name is the diagnostic part of the library. So if the library was named libgomp.so.4.5, the LDFLAGS abbreviation would be -lgomp. The specification of the -lname is also often associated with the envar LIBS, depending on who is writing the code.

The location of the library is specified in LDFLAGS with the prefix -L/full/path/to/lib/dir

# if the libraries of interest were the bzip2 and pcre libs, the envar would be set:

export LDFLAGS+="-L/data/apps/bzip2/1.0.6/lib -lbz2 -L/data/apps/pcre/8.40/lib -lpcre"

# the above line adds the locations and libnames of libbz2.so and libpcre.so to the existing LDFLAGS envar

11.5. CPPFLAGS

CPPFLAGS is used to provide paths to header (*.h) files that are not on the standard include path (usually /usr/include). The format used is similar to the LDFLAGS above:

# if the libraries of interest were the bzip2 and pcre libs, the envar would be set:
export CPPFLAGS+="-I/data/apps/bzip2/1.0.6/include -I/data/apps/pcre/8.40/include"

# the above line adds the locations of the relevant header files to the CPPFLAGS envar

11.6. MANPATH

MANPATH is simply the path that the man program should search in order to find the man pages for an entry.

# if you needed to add a specific path to find the man pages for bzip2 and pcre, the envar would be set:
export MANPATH=":/data/apps/bzip2/1.0.6/man:/data/apps/pcre/8.40/man"

# the above line adds the locations of the relevant man pages to the MANPATH envar
# note that the string starts with ':/data/apps...'  That syntax appends the given
# MANPATH to the already defined one

12. Fixing Missing Libraries

Quite often, especially when you’ve copied a shared application from another Linux distribution, you will find that it complains about missing libs when you know that you already have that library or you have a similar lib that may be a version higher or lower than the specific one demanded.

12.1. Modifying LD_LIBRARY_PATH

If you know that you have the missing lib, your current LD_LIBRARY_PATH is probably missing or misconfigured. You can determine if this is so by first locating the missing lib and then checking the value of LD_LIBRARY_PATH. If the missing lib is called libpcre.so.3.12.1, try:

$ locate libpcre.so
/lib/i386-linux-gnu/libpcre.so.3
/lib/i386-linux-gnu/libpcre.so.3.13.2
/lib/x86_64-linux-gnu/libpcre.so.3
/lib/x86_64-linux-gnu/libpcre.so.3.13.2
/ohome/hjm/.singularity-cache/tacg/c/lib/x86_64-linux-gnu/libpcre.so.3
/ohome/hjm/Downloads/kdirstat-2.5.3/kdirstatlibs/libpcre.so.3
/ohome/hjm/Downloads/kdirstat-2.5.3/kdirstatlibs/libpcre.so.3.12.1
/usr/lib/x86_64-linux-gnu/libpcre.so

$ printenv LD_LIBRARY_PATH
<nothing>

The envar LD_LIBRARY_PATH is not set, so the system is relying on the default set of paths set in the files in /etc/ld.so.conf.d. These files are editable by root, so if you have your libs in a non-standard location, you can include it by editing those files and then running ldconfig which will add the libraries therein to the cache.

If you wanted to add the libpcre.so library above to the LD_LIBRARY_PATH, you can by explicitly setting it:

export LD_LIBRARY_PATH+=/ohome/hjm/Downloads/kdirstat-2.5.3/kdirstatlibs

The application would then be able to find the missing lib and execute (at least to the next failure).

Note
LD_LIBRARY_PATH vs LIBRARY_PATH

To make matters even more confusing, in rare cases when compiling with gcc, software developers sometimes use a similar environment variable name LIBRARY_PATH to provide the same functionality as the more commonly used LD_LIBRARY_PATH:

According to the GCC manual:

The value of LIBRARY_PATH is a colon-separated list of directories, much like PATH. When configured as a native compiler, GCC tries the directories thus specified when searching for special linker files, if it can't find them using GCC_EXEC_PREFIX. Linking using GCC also uses these directories when searching for ordinary libraries for the -l option (but directories specified with -L come first).

So if the libs on the LD_LIBRARY_PATH don’t seem to satisfying gcc, try this:

export LIBRARY_PATH=$LD_LIBRARY_PATH

and try the makefile again.

12.2. Symlinking close matches

You can often fake a fix if you don’t have the specific lib for which the application is looking. Linux libs tend to be very conservative, so that if you need libpcre.so.3.13.2 and you have libpcre.so.3.12.1, the application may very well be able to work just fine. All you need to do is provide a symlink to the older lib.

ln -s /path/to/libpcre.so.3.12.1 /path/to/libpcre.so.3.13.1  # lie to the application

12.3. GLIBC problems

GLIBC is the Gnu Standard C library. It’s part of every Linux distribution and is very stable. However, if you build a program on one Linux system (especially a modern one, such as your laptop) and then copy that application to an older system (a cluster or larger system that may not be as up-to-date as your laptop) you’ll often see this error:

./myprog-install: /lib/tls/libc.so.6: version `GLIBC_2.4' not found (required by ./myprog-install)

This is due to having a GLIBC compatibility problem. GLIBC is always backward compatible (so you can always run an old program on a newer system) but can’t be forward compatible, so running a new program on an old system will often yield this error. There are 2 somewhat easy solutions and one more difficult one. The easiest solution is to recompile your program on your laptop as a static executable. That will force your program to carry with it all the functionality that it will ever need. The other somewhat easy alternative is to re-compile your program on the old system, which will resolve all the symbols into the older GLIBC. The harder alternative is to provide your current shared program with the correct GLIBC version by running it in a chroot environment that has the correct GLIBC.

13. References

  • Eric Raymond’s http://www.catb.org/~esr/writings/taoup/html/ [The Art of Unix Programming]. Deeply researched, historical, profound, and … free. If you want to really know how Unix works, read this.

  • The O’Reilly library, free for UC Irvine faculty and staff via Safari. If you’re not in a UCI domain, you’ll have to access it via the VPN or by ssh’ing into a UCI machine.

  • Search Engines. This book is barely more than a synthesis of my own Google searches. The Internet was built on Linux, so everything Linux-related is available online.

14. License & Availability

This document is released under the GNU Free Documentation License images/240px-GFDL_Logo.png