Make Your Data Relational

In many domains, scientific and otherwise, you end up generating lots of tables of data. Many people keep such tables in Spreadsheets (mostly Microsoft Excel) where they are easily viewed in columns, sorted, and cut/pasted into other spreadsheets but still remain 2D data. It’s hard to relate one set of data to another.

There is a data structure that has been optimised for making and manipulating such data-relations. It’s called, not surprisingly, a relational database and it’s even sorta-kinda-fairly easy to set up. The problem that many people have is creating the queries that allow you to pull the desired data from the database. In the following few paragraphs, I’ll show how to go about creating the flat files that we’ll use to populate the database, creating the database itself, importing the files, and then trying to extract some useful data relations from the database.

USE OR NOT? There might be tens of such input_files totalling 10s to 100s of millions of data points. The question is: how to relate these variables to each other without a lifetime of pointing & clicking & sorting & resorting columns.

This is NOT a tutorial on database Best Practices. It is definitely aimed at quick & dirty data manipulation so that you can get on with your work. Sort of a Commando Database Tutorial for biologists (and others but the target is biologists).

A comment about research and the command line

You’ve heard this before, but it’s worth repeating: If you’re going to do digital biology (and by that I mean High Throughput or Large Scale ANYTHING, you’re going to have to know how to do things on the commandline. If you don’t - if you want an easy-to-use GUI to do pointy-clicky analysis, you’ll end up 2 or 3 years behind the curve and you’ll know a LOT less about the process. Read about this elsewhere, say Software Carpentry

We’re going to use a relational database engine called SQLite. It differs from more familiar database engines in that it’s not a database server. There’s no internet security because there is no internet access. It provides only the relational guts - tiny, very easy to set up, hardly any administration, fairly fast and quite adherent to SQL standards. If you want a database engine for Amazon, it’s not going to be SQLite, but for what we want, it’s perfect.

SQLite is quite easy to set up - essentially a single file install, but I’ll leave that to others to describe in detail, but almost any Linux distribution will have it available from the Distribution repository: try

sudo apt-get install sqlite3     # Debian/Ubuntu-based
#  or
sudo yum install sqlite          # RedHat/CentOS-based

1. The input flatfiles

Typically you’ll have multiple large tables, either in a proprietary spreadsheet format (Excel) or more typically, ascii text flatfiles that are alphabetic strings separated by a common delimiter (commas, tabs/whitespace) like this:

    Line  Columns (up to 100 - 200) ---------------------------------------->
      1        0           1        2         3         4             5
      2    Probe        Id_1   GeneID       Set        ID  stress_ko_n1
      3 10344614        NULL     NULL  138.0018  163.8001      140.7875
      4 10344616        NULL    YHE72  7.355559  7.886098      8.340074
      5 10344620     ITE9266    Adult  101.1456  84.57693      81.36497
      6 10344622    EG620009       A1  718.7775  627.4826      690.6035
      7 10344624      Lypla1      Mus  1132.786  1062.253      1121.712
      8 10344633       Tcea1    (SII)  1677.235  1660.797      1727.314
      9 10344637     Atp6v1h       V1  2735.426  2467.097      2579.454
     10 10344653       Oprk1        1  185.5172  208.5798      250.5823
     11 10344658      Rb1cc1       A2  1618.653  1580.918      1540.812
      <continue for 35,000 lines>

Alternatively, you might have a single huge flatfile that has a large number of columns and 30,000 or more rows in a typical gene expression data set.

In addition to the Probe, ID_1, and GeneID columns shown above, there might be 10s of identifier columns associated with the numeric data in a very large matrix. This makes it particularly difficult to figure out what the relationships are from the Spreadsheet application.

I’m going to demonstrate with an input file called input_file composed of 35,558 rows & 29 columns for a total size of ~1M values in 15.6MB containing the data from 4 experiments that needed to be sliced into 4 separate files to submit to an analysis. (It formally would have been possible to have submitted the original data file, but to upload 16MB multiple times over a relatively slow internet connection inhibited the process. It was a fairly easy process to slice the original data file into 4 much smaller data files of 2.2 MB each to feed to the analysis site.)

The free utilities I used for doing this are called scut and cols and their use is described in that link. The 4 files to upload were generated using the Probe ID tag (column 0) which provided a common reference key.

For example, the first command below defines the input delimiter as a TAB and selects the columns to be extracted as 0 4 5 6 1 2 3 which slices them out and writes them in that order (cols 1 2 3 are the wt and cols 4 5 6 are the ko data. The following commands process the same input file and generate the other 4 sets of data. This is specific for the CyberT analysis, but by varying the column specification, it is portable to any slice/dice operation.

scut --id1='TAB' --c1='0 4 5 6 1 2 3'    <  input_file > wt_stress__ko_stress.upload
scut --id1='TAB' --c1='0 7 8 9 1 2 3'    <  input_file > ko_ctrl__ko_stress.upload
scut --id1='TAB' --c1='0 10 11 12 4 5 6' <  input_file > wt_ctrl__wt_stress.upload
scut --id1='TAB' --c1='0 10 11 12 7 8 9' <  input_file > wt_ctrl__ko_ctrl.upload

So we have now ended up with a subset the input data, creating identically formatted files. These files are uploaded to the CyberT Gene Expression analysis website for analysis.

For each upload file above, we now have a CyberT output file that consists of even MORE data.

An example is below:

                   0         1         2         3         4         5 .. -> 33 cols of data
R version R 2.2.1 (2         -         -         -         -         -
Cyber-T version 8.01         -         -         -         -         -
hdarray library vers         -         -         -         -         -
Date of this analysi         -         -         -         -         -
              "Row#"   "Lab_0"     "C_1"     "C_2"     "C_3"     "E_4"
             "26796"  10598071  2145.323  1793.633  1724.169  361.2002
             "26793"  10598064  3754.996   2901.02  3028.524  754.0856
             "21826"  10553092  2845.467  3001.603  2941.748  1885.066
             "29145"  10339319  664.0526  658.5493  621.1962  1090.339

35K rows
   |
   v

The CyberT output has expanded each of 4 files (~250K values) to a results file containing >1M values. Here is where we have to do some triage to reduce the number of variables we’re carrying along. Data reduction or triage is a general process in analysis. As long as you keep copies of the complete output or can reiterate the process to regenerate the data, we’re fine.

The CyberT output contains the following columns

This is critical to be able to suck them into the database table we will define for them.

also have to cut down on the number of label names ie (label // label // label // label)

create table wtc_koc (
row           integer not null,
probeid   integer,
label     varchar(100),
CMn       float,
EMn       float,
Bayes_p   float,
fold      float,
PPDE_lt_p float,
PPDE_p    float
);

import "sql.input" "wtc_koc";

# to pull CMn,EMn,fold,Bayes_p,PPDE_lt_p where etc.
select probeid, label,CMn,EMn,fold,Bayes_p,PPDE_lt_p from wtc_koc where Cmn != "NA" and Bayes_p <0.05 and PPDE_lt_p > 0.9 limit 7;

# to count how many records get returned
SELECT COUNT(fold) FROM (SELECT DISTINCT fold FROM wtc_koc where CMn != "NA"  and PPDE_lt_p > 0.95);

# prep the raw data to go into cybert
# use the Probe Set ID as the identifier not the label.
# Will need another table to correlate probe ID with the labels, etc.  and then a lot of tables to carry the data from cybert.


   create table ident (
0      probeid            integer,
13     gene_acc           varchar(50),
14     gene_sym           varchar(50),
15     gene_desc          varchar(50),
16     cytoband           varchar(20),
17     mrna_desc          varchar(50),
18     unigene            varchar(20),
19     chromo             varchar(10),
20     go_biopr_id        varchar(50),
21     go_biopr_term      varchar(50),
22     go_cellcomp_id     varchar(50),
23     go_cellcomp_term   varchar(50),
24     go_molfunc_id      varchar(50),
25     go_molfunc_term    varchar(50),
28     tot_probes         integer
   );

# create the db 'em1'
sqlite3 em1

to create this table,
tail -n +2 0410f-07_borre_plier.data|  scut  --id1='TAB' --c1='0 13:25 28' > ident.sql.load

# suck it into sqlite table with:
sqlite3 eb1
sqlite> .sep \t
# create the table
create table ident (
    probeid            integer,
    gene_acc           varchar(50),
    gene_sym           varchar(50),
    gene_desc          varchar(50),
    cytoband           varchar(20),
    mrna_desc          varchar(50),
    unigene            varchar(20),
    chromo             varchar(10),
    go_biopr_id        varchar(50),
    go_biopr_term      varchar(50),
    go_cellcomp_id     varchar(50),
    go_cellcomp_term   varchar(50),
    go_molfunc_id      varchar(50),
    go_molfunc_term    varchar(50),
    tot_probes         integer
);

make sure that the file is intact at top and bottom. from line 35521 to bottom 35558, no ids besides the Probe Set ID

I deleted those lines.

# import the ident data from ident.sql.load
.import "ident.sql.load"  ident

# OK, so far so good.
# now have to create the cybert tables for each of the comparisons and populate them with the correct data.
# use the /probeid/ for teh identifier, not any gene id.  NB the cybert output has 33 columns.  Don' need them all.
# this table will be specific for 3 replicates for C/E/



scut --id1='TAB' --c1='0 10 11 12 4 5 6' < 0410f-07_borre_plier.data > wt_ctrl__wt_stress.data
# cybert with wt_ctrl__wt_stress.data
wget <the ALLGENES URL>
mv ALLGENES wt_ctrl__wt_stress.ALLGENES
# checked the cols  - ppde_p is >0

scut --id1='TAB' --c1='0 10 11 12 7 8 9' < 0410f-07_borre_plier.data > wt_ctrl__ko_ctrl.data
#cybert with wt_ctrl__ko_ctrl.data
mv ALLGENES wt_ctrl__wt_stress.ALLGENES wt_ctrl__ko_ctrl.ALLGENES
# checked cols and the ppdes are all zeros (expected?: ctrl vs ctrl)

scut --id1='TAB' --c1='0 4 5 6 1 2 3' < 0410f-07_borre_plier.data  > wt_stress__ko_stress.data
#cybert with wt_stress__ko_stress.data
mv ALLGENES wt_stress__ko_stress.ALLGENES
# checked cols and the ppdes are all zeros (expected? stress vs stress)

scut --id1='TAB' --c1='0 7 8 9 1 2 3' < 0410f-07_borre_plier.data > ko_ctrl__ko_stress.data
#cybert with ko_ctrl__ko_stress.data
mv ALLGENES ko_ctrl__ko_stress.ALLGENES
# checked the cols  - ppde_p is >0

drop previous tables:
drop table ko_c__ko_s;
drop table ko_s__wt_s;
drop table wt_c__ko_c;
drop table wt_c__wt_s;

# create the data tables again (not the ident table) with the output from teh .schema dump below.

(omitted - see below)

# the 'tail -n +6' trims the top 6 lines off.  sqlite can't take headers (or the "s that are in the headers)
tail -n +6 ko_ctrl__ko_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  ko_ctrl__ko_stress.sqlinput
tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_stress__ko_stress.sqlinput
tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__wt_stress.sqlinput

# now suck them into the DB with the '.import' command:

.import "ko_ctrl__ko_stress.sqlinput"    ko_c__ko_s
.import "wt_stress__ko_stress.sqlinput"  wt_s__ko_s
.import "wt_ctrl__ko_ctrl.sqlinput"      wt_c__ko_c
.import "wt_ctrl__wt_stress.sqlinput"    wt_c__wt_s


    create table yadda (
 1      probeid    integer,
 2      c1         float,
 3      c2         float,
 4      c3         float,
 5      e1         float,
 6      e2         float,
 7      e3         float,
10      cmn        float,
11      emn        float,
12      csd        float,
13      esd        float,
14      clnmn      float,
15      elnmn      float,
16      clnsd      float,
17      elnsd      float,
18      cbaysd     float,
19      ebaysd     float,
20      cbaylnsd   float,
21      ebaylnsd   float,
22      bay_t      float,
28      bay_p      float,
29      bay_lnp    float,
30      fold       float,
31      ppde_lt_p  float,
32      ppde_p     float,
    );

need cols: 1:7 10:22 28:32 from:

ko_ctrl__ko_stress.ALLGENES
ko_stress__wt_stress.ALLGENES
wt_ctrl__ko_ctrl_ALLGENES
wt_ctrl__wt_stress.ALLGENES

# the 'tail -n +6' trims the top 6 lines off.  sqlite can't take headers (or the "s that are in the headers)
tail -n +6 ko_ctrl__ko_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  ko_ctrl__ko_stress.sqlinput
tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_stress__ko_stress.sqlinput
tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__wt_stress.sqlinput




tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less   >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less   >  wt_ctrl__wt_stress.sqlinput


# now have to create the 4 tables in sqlite3 to hold the separate comparisons.  Lnk to 'ident' table via the 'probeid' col which is essentially a foreign key.

so the 4 tables will be:

ko_c__ko_s
wt_s__ko_s
wt_c__ko_c
wt_c__wt_s

create table wt_c__wt_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);

then import the equiv files to tables, in sqlite:

.import "ko_ctrl__ko_stress.sqlinput"    ko_c__ko_s
.import "ko_stress__wt_stress.sqlinput"  wt_s__ko_s
.import "wt_ctrl__ko_ctrl.sqlinput"      wt_c__ko_c
.import "wt_ctrl__wt_stress.sqlinput"    wt_c__wt_s

# seems to have gone OK

.schema
CREATE TABLE ko_c__ko_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_s__ko_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_c__ko_c (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_c__wt_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);

CREATE TABLE ident (
    probeid            integer,
    gene_acc           varchar(50),
    gene_sym           varchar(50),
    gene_desc          varchar(50),
    cytoband           varchar(20),
    mrna_desc          varchar(50),
    unigene            varchar(20),
    chromo             varchar(10),
    go_biopr_id        varchar(50),
    go_biopr_term      varchar(50),
    go_cellcomp_id     varchar(50),
    go_cellcomp_term   varchar(50),
    go_molfunc_id      varchar(50),
    go_molfunc_term    varchar(50),
    tot_probes         integer
);

ident
ko_c__ko_s (ppde_p all ZERO)
ko_s__wt_s (ppde_p all ZERO)
wt_c__ko_c (ppde_p > 0 bt probeid is forked)
wt_c__wt_s (ppde_p > 0 but probeid is forked)

ko_c__ko_s
wt_s__ko_s
wt_c__ko_c
wt_c__wt_s


so lets try a few queries:
select probeid from wt_s__ko_s where ppde_p >.2 and bay_p <0.15 limit 5;   Nothing
select probeid from wt_c__ko_c where ppde_p >.2 and bay_p <0.15 limit 5;   Nothing
select probeid from ko_c__ko_s where ppde_p >.2 and bay_p <0.15 limit 5;   5 probeids
select probeid from wt_c__wt_s where ppde_p >.2 and bay_p <0.15 limit 5;   5 probeids

select probeid from wt_s__ko_s where bay_p <0.05 limit 5;  yes

select d.probeid,d.bay_p,i.chromo,i.mrna_desc from wt_s__ko_s d, ident i where d.bay_p <0.05 and i.probeid = d.probeid  limit 5;

select d.probeid,d.bay_p,d.ppde_p,i.mrna_desc from ko_c__ko_s d, ident i where d.bay_p < 0.01 and d.ppde_p > 0.98 and i.probeid = d.probeid limit 5;

select probeid from ko_s__wt_s limit 5;
select probeid from ko_s__wt_s limit 5;
select probeid from wt_c__ko_c limit 5;
select probeid from wt_c__wt_s limit 5;

# run the cybert; filling out the columns as needed.

# take the cybert output ALLGENES and save as 'something_ALLGENES.txt'.
# process that ALLGENES file to take only the relevant data:

# get rid of /"/
perl -e "s/\"//gi" -p -ibak sql.input

# have to line up the variables that come out of cybert and the order in which they're sucked into the sqlite db

du -sch /home/aa* | mutt -s 'bduc user disk usage' hmangala@uci.edu


comparisons: (all use sliding window of 101, conf val of 9)
x ko_control vs wt_control

x ko_stress__wt_stress
x ko_ctrl__ko_stress
wt_ctrl__wt_stress

           0             1             2             3             4             5             6           7           8           9          10          11          12
Probe Set ID  stress_ko_n1  stress_ko_n2  stress_ko_n3  stress_wt_n1  stress_wt_n2  stress_wt_n3  ko_ctrl_o1  ko_ctrl_o2  ko_ctrl_o3  wt_ctrl_o1  wt_ctrl_o2  wt_ctrl_o3