Make Your Data Relational
=========================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.03, Oct 28, 2015
:icons:


//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// this file is converted to the HTML via the command:
// export fileroot="/home/hjm/nacs/MakeYourDataRelational"; asciidoc -a icons -a toc2 -b html5 -a numbered  ${fileroot}.txt; scp ${fileroot}.[ht]*   moo:~/public_html;  

// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html


In many domains, scientific and otherwise, you end up generating lots of
tables of data.  Many people keep such tables in Spreadsheets (mostly
Microsoft Excel) where they are easily viewed in columns, sorted, and
cut/pasted into other spreadsheets but still remain 2D data.  It's hard to
'relate' one set of data to another.

There is a data structure that has been optimised for making and
manipulating such data-relations.  It's called, not surprisingly, a
'relational database' and it's even sorta-kinda-fairly easy to set up. 
The problem that many people have is creating the queries that allow you
to pull the desired data from the database.  In the following few
paragraphs, I'll show how to go about creating the flat files that we'll
use to populate the database, creating the database itself, importing the
files, and then trying to extract some useful data relations from the
database.

USE OR NOT? There might be tens of such 'input_files' totalling 10s to
100s of millions of data points. The question is: how to relate these
variables to each other without a lifetime of pointing & clicking &
sorting & resorting columns.


This is NOT a tutorial on database 'Best Practices'.  It is definitely
aimed at 'quick & dirty' data manipulation so that you can get on with
your work.  Sort of a 'Commando Database Tutorial' for biologists (and
others but the target is biologists).


.A comment about research and the command line  
[NOTE]
=======================================================================
You've heard this before, but it's worth repeating:  If you're going to do
digital biology (and by that I mean High Throughput or Large Scale
ANYTHING, you're going to have to know how to do things on the
commandline.  If you don't - if you want an easy-to-use GUI to do
'pointy-clicky' analysis, you'll end up 2 or 3 years behind the curve and
you'll know a LOT less about the process.  Read about this elsewhere, say
http://www.software-carpentry.org[Software Carpentry]
========================================================================

We're going to use a relational database engine called
http://www.sqlite.org[SQLite].  It differs from more familiar database
engines in that it's not a database 'server'.  There's no internet
security because there is no internet access.  It provides only the
relational guts - tiny, very easy to set up, hardly any administration,
fairly fast and quite adherent to SQL standards.  If you want a database
engine for Amazon, it's not going to be SQLite, but for what we want, it's
perfect.

SQLite is quite easy to set up - essentially a single file install, but
I'll leave that to others to describe in detail, but almost any Linux distribution will
have it available from the Distribution repository:  try 

-------------------------------------------------------------------------
sudo apt-get install sqlite3     # Debian/Ubuntu-based
#  or
sudo yum install sqlite          # RedHat/CentOS-based
-------------------------------------------------------------------------


The input flatfiles
-------------------
Typically you'll have multiple large tables, either in a proprietary
spreadsheet format (Excel) or more typically, ascii text flatfiles that
are alphabetic strings separated by a common delimiter (commas, tabs/whitespace)
like this:

----------------------------------------------------------------------------
    Line  Columns (up to 100 - 200) ---------------------------------------->
      1        0           1        2         3         4             5  
      2    Probe        Id_1   GeneID       Set        ID  stress_ko_n1 
      3 10344614        NULL     NULL  138.0018  163.8001      140.7875 
      4 10344616        NULL    YHE72  7.355559  7.886098      8.340074 
      5 10344620     ITE9266    Adult  101.1456  84.57693      81.36497 
      6 10344622    EG620009       A1  718.7775  627.4826      690.6035 
      7 10344624      Lypla1      Mus  1132.786  1062.253      1121.712 
      8 10344633       Tcea1    (SII)  1677.235  1660.797      1727.314 
      9 10344637     Atp6v1h       V1  2735.426  2467.097      2579.454 
     10 10344653       Oprk1        1  185.5172  208.5798      250.5823 
     11 10344658      Rb1cc1       A2  1618.653  1580.918      1540.812 
      <continue for 35,000 lines>
----------------------------------------------------------------------------

Alternatively, you might have a single huge flatfile that has a large
number of columns and 30,000 or more rows in a typical gene expression
data set.

In addition to the 'Probe', 'ID_1', and 'GeneID' columns shown above,
there might be 10s of identifier columns associated with the numeric data
in a very large matrix.   This makes it particularly difficult to figure
out what the relationships are from the Spreadsheet application.

I'm going to demonstrate with an input file called 'input_file' composed
of 35,558 rows & 29 columns for a total size of ~1M values in 15.6MB
containing the data from 4 experiments that needed to be sliced into 4
separate files to submit to an analysis.  (It formally would have been
possible to have submitted the original data file, but to upload 16MB
multiple times over a relatively slow internet connection inhibited the
process.  It was a fairly easy process to slice the original data file
into 4 much smaller data files of 2.2 MB each to feed to the analysis
site.)

The free utilities I used for doing this are called
http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut and cols] and their
use is described in that link.  The 4 files to upload were generated using
the 'Probe' ID tag (column 0) which provided a common reference key. 

For example, the first command below defines the input delimiter as a TAB
and selects the columns to be extracted as '0 4 5 6 1 2 3' which slices
them out and writes them in that order (cols '1 2 3' are the 'wt' and cols
'4 5 6' are the 'ko' data.  The following commands process the same input
file and generate the other 4 sets of data.  This is specific for the
CyberT analysis, but by varying the column specification, it is portable
to any slice/dice operation.

----------------------------------------------------------------------------
scut --id1='TAB' --c1='0 4 5 6 1 2 3'    <  input_file > wt_stress__ko_stress.upload
scut --id1='TAB' --c1='0 7 8 9 1 2 3'    <  input_file > ko_ctrl__ko_stress.upload
scut --id1='TAB' --c1='0 10 11 12 4 5 6' <  input_file > wt_ctrl__wt_stress.upload
scut --id1='TAB' --c1='0 10 11 12 7 8 9' <  input_file > wt_ctrl__ko_ctrl.upload

----------------------------------------------------------------------------

So we have now ended up with a subset the input data, creating identically
formatted files.   These files are uploaded to the
http://cybert.microarray.ics.uci.edu/[CyberT Gene Expression analysis
website] for analysis.

For each 'upload' file above, we now have a CyberT output file that
consists of even MORE data.  

An example is below:
----------------------------------------------------------------------------

                   0         1         2         3         4         5 .. -> 33 cols of data
R version R 2.2.1 (2         -         -         -         -         -
Cyber-T version 8.01         -         -         -         -         -
hdarray library vers         -         -         -         -         -
Date of this analysi         -         -         -         -         -
              "Row#"   "Lab_0"     "C_1"     "C_2"     "C_3"     "E_4"
             "26796"  10598071  2145.323  1793.633  1724.169  361.2002
             "26793"  10598064  3754.996   2901.02  3028.524  754.0856
             "21826"  10553092  2845.467  3001.603  2941.748  1885.066
             "29145"  10339319  664.0526  658.5493  621.1962  1090.339

35K rows
   |
   v            
----------------------------------------------------------------------------

The CyberT output has expanded each of 4 files (~250K values) to a results
file containing >1M values.  Here is where we have to do some triage to
reduce the number of variables we're carrying along.  Data reduction or
triage is a general process in analysis.  As long as you keep copies of
the complete output or can reiterate the process to regenerate the data,
we're fine.

The CyberT output contains the following columns


This is critical to be able to suck them into the database table we will
define for them.

 
also have to cut down on the number of label names ie (label // label //
label // label)
-------------------------------------------------------------------------
create table wtc_koc (
row	      integer not null,
probeid   integer,
label     varchar(100),
CMn       float,
EMn       float,
Bayes_p   float,
fold      float,
PPDE_lt_p float,
PPDE_p    float
);

import "sql.input" "wtc_koc";

# to pull CMn,EMn,fold,Bayes_p,PPDE_lt_p where etc.
select probeid, label,CMn,EMn,fold,Bayes_p,PPDE_lt_p from wtc_koc where Cmn != "NA" and Bayes_p <0.05 and PPDE_lt_p > 0.9 limit 7;

# to count how many records get returned
SELECT COUNT(fold) FROM (SELECT DISTINCT fold FROM wtc_koc where CMn != "NA"  and PPDE_lt_p > 0.95);

# prep the raw data to go into cybert
# use the Probe Set ID as the identifier not the label.  
# Will need another table to correlate probe ID with the labels, etc.  and then a lot of tables to carry the data from cybert.


   create table ident (      
0      probeid            integer,
13     gene_acc           varchar(50),
14     gene_sym           varchar(50),
15     gene_desc          varchar(50),
16     cytoband           varchar(20),
17     mrna_desc          varchar(50),
18     unigene            varchar(20),
19     chromo             varchar(10),
20     go_biopr_id        varchar(50),
21     go_biopr_term      varchar(50),
22     go_cellcomp_id     varchar(50),
23     go_cellcomp_term   varchar(50),
24     go_molfunc_id      varchar(50),
25     go_molfunc_term    varchar(50),
28     tot_probes         integer
   );

# create the db 'em1'
sqlite3 em1

to create this table,
tail -n +2 0410f-07_borre_plier.data|  scut  --id1='TAB' --c1='0 13:25 28' > ident.sql.load

# suck it into sqlite table with:
sqlite3 eb1
sqlite> .sep \t
# create the table
create table ident (
    probeid            integer,
    gene_acc           varchar(50),
    gene_sym           varchar(50),
    gene_desc          varchar(50),
    cytoband           varchar(20),
    mrna_desc          varchar(50),
    unigene            varchar(20),
    chromo             varchar(10),
    go_biopr_id        varchar(50),
    go_biopr_term      varchar(50),
    go_cellcomp_id     varchar(50),
    go_cellcomp_term   varchar(50),
    go_molfunc_id      varchar(50),
    go_molfunc_term    varchar(50),
    tot_probes         integer
);
-------------------------------------------------------------------------
make sure that the file is intact at top and bottom.  from line 35521 to bottom 35558, no ids besides the Probe Set ID

I deleted those lines.
-------------------------------------------------------------------------

# import the ident data from ident.sql.load
.import "ident.sql.load"  ident 

# OK, so far so good.
# now have to create the cybert tables for each of the comparisons and populate them with the correct data.
# use the /probeid/ for teh identifier, not any gene id.  NB the cybert output has 33 columns.  Don' need them all.
# this table will be specific for 3 replicates for C/E/


scut --id1='TAB' --c1='0 10 11 12 4 5 6' < 0410f-07_borre_plier.data > wt_ctrl__wt_stress.data
# cybert with wt_ctrl__wt_stress.data
wget <the ALLGENES URL>
mv ALLGENES wt_ctrl__wt_stress.ALLGENES
# checked the cols  - ppde_p is >0

scut --id1='TAB' --c1='0 10 11 12 7 8 9' < 0410f-07_borre_plier.data > wt_ctrl__ko_ctrl.data
#cybert with wt_ctrl__ko_ctrl.data
mv ALLGENES wt_ctrl__wt_stress.ALLGENES wt_ctrl__ko_ctrl.ALLGENES
# checked cols and the ppdes are all zeros (expected?: ctrl vs ctrl)

scut --id1='TAB' --c1='0 4 5 6 1 2 3' < 0410f-07_borre_plier.data  > wt_stress__ko_stress.data
#cybert with wt_stress__ko_stress.data
mv ALLGENES wt_stress__ko_stress.ALLGENES
# checked cols and the ppdes are all zeros (expected? stress vs stress)

scut --id1='TAB' --c1='0 7 8 9 1 2 3' < 0410f-07_borre_plier.data > ko_ctrl__ko_stress.data
#cybert with ko_ctrl__ko_stress.data
mv ALLGENES ko_ctrl__ko_stress.ALLGENES
# checked the cols  - ppde_p is >0

drop previous tables:
drop table ko_c__ko_s;
drop table ko_s__wt_s;
drop table wt_c__ko_c;
drop table wt_c__wt_s;

# create the data tables again (not the ident table) with the output from teh .schema dump below.

(omitted - see below)

# the 'tail -n +6' trims the top 6 lines off.  sqlite can't take headers (or the "s that are in the headers)
tail -n +6 ko_ctrl__ko_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  ko_ctrl__ko_stress.sqlinput
tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_stress__ko_stress.sqlinput
tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__wt_stress.sqlinput

# now suck them into the DB with the '.import' command:

.import "ko_ctrl__ko_stress.sqlinput"    ko_c__ko_s
.import "wt_stress__ko_stress.sqlinput"  wt_s__ko_s
.import "wt_ctrl__ko_ctrl.sqlinput"      wt_c__ko_c
.import "wt_ctrl__wt_stress.sqlinput"    wt_c__wt_s
 

    create table yadda (
 1      probeid    integer,
 2      c1         float,
 3      c2         float,
 4      c3         float,
 5      e1         float,
 6      e2         float,
 7      e3         float,
10      cmn        float,
11      emn        float,
12      csd        float,
13      esd        float,
14      clnmn      float,
15      elnmn      float,
16      clnsd      float,
17      elnsd      float,
18      cbaysd     float,
19      ebaysd     float,
20      cbaylnsd   float,
21      ebaylnsd   float,
22      bay_t      float,
28      bay_p      float,
29      bay_lnp    float,
30      fold       float,
31      ppde_lt_p  float,
32      ppde_p     float,
    );

need cols: 1:7 10:22 28:32 from:

ko_ctrl__ko_stress.ALLGENES
ko_stress__wt_stress.ALLGENES
wt_ctrl__ko_ctrl_ALLGENES
wt_ctrl__wt_stress.ALLGENES

# the 'tail -n +6' trims the top 6 lines off.  sqlite can't take headers (or the "s that are in the headers)
tail -n +6 ko_ctrl__ko_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  ko_ctrl__ko_stress.sqlinput
tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_stress__ko_stress.sqlinput
tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | scut --id1='tab' --c1='1:7 10:22 28:32' >  wt_ctrl__wt_stress.sqlinput


tail -n +6 wt_ctrl__ko_ctrl.ALLGENES     | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less   >  wt_ctrl__ko_ctrl.sqlinput
tail -n +6 wt_ctrl__wt_stress.ALLGENES   | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less   >  wt_ctrl__wt_stress.sqlinput


# now have to create the 4 tables in sqlite3 to hold the separate comparisons.  Lnk to 'ident' table via the 'probeid' col which is essentially a foreign key.

so the 4 tables will be:

ko_c__ko_s
wt_s__ko_s
wt_c__ko_c
wt_c__wt_s

create table wt_c__wt_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);

then import the equiv files to tables, in sqlite:

.import "ko_ctrl__ko_stress.sqlinput"    ko_c__ko_s
.import "ko_stress__wt_stress.sqlinput"  wt_s__ko_s
.import "wt_ctrl__ko_ctrl.sqlinput"      wt_c__ko_c
.import "wt_ctrl__wt_stress.sqlinput"    wt_c__wt_s

# seems to have gone OK

.schema
CREATE TABLE ko_c__ko_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_s__ko_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_c__ko_c (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);
CREATE TABLE wt_c__wt_s (
    probeid    integer,
    c1         float,
    c2         float,
    c3         float,
    e1         float,
    e2         float,
    e3         float,
    cmn        float,
    emn        float,
    csd        float,
    esd        float,
    clnmn      float,
    elnmn      float,
    clnsd      float,
    elnsd      float,
    cbaysd     float,
    ebaysd     float,
    cbaylnsd   float,
    ebaylnsd   float,
    bay_t      float,
    bay_p      float,
    bay_lnp    float,
    fold       float,
    ppde_lt_p  float,
    ppde_p     float
);

CREATE TABLE ident (
    probeid            integer,
    gene_acc           varchar(50),
    gene_sym           varchar(50),
    gene_desc          varchar(50),
    cytoband           varchar(20),
    mrna_desc          varchar(50),
    unigene            varchar(20),
    chromo             varchar(10),
    go_biopr_id        varchar(50),
    go_biopr_term      varchar(50),
    go_cellcomp_id     varchar(50),
    go_cellcomp_term   varchar(50),
    go_molfunc_id      varchar(50),
    go_molfunc_term    varchar(50),
    tot_probes         integer
);

ident
ko_c__ko_s (ppde_p all ZERO)
ko_s__wt_s (ppde_p all ZERO)
wt_c__ko_c (ppde_p > 0 bt probeid is forked)
wt_c__wt_s (ppde_p > 0 but probeid is forked)

ko_c__ko_s
wt_s__ko_s
wt_c__ko_c
wt_c__wt_s


so lets try a few queries:
select probeid from wt_s__ko_s where ppde_p >.2 and bay_p <0.15 limit 5;   Nothing
select probeid from wt_c__ko_c where ppde_p >.2 and bay_p <0.15 limit 5;   Nothing
select probeid from ko_c__ko_s where ppde_p >.2 and bay_p <0.15 limit 5;   5 probeids
select probeid from wt_c__wt_s where ppde_p >.2 and bay_p <0.15 limit 5;   5 probeids

select probeid from wt_s__ko_s where bay_p <0.05 limit 5;  yes

select d.probeid,d.bay_p,i.chromo,i.mrna_desc from wt_s__ko_s d, ident i where d.bay_p <0.05 and i.probeid = d.probeid  limit 5;

select d.probeid,d.bay_p,d.ppde_p,i.mrna_desc from ko_c__ko_s d, ident i where d.bay_p < 0.01 and d.ppde_p > 0.98 and i.probeid = d.probeid limit 5;

select probeid from ko_s__wt_s limit 5;
select probeid from ko_s__wt_s limit 5;
select probeid from wt_c__ko_c limit 5;
select probeid from wt_c__wt_s limit 5;

# run the cybert; filling out the columns as needed.

# take the cybert output ALLGENES and save as 'something_ALLGENES.txt'.
# process that ALLGENES file to take only the relevant data:

# get rid of /"/
perl -e "s/\"//gi" -p -ibak sql.input

# have to line up the variables that come out of cybert and the order in which they're sucked into the sqlite db

du -sch /home/aa* | mutt -s 'bduc user disk usage' hmangala@uci.edu


comparisons: (all use sliding window of 101, conf val of 9)
x ko_control vs wt_control

x ko_stress__wt_stress
x ko_ctrl__ko_stress
wt_ctrl__wt_stress

           0             1             2             3             4             5             6           7           8           9          10          11          12 
Probe Set ID  stress_ko_n1  stress_ko_n2  stress_ko_n3  stress_wt_n1  stress_wt_n2  stress_wt_n3  ko_ctrl_o1  ko_ctrl_o2  ko_ctrl_o3  wt_ctrl_o1  wt_ctrl_o2  wt_ctrl_o3


-------------------------------------------------------------------------