Make Your Data Relational ========================= by Harry Mangalam v1.03, Oct 28, 2015 :icons: //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // this file is converted to the HTML via the command: // export fileroot="/home/hjm/nacs/MakeYourDataRelational"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* moo:~/public_html; // don't forget that the HTML equiv of '~' = '%7e' // asciidoc cheatsheet: http://powerman.name/doc/asciidoc // asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html In many domains, scientific and otherwise, you end up generating lots of tables of data. Many people keep such tables in Spreadsheets (mostly Microsoft Excel) where they are easily viewed in columns, sorted, and cut/pasted into other spreadsheets but still remain 2D data. It's hard to 'relate' one set of data to another. There is a data structure that has been optimised for making and manipulating such data-relations. It's called, not surprisingly, a 'relational database' and it's even sorta-kinda-fairly easy to set up. The problem that many people have is creating the queries that allow you to pull the desired data from the database. In the following few paragraphs, I'll show how to go about creating the flat files that we'll use to populate the database, creating the database itself, importing the files, and then trying to extract some useful data relations from the database. USE OR NOT? There might be tens of such 'input_files' totalling 10s to 100s of millions of data points. The question is: how to relate these variables to each other without a lifetime of pointing & clicking & sorting & resorting columns. This is NOT a tutorial on database 'Best Practices'. It is definitely aimed at 'quick & dirty' data manipulation so that you can get on with your work. Sort of a 'Commando Database Tutorial' for biologists (and others but the target is biologists). .A comment about research and the command line [NOTE] ======================================================================= You've heard this before, but it's worth repeating: If you're going to do digital biology (and by that I mean High Throughput or Large Scale ANYTHING, you're going to have to know how to do things on the commandline. If you don't - if you want an easy-to-use GUI to do 'pointy-clicky' analysis, you'll end up 2 or 3 years behind the curve and you'll know a LOT less about the process. Read about this elsewhere, say http://www.software-carpentry.org[Software Carpentry] ======================================================================== We're going to use a relational database engine called http://www.sqlite.org[SQLite]. It differs from more familiar database engines in that it's not a database 'server'. There's no internet security because there is no internet access. It provides only the relational guts - tiny, very easy to set up, hardly any administration, fairly fast and quite adherent to SQL standards. If you want a database engine for Amazon, it's not going to be SQLite, but for what we want, it's perfect. SQLite is quite easy to set up - essentially a single file install, but I'll leave that to others to describe in detail, but almost any Linux distribution will have it available from the Distribution repository: try ------------------------------------------------------------------------- sudo apt-get install sqlite3 # Debian/Ubuntu-based # or sudo yum install sqlite # RedHat/CentOS-based ------------------------------------------------------------------------- The input flatfiles ------------------- Typically you'll have multiple large tables, either in a proprietary spreadsheet format (Excel) or more typically, ascii text flatfiles that are alphabetic strings separated by a common delimiter (commas, tabs/whitespace) like this: ---------------------------------------------------------------------------- Line Columns (up to 100 - 200) ----------------------------------------> 1 0 1 2 3 4 5 2 Probe Id_1 GeneID Set ID stress_ko_n1 3 10344614 NULL NULL 138.0018 163.8001 140.7875 4 10344616 NULL YHE72 7.355559 7.886098 8.340074 5 10344620 ITE9266 Adult 101.1456 84.57693 81.36497 6 10344622 EG620009 A1 718.7775 627.4826 690.6035 7 10344624 Lypla1 Mus 1132.786 1062.253 1121.712 8 10344633 Tcea1 (SII) 1677.235 1660.797 1727.314 9 10344637 Atp6v1h V1 2735.426 2467.097 2579.454 10 10344653 Oprk1 1 185.5172 208.5798 250.5823 11 10344658 Rb1cc1 A2 1618.653 1580.918 1540.812 ---------------------------------------------------------------------------- Alternatively, you might have a single huge flatfile that has a large number of columns and 30,000 or more rows in a typical gene expression data set. In addition to the 'Probe', 'ID_1', and 'GeneID' columns shown above, there might be 10s of identifier columns associated with the numeric data in a very large matrix. This makes it particularly difficult to figure out what the relationships are from the Spreadsheet application. I'm going to demonstrate with an input file called 'input_file' composed of 35,558 rows & 29 columns for a total size of ~1M values in 15.6MB containing the data from 4 experiments that needed to be sliced into 4 separate files to submit to an analysis. (It formally would have been possible to have submitted the original data file, but to upload 16MB multiple times over a relatively slow internet connection inhibited the process. It was a fairly easy process to slice the original data file into 4 much smaller data files of 2.2 MB each to feed to the analysis site.) The free utilities I used for doing this are called http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut and cols] and their use is described in that link. The 4 files to upload were generated using the 'Probe' ID tag (column 0) which provided a common reference key. For example, the first command below defines the input delimiter as a TAB and selects the columns to be extracted as '0 4 5 6 1 2 3' which slices them out and writes them in that order (cols '1 2 3' are the 'wt' and cols '4 5 6' are the 'ko' data. The following commands process the same input file and generate the other 4 sets of data. This is specific for the CyberT analysis, but by varying the column specification, it is portable to any slice/dice operation. ---------------------------------------------------------------------------- scut --id1='TAB' --c1='0 4 5 6 1 2 3' < input_file > wt_stress__ko_stress.upload scut --id1='TAB' --c1='0 7 8 9 1 2 3' < input_file > ko_ctrl__ko_stress.upload scut --id1='TAB' --c1='0 10 11 12 4 5 6' < input_file > wt_ctrl__wt_stress.upload scut --id1='TAB' --c1='0 10 11 12 7 8 9' < input_file > wt_ctrl__ko_ctrl.upload ---------------------------------------------------------------------------- So we have now ended up with a subset the input data, creating identically formatted files. These files are uploaded to the http://cybert.microarray.ics.uci.edu/[CyberT Gene Expression analysis website] for analysis. For each 'upload' file above, we now have a CyberT output file that consists of even MORE data. An example is below: ---------------------------------------------------------------------------- 0 1 2 3 4 5 .. -> 33 cols of data R version R 2.2.1 (2 - - - - - Cyber-T version 8.01 - - - - - hdarray library vers - - - - - Date of this analysi - - - - - "Row#" "Lab_0" "C_1" "C_2" "C_3" "E_4" "26796" 10598071 2145.323 1793.633 1724.169 361.2002 "26793" 10598064 3754.996 2901.02 3028.524 754.0856 "21826" 10553092 2845.467 3001.603 2941.748 1885.066 "29145" 10339319 664.0526 658.5493 621.1962 1090.339 35K rows | v ---------------------------------------------------------------------------- The CyberT output has expanded each of 4 files (~250K values) to a results file containing >1M values. Here is where we have to do some triage to reduce the number of variables we're carrying along. Data reduction or triage is a general process in analysis. As long as you keep copies of the complete output or can reiterate the process to regenerate the data, we're fine. The CyberT output contains the following columns This is critical to be able to suck them into the database table we will define for them. also have to cut down on the number of label names ie (label // label // label // label) ------------------------------------------------------------------------- create table wtc_koc ( row integer not null, probeid integer, label varchar(100), CMn float, EMn float, Bayes_p float, fold float, PPDE_lt_p float, PPDE_p float ); import "sql.input" "wtc_koc"; # to pull CMn,EMn,fold,Bayes_p,PPDE_lt_p where etc. select probeid, label,CMn,EMn,fold,Bayes_p,PPDE_lt_p from wtc_koc where Cmn != "NA" and Bayes_p <0.05 and PPDE_lt_p > 0.9 limit 7; # to count how many records get returned SELECT COUNT(fold) FROM (SELECT DISTINCT fold FROM wtc_koc where CMn != "NA" and PPDE_lt_p > 0.95); # prep the raw data to go into cybert # use the Probe Set ID as the identifier not the label. # Will need another table to correlate probe ID with the labels, etc. and then a lot of tables to carry the data from cybert. create table ident ( 0 probeid integer, 13 gene_acc varchar(50), 14 gene_sym varchar(50), 15 gene_desc varchar(50), 16 cytoband varchar(20), 17 mrna_desc varchar(50), 18 unigene varchar(20), 19 chromo varchar(10), 20 go_biopr_id varchar(50), 21 go_biopr_term varchar(50), 22 go_cellcomp_id varchar(50), 23 go_cellcomp_term varchar(50), 24 go_molfunc_id varchar(50), 25 go_molfunc_term varchar(50), 28 tot_probes integer ); # create the db 'em1' sqlite3 em1 to create this table, tail -n +2 0410f-07_borre_plier.data| scut --id1='TAB' --c1='0 13:25 28' > ident.sql.load # suck it into sqlite table with: sqlite3 eb1 sqlite> .sep \t # create the table create table ident ( probeid integer, gene_acc varchar(50), gene_sym varchar(50), gene_desc varchar(50), cytoband varchar(20), mrna_desc varchar(50), unigene varchar(20), chromo varchar(10), go_biopr_id varchar(50), go_biopr_term varchar(50), go_cellcomp_id varchar(50), go_cellcomp_term varchar(50), go_molfunc_id varchar(50), go_molfunc_term varchar(50), tot_probes integer ); ------------------------------------------------------------------------- make sure that the file is intact at top and bottom. from line 35521 to bottom 35558, no ids besides the Probe Set ID I deleted those lines. ------------------------------------------------------------------------- # import the ident data from ident.sql.load .import "ident.sql.load" ident # OK, so far so good. # now have to create the cybert tables for each of the comparisons and populate them with the correct data. # use the /probeid/ for teh identifier, not any gene id. NB the cybert output has 33 columns. Don' need them all. # this table will be specific for 3 replicates for C/E/ scut --id1='TAB' --c1='0 10 11 12 4 5 6' < 0410f-07_borre_plier.data > wt_ctrl__wt_stress.data # cybert with wt_ctrl__wt_stress.data wget mv ALLGENES wt_ctrl__wt_stress.ALLGENES # checked the cols - ppde_p is >0 scut --id1='TAB' --c1='0 10 11 12 7 8 9' < 0410f-07_borre_plier.data > wt_ctrl__ko_ctrl.data #cybert with wt_ctrl__ko_ctrl.data mv ALLGENES wt_ctrl__wt_stress.ALLGENES wt_ctrl__ko_ctrl.ALLGENES # checked cols and the ppdes are all zeros (expected?: ctrl vs ctrl) scut --id1='TAB' --c1='0 4 5 6 1 2 3' < 0410f-07_borre_plier.data > wt_stress__ko_stress.data #cybert with wt_stress__ko_stress.data mv ALLGENES wt_stress__ko_stress.ALLGENES # checked cols and the ppdes are all zeros (expected? stress vs stress) scut --id1='TAB' --c1='0 7 8 9 1 2 3' < 0410f-07_borre_plier.data > ko_ctrl__ko_stress.data #cybert with ko_ctrl__ko_stress.data mv ALLGENES ko_ctrl__ko_stress.ALLGENES # checked the cols - ppde_p is >0 drop previous tables: drop table ko_c__ko_s; drop table ko_s__wt_s; drop table wt_c__ko_c; drop table wt_c__wt_s; # create the data tables again (not the ident table) with the output from teh .schema dump below. (omitted - see below) # the 'tail -n +6' trims the top 6 lines off. sqlite can't take headers (or the "s that are in the headers) tail -n +6 ko_ctrl__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > ko_ctrl__ko_stress.sqlinput tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_stress__ko_stress.sqlinput tail -n +6 wt_ctrl__ko_ctrl.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_ctrl__ko_ctrl.sqlinput tail -n +6 wt_ctrl__wt_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_ctrl__wt_stress.sqlinput # now suck them into the DB with the '.import' command: .import "ko_ctrl__ko_stress.sqlinput" ko_c__ko_s .import "wt_stress__ko_stress.sqlinput" wt_s__ko_s .import "wt_ctrl__ko_ctrl.sqlinput" wt_c__ko_c .import "wt_ctrl__wt_stress.sqlinput" wt_c__wt_s create table yadda ( 1 probeid integer, 2 c1 float, 3 c2 float, 4 c3 float, 5 e1 float, 6 e2 float, 7 e3 float, 10 cmn float, 11 emn float, 12 csd float, 13 esd float, 14 clnmn float, 15 elnmn float, 16 clnsd float, 17 elnsd float, 18 cbaysd float, 19 ebaysd float, 20 cbaylnsd float, 21 ebaylnsd float, 22 bay_t float, 28 bay_p float, 29 bay_lnp float, 30 fold float, 31 ppde_lt_p float, 32 ppde_p float, ); need cols: 1:7 10:22 28:32 from: ko_ctrl__ko_stress.ALLGENES ko_stress__wt_stress.ALLGENES wt_ctrl__ko_ctrl_ALLGENES wt_ctrl__wt_stress.ALLGENES # the 'tail -n +6' trims the top 6 lines off. sqlite can't take headers (or the "s that are in the headers) tail -n +6 ko_ctrl__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > ko_ctrl__ko_stress.sqlinput tail -n +6 wt_stress__ko_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_stress__ko_stress.sqlinput tail -n +6 wt_ctrl__ko_ctrl.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_ctrl__ko_ctrl.sqlinput tail -n +6 wt_ctrl__wt_stress.ALLGENES | scut --id1='tab' --c1='1:7 10:22 28:32' > wt_ctrl__wt_stress.sqlinput tail -n +6 wt_ctrl__ko_ctrl.ALLGENES | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less > wt_ctrl__ko_ctrl.sqlinput tail -n +6 wt_ctrl__wt_stress.ALLGENES | head -22 | scut --id1='tab' --c1='1:7 10:22 28:32' |cols |less > wt_ctrl__wt_stress.sqlinput # now have to create the 4 tables in sqlite3 to hold the separate comparisons. Lnk to 'ident' table via the 'probeid' col which is essentially a foreign key. so the 4 tables will be: ko_c__ko_s wt_s__ko_s wt_c__ko_c wt_c__wt_s create table wt_c__wt_s ( probeid integer, c1 float, c2 float, c3 float, e1 float, e2 float, e3 float, cmn float, emn float, csd float, esd float, clnmn float, elnmn float, clnsd float, elnsd float, cbaysd float, ebaysd float, cbaylnsd float, ebaylnsd float, bay_t float, bay_p float, bay_lnp float, fold float, ppde_lt_p float, ppde_p float ); then import the equiv files to tables, in sqlite: .import "ko_ctrl__ko_stress.sqlinput" ko_c__ko_s .import "ko_stress__wt_stress.sqlinput" wt_s__ko_s .import "wt_ctrl__ko_ctrl.sqlinput" wt_c__ko_c .import "wt_ctrl__wt_stress.sqlinput" wt_c__wt_s # seems to have gone OK .schema CREATE TABLE ko_c__ko_s ( probeid integer, c1 float, c2 float, c3 float, e1 float, e2 float, e3 float, cmn float, emn float, csd float, esd float, clnmn float, elnmn float, clnsd float, elnsd float, cbaysd float, ebaysd float, cbaylnsd float, ebaylnsd float, bay_t float, bay_p float, bay_lnp float, fold float, ppde_lt_p float, ppde_p float ); CREATE TABLE wt_s__ko_s ( probeid integer, c1 float, c2 float, c3 float, e1 float, e2 float, e3 float, cmn float, emn float, csd float, esd float, clnmn float, elnmn float, clnsd float, elnsd float, cbaysd float, ebaysd float, cbaylnsd float, ebaylnsd float, bay_t float, bay_p float, bay_lnp float, fold float, ppde_lt_p float, ppde_p float ); CREATE TABLE wt_c__ko_c ( probeid integer, c1 float, c2 float, c3 float, e1 float, e2 float, e3 float, cmn float, emn float, csd float, esd float, clnmn float, elnmn float, clnsd float, elnsd float, cbaysd float, ebaysd float, cbaylnsd float, ebaylnsd float, bay_t float, bay_p float, bay_lnp float, fold float, ppde_lt_p float, ppde_p float ); CREATE TABLE wt_c__wt_s ( probeid integer, c1 float, c2 float, c3 float, e1 float, e2 float, e3 float, cmn float, emn float, csd float, esd float, clnmn float, elnmn float, clnsd float, elnsd float, cbaysd float, ebaysd float, cbaylnsd float, ebaylnsd float, bay_t float, bay_p float, bay_lnp float, fold float, ppde_lt_p float, ppde_p float ); CREATE TABLE ident ( probeid integer, gene_acc varchar(50), gene_sym varchar(50), gene_desc varchar(50), cytoband varchar(20), mrna_desc varchar(50), unigene varchar(20), chromo varchar(10), go_biopr_id varchar(50), go_biopr_term varchar(50), go_cellcomp_id varchar(50), go_cellcomp_term varchar(50), go_molfunc_id varchar(50), go_molfunc_term varchar(50), tot_probes integer ); ident ko_c__ko_s (ppde_p all ZERO) ko_s__wt_s (ppde_p all ZERO) wt_c__ko_c (ppde_p > 0 bt probeid is forked) wt_c__wt_s (ppde_p > 0 but probeid is forked) ko_c__ko_s wt_s__ko_s wt_c__ko_c wt_c__wt_s so lets try a few queries: select probeid from wt_s__ko_s where ppde_p >.2 and bay_p <0.15 limit 5; Nothing select probeid from wt_c__ko_c where ppde_p >.2 and bay_p <0.15 limit 5; Nothing select probeid from ko_c__ko_s where ppde_p >.2 and bay_p <0.15 limit 5; 5 probeids select probeid from wt_c__wt_s where ppde_p >.2 and bay_p <0.15 limit 5; 5 probeids select probeid from wt_s__ko_s where bay_p <0.05 limit 5; yes select d.probeid,d.bay_p,i.chromo,i.mrna_desc from wt_s__ko_s d, ident i where d.bay_p <0.05 and i.probeid = d.probeid limit 5; select d.probeid,d.bay_p,d.ppde_p,i.mrna_desc from ko_c__ko_s d, ident i where d.bay_p < 0.01 and d.ppde_p > 0.98 and i.probeid = d.probeid limit 5; select probeid from ko_s__wt_s limit 5; select probeid from ko_s__wt_s limit 5; select probeid from wt_c__ko_c limit 5; select probeid from wt_c__wt_s limit 5; # run the cybert; filling out the columns as needed. # take the cybert output ALLGENES and save as 'something_ALLGENES.txt'. # process that ALLGENES file to take only the relevant data: # get rid of /"/ perl -e "s/\"//gi" -p -ibak sql.input # have to line up the variables that come out of cybert and the order in which they're sucked into the sqlite db du -sch /home/aa* | mutt -s 'bduc user disk usage' hmangala@uci.edu comparisons: (all use sliding window of 101, conf val of 9) x ko_control vs wt_control x ko_stress__wt_stress x ko_ctrl__ko_stress wt_ctrl__wt_stress 0 1 2 3 4 5 6 7 8 9 10 11 12 Probe Set ID stress_ko_n1 stress_ko_n2 stress_ko_n3 stress_wt_n1 stress_wt_n2 stress_wt_n3 ko_ctrl_o1 ko_ctrl_o2 ko_ctrl_o3 wt_ctrl_o1 wt_ctrl_o2 wt_ctrl_o3 -------------------------------------------------------------------------