TCGA breast cancer data analysis

For all breast cancer patients, we have genome-wide amplification and transcript level data. Patients and samples are identified by their TCGA “barcodes”. There is relevant meta-data embedded in these barcodes.

Amplification and transcript levels

Our collaborator Riku Louhimo has pre-processed the genome-wide data available from TCGA for all breast cancer samples and provided two sets of data. Both sets have one gene on every row, with each field representing one sample.

The first set contains DNA amplification calls: a matrix of 1 (Amplified), 0 (Not amplified), or NA (Not available). The other set contains mRNA transcript levels, a matrix of floating-point values or NA (Not available).

For both files, the rows are labelled with Ensembl IDs for each gene and the columns are labelled with TCGA barcodes representing patients or samples. The results of the query:

Listing 1: rawdata.size

21M	raw/ampl.tsv
187M	raw/mrna.tsv
207M	total

Raw data file sizes. The sizes of each of the raw data files, in a human-readable format.

C1: ⟪ :make ⟫ ≡
root chunk
definition continued in ↓ C2 + ⇣ C15 + ⇣ C16 + ⇣ C19 + ⇣ C22 + ⇣ C47 + ⇣ C60 + ⇣ C61 + ⇣ C62 + ⇣ C63 + ⇣ C64 + ⇣ C75 + ⇣ C77 + ⇣ C82 + ⇣ C87 + ⇣ C89 + ⇣ C90 + ⇣ C97

rawdata.size: raw/ampl.tsv raw/mrna.tsv
	du --dereference --human-readable --total $^ > $@

⇩ C2

Since both files are quite big, and saved in a plain text, it is prohibitively expensive to load them. To avoid this, we load them once from R and save the data as a native R object that is both smaller and much faster to load.

Listing 2: input-data.Rsave.size

82M	input-data.Rsave

Size of the native R data object. Once a tab-separated plain-text data matrix has been loaded and saved as a native R object, it becomes much smaller.

C2: ⟪ :make ⟫ ≡
⇧ C1

input-data.Rsave.size: input-data.Rsave
	du --human-readable $^ > $@

⇩ C15

To keep the intermediate data object, we declare it as SECONDARY in the Makefile.

C3: ⟪ :make The R saved state ⟫ ≡
root chunk

.SECONDARY: input-data.Rsave
input-data.Rsave: input-data.R \
                  ampl.rownames \
                  ampl.colnames \
                  ampl.vals \
                  mrna.rownames \
                  mrna.colnames \
                  mrna.vals
	Rscript --vanilla $^ $@

◼

The script that reads the plain text data matrices and saves them to native R objects. It takes the names of all files as command line arguments.

C4: ⟪ input-data.R ⟫ ≡
root chunk

args <- commandArgs(trailingOnly = TRUE)
names(args) <- c("ampl.rownames",
                 "ampl.colnames",
                 "ampl.vals",
                 "mrna.rownames",
                 "mrna.colnames",
                 "mrna.vals",
                 "save.file")

⟪ Load the amplification matrix ⟫
⟪ Load the mRNA matrix ⟫

save(ampl, mrna, file=args["save.file"])

◼

Here, the second line makes all NA values 0 (no amplification). The third one makes the amplification a truth value: FALSE for basal level and TRUE for amplification of the gene.

C5: ⟪ Load the amplification matrix ⟫ ≡
Appears in C4

ampl <- scan(file=args["ampl.vals"], what=integer())
ampl[is.na(ampl)] <- 0
ampl <- as.logical(ampl)
ampl.rownames <- scan(file=args["ampl.rownames"], what=character())
ampl.colnames <- scan(file=args["ampl.colnames"], what=character())

ampl <- matrix(ampl,
               nrow=length(ampl.rownames), ncol=length(ampl.colnames),
               dimnames=list(ampl.rownames, ampl.colnames),
               byrow=TRUE)

◼

C6: ⟪ Load the mRNA matrix ⟫ ≡
Appears in C4

mrna <- scan(file=args["mrna.vals"], what=numeric())
mrna.rownames <- scan(file=args["mrna.rownames"], what=character())
mrna.colnames <- scan(file=args["mrna.colnames"], what=character())

mrna <- matrix(mrna,
               nrow=length(mrna.rownames), ncol=length(mrna.colnames),
               dimnames=list(mrna.rownames, mrna.colnames),
               byrow=TRUE)

◼

C7: ⟪ :make Matrix values and row and column labels ⟫ ≡
root chunk

%.vals: get-vals.bash raw/%.tsv
	bash $^ > $@

%.rownames: get-rownames.bash raw/%.tsv
	bash $^ > $@

%.colnames: get-colnames.bash raw/%.tsv
	bash $^ > $@

◼

C8: ⟪ get-vals.bash ⟫ ≡
root chunk

< "$1" ⟪ Drop first row ⟫ | ⟪ Drop first column ⟫

◼

C9: ⟪ get-rownames.bash ⟫ ≡
root chunk

< "$1" ⟪ Drop first row ⟫ | ⟪ Get first column ⟫

◼

C10: ⟪ get-colnames.bash ⟫ ≡
root chunk

< "$1" ⟪ Get first row ⟫ | ⟪ Drop first column ⟫

◼

C11: ⟪ Drop first row ⟫ ≡
Appears in C8, C9, C62, C63, C65, C89

sed -n '2~1p'

◼

C12: ⟪ Get first row ⟫ ≡
Appears in C10, C60, C61, C87

sed '1q'

◼

C13: ⟪ Drop first column ⟫ ≡
Appears in C8, C10, C65

cut --complement --fields=1

◼

C14: ⟪ Get first column ⟫ ≡
Appears in C9, C15

cut --fields=1

◼

Mapping gene IDs to names

All genes are identified by their Ensembl IDs. It would be convenient to be able to select genes based on the canonical names of their protein products. To do that, we donwloaded a table that maps Ensembl gene IDs to a short description of the gene and the associated gene name. This was done using the “Customize your download” feature on the Ensembl downloads page, with the following settings:

Attributes: “Ensembl Gene ID”, “Description”, and “Associated Gene Name”.

The results of the query were downloaded as a tab-separated (TSV) file with a header and saved as:

Already we can check if any of the gene IDs in either of the two data matrices is not in this list.

Listing 3: mystery-counts

 1729 mystery-genes-ampl
 1815 mystery-genes-mrna
 3544 total

Number of genes without names. There are quite a few genes without names; about 10% of the total.

C15: ⟪ :make ⟫ ≡
⇧ C2

mystery-counts: mystery-genes-ampl mystery-genes-mrna
	wc --lines $^ > $@

mystery-genes-%: sorted-genes-% sorted-genes-known
	comm -2 -3 $^ > $@

sorted-genes-%: %.rownames
	sort --unique $^ > $@

sorted-genes-known: ensembl/id_descr_name.tsv
	⟪ Get first column ⟫ $< | sort --unique > $@

⇩ C16

We can already do a sanity check of the data. We will compare two genes that we know should be co-amplified in almost all samples, and correlate very well on the transcript level: ERBB2 and STARD3. We would have to find their IDs from the file ensembl/id_descr_name.tsv, and use these IDs to pick out the amplification and mRNA data from the saved R data object input-data.Rsave. This would be an informal validation of all work so far.

Figure 1: sanity-check.svg

Sanity check. As expected, STARD3 and ERBB2 are almost always co-amplified and their transcript levels correlate very strongly.

C16: ⟪ :make ⟫ ≡
⇧ C15

sanity-check.ids: sanity-check.bash ensembl/id_descr_name.tsv
	bash $^ > $@

⇩ C19

C17: ⟪ sanity-check.bash ⟫ ≡
root chunk

awk -v name='^STARD3$' '⟪ ID from name .awk ⟫' "$1"
awk -v name='^ERBB2$' '⟪ ID from name .awk ⟫' "$1"

◼

C18: ⟪ ID from name .awk ⟫ ≡
Appears in C17

BEGIN { FS = "\t"; OFS = "\t" }
$3 ~ name {
    print $3, $1
    exit
}

◼

C19: ⟪ :make ⟫ ≡
⇧ C16

sanity-check.svg: sanity-check.R input-data.Rsave sanity-check.ids
	Rscript --vanilla $^ $@

⇩ C22

C20: ⟪ sanity-check.R ⟫ ≡
root chunk

args.names <- c("save.file", "ids.file", "fig.file")
⟪ Read and name R command line arguments ⟫

ids.table <- read.delim(args["ids.file"], header=F, row.names=1)
ensembl.id <- as.character(ids.table[,1])
names(ensembl.id) <- rownames(ids.table)

load(args["save.file"])

ampl[ensembl.id["STARD3"],] -> stard3.ampl
mrna[ensembl.id["STARD3"],] -> stard3.mrna
ampl[ensembl.id["ERBB2"],] -> erbb2.ampl
mrna[ensembl.id["ERBB2"],] -> erbb2.mrna

chisq.test(erbb2.ampl, stard3.ampl) -> chisq.test.result

svg(filename=args["fig.file"], width=7, height=4.7, pointsize=11)
layout(matrix(c(1,2), nrow=1, ncol=2, byrow=T),
       widths=c(2,3),
       heights=c(1,1))

# amplification
ampl.c <- hcl(c(140,320), l=80)
barplot(t(chisq.test.result$observed), beside=T, col=ampl.c,
        xlab="ERBB2 ampl",
        ylab="number of patients")
legend("topright",
       title="STARD3 ampl",
       legend=c("FALSE", "TRUE"),
       fill=ampl.c, border=ampl.c,
       bty="n",
       cex=.9)

# mRNA
plot(erbb2.mrna, stard3.mrna,
     xlab="ERBB2 mRNA level", ylab="STARD3 mRNA level",
     bty="n")

dev.off() -> foo

◼

C21: ⟪ Read and name R command line arguments ⟫ ≡
Appears in C20, C49, C52, C76, C78, C83, C94, C98, C100

args <- commandArgs(trailingOnly=T)
names(args) <- args.names

◼

Meta data: TCGA, Ensembl

Listing 4: tcga.db.schema

CREATE TABLE patient_survival (
  "Barcode" TEXT,
  "Event" TEXT,
  "Days" INTEGER
);
CREATE TABLE mirna (
  mirna TEXT,
  gene TEXT,
  refseqid TEXT,
  sum INTEGER
);
CREATE TABLE ampl_rownames(
  "Ensembl Gene ID" TEXT
);
CREATE TABLE mrna_rownames(
  "Ensembl Gene ID" TEXT
);
CREATE TABLE portion_analyte(
  "Code" TEXT,
  "Definition" TEXT
);
CREATE TABLE sample_type(
  "Code" TEXT,
  "Definition" TEXT,
  "Short Letter Code" TEXT
);
CREATE TABLE tissue_source_site(
  "TSS Code" TEXT,
  "Source Site" TEXT,
  "Study Name" TEXT,
  "BCR" TEXT
);
CREATE TABLE id_descr_name(
  "Ensembl Gene ID" TEXT,
  "Description" TEXT,
  "Associated Gene Name" TEXT
);
CREATE TABLE ampl_cols(
  "Barcode" TEXT,
  "TSS Code" TEXT,
  "Patient" TEXT
);
CREATE TABLE mrna_cols(
  "Barcode" TEXT,
  "TSS Code" TEXT,
  "Patient" TEXT,
  "Sample" TEXT,
  "Vial" TEXT,
  "Portion" TEXT,
  "Analyte" TEXT
);

The complete schema so far.

This is the complete schema for the database, generated automatically from the database file.

C22: ⟪ :make ⟫ ≡
⇧ C19

tcga.db.schema: tcga.db
	echo '.schema' | sqlite3 $< > $@

⇩ C47

To insert tables in to the database, we directly import tab-separated files with headers to a table named as the file, with the .tsv extension stripped. This uses a useful command of the SQLite shell, .import. When the table with that name has not been defined yet, the header line in the imported file is used.

C23: ⟪ create-table.bash ⟫ ≡
root chunk

NAME=$(basename --suffix=.tsv "$1")
echo 'DROP TABLE IF EXISTS "'"$NAME"'";'
echo '.separator "\t"'
echo ".import '$1' $NAME"

◼

This defines an implicit rule that generates the SQL code for inserting a table, using ⟪ create-table.bash ⟫ defined above, if the appropriate .tsv file exists.

C24: ⟪ :make Implicit rule for SQL tables from files with headers ⟫ ≡
root chunk

%.create-table.sql: create-table.bash %.tsv
	bash $^ > $@

◼

This is the rule that builds and updates the database. Note that only changes will be applied to the database.

C25: ⟪ :make Populate the relational database ⟫ ≡
root chunk

tcga.db:
	cat $? | sqlite3 $@

◼

In the amplification data matrix, column labels denote patients. In the transcript level matrix, column labels denote samples. Both patients and samples are labeled by a TCGA barcode. Those barcodes contain meta-information: for example, using the barcode, we can separate solid tumor samples from healthy tissue control samples in the transcript level data matrix.

The different identifiers within a barcode are described in “code table reports” available for download. The following three code table reports are relevant for us:

C26: ⟪ :make Code table reports to the database ⟫ ≡
root chunk

tcgatables := $(patsubst %.tsv,%.create-table.sql,$(wildcard tcga/*.tsv))
tcga.db: $(tcgatables)

◼

C27: ⟪ tumors.query.sql ⟫ ≡
root chunk

SELECT * FROM sample_type
WHERE "Definition" LIKE '%tumor%';

◼

Table 1: tumors.sql-result

Code	Definition	Short Letter Code
01	Primary solid Tumor	TP
02	Recurrent Solid Tumor	TR
08	Human Tumor Original Cells	THOC

Tumor sample types. We are most interested in the type “Primary solid Tumor”.

C28: ⟪ :make Run a query on the relational database ⟫ ≡
root chunk

%.sql-result: tcga.db sql-result.bash %.query.sql
	bash sql-result.bash $*.query.sql \
	    | sqlite3 tcga.db \
	    > $@

◼

A script that runs a query so that the results are returned as an HTML table with headers.

C29: ⟪ sql-result.bash ⟫ ≡
root chunk

echo '.mode html'
echo '.headers on'
cat "$1"

◼

All that is left is to link the barcodes used in the two data matrices to these tables. To do that, we use two tables that have the complete barcode as a unique identifier, and the fields of the barcode as the rest of the columns.

C30: ⟪ ampl.colnames.awk ⟫ ≡
root chunk

BEGIN {
    FS = "-"; OFS = "\t"
    print "Barcode", "TSS Code", "Patient"
}
{ print $0, $2, $3 }

◼

C31: ⟪ mrna.colnames.awk ⟫ ≡
root chunk

BEGIN {
    FS = "-"; OFS = "\t"
    print "Barcode", "TSS Code", "Patient",
          "Sample", "Vial", "Portion", "Analyte"
}
{
    sample_type = substr($4, 1, 2);
    vial = substr($4, 3, 1);
    portion = substr($5, 1, 2);
    analyte = substr($5, 3, 1);
    print $0, $2, $3,
          sample_type, vial, portion, analyte;
}

◼

C32: ⟪ :make Matrix column labels to files ⟫ ≡
root chunk

%_cols.tsv: %.colnames %.colnames.awk
	< $*.colnames ⟪ Tab-separated fields to lines ⟫  \
	    | awk -f $*.colnames.awk \
	    > $@

◼

C33: ⟪ Tab-separated fields to lines ⟫ ≡
Appears in C32, C61, C65, C87

tr "\t" "\n"

◼

C34: ⟪ :make Matrix column label files to the database ⟫ ≡
root chunk

tcga.db: ampl_cols.create-table.sql mrna_cols.create-table.sql

◼

C35: ⟪ samples.query.sql ⟫ ≡
root chunk

SELECT
  count("Sample") as n,
  "Definition"
FROM mrna_cols
  INNER JOIN sample_type
    ON mrna_cols."Sample" = sample_type."Code"
GROUP BY "Definition"
ORDER BY n DESC;

◼

Table 2: samples.sql-result

n	Definition
524	Primary solid Tumor
59	Solid Tissue Normal

Sample types in transcript levels matrix. Most of the samples are from tumors, but some have associated normal tissue controls.

C36: ⟪ patients-in-mrna.query.sql ⟫ ≡
root chunk

SELECT count(*) as "Number of patients" FROM (
  SELECT DISTINCT "TSS Code", "Patient"
  FROM mrna_cols
);

◼

Number of patients
526

C37: ⟪ patients-with-control.query.sql ⟫ ≡
root chunk

SELECT count(*) AS n FROM (
  SELECT x."TSS Code", x."Patient"
  FROM mrna_cols AS x
    INNER JOIN mrna_cols AS y
      ON x."TSS Code" = y."TSS Code"
        AND x."Patient" = y."Patient"
        AND x."Sample" < y."Sample"
);

◼

Table 4: patients-with-control.sql-result

n
57

Number of patients with control sample. This counts all patients that have both tumor and normal solid tissue sample in the transcript level matrix.

How many are the patients from the amplification matrix for which we have tumor samples in the transcript level matrix?

C38: ⟪ patients-tumors.query.sql ⟫ ≡
root chunk

SELECT count(*) AS 'Cancer patient count'
FROM ( ⟪ ampl-mrna.fragment.sql ⟫ );

◼

This SQL query fragment creates a with two columns. In the ampl column will be the patient barcodes, and in the mrna column the solid tumor sample barcodes of the same patients.

C39: ⟪ ampl-mrna.fragment.sql ⟫ ≡
Appears in C38, C48, C72, C73

SELECT
  ampl_cols."Barcode" AS ampl,
  mrna_cols."Barcode" AS mrna
FROM ampl_cols
  INNER JOIN mrna_cols
    USING ("TSS Code", "Patient")
  INNER JOIN sample_type
    ON mrna_cols."Sample" = sample_type."Code"
WHERE sample_type."Definition" = 'Primary solid Tumor'

◼

Table 5: patients-tumors.sql-result

Cancer patient count
516

Useful number of patients. These are the patients that have an amplification status, and a transcript level from a solid tumor sample.

Finally, for convenience, we add the Ensembl ID to Gene name mapping downloaded from Ensembl to the database, too. Even though we are able to query this file with the help of Awk and Bash, as demonstrated by ⟪ sanity-check.bash ⟫ and ⟪ ID from name .awk ⟫, it is unnecessarily hacky.

C40: ⟪ :make Ensembl IDs and names to the database ⟫ ≡
root chunk

tcga.db: ensembl/id_descr_name.create-table.sql

◼

C41: ⟪ erbb2-stard3-ids.query.sql ⟫ ≡
root chunk

SELECT
  "Associated Gene Name",
  "Ensembl Gene ID"
FROM id_descr_name
WHERE "Associated Gene Name" IN ( 'ERBB2', 'STARD3' );

◼

Table 6: erbb2-stard3-ids.sql-result

Associated Gene Name	Ensembl Gene ID
STARD3	ENSG00000131748
ERBB2	ENSG00000141736

Ensembl IDs of ERBB2 and STARD3.

Gene association

We will try to find a correlation of the amplification status or the transcript levels between pairs of this set of genes:

C42: ⟪ Genes of interest ⟫ ≡
Appears in C43, C44

"STARD3", "ERBB2", "NDRG1", "LAPTM4B", "LAPTM4A"

◼

Pairs of genes

C43: ⟪ :dummy Generate pairs ⟫ ≡
root chunk

list = { ⟪ Genes of interest ⟫ };
for (i = 0; i < length(list); ++i)
    for (j = i+1; j < length(list); ++j)
        compare(list[i], list[j]);

◼

C44: ⟪ genes-of-interest-pairs.query.sql ⟫ ≡
root chunk

SELECT DISTINCT
  x."Associated Gene Name" AS name1,
  y."Associated Gene Name" AS name2,
  x."Ensembl Gene ID" AS id1,
  y."Ensembl Gene ID" AS id2
FROM id_descr_name AS x, id_descr_name AS y
WHERE name1 IN ( ⟪ Genes of interest ⟫ )
  AND name2 IN ( ⟪ Genes of interest ⟫ )
  AND name1 < name2
ORDER BY name1;

◼

This implicit rule keeps the headers, so the resulting *.tsv-result file can be loaded from R using read.delim with the default parameters.

C45: ⟪ :make Results of a query to a file ⟫ ≡
root chunk

%.tsv-result: tcga.db sql-tsv.bash %.query.sql
	bash sql-tsv.bash $*.query.sql \
	    | sqlite3 tcga.db \
	    > $@

◼

C46: ⟪ sql-tsv.bash ⟫ ≡
root chunk

echo '.mode tabs'
echo '.headers on'
cat "$1"

◼

Association results

Now, using this, we can read this file from R and use it to extract the relevant genes (rows) from the amplification status and transcript level matrices.

C47: ⟪ :make ⟫ ≡
⇧ C22

genes-of-interest.Rsave: genes-of-interest.R \
                         genes-of-interest-pairs.tsv-result \
                         columns-of-interest.tsv-result \
                         input-data.Rsave
	Rscript --vanilla $^ $@

fig.genes-of-interest.%: plot-genes-of-interest.R \
                         genes-of-interest.Rsave
	Rscript --vanilla $^ $@ $*

⇩ C60

C48: ⟪ columns-of-interest.query.sql ⟫ ≡
root chunk

⟪ ampl-mrna.fragment.sql ⟫;

◼

To summarize some of the useful information in this subset of the data, we perform a χ² test on the amplification status from both genes, and a Pearson’s correlation test on the transcript levels from both genes.

C49: ⟪ genes-of-interest.R ⟫ ≡
root chunk

args.names <- c("from.file", "cols.file", "save.file", "result.file")
⟪ Read and name R command line arguments ⟫
⟪ Load genes of interest data from files ⟫
⟪ Keep only the overlapping columns ⟫

correlation <- function(x) {
    n1 <- x[1]
    n2 <- x[2]
    a1 <- ampl[x[3],]
    a2 <- ampl[x[4],]
    t1 <- mrna[x[3],]
    t2 <- mrna[x[4],]

    chisq.test(a1, a2) -> chisq.r

    ⟪ Color mRNA scatter plot R function ⟫
    scatter.plot.col(a1, a2) -> mrna.colors
    cor.test(t1, t2, method="pearson") -> cor.r

    list(name.a=n1, name.b=n2,
         chisq.result=chisq.r,
         transcript.a=t1, transcript.b=t2,
         mrna.col=mrna.colors,
         pearson.estimate=cor.r$estimate)
}

apply(pairs, 1, correlation) -> result
save(result, file=args["result.file"])

◼

Figure 2: fig.genes-of-interest.svg

The results for all possible pairs. The first two plots, a positive control. STARD3 and ERBB2 are on the same amplicon and are known to be co-amplified and co-overexpressed in breast canser patient samples.

There is no correlation between NRDG1 and STARD3. They are not co-amplified and the transcript levels do not correlate.

LAPTM4B and STARD3 are not co-amplified. On the transcript level however there is a moderate positive correlation between the two.

LAPTM4B is co-amplified with NDRG1. This is reflected in the correlation of the transcript levels of the two genes.

C50: ⟪ Load genes of interest data from files ⟫ ≡
Appears in C49

as.matrix(read.delim(args["from.file"])) -> pairs
as.matrix(read.delim(args["cols.file"])) -> cols
load(args["save.file"])

◼

C51: ⟪ Keep only the overlapping columns ⟫ ≡
Appears in C49, C94

ampl <- ampl[,cols[,"ampl"]]
mrna <- mrna[,cols[,"mrna"]]

◼

Plotting the results

The names are used to label the axes. The transcript levels are used for the scatter plot. The generated colors for the scatter plot were generated above. This uses the list of values to draw summary plots.

C52: ⟪ plot-genes-of-interest.R ⟫ ≡
root chunk

args.names <- c("result.file", "fig.file", "format")
⟪ Read and name R command line arguments ⟫
load(args["result.file"])

⟪ Open graphics device R function ⟫
r <- length(result)
open.device(args["format"], 
            args["fig.file"],
            width=5.6, height=r*3.8,
            pointsize=14)
⟪ Define layout for correlation plots R function ⟫
cor.plots.layout(r)

⟪ Draw correlation plots R function ⟫
lapply(result, draw.cor.plots) -> foo

dev.off() -> foo

◼

Unfortunately, I could not figure out an easy way to choose the graphics device at run-time. Instead, I am using this hack to open the correct device based on a character argument:

C53: ⟪ Open graphics device R function ⟫ ≡
Appears in C52, C78, C83, C100

open.device <- function(format, ...) {
    if (format == "pdf") {
        pdf(...)
    } else if (format == "svg") {
        svg(...)
    }
}

◼

This defines a layout with as many rows as items in the result variable, and splits each row in two columns.

C54: ⟪ Define layout for correlation plots R function ⟫ ≡
Appears in C52, C100

cor.plots.layout <- function(r) {
    layout(matrix(1:(2*r), nrow=r, ncol=2, byrow=T),
           widths=c(4,7),
           heights=rep(1, r))
}

◼

C55: ⟪ Generate a subtitle ⟫ ≡
Appears in C56

make.subtitle.str <- function(str, limit, prec, x) {
    if (x < limit) {
        subtitle.str <- paste(str, "<", limit)
    } else {
        subtitle.str <- paste(str, "=", round(x, prec))
    }
}

◼

C56: ⟪ Draw correlation plots R function ⟫ ≡
Appears in C52, C100

draw.cor.plots <- function(x) {
    ⟪ Generate a subtitle ⟫
    ampl.col <- hcl(c(150,330), l=67)
    ⟪ Draw a barplot to visualize amplification status correlation ⟫
    ⟪ Draw a scatter plot to visualize transcript level correlation ⟫
}

◼

This uses the observed 2 × 2 contingency table from the χ² test to plot each value as a bar height.

C57: ⟪ Draw a barplot to visualize amplification status correlation ⟫ ≡
Appears in C56

subtitle.str <- make.subtitle.str("p-val", 0.005, 3, x$chisq.result$p.value)
cont.table <- t(x$chisq.result$observed)
foo <- c("basal", "amplified")
dimnames(cont.table) <- list(foo, foo)

barplot(cont.table, beside=T,
        col=ampl.col,
        main="Amplification",
        sub=subtitle.str,
        xlab=paste(x$name.a),
        ylab="number of patients")

legend("topright",
       title=paste(x$name.b),
       legend=foo,
       fill=ampl.col, border=ampl.col,
       bty="n")

◼

C58: ⟪ Draw a scatter plot to visualize transcript level correlation ⟫ ≡
Appears in C56

subtitle.str <- make.subtitle.str("cor", 0.1, 2, x$pearson.estimate)
plot(x$transcript.a, x$transcript.b,
     col=x$mrna.col,
     main=paste("Transcipt levels"),
     sub=subtitle.str,
     xlab=x$name.a, ylab=x$name.b,
     bty="n")

legend("topleft",
    title = "Amplification",
    legend = c("amplified", "basal", "different"),
    pch = c(1, 1, 1),
    col=c("darkred", "darkblue", "olivedrab"),
    bty = "n")

◼

In addition to using the actual transcript levels from the two patients, it also uses the amplification status for each patient to draw the circle in blue (both negative), red (both positive), or green (different). This implementation works because the amplification statuses of both genes are represented as logical vectors, and we know that all NA’s have been substituted with FALSE. When used in ordinary arithmetic, logical vectors are coerced into numeric vectors, with FALSE = 0 and TRUE = 1.

C59: ⟪ Color mRNA scatter plot R function ⟫ ≡
Appears in C49, C100

scatter.plot.col <- function(a1, a2) {
    c("darkblue", "olivedrab", "darkred")[1+a1+a2]
}

◼

Survival

The next question is whether there is a correlation between gene amplification and transcription levels and patient survival.

Clinical patient data

The data on patient survival is extracted from a table with many cliniclal parameters.

Listing 5: clinical-column-number

Number of columns. Each column in this matrix is one clinical parameter of that patient. The majority are not interesting at this stage.

C60: ⟪ :make ⟫ ≡
⇧ C47

clinical-column-number: clinical/clinical_patient_all_brca.txt
	< $< ⟪ Get first row ⟫ | wc --words > $@

⇩ C61

Which could be the relevant rows? Time is likely to be measured in days or months.

Listing 6: clinical-relevant-columns

24:days_to_birth
25:days_to_death
26:days_to_initial_pathologic_diagnosis
27:days_to_last_followup
28:days_to_last_known_alive

Columns that could be relevant. We need the days to death (25), and last followup (27) or last known alivem (28), whichever is longer.

C61: ⟪ :make ⟫ ≡
⇧ C60

clinical-relevant-columns: clinical/clinical_patient_all_brca.txt
	< $< ⟪ Get first row ⟫ \
	    | ⟪ Tab-separated fields to lines ⟫ \
	    | grep --line-number 'days\|months' \
	    > $@

⇩ C62

Extract the relevant columns only: Patient barcode (1), and days to death (25), to last followup (27), and to last known alive (28).

C62: ⟪ :make ⟫ ≡
⇧ C61

patient-days: clinical/clinical_patient_all_brca.txt
	< $^ ⟪ Drop first row ⟫ | cut --fields=1,25,27,28 > $@

⇩ C63

Note that column 26, “Days to initial pathologic diagnosis”, has a value of 0 for every one of the 863 patients:

Listing 7: days-to-initial

    863 0

A useless column.

C63: ⟪ :make ⟫ ≡
⇧ C62

days-to-initial: clinical/clinical_patient_all_brca.txt
	⟪ Drop first row ⟫ $< | cut --fields=26 | sort | uniq -c > $@

⇩ C64

Are there field values that are not numbers? Which are they? How many of each?

Listing 8: patient-days-nans

   1706 [Not Available]

Field values that are not numbers. There is only one value that is not a number: the string “[Not Available]”.

C64: ⟪ :make ⟫ ≡
⇧ C63

patient-days-nans: nan-fields.sh patient-days
	bash $^ > $@

⇩ C75

C65: ⟪ nan-fields.sh ⟫ ≡
root chunk

⟪ Drop first row ⟫ "$1" \
    | ⟪ Drop first column ⟫ \
    | ⟪ Tab-separated fields to lines ⟫ \
    | sed -n '/^[0-9.]\+$/!p' \
    | sort \
    | uniq --count

◼

C66: ⟪ :make Normalized patient survival file ⟫ ≡
root chunk

patient_survival.rows-tsv: patient_survival.awk patient-days
	awk -f $^ > $@

◼

To prepare this table for the database, we normalize it first. This means that there will be one record in the database for each Patient-Value combination. All “[Not Available]” values will be skipped.

C67: ⟪ patient_survival.awk ⟫ ≡
root chunk

BEGIN {
    FS = "\t" ; OFS = "\t"
    col[2] = "Death"
    col[3] = "Last followup"
    col[4] = "Last known alive"
}
{
    for (i = 2; i <= 4; ++i) {
        if ($i != "[Not Available]") {
            print $1, col[i], $i
        }
    }
}

◼

C68: ⟪ :make Patient survival to the database ⟫ ≡
root chunk

tcga.db: patient_survival.define-insert-table.sql

◼

Because the last column in the database is a number, it is better to manually define the schema for the table that is going to hold this data.

C69: ⟪ patient_survival.schema.sql ⟫ ≡
root chunk

CREATE TABLE patient_survival (
  "Barcode" TEXT,
  "Event" TEXT,
  "Days" INTEGER
);

◼

C70: ⟪ :make Implicit rule for SQL tables with an explicit schema ⟫ ≡
root chunk

%.define-insert-table.sql: define-insert-table.bash %.schema.sql %.rows-tsv
	bash $^ > $@

◼

In addition to ⟪ create-table.bash ⟫, this script takes an explicitly defined schema.

C71: ⟪ define-insert-table.bash ⟫ ≡
root chunk

NAME=$(basename --suffix=.schema.sql "$1")
echo 'DROP TABLE IF EXISTS "'"$NAME"'";'
cat "$1"
echo '.separator "\t"'
echo ".import '$2' $NAME"

◼

Long-term survival

We already know from ⟪ :table patients-tumors.sql-result ⟫ that there are 516 patients for whom we have both amplification and transcript level data. Now we can ask, for how many of them do we have an event (patient dies) within the first 5 years?

C72: ⟪ five-year-events.query.sql ⟫ ≡
root chunk

WITH ampl_mrna AS ( ⟪ ampl-mrna.fragment.sql ⟫ )
SELECT count(*) AS "Number of events"
FROM patient_survival
  INNER JOIN ampl_mrna
    ON ampl_mrna.ampl = "Barcode"
WHERE "Event" = 'Death'
  AND "Days" < 365*5;

◼

And finally, we need a table that we can use to do the survival statistics in R. For the survival statistics, we need at the very least a table with two columns: one with the follow-up times, and one with the status indicator with 0 for alive and 1 for dead. We also will need the patient barcodes for the amplification dataset and the sample barcode for the transcript level dataset.

Number of events
39

C73: ⟪ survival-table.query.sql ⟫ ≡
root chunk

WITH
  ampl_mrna AS ( ⟪ ampl-mrna.fragment.sql ⟫ ),
  patient_event_time AS (
    SELECT
      "Barcode" AS barcode,
      "Event" AS event,
      MAX("Days"/365.25) AS years
    FROM patient_survival
    GROUP BY barcode
  )
SELECT
  ampl AS ampl_patient,
  mrna AS mrna_sample,
  CASE
    WHEN event = 'Death' AND years <= 5 THEN 1
    WHEN event <> 'Death' OR years > 5 THEN 0
  END AS status,
  CASE
    WHEN years <= 5 THEN years
    ELSE 5
  END AS time
FROM patient_event_time
  INNER JOIN ampl_mrna ON ampl_patient = barcode;

◼

C74: ⟪ survival-genes.query.sql ⟫ ≡
root chunk

SELECT
  "Associated Gene Name" AS name,
  "Ensembl Gene ID" AS id
FROM id_descr_name
WHERE name IN ('STARD3', 'LAPTM4B', 'NDRG1', 'SERPINA1');

◼

In order to avoid loading the complete data matrices multiple times, we will load them once and save only the four genes of interest, and only the columns we need.

C75: ⟪ :make ⟫ ≡
⇧ C64

survival-genes.Rsave: survival-genes.R \
                      input-data.Rsave \
                      survival-table.tsv-result \
                      survival-genes.tsv-result
	Rscript --vanilla $^ $@

⇩ C77

C76: ⟪ survival-genes.R ⟫ ≡
root chunk

args.names <- c("input.file",
                "clinical.file",
                "genes.file",
                "save.file")
⟪ Read and name R command line arguments ⟫

read.delim(args["clinical.file"]) -> clinical
attach(clinical)
as.character(ampl_patient) -> ampl.patient
as.character(mrna_sample) -> mrna.sample

load(args["input.file"])
ampl[,ampl.patient] -> ampl
mrna[,mrna.sample] -> mrna

as.matrix(read.delim(args["genes.file"])) -> genes

ampl.survival <- ampl[genes[,"id"],]
rownames(ampl.survival) <- genes[,"name"]
mrna.survival <- mrna[genes[,"id"],]
rownames(mrna.survival) <- genes[,"name"]

save(status, time, ampl.survival, mrna.survival, file=args["save.file"])

◼

Effects on long-term patient survival

Now we determine the correlation between the amplification status, transcipt levels, and patient survival. First, we split patients in two groups. For the amplification data, the two groups are “basal” and “amplified”. For the transcript level data, the two groups are simply levels below and above the median within this gene.

C77: ⟪ :make ⟫ ≡
⇧ C75

%-survival-curve.svg: survival-curve.R survival-genes.Rsave
	Rscript --vanilla $^ $* $@ svg

%-survival-curve.pdf: survival-curve.R survival-genes.Rsave
	Rscript --vanilla $^ $* $@ pdf

⇩ C82

Figure 3: SERPINA1-survival-curve.svg

Survival curve for SERPINA1. SERPINA1 amplification status and transcript levels are positively correlated with patient survival, as reported previously.

Figure 5: NDRG1-survival-curve.svg

Survival curve for NDRG1. NDRG1 amplification and upregulation of transcript levels both have a weak negative effect on breast cancer patient survival.

Figure 6: LAPTM4B-survival-curve.svg

Survival curve for LAPTM4B. LAPTM4B amplification has a statistically significant effect on the survival of breast cancer patients. Transcript levels upregulation exhibits a similar tendency, but the difference is not statistically significant.

C78: ⟪ survival-curve.R ⟫ ≡
root chunk

args.names <- c("save.file", "gene.name", "fig.file", "format")
⟪ Read and name R command line arguments ⟫
⟪ Load survival saved state ⟫
⟪ Survival statistics ⟫

⟪ Open graphics device R function ⟫
open.device(args["format"],
            args["fig.file"], width=8, height=4.4, pointsize=11)
par(mfrow=c(1,2))
⟪ Plot survival R function ⟫
plot.survival(list(fit=survfit.ampl, diff=survdiff.ampl),
              main.text="Amplification",
              title.text=paste(gn, "amplification"),
              legend.text=c("basal", "amplified"))
plot.survival(list(fit=survfit.mrna, diff=survdiff.mrna),
              main.text="Transcript levels",
              title.text=paste(gn, "transcript"),
              legend.text=c("below median", "above median"))
dev.off() -> foo

◼

C79: ⟪ Load survival saved state ⟫ ≡
Appears in C78

load(args["save.file"])
args["gene.name"] -> gn
ampl.survival[gn,] -> ampl
mrna.survival[gn,] -> mrna

◼

C80: ⟪ Survival statistics ⟫ ≡
Appears in C78

mrna > median(mrna) -> mrna.high

library(survival)

Surv(time, status) -> so
survfit(so ~ ampl) -> survfit.ampl
survdiff(so ~ ampl) -> survdiff.ampl
survfit(so ~ mrna > median(mrna, na.rm=T)) -> survfit.mrna
survdiff(so ~ mrna > median(mrna, na.rm=T)) -> survdiff.mrna

◼

The p-value is calculated from the χ² test statistic with one degree of freedom (this is always the case, since we always have two groups, and both have patients in them). A table with the “numbers at risk” is added to directly to the plot.

C81: ⟪ Plot survival R function ⟫ ≡
Appears in C78, C83

plot.survival <- function(x, main.text, title.text, legend.text) {
    pval <- 1 - pchisq(x$diff$chisq, 1)
    cols <- c("darkblue", "darkred")
    plot(x$fit,
         main=main.text,
         ylab="Survival", xlab="Years",
         xlim=c(-0.4,5.3), ylim=c(0.5,1),
         col=cols,
         bty="n")
    legend("bottomleft",
           legend=legend.text,
           lty=1,
           bty="n",
           col=cols)
    legend("bottomright",
           bty="n",
           legend=paste("p=", signif(pval, digits=2), sep=""))
    text(x=-0.3, y=0.71, labels="Numbers at risk", pos=4, offset=0)
    text(x=0:5,
         y=rep(c(0.67,0.63), each=6),
         col=rep(cols, each=6),
         labels=summary(x$fit, times=0:5)$n.risk)
}

◼

Is there an additive (negative) effect of LAPTM4B and NDRG1 on patient survival?

C82: ⟪ :make ⟫ ≡
⇧ C77

%-survival-curve-combined.svg: survival-curve-combined.R \
                               survival-genes.Rsave
	Rscript --vanilla $^ $* $@ svg

%-survival-curve-combined.pdf: survival-curve-combined.R \
                               survival-genes.Rsave
	Rscript --vanilla $^ $* $@ pdf

⇩ C87

Figure 7: NDRG1-LAPTM4B-survival-curve-combined.svg

Combined effect of NDRG1 and LAPTM4B on survival. On the left hand side the patients are separated in two groups: in the group “both high” (in red), the transcript levels for both patients are above median. In the other group, “different” (in blue), are all other patients. On the right hand side, patients with different mRNA expression levels for the two trancripts were removed from the data set. In the “high” group (in red) are the patients with both transcript levels high, while in the “low” group (in blue) are the patients that have both transcript levels low.

C83: ⟪ survival-curve-combined.R ⟫ ≡
root chunk

args.names <- c("save.file", "gene1.gene2.names", "fig.file", "format")
⟪ Read and name R command line arguments ⟫
⟪ Names and transcript levels of the two genes ⟫
apply(mrna, 1, median, na.rm=T) -> mrna.median
mrna > mrna.median -> is.high
apply(is.high, 2, all) -> both.high

library(survival)
⟪ Patient survival: both high vs. different ⟫ -> high.different
⟪ Patient survival: both high vs. both low ⟫ -> high.low

⟪ Open graphics device R function ⟫
open.device(args["format"],
            args["fig.file"], width=8, height=4.4, pointsize=11)
par(mfrow=c(1,2))
⟪ Plot survival R function ⟫
plot.survival(high.different,
              main.text="Different vs. High",
              legend.text=c("different", "both high"))
plot.survival(high.low,
              main.text="Low vs. High",
              legend.text=c("both low", "both high"))
dev.off() -> foo

◼

C84: ⟪ Names and transcript levels of the two genes ⟫ ≡
Appears in C83

load(args["save.file"])
strsplit(args["gene1.gene2.names"],
         split="-", fixed=T)[[1]] -> gene.names
mrna.survival[gene.names,] -> mrna

◼

C85: ⟪ Patient survival: both high vs. different ⟫ ≡
Appears in C83

Surv(time, status) -> so
list(fit=survfit(so ~ both.high), diff=survdiff(so ~ both.high))

◼

C86: ⟪ Patient survival: both high vs. both low ⟫ ≡
Appears in C83

mrna < mrna.median -> is.low
apply(is.low, 2, all) -> both.low
both.high | both.low -> filtered
time[filtered] -> time.f
status[filtered] -> status.f
both.high[filtered] -> both.high.f

Surv(time.f, status.f) -> so.f
list(fit=survfit(so.f ~ both.high.f), diff=survdiff(so.f ~ both.high.f))

◼

Protein complexes regulated by micro-RNA

Is there a correlation between the predicted targets of the same miRNA? This question is motivated by the known phenomenon of miRNA targetting a complex of functionally related proteins. The predicted targets are collected using a short-list of experimentally validated micro-RNAs. Then, these predicted targets for each gene are correlated to the gene using the available mRNA levels from patient solid tumors.

Micro-RNAs and predicted targets

To answer this question, we will try to find experimentally validated micro-RNAs that target the genes of interest, as listed in ⟪ Genes of interest ⟫. We drop STARD3 from that list, as it correlates very strongly with ERBB2 and is far less studied experimentally; we also drop LAPTM4A, as it showed no correlation to any of the other genes, and is also poorly studied. We are left with ERBB2, NDRG1, and LAPTM4B. The main criteria for choosing a miRNA that targets each gene is that it downregulates the gene in the context of cancer progression.

For ERBB2, hsa-miR-155-5p (MIMAT0000646) was shown to downregulate ErbB2 and suppress ErbB2-induced malignant transformation of breast epithelial cells by two distinct mechanism: repressing ErbB2 transcription, and directly targetting ErbB2 via a regulatory element (doi:10.1038/onc.2016.132).

For NDRG1, miR-769-3p (MIMAT0003887) was found to down-regulate NDRG1 in MCF-7 cells during reoxygenation (doi:10.1038/srep05908).

For LAPTM4B, miR-188-5p (MIMAT0000457) was found to inhibit tumor growth and metastasis by repressing LAPTM4B expression, acting as a tumor supressor (doi:10.18632/oncotarget.3341).

Listing 9: microrna/table-header

     1	miRNA
     2	MIMATid
     3	Gene
     4	EntrezID
     5	RefseqID
     6	miRWalk
     7	miRanda
     8	miRDB
     9	RNA22
    10	Targetscan
    11	SUM

Column labels.

C87: ⟪ :make ⟫ ≡
⇧ C82

microrna/table-header: microrna/3utr-comparative.tsv
	⟪ Get first row ⟫ $^ \
	    | ⟪ Tab-separated fields to lines ⟫ \
	    | nl \
	    > $@

⇩ C89

For now, we are interested in the miRNA identifier (1), the predicted target gene (3), the IDs of the mRNA sequences being targetted, and the number of positive predictions for that gene, commonly used as an indicator of the prediction strength (11). We will a table with these columns to the relational database.

C88: ⟪ mirna.schema.sql ⟫ ≡
root chunk

CREATE TABLE mirna (
  mirna TEXT,
  gene TEXT,
  refseqid TEXT,
  sum INTEGER
);

◼

C89: ⟪ :make ⟫ ≡
⇧ C87

tcga.db: mirna.define-insert-table.sql
mirna.rows-tsv: microrna/3utr-comparative.tsv
	⟪ Drop first row ⟫ $^ \
	    | cut --fields=1,3,5,11 \
	    > $@

⇩ C90

To be able to find the intersection of genes that are in the predicted targets, and in the transcript level matrix, we will insert the row labels of the two matrices in the relational database.

C90: ⟪ :make ⟫ ≡
⇧ C89

tcga.db: ampl_rownames.create-table.sql \
         mrna_rownames.create-table.sql
%_rownames.tsv: %.rownames
	echo "Ensembl Gene ID" > $@
	cat $^ >> $@

⇩ C97

To be able to caclulate the transcript level correlation for each of these genes, we need the Ensembl Gene IDs for the predictions for each of the three miRNAs. Note the reuse of columns-of-interest.tsv-result, generated in ⟪ columns-of-interest.query.sql ⟫.

C91: ⟪ :make Calculate and save the mRNA correlation coefficients ⟫ ≡
root chunk

mirna-mrna.Rsave: mirna-mrna.R \
                  input-data.Rsave \
                  columns-of-interest.tsv-result \
                  mirna-genes.tsv-result \
                  hsa-miR-155-5p.mirna-predictions.tsv \
                  hsa-miR-188-5p.mirna-predictions.tsv \
                  hsa-miR-769-3p.mirna-predictions.tsv
	Rscript --vanilla $^ $@

mirna-counts.txt: hsa-miR-155-5p.mirna-predictions.tsv \
                  hsa-miR-188-5p.mirna-predictions.tsv \
                  hsa-miR-769-3p.mirna-predictions.tsv
	wc --lines $^ > $@

%.mirna-predictions.tsv: mirna-predictions.bash tcga.db
	bash $^ $* > $@

◼

Listing 10: mirna-counts.txt

  957 hsa-miR-155-5p.mirna-predictions.tsv
  826 hsa-miR-188-5p.mirna-predictions.tsv
  770 hsa-miR-769-3p.mirna-predictions.tsv
 2553 total

Number of targets for each gene.

C92: ⟪ mirna-genes.query.sql ⟫ ≡
root chunk

WITH gene_mirna ("Associated Gene Name", mirna) AS (
  VALUES
    ('ERBB2', 'hsa-miR-155-5p'),
    ('NDRG1', 'hsa-miR-769-3p'),
    ('LAPTM4B', 'hsa-miR-188-5p')
)
SELECT
  "Associated Gene Name" AS name,
  "Ensembl Gene ID" AS id,
  mirna
FROM id_descr_name
  NATURAL INNER JOIN gene_mirna;

◼

C93: ⟪ mirna-predictions.bash ⟫ ≡
root chunk

sqlite3 "$1" << end_of_file
.header on
.separator "\t"
WITH gene_count_sum AS (
  SELECT
    gene,
    count(refseqid) AS count,
    sum
  FROM mirna
  WHERE mirna = '$2'
  GROUP BY gene, sum
)
SELECT DISTINCT
  gene,
  "Ensembl Gene ID" AS id
FROM gene_count_sum
  INNER JOIN id_descr_name
    ON gene = "Associated Gene Name"
  NATURAL INNER JOIN ampl_rownames
  NATURAL INNER JOIN mrna_rownames
WHERE sum >= 4
  OR (sum = 3 AND count >= 2);
end_of_file

◼

C94: ⟪ mirna-mrna.R ⟫ ≡
root chunk

args.names <- c("input",
                "cols",
                "genes",
                "one",
                "two",
                "three",
                "result")
⟪ Read and name R command line arguments ⟫

as.matrix(read.delim(args["genes"])) -> mirna.gene.id
mirna.gene.id[,c("name","id")] -> gene.id
mirna.gene.id[,"mirna"] -> rownames(gene.id)

⟪ Load miRNA targets R function ⟫
lapply(list("one", "two", "three"),
       load.mirna.targets) -> mirna.targets
lapply(mirna.targets,
       function(x) { x$gene["name"] }) -> names(mirna.targets)

load(args["input"])
as.matrix(read.delim(args["cols"])) -> cols
⟪ Keep only the overlapping columns ⟫

pearson.cor <- function(x, y) {
    cor.test(x, y, method="pearson")$estimate
}

⟪ Shortlist highly correlated targets R function ⟫
lapply(mirna.targets, target.cor) -> gene.complex

save(gene.complex, file=args["result"])

◼

C95: ⟪ Load miRNA targets R function ⟫ ≡
Appears in C94

load.mirna.targets <- function(arg.name) {
    strsplit(args[arg.name],
             split=".", fixed=T)[[1]][1] -> mirna.name
    as.matrix(read.delim(args[arg.name])) -> mirna.targets
    mirna.targets[,"id"] -> target.ids
    mirna.targets[,"gene"] -> names(target.ids)
    list(mirna=mirna.name,
         gene=gene.id[mirna.name,],
         targets=target.ids)
}

◼

C96: ⟪ Shortlist highly correlated targets R function ⟫ ≡
Appears in C94

target.cor <- function(x) {
    mrna[x$gene["id"],] -> gene.mrna
    mrna[x$targets,] -> targets.mrna
    names(x$targets) -> rownames(targets.mrna)
    apply(targets.mrna, 1, pearson.cor, gene.mrna) -> target.cors
    sort(target.cors[target.cors > 0.33], decreasing=T) -> shortlist
    targets.mrna[names(shortlist),] -> targets.mrna
    ampl[x$gene["id"],] -> gene.ampl
    ampl[x$targets,] -> targets.ampl
    names(x$targets) -> rownames(targets.ampl)
    targets.ampl[names(shortlist),] -> targets.ampl
    
    list(mirna=x$mirna,
         hits=shortlist,
         gene.ampl=gene.ampl,
         ampl=targets.ampl,
         gene.mrna=gene.mrna,
         mrna=targets.mrna)
}

◼

Listing 11: mirna-mrna-result.txt

Gene:  ERBB2 
miRNA:  hsa-miR-155-5p 
Targets:
[1] "RAB27B" "GPD1L" 

Gene:  LAPTM4B 
miRNA:  hsa-miR-188-5p 
Targets:
 [1] "LAPTM4B"  "YWHAZ"    "JMJD6"   
 [4] "SLMO2"    "TXNRD1"   "C1orf106"
 [7] "RBL1"     "CDC25B"   "PVR"     
[10] "TMEM194A" "PCMT1"    "UQCRB"   
[13] "UCK2"     "FBXO45"   "OSGIN2"  
[16] "USP31"    "SNX22"    "AQP9"    
[19] "SPAST"   

Gene:  NDRG1 
miRNA:  hsa-miR-769-3p 
Targets:
 [1] "NDRG1"    "C11orf86" "SLC6A2"  
 [4] "KIF1B"    "TRIM2"    "CDCA8"   
 [7] "STAC"     "HIC2"     "FOXK2"   
[10] "KLF11"    "ASAP1"    "PHKG1"   
[13] "MFGE8"    "RBM38"

The predicted targets with high correlation on the transcript level.

C97: ⟪ :make ⟫ ≡
⇧ C90

mirna-mrna-result.txt: mirna-mrna-result.txt.R mirna-mrna.Rsave
	Rscript --vanilla $^ $@

◼

C98: ⟪ mirna-mrna-result.txt.R ⟫ ≡
root chunk

args.names <- c("data", "result")
⟪ Read and name R command line arguments ⟫
load(args["data"])

print.gene.complex <- function(i) {
    cat("Gene: ", names(gene.complex)[[i]], "\n")
    cat("miRNA: ", gene.complex[[i]]$mirna, "\n")
    cat("Targets:\n")
    print(names(gene.complex[[i]]$hits))
    cat("\n")
}

sink(file=args["result"])
options(width=40)
lapply(seq_along(gene.complex), print.gene.complex) -> foo
sink(file=NULL)

◼

For LAPTM4B, there are several interesting hits. One is PVR, encoding the Poliovirus receptor protein, which mediates natural killer (NK) cell adhesion and triggers NK cell effector functions. It might provide tumors with a mechanism of immunoevasion, and it plays a role in mediating tumor cell invasion and migration.

Another interesting gene is SNX22, encoding the Sorting nexin-22 protein, which may be involved in several stages of intracellular trafficking (By similarity). It interacts with membranes containing phosphatidylinositol 3-phosphate (PtdIns(3P)).

Finally, SPAST (Spastin) is a protein that is required for the membrane traffic from the ER to the Golgi.

Now, visualize the amplification status and transcript level correlation between LAPTM4B and each of these three genes, as available in the patient solid tumor samples.

C99: ⟪ :make Draw correlation plots for miRNA targets ⟫ ≡
root chunk

%-mirna.svg: plot-mirna-target.R mirna-mrna.Rsave
	Rscript --vanilla $^ $@ $* svg

%-mirna.pdf: plot-mirna-target.R mirna-mrna.Rsave
	Rscript --vanilla $^ $@ $* pdf

◼

C100: ⟪ plot-mirna-target.R ⟫ ≡
root chunk

args.names <- c("data", "plot", "genes", "format")
⟪ Read and name R command line arguments ⟫
load(args["data"])
strsplit(args["genes"], split="-", fixed=T)[[1]] -> gene.names
gene.names[1] -> gene.of.interest
gene.names[2] -> mirna.target
gene.complex[[gene.of.interest]] -> g
a1 <- g$gene.ampl
a2 <- g$ampl[mirna.target,]

⟪ Color mRNA scatter plot R function ⟫
col <- scatter.plot.col(a1, a2)

⟪ Open graphics device R function ⟫
open.device(args["format"], args["plot"],
            width=7.5, height=5.2, pointsize=12)

⟪ Define layout for correlation plots R function ⟫
cor.plots.layout(1)

⟪ Draw correlation plots R function ⟫

draw.cor.plots(list(name.a=gene.of.interest,
                    name.b=mirna.target,
                    chisq.result=chisq.test(a1, a2),
                    transcript.a=g$gene.mrna,
                    transcript.b=g$mrna[mirna.target,],
                    mrna.col=col,
                    pearson.estimate=g$hits[mirna.target]))

dev.off() -> foo

◼